Web Server Overloads

In 2023 December, TWiki was upgraded to 6.0.1, and migrated from ServerJustice to ServerPetra. Promptly thereafter, we started seeing kernel OOM (out-of-memory) alerts and resulting oom-killer action on petra.

Symptoms

Kernel OOM

The kernel OOMs typically included lines that looked like this:

view invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
CPU: 0 PID: 226139 Comm: view Not tainted 5.10.0-26-amd64 #1 Debian 5.10.197-1
Tasks state (memory values in pages):
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/apache2.service,task=view,pid=226125,uid=33
Out of memory: Killed process 226125 (view) total-vm:38780kB, anon-rss:30296kB, file-rss:0kB, shmem-rss:0kB, UID:33 pgtables:116kB oom_score_adj:0

The Perl script that renders pages for TWiki is called view. It gets run any time a TWiki page is requested by a browser.

Eventually, Apache gets killed off as the cause of the problem. I'm not sure if that's the OOM killer being smart, or systemd being smart, or just dumb luck. Regardless, our monitoring sends notifications that the web server is not answering, and the Apache service needs to be restarted.

Worse still, sometimes some other process would actually be the one to trip OOM, after TWiki ate almost all the RAM. So we'd lose the fail2ban daemon or something, and then Apache a little later.

With process limits

With per-process memory limits in place, the kernel alerts stopped, and instead view would abort when a render got too big. This at least limited the damage to that TWiki request, but it meant some pages could never render. Examples of the log messages one might see (across several log files) include things like:

Out of memory! in the global Apache error log.

[cgid:error] End of script output before headers: view, referer: https://wiki.gnhlug.org/TWiki/WebTopicList in the vhost-specific error log.

| 0023-12-26 - 12:00:38 | guest | view | TWikiDocumentation | Mozilla | x.y.z.w | in the TWiki access log.

Nothing in the TWiki warn log. I believe the Perl process running TWiki just coughs and dies, without a chance to log anything.

Causes

Obviously, something is using much more memory than before. petra has twice as much memory as justice. The software on petra is much newer, and 64-bit vs 32-bit, but that's not enough to justify the dramatic increase. Neither server has much load most of the time, and petra has no other significant sites hosted at this time. It seems likely that TWiki 6.x is using significantly more memory due to internal changes, but that much more?

Crawlers and robots

Current theory is, the configuration on petra is allowing more rude robots to make requests. The TWiki configuration was not carried over from justice. In particular, petra is not (yet) running the TWiki BlacklistPlugin. (That plugin has its own issues, so I'm not in a rush to add it.)

Just blocking everything that's not human also blocks crawlers from Google, Bing, etc. We want to be found. We just don't want to be slammed. So casting too broad a net is bad, too.

The recommended /robots.txt was put in place. I don't think this helped much/any, but it can't hurt, and is a good idea for non-rude robots.

In the LocalSite.cfg file, RobotsAreWelcome is set to true. Setting it to false just causes TWiki to emit HTML META tags saying "no robots" for all pages. TWiki always adds nofollow tags to links which should not be crawled (e.g., edit buttons). Checking the rendered HTML, it seems both old and new sites are putting nofollow tags in where they should. So that's not it.

From the logs, it looks like some robots are just crawling the site as fast as they can, ignoring robots.txt and nofollow and whatever. So apparently the problem is rude robots - asking nicely doesn't help.

Adding the worst offenders to the User-Agent blacklist in the Apache config (badbots.inc) seems to be helping a lot.

Piggish TWiki pages

Some TWiki pages simply use tons of resources to render. Typically because they have lots of content, or pull in lots of content from other pages. More complicated markup or plugins also use more resources to render. Combined with rude robots, this quickly leads to large memory consumption.

TWikiDocumentation was the worst offender. It seems that page tries to include every TWiki doc page, combining them into one giant page. It appears more than 128 MB was consumed for each page render request. (Run a few of those at once and you can easily blow our 1 GB out of the water.) I neutered the page. Search is a thing.

TWikiHistory is another noteworthy one. It doesn't have as many includes, but it's very long and full of markup. It takes more memory and a lot of CPU time. For the moment I'm leaving it.

Mitigation

There are things we can do to reduce the impact of this significantly. They don't really address the root cause, but they limit the damage significantly. See Prevention for things that actually address the cause.

Resource limits are mostly about mitigation, but are complex enough to warrant their own section.

Memory and swap space

We would add more RAM, of course. It's a cloud VM; all we have to do is increase the performance tier. However, informal analysis of memory use during a lower-level of induced flooding suggests that most of the time, we have plenty of memory. The problem is rude robots effectively DoS 'ing us. Trying to buy our way out of that is a poor way to spend money, assuming we could even afford it.

Adding a 2 GB swap file to petra gave us some breathing room. This also allows rarely/never used memory pages to be moved out to disk, saving RAM for things that are actually useful (including caching/buffering).

Caching

Looking into TWiki:Plugins.CacheAddOn is likely a good idea (FIXME). It should help performance generally -- from what I've seen, it speeds up TWiki.org dramatically. Since robots don't log-in, all their page requests should be cached, avoiding the render process entirely. However, this ultimately will just move the problem to CPU or bandwidth, or just allow more robot requests at once before we run out of RAM.

Prevention

Ideally, what really want is a way to identify rude client behavior, and then block just that. Nothing is nice enough to set the evil bit, alas, but some kind of per-IP rate limiting would be very good, I think. It would prevent rude robots (or even just a really busy NAT exit) from overwhelming the site, without blocking them completely, and without affecting polite sources at all.

Resource limits are mostly about mitigation, but if they are smart enough, can be so good as to be nearly indistinguishable from preventing the problem in the first place. Unfortunately ours are not that good, but maybe they could be.

Resource limits

Limiting the consumption of resources (like memory, or number of running TWiki processes) will mitigate the damage very well, and if done with enough precision, can effectively act as a preventive.

If we could limit just TWiki's resource use, that would be very good. Everything else would be unaffected. If something is hammering TWiki, then TWiki gets choked back, which at least means only the problem area will be affected. And maybe the cause will get tired of waiting, and go bother someone else.

Failing that, limiting all of Apache at least limits the damage to the web server, and doesn't take out other daemons.

Resource limits above Apache

These were set in the apache2.service systemd unit file:

MemoryHigh=500M
MemoryMax=700M

These limits apply to Apache itself, as well as all child processes, regardless of user changes. They apply cumulatively across all of those processes. (More precisely, they apply to the kernel cgroup (control group) created for the Apache service.) This at least keeps Apache/TWiki from consuming so many resources it takes down other things. Since Apache is invoking TWiki, and Apache normally has multiple processes running in the background already, it's also more likely for TWiki to be the thing that gets denied more memory. Unfortunately, it is still possible for TWiki to use up so much memory that Apache may be effected.

These were tried after attempts at setting limits within Apache, and have been much more successful overall, despite being less precise than desired.

Resource limits within Apache

Apache has three RLimit directives for resource limiting. They initially looked promising, but did not play out as well as hoped.

The docs claim these apply just to children spawned to run CGI, not Apache itself. Sounds like what we want, right? But the docs also imply RLimitNPROC may apply to server itself, so who knows? The docs are also rather vague on how these actually work -- in particular, as to whether these are cumulative across all processes, or per process. From the Apache docs, these smell like setrlimit(2) so I am guessing per process, but again, who knows? I didn't feel like digging into the sources to be sure.

Each parameter takes two arguments, the first a soft limit, the second a hard limit. The Apache docs don't really explain the difference, but this is one of the things that makes me think setrlimit(2) is in play.

RLimitCPU is cumulative seconds of CPU time used. Assuming this is, in fact, implemented in terms of setrlimit(2), exceeding the first would result in SIGXCPU (can be caught), the second SIGKILL (instant death). Since our problem is memory, this doesn't really apply. I set them anyway, to 15 and 20 seconds, on general principles. A process might get stuck in a loop somehow, and killing such a thing off is useful. If a CGI script is running the CPU for that long, something is broken regardless.

RLimitMEM is bytes of memory. It's not clear if they mean virtual size, resident size, or data segment size (all of which can be limited by setrlimit(2) independently). This is not as useful as one might hope. Setting it to 25 MB meant many pages did not render. 50 MB was enough for most. 75 MB was enough for all but TWikiDocumentation. 150 MB was enough for all. But at 150 MB each, 5 running requests could still exhaust all available RAM. Eventually gave up on this and went to the cgroup limits via systemd (see above).

RLimitNPROC is allegedly number of processes. But the Apache docs say it applies to Apache itself, too, if Apache is running as the same user ID (and it is, at least for now -- see Segregating CGI). Setting it to 50 resulted in many "can't fork" errors, and we sure did not have 50 processes running. Again, I think this is really setrlimit with RLIMIT_NPROC, which on Linux means, "total number of threads running per user (across all processes/sessions)". We are using mpm_event which uses threads. So this was no help. Reverted to unset.

Worker limits within Apache

Instead of limiting Apache resource consumption generally, can we limit how many CGI processes Apache runs at once? If only a handful of TWiki processes are allowed to run at once, it should reduce the likelihood of them exhausting memory. Unfortunately, I have not found a way to do this directly. But we can tune the number of Apache worker threads. This limits the overall number of requests Apache can be serving at once. That means it will not be as good at things that we can otherwise handle well (like serving up a bunch of small files at once), but maybe it's worth the tradeoff.

I am currently using:

MaxRequestWorkers 32
ThreadsPerChild 8
ListenBackLog 256
MinSpareThreads 4
MaxSpareThreads 32

MaxRequestWorkers limits threads serving requests, simultaneously across all Apache processes. Additional requests will queue and be subject to ListenBackLog. Setting this low is what we're after here; this prevents too many CGI requests at once (or any other requests, as noted). ListenBackLog limits the size of the wait queue. Past that point, clients just get HTTP 503 "Server Too Busy". This prevents resource exhaustion in Apache due to a flood of requests.

Setting MaxRequestWorkers brought in some other directives.

MaxRequestWorkers has to be an integer multiple of ThreadsPerChild, which is the number of processes spawned by Apache to run worker threads. The quotient of MaxRequestWorkers divided by ThreadsPerChild must also be less than ServerLimit. ServerLimit is the maximum number of processes (each with ThreadsPerChild threads) Apache will spawn to serve additional requests. For us, the quotient is 4, and ServerLimit defaults to 16, so we are good.

The *SpareThreads directives are how many idle threads Apache will try to keep around for new requests. The default max is higher, so we reduce it -- we can only serve 32 requests, so having more than that spare is pointless. We also reduce the the minimum significantly, since we'll be running relatively close to the maximum even when idle.

These values are all mostly educated guesses, and the whole thing may be a bad idea, so don't take this as gospel.

Resource limits below Apache

Can we do anything within Perl and/or TWiki to limit the resources they use? Just throwing ulimit(1) calls in a wrapper script is unlikely to help, but perhaps there is something more suitable?

Even better would be segregating CGI out of the Apache process tree entirely; see below.

Segregating CGI

Ideally, we run TWiki CGI scripts as a separate user, in a separate kernel cgroup, with limits specific to those things. Done right, this would increase security, too. But how?

  • running multiple httpd instances would do it
    • but we'd need a reverse proxy and rewrites, yuck
  • suEXEC is not suitable
    • our TWiki is carefully locked down with file ownership and permissions
    • suEXEC insists that CGI scripts and directories be owned by running user, increases exposure for us
  • CGIWrap has the same problems as suEXEC
  • mod_perl might save some resources
    • but I suspect no amount of added resources is sufficient
    • rude robots will keep using resources until we run out, no matter what
  • ditto for SpeedyCGI / PersistentPerl, plus it requires TWiki code changes
  • FastCGI, see below

FastCGI

  • FastCGI is a protocol for web servers to call out to separately-running external programs
  • fcgiwrap can run an arbitrary CGI script for FastFGI
  • Apache has two options, mod_proxy_fcgi and mod_fcgid
  • mod_proxy_fcgi
    • Newer, nominally preferred Apache module
    • Already in-use on the same server for a PHP app
    • Has a known bug that breaks some scripts, but TWiki might be OK
    • Had lots of trouble getting it hooked into the Apache request chain
    • Started to see Apache thread/process hangs
    • Gave up and removed it all
    • Still saw a few thread/process hangs so many that was unrelated or a bad diagnostic?
  • mod_fcgid
    • This is an evolution of mod_fastcgi, the original FastCGI implementation
    • Older, deprecated, not as well documented
    • Looked into it, but could not see a way to hook it into fcgiwrap
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r3 - 2024-02-08 - BenScott
 

All content is Copyright © 1999-2024 by, and the property of, the contributing authors.
Questions, comments, or concerns? Contact GNHLUG.
All use of this site subject to our Legal Notice (includes Terms of Service).