Web Server Overloads
In 2023 December, TWiki was upgraded to 6.0.1, and migrated from
ServerJustice to
ServerPetra. Promptly thereafter, we started seeing kernel OOM (out-of-memory) alerts and resulting oom-killer action on
petra
.
Symptoms
Kernel OOM
The kernel OOMs typically included lines that looked like this:
view invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
CPU: 0 PID: 226139 Comm: view Not tainted 5.10.0-26-amd64 #1 Debian 5.10.197-1
Tasks state (memory values in pages):
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/apache2.service,task=view,pid=226125,uid=33
Out of memory: Killed process 226125 (view) total-vm:38780kB, anon-rss:30296kB, file-rss:0kB, shmem-rss:0kB, UID:33 pgtables:116kB oom_score_adj:0
The Perl script that renders pages for TWiki is called
view
. It gets run any time a TWiki page is requested by a browser.
Eventually, Apache gets killed off as the cause of the problem. I'm not sure if that's the OOM killer being smart, or systemd being smart, or just dumb luck. Regardless, our monitoring sends notifications that the web server is not answering, and the Apache service needs to be restarted.
Worse still, sometimes some other process would actually be the one to trip OOM, after TWiki ate
almost all the RAM. So we'd lose the
fail2ban
daemon or something, and
then Apache a little later.
With process limits
With per-process memory limits in place, the kernel alerts stopped, and instead
view
would abort when a render got too big. This at least limited the damage to that TWiki request, but it meant some pages could never render. Examples of the log messages one might see (across several log files) include things like:
Out of memory!
in the global Apache
error
log.
[cgid:error] End of script output before headers: view, referer:
https://wiki.gnhlug.org/TWiki/WebTopicList
in the vhost-specific
error
log.
| 0023-12-26 - 12:00:38 | guest | view | TWikiDocumentation | Mozilla | x.y.z.w |
in the TWiki
access
log.
Nothing in the TWiki
warn
log. I believe the Perl process running TWiki just coughs and dies, without a chance to log anything.
Causes
Obviously, something is using much more memory than before.
petra
has twice as much memory as
justice
. The software on
petra
is much newer, and 64-bit vs 32-bit, but that's not enough to justify the dramatic increase. Neither server has much load most of the time, and
petra
has no other significant sites hosted at this time. It seems likely that TWiki 6.x is using significantly more memory due to internal changes, but
that much more?
Crawlers and robots
Current theory is, the configuration on
petra
is allowing more rude robots to make requests. The TWiki configuration was not carried over from
justice
. In particular,
petra
is not (yet) running the TWiki BlacklistPlugin. (That plugin has its own issues, so I'm not in a rush to add it.)
Just blocking everything that's not human also blocks crawlers from Google, Bing, etc. We want to be found. We just don't want to be slammed. So casting too broad a net is bad, too.
The recommended
/robots.txt
was put in place. I don't think this helped much/any, but it can't hurt, and is a good idea for non-rude robots.
In the
LocalSite.cfg
file, RobotsAreWelcome is set to true. Setting it to false just causes TWiki to emit HTML META tags saying "no robots" for all pages. TWiki always adds
nofollow
tags to links which should not be crawled (e.g., edit buttons). Checking the rendered HTML, it seems both old and new sites are putting
nofollow
tags in where they should. So that's not it.
From the logs, it looks like some robots are just crawling the site as fast as they can, ignoring
robots.txt
and
nofollow
and whatever. So apparently the problem is rude robots - asking nicely doesn't help.
Adding the worst offenders to the
User-Agent
blacklist in the Apache config (
badbots.inc
) seems to be helping
a lot.
Piggish TWiki pages
Some TWiki pages simply use tons of resources to render. Typically because they have lots of content, or pull in lots of content from other pages. More complicated markup or plugins also use more resources to render. Combined with rude robots, this quickly leads to large memory consumption.
TWikiDocumentation was the worst offender. It seems that page tries to include every TWiki doc page, combining them into one giant page. It appears more than 128 MB was consumed for each page render request. (Run a few of those at once and you can easily blow our 1 GB out of the water.) I neutered the page. Search is a thing.
TWikiHistory is another noteworthy one. It doesn't have as many includes, but it's very long and full of markup. It takes more memory and a lot of CPU time. For the moment I'm leaving it.
Mitigation
There are things we can do to reduce the impact of this significantly. They don't really address the root cause, but they limit the damage significantly. See
Prevention for things that actually address the cause.
Resource limits are mostly about mitigation, but are complex enough to warrant their own section.
Memory and swap space
We would add more RAM, of course. It's a cloud VM; all we have to do is increase the performance tier. However, informal analysis of memory use during a lower-level of induced flooding suggests that most of the time, we have plenty of memory. The problem is rude robots effectively
DoS 'ing us. Trying to buy our way out of that is a poor way to spend money, assuming we could even afford it.
Adding a 2 GB swap file to
petra
gave us some breathing room. This also allows rarely/never used memory pages to be moved out to disk, saving RAM for things that are actually useful (including caching/buffering).
Caching
Looking into
TWiki:Plugins.CacheAddOn is likely a good idea (FIXME). It should help performance generally -- from what I've seen, it speeds up TWiki.org dramatically. Since robots don't log-in, all their page requests should be cached, avoiding the render process entirely. However, this ultimately will just move the problem to CPU or bandwidth, or just allow more robot requests at once before we run out of RAM.
Prevention
Ideally, what really want is a way to identify rude client behavior, and then block just that. Nothing is nice enough to
set the evil bit, alas, but some kind of per-IP rate limiting would be very good, I think. It would prevent rude robots (or even just a really busy NAT exit) from overwhelming the site, without blocking them completely, and without affecting polite sources at all.
Resource limits are mostly about mitigation, but if they are smart enough, can be so good as to be nearly indistinguishable from preventing the problem in the first place. Unfortunately ours are not that good, but maybe they could be.
Resource limits
Limiting the consumption of resources (like memory, or number of running TWiki processes) will mitigate the damage very well, and if done with enough precision, can effectively act as a preventive.
If we could limit just TWiki's resource use, that would be very good. Everything else would be unaffected. If something is hammering TWiki, then TWiki gets choked back, which at least means only the problem area will be affected. And maybe the cause will get tired of waiting, and go bother someone else.
Failing that, limiting all of Apache at least limits the damage to the web server, and doesn't take out other daemons.
Resource limits above Apache
These were set in the
apache2.service
systemd unit file:
MemoryHigh=500M
MemoryMax=700M
These limits apply to Apache itself, as well as all child processes, regardless of user changes. They apply cumulatively across all of those processes. (More precisely, they apply to the kernel cgroup (control group) created for the Apache service.) This at least keeps Apache/TWiki from consuming so many resources it takes down other things. Since Apache is invoking TWiki, and Apache normally has multiple processes running in the background already, it's also more likely for TWiki to be the thing that gets denied more memory. Unfortunately, it is still possible for TWiki to use up so much memory that Apache may be effected.
These were tried after attempts at setting limits within Apache, and have been much more successful overall, despite being less precise than desired.
Resource limits within Apache
Apache has three
RLimit
directives for resource limiting. They initially looked promising, but did not play out as well as hoped.
The docs claim these apply just to children spawned to run CGI, not Apache itself. Sounds like what we want, right? But the docs also imply
RLimitNPROC
may apply to server itself, so who knows? The docs are also rather vague on how these actually work -- in particular, as to whether these are cumulative across all processes, or per process. From the Apache docs, these smell like
setrlimit(2)
so I am guessing per process, but again, who knows? I didn't feel like digging into the sources to be sure.
Each parameter takes two arguments, the first a soft limit, the second a hard limit. The Apache docs don't really explain the difference, but this is one of the things that makes me think
setrlimit(2)
is in play.
RLimitCPU
is cumulative seconds of CPU time used. Assuming this is, in fact, implemented in terms of
setrlimit(2)
, exceeding the first would result in
SIGXCPU
(can be caught), the second
SIGKILL
(instant death). Since our problem is memory, this doesn't really apply. I set them anyway, to 15 and 20 seconds, on general principles. A process might get stuck in a loop somehow, and killing such a thing off is useful. If a CGI script is running the CPU for that long, something is broken regardless.
RLimitMEM
is bytes of memory. It's not clear if they mean virtual size, resident size, or data segment size (all of which can be limited by
setrlimit(2)
independently). This is not as useful as one might hope. Setting it to 25 MB meant many pages did not render. 50 MB was enough for most. 75 MB was enough for all but TWikiDocumentation. 150 MB was enough for all. But at 150 MB each, 5 running requests could still exhaust all available RAM. Eventually gave up on this and went to the cgroup limits via systemd (see above).
RLimitNPROC
is allegedly number of processes. But the Apache docs say it applies to Apache itself, too, if Apache is running as the same user ID (and it is, at least for now -- see
Segregating CGI). Setting it to 50 resulted in many "can't fork" errors, and we sure did not have 50 processes running. Again, I think this is really
setrlimit
with
RLIMIT_NPROC
, which on Linux means, "total number of
threads running
per user (across all processes/sessions)". We are using
mpm_event
which uses threads. So this was no help. Reverted to unset.
Worker limits within Apache
Instead of limiting Apache resource consumption generally, can we limit how many CGI processes Apache runs at once? If only a handful of TWiki processes are allowed to run at once, it should reduce the likelihood of them exhausting memory. Unfortunately, I have not found a way to do this directly. But we can tune the number of Apache worker threads. This limits the overall number of requests Apache can be serving at once. That means it will not be as good at things that we can otherwise handle well (like serving up a bunch of small files at once), but maybe it's worth the tradeoff.
I am currently using:
MaxRequestWorkers 32
ThreadsPerChild 8
ListenBackLog 256
MinSpareThreads 4
MaxSpareThreads 32
MaxRequestWorkers
limits threads serving requests, simultaneously across all Apache processes. Additional requests will queue and be subject to ListenBackLog. Setting this low is what we're after here; this prevents too many CGI requests at once (or any other requests, as noted).
ListenBackLog
limits the size of the wait queue. Past that point, clients just get HTTP 503 "Server Too Busy". This prevents resource exhaustion in Apache due to a flood of requests.
Setting
MaxRequestWorkers
brought in some other directives.
MaxRequestWorkers
has to be an integer multiple of
ThreadsPerChild
, which is the number of processes spawned by Apache to run worker threads. The quotient of
MaxRequestWorkers
divided by
ThreadsPerChild
must also be less than
ServerLimit
.
ServerLimit
is the maximum number of processes (each with
ThreadsPerChild
threads) Apache will spawn to serve additional requests. For us, the quotient is 4, and
ServerLimit
defaults to 16, so we are good.
The
*SpareThreads
directives are how many idle threads Apache will try to keep around for new requests. The default max is higher, so we reduce it -- we can only serve 32 requests, so having more than that spare is pointless. We also reduce the the minimum significantly, since we'll be running relatively close to the maximum even when idle.
These values are all mostly educated guesses, and the whole thing may be a bad idea, so don't take this as gospel.
Resource limits below Apache
Can we do anything within Perl and/or TWiki to limit the resources they use? Just throwing
ulimit(1)
calls in a wrapper script is unlikely to help, but perhaps there is something more suitable?
Even better would be segregating CGI out of the Apache process tree entirely; see below.
Segregating CGI
Ideally, we run TWiki CGI scripts as a separate user, in a separate kernel cgroup, with limits specific to those things. Done right, this would increase security, too. But how?
- running multiple
httpd
instances would do it
- but we'd need a reverse proxy and rewrites, yuck
- suEXEC is not suitable
- our TWiki is carefully locked down with file ownership and permissions
- suEXEC insists that CGI scripts and directories be owned by running user, increases exposure for us
- CGIWrap has the same problems as suEXEC
-
mod_perl
might save some resources
- but I suspect no amount of added resources is sufficient
- rude robots will keep using resources until we run out, no matter what
- ditto for SpeedyCGI / PersistentPerl, plus it requires TWiki code changes
- FastCGI, see below
- FastCGI is a protocol for web servers to call out to separately-running external programs
-
fcgiwrap
can run an arbitrary CGI script for FastFGI
- Apache has two options,
mod_proxy_fcgi
and mod_fcgid
-
mod_proxy_fcgi
- Newer, nominally preferred Apache module
- Already in-use on the same server for a PHP app
- Has a known bug that breaks some scripts, but TWiki might be OK
- Had lots of trouble getting it hooked into the Apache request chain
- Started to see Apache thread/process hangs
- Gave up and removed it all
- Still saw a few thread/process hangs so many that was unrelated or a bad diagnostic?
-
mod_fcgid
- This is an evolution of
mod_fastcgi
, the original FastCGI implementation
- Older, deprecated, not as well documented
- Looked into it, but could not see a way to hook it into
fcgiwrap