Table of Contents:
|
The
Writing Apache Modules with Perl and C
book can be purchased online from O'Reilly
and
Amazon.com.
|
|
Your corrections of the technical and grammatical
errors are very welcome. You are encouraged to help me
improve this guide. If you have something to contribute
please send it
directly to me.
|
To make the user's Web browsing experience as painless as possible, every effort must be made to wring the last drop of performance from the server. There are many factors which affect Web site usability, but speed is one of the most important. This applies to any webserver, not just Apache, so it is very important that you understand it.
How do we measure the speed of a server? Since the user (and not the computer) is the one that interacts with the Web site, one good speed measurement is the time elapsed between the moment when she clicks on a link or presses a Submit button to the moment when the resulting page is fully rendered.
The requests and replies are broken into packets. A request may be made up of several packets, a reply may be many thousands. Each packet has to make its own way from one machine to another, perhaps passing through many interconnection nodes. We must measure the time starting from when the first packet of the request leaves our user's machine to when the last packet of the reply arrives back there.
A webserver is only one of the entities the packets see along their way. If we follow them from browser to server and back again, they may travel by different routes through many different entities. Before they are processed by your server the packets might have to go through proxy (accelerator) servers and if the request contains more than one packet, packets might arrive to the server by different routes with different arrival times, therefore it's possible that some packets that arrive earlier will have to wait for other packets before they could be reassembled into a chunk of the request message that will be then read by the server. Then the whole process is repeated in reverse.
You could work hard to fine tune your webserver's performance, but a slow Network Interface Card (NIC) or a slow network connection from your server might defeat it all. That's why it's important to think about the Big Picture and to be aware of possible bottlenecks between the server and the Web.
Of course there is little that you can do if the user has a slow connection. You might tune your scripts and webserver to process incoming requests ultra quickly, so you will need only a small number of working servers, but you might find that the server processes are all busy waiting for slow clients to accept their responses.
But there are techniques to cope with this. For example you can deliver the respond after it was compressed. If you are delivering a pure text respond--gzip compression will reduce the size of the message by 10 times.
You should analyze all the involved components when you try to create the best service for your users, and not the web server or the code that the web server executes. A Web service is like a car, if one of the parts or mechanisms is broken the car may not go smoothly and it can even stop dead if pushed too far without first fixing it.
And let me stress it again--If you want to have a success in the web service business you should start worrying about the client's browsing experience and not only how good your code benchmarks are.
[ TOC ]
Before we try to solve a problem we need to indentify it. In our case we want to get the best performance we can with as little monetary and time investment as possible.
[ TOC ]
Covered in the section ``Choosing an Operating System''.
[ TOC ]
(META: Only partial analysis. Please submit more points. Many points are scattered around the document and should be gathered here, to represent the whole picture. It also should be merged with the above item!)
You need to analyze all of the problem's dimensions. There are several things that need to be considered:
How long does it take to process each request?
How many requests can you process simultaneously?
How many simultaneous requests are you planning to get?
At what rate are you expecting to receive requests?
The first one is probably the easiest to optimize. Following the performance optimization tips in this and other documents allows a perl (mod_perl) programmer to exercise their code and improve it.
The second one is a function of RAM. How much RAM is in each box, how many boxes do you have, and how much RAM does each mod_perl process use? Multiply the first two and divide by the third. Ask yourself whether it is better to switch to another, possibly just as inefficient language or whether that will actually cost more than throwing another powerful machine into the rack.
Also ask yourself whether switching to another language will even help. In some applications, for example to link Oracle runtime libraries, a huge chunk of memory is needed so you would save nothing even if you switched from Perl to C.
The last two are important. You need a realistic estimate. Are you really expecting 8 million hits per day? What is the expected peak load, and what kind of response time do you need to guarantee? Remember that these numbers might change drastically when you apply code changes and your site becomes popular. Remember that when you get a very high hit rate, the resource requirements don't grow linearly but exponentially!
More coverage is provided in the section ``Choosing Hardware''.
[ TOC ]
In order to improve performance we need measurement tools. The main tool categories are benchmarking and code profiling.
It's important to understand that in a major number of the benchmarking tests that we will execute we will not look at the absolute result numbers but the relation between the two and more result sets, since in most cases we would try to show which coding approach is preferable and the you shouldn't try to compare the absolute results collected while running the same benchmarks on your machine, since you won't have the exact hardware and software setup anyway. So this kind of comparisment would be misleading. Compare the relative results from the tests running on your machine, don't compare your absolute results with those in this Guide.
[ TOC ]
How much faster is mod_perl than mod_cgi (aka plain perl/CGI)? There are
many ways to benchmark the two. I'll present a few examples and numbers
below. Check out the benchmark directory of the mod_perl distribution for more examples.
If you are going to write your own benchmarking utility, use the
Benchmark module for heavy scripts and the Time::HiRes module for very fast scripts (faster than 1 sec) where you will need better
time precision.
There is no need to write a special benchmark though. If you want to
impress your boss or colleagues, just take some heavy CGI script you have
(e.g. a script that crunches some data and prints the results to STDOUT),
open 2 xterms and call the same script in mod_perl mode in one xterm and in
mod_cgi mode in the other. You can use lwp-get
from the LWP package to emulate the browser. The benchmark
directory of the mod_perl distribution includes such an example.
See also two tools for benchmarking: ApacheBench and crashme test
[ TOC ]
Perrin Harkins writes on benchmarks or comparisons, official or unofficial:
I have used some of the platforms you mentioned and researched others. What I can tell you for sure, is that no commercially available system offers the depth, power, and ease of use that mod_perl has. Either they don't let you access the web server internals, or they make you use less productive languages than Perl, sometimes forcing you into restrictive and confusing APIs and/or GUI development environments. None of them offers the level of support available from simply posting a message to [the mod-perl] list, at any price.
As for performance, beyond doing several important things (code-caching, pre-forking/threading, and persistent database connections) there isn't much these tools can do, and it's mostly in your hands as the developer to see that the things which really take the time (like database queries) are optimized.
The downside of all this is that most manager types seem to be unable to believe that web development software available for free could be better than the stuff that cost $25,000 per CPU. This appears to be the major reason most of the web tools companies are still in business. They send a bunch of suits to give PowerPoint presentations and hand out glossy literature to your boss, and you end up with an expensive disaster and an approaching deadline.
But I'm not bitter or anything...
Jonathan Peterson adds:
Most of the major solutions have something that they do better than the others, and each of them has faults. Microsoft's ASP has a very nice objects model, and has IMO the best data access object (better than DBI to use - but less portable). It has the worst scripting language. PHP has many of the advantages of Perl-based solutions, and is less complicated for developers. Netscape's Livewire has a good object model too, and provides good server-side Java integration - if you want to leverage Java skills, it's good. Also, it has a compiled scripting language - which is great if you aren't selling your clients the source code (and a pain otherwise).
mod_perl's advantage is that it is the most powerful. It offers the greatest degree of control with one of the more powerful languages. It also offers the greatest granularity. You can use an embedding module (eg eperl) from one place, a session module (Session) from another, and your data access module from yet another.
I think the
Apache::ASPmodule looks very promising. It has very easy to use and adequately powerful state maintenance, a good embedding system, and a sensible object model (that emulates the Microsoft ASP one). It doesn't replicate MS's ADO for data access, butDBIis fine for that.I have always found that the developers available make the greatest impact on the decision. If you have a team with no Perl experience, and a small or medium task, using something like PHP, or Microsoft ASP makes more sense than driving your staff into the vertical learning curve they'll need to use mod_perl.
For very large jobs, it may be worth finding the best technical solution, and then recruiting the team with the necessary skills.
[ TOC ]
Here are the numbers from Michael Parker's mod_perl presentation at the Perl Conference (Aug, 98). (Sorry, there used to be links here to the source, but they went dead one day, so I removed them). The script is a standard hits counter, but it logs the counts into a mysql relational DataBase:
Benchmark: timing 100 iterations of cgi, perl... [rate 1:28]
cgi: 56 secs ( 0.33 usr 0.28 sys = 0.61 cpu)
perl: 2 secs ( 0.31 usr 0.27 sys = 0.58 cpu)
Benchmark: timing 1000 iterations of cgi,perl... [rate 1:21]
cgi: 567 secs ( 3.27 usr 2.83 sys = 6.10 cpu)
perl: 26 secs ( 3.11 usr 2.53 sys = 5.64 cpu)
Benchmark: timing 10000 iterations of cgi, perl [rate 1:21]
cgi: 6494 secs (34.87 usr 26.68 sys = 61.55 cpu)
perl: 299 secs (32.51 usr 23.98 sys = 56.49 cpu)
|
We don't know what server configurations were used for these tests, but I guess the numbers speak for themselves.
The source code of the script was available at http://www.realtime.net/~parkerm/perl/conf98/sld006.htm. It's now a dead link. If you know its new location, please let me know.
[ TOC ]
If you want to get the benchmark results in micro-seconds you will have to
use the Time::HiRes module, its usage is similar to
Benchmark's.
use Time::HiRes qw(gettimeofday tv_interval); my $start_time = [ gettimeofday ]; sub_that_takes_a_teeny_bit_of_time(); my $end_time = [ gettimeofday ]; my $elapsed = tv_interval($start_time,$end_time); print "The sub took $elapsed seconds." |
See also the crashme test.
[ TOC ]
The Apache::Timeit module does PerlHandler Benchmarking. With the help of this module you can log the time taken to
process the request, just like you'd use the Benchmark module to benchmark a regular Perl script. Of course you can extend this
module to perform more advanced processing like putting the results into a
database for a later processing. But all it takes is adding this
configuration directive inside httpd.conf:
PerlFixupHandler Apache::Timeit |
Since scripts running under Apache::Registry are running inside the PerlHandler these are benchmarked as well.
An example of the lines which show up in the error_log file:
timing request for /perl/setupenvoff.pl:
0 wallclock secs ( 0.04 usr + 0.01 sys = 0.05 CPU)
timing request for /perl/setupenvoff.pl:
0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU)
|
The Apache::Timeit package is a part of the Apache-Perl-contrib
files collection available from CPAN.
[ TOC ]
The profiling process helps you to determine which subroutines or just snippets of code take the longest time to execute and which subroutines are called most often. Probably you will want to optimize those.
When do you need to profile your code? You do that when you suspect that some part of your code is called very often and may be there is a need to optimize it to significantly improve the overall performance.
For example if you have ever used the diagnostics pragma, which extends the terse diagnostics normally emitted by both the
Perl compiler and the Perl interpreter, augmenting them with the more
verbose and endearing descriptions found in the perldiag manpage. You know that it might tremendously slow you code down, so let's
first prove that it is correct.
We will run a benchmark, once with diagnostics enabled and once disabled, on a subroutine called test_code.
The code inside the subroutine does an arithmetic and a numeric comparison
of two strings. It assigns one string to another if the condition tests
true but the condition always tests false. To demonstrate the diagnostics overhead the comparison operator is intentionally wrong. It should be a string comparison, not a numeric one.
use Benchmark;
use diagnostics;
use strict;
my $count = 50000;
disable diagnostics;
my $t1 = timeit($count,\&test_code);
enable diagnostics;
my $t2 = timeit($count,\&test_code);
print "Off: ",timestr($t1),"\n";
print "On : ",timestr($t2),"\n";
sub test_code{
my ($a,$b) = qw(foo bar);
my $c;
if ($a == $b) {
$c = $a;
}
}
|
For only a few lines of code we get:
Off: 1 wallclock secs ( 0.81 usr + 0.00 sys = 0.81 CPU) On : 13 wallclock secs (12.54 usr + 0.01 sys = 12.55 CPU) |
With diagnostics enabled, the subroutine test_code() is 16 times slower, than
with diagnostics disabled!
Now let's fix the comparison the way it should be, by replacing ==
with eq, so we get:
my ($a,$b) = qw(foo bar);
my $c;
if ($a eq $b) {
$c = $a;
}
|
and run the same benchmark again:
Off: 1 wallclock secs ( 0.57 usr + 0.00 sys = 0.57 CPU) On : 1 wallclock secs ( 0.56 usr + 0.00 sys = 0.56 CPU) |
Now there is no overhead at all. The diagnostics pragma slows things down only when warnings are generated.
After we have verified that using the diagnostics pragma might adds a big overhead to execution runtime, let's use the code
profiling to understand why this happens. We are going to use Devel::DProf to profile the code. Let's use this code:
diagnostics.pl
--------------
use diagnostics;
print "Content-type:text/html\n\n";
test_code();
sub test_code{
my ($a,$b) = qw(foo bar);
my $c;
if ($a == $b) {
$c = $a;
}
}
|
Run it with the profiler enabled, and then create the profiling stastics with the help of dprofpp:
% perl -d:DProf diagnostics.pl
% dprofpp
Total Elapsed Time = 0.342236 Seconds
User+System Time = 0.335420 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
92.1 0.309 0.358 1 0.3089 0.3578 main::BEGIN
14.9 0.050 0.039 3161 0.0000 0.0000 diagnostics::unescape
2.98 0.010 0.010 2 0.0050 0.0050 diagnostics::BEGIN
0.00 0.000 -0.000 2 0.0000 - Exporter::import
0.00 0.000 -0.000 2 0.0000 - Exporter::export
0.00 0.000 -0.000 1 0.0000 - Config::BEGIN
0.00 0.000 -0.000 1 0.0000 - Config::TIEHASH
0.00 0.000 -0.000 2 0.0000 - Config::FETCH
0.00 0.000 -0.000 1 0.0000 - diagnostics::import
0.00 0.000 -0.000 1 0.0000 - main::test_code
0.00 0.000 -0.000 2 0.0000 - diagnostics::warn_trap
0.00 0.000 -0.000 2 0.0000 - diagnostics::splainthis
0.00 0.000 -0.000 2 0.0000 - diagnostics::transmo
0.00 0.000 -0.000 2 0.0000 - diagnostics::shorten
0.00 0.000 -0.000 2 0.0000 - diagnostics::autodescribe
|
It's not easy to see what is responsible for this enormous overhead, even
if main::BEGIN seems to be running most of the time. To get the full picture we must see
the OPs tree, which shows us who calls whom, so we run:
% dprofpp -T |
and the output is:
main::BEGIN
diagnostics::BEGIN
Exporter::import
Exporter::export
diagnostics::BEGIN
Config::BEGIN
Config::TIEHASH
Exporter::import
Exporter::export
Config::FETCH
Config::FETCH
diagnostics::unescape
.....................
3159 times [diagnostics::unescape] snipped
.....................
diagnostics::unescape
diagnostics::import
diagnostics::warn_trap
diagnostics::splainthis
diagnostics::transmo
diagnostics::shorten
diagnostics::autodescribe
main::test_code
diagnostics::warn_trap
diagnostics::splainthis
diagnostics::transmo
diagnostics::shorten
diagnostics::autodescribe
diagnostics::warn_trap
diagnostics::splainthis
diagnostics::transmo
diagnostics::shorten
diagnostics::autodescribe
|
So we see that two executions of diagnostics::BEGIN and 3161 of
diagnostics::unescape are responsible for most of the running overhead.
META: but we see that it might be run only once in mod_perl, so the numbers are better. Am I right? check it!
If we comment out the diagnostics module, we get:
Total Elapsed Time = 0.079974 Seconds
User+System Time = 0.059974 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
0.00 0.000 -0.000 1 0.0000 - main::test_code
|
It is possible to profile code running under mod_perl with the
Devel::DProf module, available on CPAN. However, you must have apache version 1.3b3 or
higher and the PerlChildExitHandler enabled during the httpd build process. When the server is started,
Devel::DProf installs an END block to write the tmon.out
file. This block will be called at server shutdown. Here is how to start
and stop a server with the profiler enabled:
% setenv PERL5OPT -d:DProf % httpd -X -d `pwd` & ... make some requests to the server here ... % kill `cat logs/httpd.pid` % unsetenv PERL5OPT % dprofpp |
The Devel::DProf package is a Perl code profiler. It will collect information on the
execution time of a Perl script and of the subs in that script (remember
that print() and map() are just like any other subroutines you write, but they come bundled with
Perl!)
Another approach is to use Apache::DProf, which hooks
Devel::DProf into mod_perl. The Apache::DProf module will run a
Devel::DProf profiler inside each child server and write the
tmon.out file in the directory $ServerRoot/logs/dprof/$$ when the child is shutdown (where $$ is the number of the child process). All it takes is to add to httpd.conf:
PerlModule Apache::DProf |
Remember that any PerlHandler that was pulled in before
Apache::DProf in the httpd.conf or startup.pl, will not have its code debugging information inserted. To run dprofpp, chdir to
$ServerRoot/logs/dprof/$$ and run:
% dprofpp |
[ TOC ]
With help of Apache::Status you can find out the size of each and every subroutine.
<Location /perl-status>
SetHandler perl-script
PerlHandler Apache::Status
order deny,allow
#deny from all
#allow from ...
</Location>
|
PerlSetVar StatusOptionsAll On PerlSetVar StatusTerse On PerlSetVar StatusTerseSize On PerlSetVar StatusTerseSizeMainSummary On |
PerlModule B::TerseSize |
Now you can start to optimize your code. Or test which of the several implementations is of the least size.
For example let's compare CGI.pm's OO vs. procedural interfaces:
As you will see below the first OO script uses about 2k bytes while the second script (procedural interface) uses about 5k.
Here are the code examples and the numbers:
cgi_oo.pl
---------
use CGI ();
my $q = CGI->new;
print $q->header;
print $q->b("Hello");
|
cgi_mtd.pl
---------
use CGI qw(header b);
print header();
print b("Hello");
|
After executing each script in single server mode (-X) the results are:
Totals: 1966 bytes | 27 OPs |
handler 1514 bytes | 27 OPs exit 116 bytes | 0 OPs |
Totals: 4710 bytes | 19 OPs handler 1117 bytes | 19 OPs basefont 120 bytes | 0 OPs frameset 120 bytes | 0 OPs caption 119 bytes | 0 OPs applet 118 bytes | 0 OPs script 118 bytes | 0 OPs ilayer 118 bytes | 0 OPs header 118 bytes | 0 OPs strike 118 bytes | 0 OPs layer 117 bytes | 0 OPs table 117 bytes | 0 OPs frame 117 bytes | 0 OPs style 117 bytes | 0 OPs Param 117 bytes | 0 OPs small 117 bytes | 0 OPs embed 117 bytes | 0 OPs font 116 bytes | 0 OPs span 116 bytes | 0 OPs exit 116 bytes | 0 OPs big 115 bytes | 0 OPs div 115 bytes | 0 OPs sup 115 bytes | 0 OPs Sub 115 bytes | 0 OPs TR 114 bytes | 0 OPs td 114 bytes | 0 OPs Tr 114 bytes | 0 OPs th 114 bytes | 0 OPs b 113 bytes | 0 OPs |
Note, that the above is correct if you didn't precompile all
CGI.pm's methods at server startup. Since if you did, the procedural interface in
the second test will take up to 18k and not 5k as we saw. That's because
the whole of CGI.pm's namespace is inherited and it already has all its methods compiled, so
it doesn't really matter whether you attempt to import only the symbols
that you need. So if you have:
use CGI qw(-compile :all); |
in the server startup script. Having:
use CGI qw(header); |
or
use CGI qw(:all); |
is essentially the same. You will have all the symbols precompiled at
startup imported even if you ask for only one symbol. It seems to me like a
bug, but probably that's how CGI.pm works.
BTW, you can check the number of opcodes in the code by a simple command line run. For example comparing 'my %hash' vs. 'my %hash = ()'.
% perl -MO=Terse -e 'my %hash' | wc -l
-e syntax OK
4
|
% perl -MO=Terse -e 'my %hash = ()' | wc -l
-e syntax OK
10
|
The first one has less opcodes.
[ TOC ]
In order to get the best performance it helps to get intimately familiar with the Operating System (OS) the web server is running on. There are many OS specific things that you may be able to optimise which will improve your web server's speed, reliability and security.
The following sections will unveal some of the most important details you should know about your OS.
[ TOC ]
The sharing of memory is one very important factor. If your OS supports it (and most sane systems do), you might save memory by sharing it between child processes. This is only possible when you preload code at server startup. However, during a child process' life its memory pages tend to become unshared.
There is no way we can make Perl allocate memory so that (dynamic) variables land on different memory pages from constants, so the copy-on-write effect (we will explain this in a moment) will hit you almost at random.
If you are pre-loading many modules you might be able to trade off the
memory that stays shared against the time for an occasional fork by tuning MaxRequestsPerChild. Each time a child reaches this upper limit and dies it should release its
unshared pages. The new child which replaces it will share its fresh pages
until it scribbles on them.
The ideal is a point where your processes usually restart before too much
memory becomes unshared. You should take some measurements to see if it
makes a real difference, and to find the range of reasonable values. If you
have success with this tuning the value of
MaxRequestsPerChild will probably be peculiar to your situation and may change with changing
circumstances.
It is very important to understand that your goal is not to have
MaxRequestsPerChild to be 10000. Having a child serving 300 requests on precompiled code is
already a huge overall speedup, so if it is 100 or 10000 it probably does
not really matter if you can save RAM by using a lower value.
Do not forget that if you preload most of your code at server startup, the newly forked child gets ready very very fast, because it inherits most of the preloaded code and the perl interpreter from the parent process.
During the life of the child its memory pages (which aren't really its own to start with, it uses the parent's pages) gradually get `dirty' - variables which were originally inherited and shared are updated or modified -- and the copy-on-write happens. This reduces the number of shared memory pages, thus increasing the memory requirement. Killing the child and spawning a new one allows the new child to get back to the pristine shared memory of the parent process.
The recommendation is that MaxRequestsPerChild should not be too large, otherwise you lose some of the benefit of sharing
memory.
See Choosing MaxRequestsPerChild for more about tuning the MaxRequestsPerChild parameter.
[ TOC ]
You've probably noticed that the word shared is repeated many times in relation to mod_perl. Indeed, shared memory might save you a lot of money, since with sharing in place you can run many more servers than without it. See the Formula and the numbers.
How much shared memory do you have? You can see it by either using the
memory utility that comes with your system or you can deploy the
GTop module:
use GTop ();
print "Shared memory of the current process: ",
GTop->new->proc_mem($$)->share,"\n";
print "Total shared memory: ",
GTop->new->mem->share,"\n";
|
When you watch the output of the top utility, don't confuse the
RES (or RSS) columns with the SHARE column. RES is RESident memory, which is the size of pages currently swapped in.
[ TOC ]
I have shown how to measure the size of the process' shared memory, but we still want to know what the real memory usage is. Obviously this cannot be calculated simply by adding up the memory size of each process because that wouldn't account for the shared memory.
On the other hand we cannot just subtract the shared memory size from the total size to get the real memory usage numbers, because in reality each process has a different history of processed requests, therefore the shared memory is not the same for all processes.
So how do we measure the real memory size used by the server we run? It's probably too difficult to give the exact number, but I've found a way to get a fair approximation which was verified in the following way. I have calculated the real memory used, by the technique you will see in the moment, and then have stopped the Apache server and saw that the memory usage report indicated that the total used memory went down by almost the same number I've calculated. Note that some OSs do smart memory pages caching so you may not see the memory usage decrease as soon as it actually happens when you quit the application.
This is a technique I've used:
For each process sum up the difference between shared and system memory. To calculate a difference for a single process use:
use GTop; my $proc_mem = GTop->new->proc_mem($$); my $diff = $proc_mem->size - $proc_mem->share; print "Difference is $diff bytes\n"; |
Now if we add the shared memory size of the process with maximum shared memory, we will get all the memory that actually is being used by all httpd processes, except for the parent process.
Finally, add the size of the parent process.
Please note that this might be incorrect for your system, so you use this number on your own risk.
I've used this technique to display real memory usage in the module Apache::VMonitor, so instead of trying to manually calculate this number you can use this module to do it automatically. In fact in the calculations used in this module there is no separation between the parent and child processes, they are all counted indifferently using the following code:
use GTop ();
my $gtop = GTop->new;
my $total_real = 0;
my $max_shared = 0;
# @mod_perl_pids is initialized by Apache::Scoreboard, irrelevant here
my @mod_perl_pids = some_code();
for my $pid (@mod_perl_pids)
my $proc_mem = $gtop->proc_mem($pid);
my $size = $proc_mem->size($pid);
my $share = $proc_mem->share($pid);
$total_real += $size - $share;
$max_shared = $share if $max_shared < $share;
}
my $total_real += $max_shared;
|
So as you see we that we accumulate the difference between the shared and reported memory:
$total_real += $size-$share; |
and at the end add the biggest shared process size:
my $total_real += $max_shared; |
So now $total_real contains approximately the really used memory.
[ TOC ]
How do you find out if the code you write is shared between the processes or not? The code should be shared, except where it is on a memory page with variables that change. Some variables are read-only in usage and never change. For example, if you have some variables that use a lot of memory and you want them to be read-only. As you know the variable becomes unshared when the process modifies its value.
So imagine that you have this 10Mb in-memory database that resides in a
single variable, you perform various operations on it and want to make sure
that the variable is still shared. For example if you do some matching
regular expression (regex) processing on this variable and want to use the
pos() function, will it make the variable unshared or not?
The Apache::Peek module comes to rescue. Let's write a module called MyShared.pm which we preload at server startup, so all the variables of this module are
initially shared by all children.
MyShared.pm
---------
package MyShared;
use Apache::Peek;
my $readonly = "Chris";
sub match { $readonly =~ /\w/g; }
sub print_pos{ print "pos: ",pos($readonly),"\n";}
sub dump { Dump($readonly); }
1;
|
This module declares the package MyShared, loads the
Apache::Peek module and defines the lexically scoped $readonly
variable which is supposed to be a variable of large size (think about a
huge hash data structure), but we will use a small one to simplify this
example.
The module also defines three subroutines: match() that does a
simple character matching, print_pos() that prints the current
position of the matching engine inside the string that was last matched and
finally the dump() subroutine that calls the Apache::Peek module's Dump() function to dump a raw Perl data-type of the $readonly
variable.
Now we write the script that prints the process ID (PID) and calls all
three functions. The goal is to check whether pos() makes the
variable dirty and therefore unshared.
share_test.pl ------------- use MyShared; print "Content-type: text/plain\r\n\r\n"; print "PID: $$\n"; MyShared::match(); MyShared::print_pos(); MyShared::dump(); |
Before you restart the server, in httpd.conf set:
MaxClients 2 |
for easier tracking. You need at least two servers to compare the print outs of the test program. Having more than two can make the comparison process harder.
Now open two browser windows and issue the request for this script several times in both windows, so you get different processes PIDs reported in the two windows and each process has processed a different number of requests to the share_test.pl script.
In the first window you will see something like that:
PID: 27040
pos: 1
SV = PVMG(0x853db20) at 0x8250e8c
REFCNT = 3
FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK)
IV = 0
NV = 0
PV = 0x8271af0 "Chris"\0
CUR = 5
LEN = 6
MAGIC = 0x853dd80
MG_VIRTUAL = &vtbl_mglob
MG_TYPE = 'g'
MG_LEN = 1
|
And in the second window:
PID: 27041
pos: 2
SV = PVMG(0x853db20) at 0x8250e8c
REFCNT = 3
FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK)
IV = 0
NV = 0
PV = 0x8271af0 "Chris"\0
CUR = 5
LEN = 6
MAGIC = 0x853dd80
MG_VIRTUAL = &vtbl_mglob
MG_TYPE = 'g'
MG_LEN = 2
|
We see that all the addresses of the supposedly big structure are the same
(0x8250e8c and 0x8271af0), therefore the variable data structure is almost completely shared. The
only difference is in
SV.MAGIC.MG_LEN record, which is not shared.
So given that the $readonly variable is a big one, its value is still shared between the processes,
while part of the variable data structure is non-shared. But it's almost
insignificant because it takes a very little memory space.
Now if you need to compare more than variable, doing it by hand can be
quite time consuming and error prune. Therefore it's better to correct the
testing script to dump the Perl data-types into files (e.g
/tmp/dump.$$, where $$ is the PID of the process) and then using diff(1) utility to
see whether there is some difference.
So correcting the dump() function to write the info to the
file will do the job. Notice that we use Devel::Peek and not
Apache::Peek. The both are almost the same, but Apache::Peek
prints it output directly to the opened socket so we cannot intercept and
redirect the result to the file. Since Devel::Peek dumps results to the STDERR stream we can use the old trick of saving away
the default STDERR handler, and open a new filehandler using the STDERR. In
our example when Devel::Peek now prints to STDERR it actually prints to our file. When we are done, we
make sure to restore the original STDERR filehandler.
So this is the resulting code:
MyShared2.pm
---------
package MyShared2;
use Devel::Peek;
my $readonly = "Chris";
sub match { $readonly =~ /\w/g; }
sub print_pos{ print "pos: ",pos($readonly),"\n";}
sub dump{
my $dump_file = "/tmp/dump.$$";
print "Dumping the data into $dump_file\n";
open OLDERR, ">&STDERR";
open STDERR, ">".$dump_file or die "Can't open $dump_file: $!";
Dump($readonly);
close STDERR ;
open STDERR, ">&OLDERR";
}
1;
|
When if we modify the code to use the modified module:
share_test2.pl ------------- use MyShared2; print "Content-type: text/plain\r\n\r\n"; print "PID: $$\n"; MyShared2::match(); MyShared2::print_pos(); MyShared2::dump(); |
And run it as before (with MaxClients 2), two dump files will be created in the directory /tmp. In our test these were created as
/tmp/dump.1224 and /tmp/dump.1225. When we run diff(1):
% diff /tmp/dump.1224 /tmp/dump.1225 12c12 < MG_LEN = 1 --- > MG_LEN = 2 |
We see that the two padlists (of the variable readonly) are different, as we have observed before when we did a manual
comparison.
In fact we if we think about these results again, we get to a conclusion
that there is no need for two processes to find out whether the variable
gets modified (and therefore unshared). It's enough to check the
datastructure before the script was executed and after that. You can modify
the MyShared2 module to dump the padlists into a different file after each invocation and
than to run the diff(1) on the two files.
If you want to watch whether some lexically scoped (with my())
variables in your Apache::Registry script inside the same process get changed between invocations you can use
the
Apache::RegistryLexInfo module instead. Since it does exactly this: it makes a snapshot of the
padlist before and after the code execution and shows the difference
between the two. This specific module was written to work with Apache::Registry scripts so it won't work for loaded modules. Use the technique we have
described above for any type of variables in modules and scripts.
Surely another way of ensuring that a scalar is readonly and therefore
sharable is to either use the constant pragma or readonly
pragma. But then you won't be able to make calls that alter the variable
even a little, like in the example that we just showed, because it will be
a true constant variable and you will get compile time error if you try
this:
MyConstant.pm
-------------
package MyConstant;
use constant readonly => "Chris";
sub match { readonly =~ /\w/g; }
sub print_pos{ print "pos: ",pos(readonly),"\n";}
1;
|
% perl -c MyConstant.pm |
Can't modify constant item in match position at MyConstant.pm line 5, near "readonly)" MyConstant.pm had compilation errors. |
However this code is just right:
MyConstant1.pm
-------------
package MyConstant1;
use constant readonly => "Chris";
sub match { readonly =~ /\w/g; }
1;
|
[ TOC ]
You can use the PerlRequire and PerlModule directives to load commonly used modules such as CGI.pm, DBI and etc., when the server is started. On most systems, server children will
be able to share the code space used by these modules. Just add the
following directives into httpd.conf:
PerlModule CGI PerlModule DBI |
But an even better approach is to create a separate startup file (where you code in plain perl) and put there things like:
use DBI (); use Carp (); |
Don't forget to prevent importing of the symbols exported by default by the
module you are going to preload, by placing empty parentheses
() after a module's name. Unless you need some of these in the startup file,
which is unlikely. This will save you a few more memory bits.
Then you require() this startup file in httpd.conf with the
PerlRequire directive, placing it before the rest of the mod_perl configuration
directives:
PerlRequire /path/to/start-up.pl |
CGI.pm is a special case. Ordinarily CGI.pm autoloads most of its functions on an as-needed basis. This speeds up the
loading time by deferring the compilation phase. When you use mod_perl,
FastCGI or another system that uses a persistent Perl interpreter, you will
want to precompile the functions at initialization time. To accomplish
this, call the package function compile() like this:
use CGI ();
CGI->compile(':all');
|
The arguments to compile() are a list of method names or sets, and are identical to those accepted by
the use() and import()
operators. Note that in most cases you will want to replace ':all'
with the tag names that you actually use in your code, since generally you
only use a subset of them.
Let's conduct a memory usage test to prove that preloading, reduces memory requirements.
In order to have an easy measurement we will use only one child process, therefore we will use this setting:
MinSpareServers 1 MaxSpareServers 1 StartServers 1 MaxClients 1 MaxRequestsPerChild 100 |
We are going to use the Apache::Registry script memuse.pl which consists of two parts: the first one preloads a bunch of modules
(that most of them aren't going to be used), the second part reports the
memory size and the shared memory size used by the single child process
that we start. and of course it prints the difference between the two
sizes.
memuse.pl --------- use strict; use CGI (); use DB_File (); use LWP::UserAgent (); use Storable (); use DBI (); use GTop (); |
my $r = shift;
$r->send_http_header('text/plain');
my $proc_mem = GTop->new->proc_mem($$);
my $size = $proc_mem->size;
my $share = $proc_mem->share;
my $diff = $size - $share;
printf "%10s %10s %10s\n", qw(Size Shared Difference);
printf "%10d %10d %10d (bytes)\n",$size,$share,$diff;
|
First we restart the server and execute this CGI script when none of the above modules preloaded. Here is the result:
Size Shared Diff 4706304 2134016 2572288 (bytes) |
Now we take all the modules:
use strict; use CGI (); use DB_File (); use LWP::UserAgent (); use Storable (); use DBI (); use GTop (); |
and copy them into the startup script, so they will get preloaded. The script remains unchanged. We restart the server and execute it again. We get the following.
Size Shared Diff 4710400 3997696 712704 (bytes) |
Let's put the two results into one table:
Preloading Size Shared Diff
Yes 4710400 3997696 712704 (bytes)
No 4706304 2134016 2572288 (bytes)
--------------------------------------------
Difference 4096 1863680 -1859584
|
You can clearly see that when the modules weren't preloaded the shared memory pages size, were about 1864Kb smaller relative to the case where the modules were preloaded.
Assuming that you have had 256M dedicated to the web server, if you didn't preload the modules, you could have:
268435456 = X * 2572288 + 2134016 |
X = (268435456 - 2134016) / 2572288 = 103 |
103 servers.
Now let's calculate the same thing with modules preloaded:
268435456 = X * 712704 + 3997696 |
X = (268435456 - 3997696) / 712704 = 371 |
You can have almost 4 times more servers!!!
Remember that we have mentioned before that memory pages gets dirty and the size of the shared memory gets smaller with time? So we have presented the ideal case where the shared memory stays intact. Therefore the real numbers will be a little bit different, but not far from the numbers in our example.
Also it's obvious that in your case it's possible that the process size will be bigger and the shared memory will be smaller, since you will use different modules and a different code, so you won't get this fantastic ratio, but this example is certainly helps to feel the difference.
[ TOC ]
What happens if you find yourself stuck with Perl CGI scripts and you
cannot or don't want to move most of the stuff into modules to benefit from
modules preloading, so the code will be shared by the children. Luckily you
can preload scripts as well. This time the
Apache::RegistryLoader modules comes to aid.
Apache::RegistryLoader compiles Apache::Registry scripts at server startup.
For example to preload the script /perl/test.pl which is in fact the file /home/httpd/perl/test.pl you would do the following:
use Apache::RegistryLoader ();
Apache::RegistryLoader->new->handler("/perl/test.pl",
"/home/httpd/perl/test.pl");
|
You should put this code either into <Perl> sections or into a startup script.
But what if you have a bunch of scripts located under the same directory
and you don't want to list them one by one. Take the benefit of Perl
modules and put them to a good use. The File::Find
module will do most of the work for you.
The following code walks the directory tree under which all
Apache::Registry scripts are located. For each encountered file with extension .pl, it calls the
Apache::RegistryLoader::handler() method to preload the script in the parent server, before pre-forking the
child processes:
use File::Find qw(finddepth);
use Apache::RegistryLoader ();
{
my $scripts_root_dir = "/home/httpd/perl/";
my $rl = Apache::RegistryLoader->new;
finddepth
(
sub {
return unless /\.pl$/;
my $url = "$File::Find::dir/$_";
$url =~ s|$scripts_root_dir/?|/|;
warn "pre-loading $url\n";
# preload $url
my $status = $rl->handler($url);
unless($status == 200) {
warn "pre-load of `$url' failed, status=$status\n";
}
},
$scripts_root_dir);
}
|
Note that we didn't use the second argument to handler() here, as in the first example. To make the loader smarter about the URI to
filename translation, you might need to provide a trans() function to translate the URI to filename. URI to filename translation
normally doesn't happen until HTTP request time, so the module is forced to
roll its own translation. If filename is omitted and a
trans() function was not defined, the loader will try using the URI relative to ServerRoot.
A simple trans() function can be something like that:
sub mytrans {
my $uri = shift;
$uri =~ s|^/perl/|/home/httpd/perl/|;
return $uri;
}
|
You can easily derive the right translation by looking at the Alias
directive. The above mytrans() function is matching our Alias:
Alias /perl/ /home/httpd/perl/ |
After defining the URI to filename translation function you should pass it
during the creation of the Apache::RegistryLoader object:
my $rl = Apache::RegistryLoader->new(trans => \&mytrans); |
I won't show any benchmarks here, since the effect is absolutely the same as with preloading modules.
See also BEGIN blocks
[ TOC ]
We have just learned that it's important to preload the modules and scripts at the server startup. It turns out that it's not enough for some modules and you have to prerun their initialization code to get more memory pages shared. Basically you will find an information about specific modules in their respective manpages. We will present a few examples of widely used modules where the code can be initialized.
[ TOC ]
The first example is the DBI module. As you know DBI works with many database drivers falling into the DBD:: category, e.g. DBD::mysql. It's not enough to preload DBI, you should initialize DBI with driver(s) that you are going to use (usually a single
driver is used), if you want to minimize memory use after forking the child
processes. Note that you want to do this under mod_perl and other
environments where the shared memory is very important. Otherwise you
shouldn't initialize drivers.
You probably know already that under mod_perl you should use the
Apache::DBI module to get the connection persistence, unless you open a separate
connection for each user--in this case you should not use this module. Apache::DBI automatically loads DBI and overrides some of its methods, so you should continue coding like there
is only a DBI module.
Just as with modules preloading our goal is to find the startup environment that will lead to the smallest "difference" between the shared and normal memory reported, therefore a smaller total memory usage.
And again in order to have an easy measurement we will use only one child process, therefore we will use this setting in httpd.conf:
MinSpareServers 1 MaxSpareServers 1 StartServers 1 MaxClients 1 MaxRequestsPerChild 100 |
We are going to run memory benchmarks on five different versions of the startup.pl file. We always preload these modules:
use Gtop(); use Apache::DBI(); # preloads DBI as well |
Leave the file unmodified.
Install MySQL driver (we will use MySQL RDBMS for our test):
DBI->install_driver("mysql");
|
It's safe to use this method, since just like with use(), if it can't be installed it'll die().
Preload MySQL driver module:
use DBD::mysql; |
Tell Apache::DBI to connect to the database when the child process starts (ChildInitHandler), no driver is preload before the child gets spawned!
Apache::DBI->connect_on_init('DBI:mysql:test::localhost',
"",
"",
{
PrintError => 1, # warn() on errors
RaiseError => 0, # don't die on error
AutoCommit => 1, # commit executes
# immediately
}
)
or die "Cannot connect to database: $DBI::errstr";
|
Here is the Apache::Registry test script that we have used:
preload_dbi.pl
--------------
use strict;
use GTop ();
use DBI ();
my $dbh = DBI->connect("DBI:mysql:test::localhost",
"",
"",
{
PrintError => 1, # warn() on errors
RaiseError => 0, # don't die on error
AutoCommit => 1, # commit executes
# immediately
}
)
or die "Cannot connect to database: $DBI::errstr";
my $r = shift;
$r->send_http_header('text/plain');
my $do_sql = "show tables";
my $sth = $dbh->prepare($do_sql);
$sth->execute();
my @data = ();
while (my @row = $sth->fetchrow_array){
push @data, @row;
}
print "Data: @data\n";
$dbh->disconnect(); # NOP under Apache::DBI
my $proc_mem = GTop->new->proc_mem($$);
my $size = $proc_mem->size;
my $share = $proc_mem->share;
my $diff = $size - $share;
printf "%8s %8s %8s\n", qw(Size Shared Diff);
printf "%8d %8d %8d (bytes)\n",$size,$share,$diff;
|
The script opens a connection to the database 'test' and issues a query to learn what tables the databases has. When the data is
collected and printed the connection would be closed in the regular case,
but Apache::DBI overrides it with empty method. When the data is processed a familiar to
you already code to print the memory usage follows.
The server was restarted before each new test.
So here are the results of the five tests that were conducted, sorted by the Diff column:
After the first request:
Version Size Shared Diff Test type
--------------------------------------------------------------------
1 3465216 2621440 843776 install_driver
2 3461120 2609152 851968 install_driver & connect_on_init
3 3465216 2605056 860160 preload driver
4 3461120 2494464 966656 nothing added
5 3461120 2482176 978944 connect_on_init
|
After the second request (all the subsequent request showed the same results):
Version Size Shared Diff Test type
--------------------------------------------------------------------
1 3469312 2609152 860160 install_driver
2 3481600 2605056 876544 install_driver & connect_on_init
3 3469312 2588672 880640 preload driver
4 3477504 2482176 995328 nothing added
5 3481600 2469888 1011712 connect_on_init
|
Now what do we conclude from looking at these numbers. First we see that only after a second reload we get the final memory footprint for a specific request in question (if you pass different arguments the memory usage might and will be different).
But both tables show the same pattern of memory usage. We can clearly see
that the real winner is the startup.pl file's version where the MySQL driver was installed (1). Since we want to
have a connection ready for the first request made to the freshly spawned
child process, we generally use the second version (2) which uses somewhat
more memory, but has almost the same number of shared memory pages. The
third version only preloads the driver which results in smaller shared
memory. The last two versions having nothing initialized (4) and having
only the connect_on_init() method used (5). The former is a
little bit better than the latter, but both significantly worse than the
first two versions.
To remind you why do we look for the smallest value in the column diff, recall the real memory usage formula:
RAM_dedicated_to_mod_perl = diff * number_of_processes
+ the_processes_with_largest_shared_memory
|
Notice that the smaller the diff is, the bigger the number of processes you can have using the same amount of RAM. Therefore every 100K difference counts, when you multiply it by the number of processes. If we take the number from the version version (1) vs. (4) and assume that we have 256M of memory dedicated to mod_perl processes we will get the following numbers using the formula derived from the above formula:
RAM - largest_shared_size
N_of Procs = -------------------------
Diff
|
268435456 - 2609152
(ver 1) N = ------------------- = 309
860160
|
268435456 - 2469888
(ver 5) N = ------------------- = 262
1011712
|
So you can tell the difference (17% more child processes in the first version).
[ TOC ]
CGI.pm is a big module that by default postpones the compilation of its methods
until they are actually needed, thus making it possible to use it under a
slow mod_cgi handler without adding a big overhead. That's not what we want
under mod_perl and if you use
CGI.pm you should precompile the methods that you are going to use at the server
startup in addition to preloading the module. Use the compile method for
that:
use CGI;
CGI->compile(':all');
|
where you should replace the tag group :all with the real tags and group tags that you are going to use if you want to
optimize the memory usage.
We are going to compare the shared memory foot print by using the script
which is back compatible with mod_cgi. You will see that you can improve
performance of this kind of scripts as well, but if you really want a fast
code think about porting it to use
Apache::Request for CGI interface and some other module for HTML generation.
So here is the Apache::Registry script that we are going to use to make the comparison:
preload_cgi_pm.pl ----------------- use strict; use CGI (); use GTop (); |
my $q = new CGI;
print $q->header('text/plain');
print join "\n", map {"$_ => ".$q->param($_) } $q->param;
print "\n";
my $proc_mem = GTop->new->proc_mem($$);
my $size = $proc_mem->size;
my $share = $proc_mem->share;
my $diff = $size - $share;
printf "%8s %8s %8s\n", qw(Size Shared Diff);
printf "%8d %8d %8d (bytes)\n",$size,$share,$diff;
|
The script initializes the CGI object, sends HTTP header and then print all the arguments and values that
were passed to the script if at all. At the end as usual we print the
memory usage.
As usual we are going to use a single child process, therefore we will use this setting in httpd.conf:
MinSpareServers 1 MaxSpareServers 1 StartServers 1 MaxClients 1 MaxRequestsPerChild 100 |
We are going to run memory benchmarks on three different versions of the startup.pl file. We always preload this module:
use Gtop(); |
Leave the file unmodified.
Preload CGI.pm:
use CGI (); |
Preload CGI.pm and pre-compile the methods that we are going to use in the script:
use CGI (); CGI->compile(qw(header param)); |
The server was restarted before each new test.
So here are the results of the five tests that were conducted, sorted by the Diff column:
After the first request:
Version Size Shared Diff Test type
--------------------------------------------------------------------
1 3321856 2146304 1175552 not preloaded
2 3321856 2326528 995328 preloaded
3 3244032 2465792 778240 preloaded & methods+compiled
|
After the second request (all the subsequent request showed the same results):
Version Size Shared Diff Test type
--------------------------------------------------------------------
1 3325952 2134016 1191936 not preloaded
2 3325952 2314240 1011712 preloaded
3 3248128 2445312 802816 preloaded & methods+compiled
|
The first version shows the results of the script execution when
CGI.pm wasn't preloaded. The second version with module preloaded. The third when
it's both preloaded and the methods that are going to be used are
precompiled at the server startup.
By looking at the version one of the second table we can conclude that,
preloading adds about 20K of shared size. As we have mention at the
beginning of this section that's how CGI.pm was implemented--to reduce the load overhead. Which means that preloading
CGI is almost hardly change a thing. But if we compare the second and the
third versions we will see a very significant difference of 207K
(1011712-802816), and we have used only a few methods (the header
method loads a few more method transparently for a user). Imagine how much
memory we are going to save if we are going to precompile all the methods
that we are using in other scripts that use CGI.pm and do a little bit more than the script that we have used in the test.
But even in our very simple case using the same formula, what do we see? (assuming that we have 256MB dedicated for mod_perl)
RAM - largest_shared_size
N_of Procs = -------------------------
Diff
|
268435456 - 2134016
(ver 1) N = ------------------- = 223
1191936
|
268435456 - 2445312
(ver 3) N = ------------------- = 331
802816
|
If we preload CGI.pm and precompile a few methods that we use in the test script, we can have
50% more child processes than when we don't preload and precompile the
methods that we are going to use.
META: I've heard that the 3.x generation will be less bloated, so probably I'll have to rerun this using the new version.
[ TOC ]
mergemem is an experimental utility for linux, which looks very
interesting for us mod_perl users: http://www.complang.tuwien.ac.at/ulrich/mergemem/
It looks like it could be run periodically on your server to find and merge duplicate pages. There are caveats: it would halt your httpds during the merge (it appears to be very fast, but still ...).
This software comes with a utility called memcmp to tell you how much you might save.
[ReaderMeta]: If you have tried this utility, please let us know what do you think about it! Thanks
[ TOC ]
In general you should not fork from your mod_perl scripts, since when you do, you are forking the entire Apache Web server, lock, stock and barrel. Not only is your Perl code and Perl interpreter being duplicated, but so is mod_ssl, mod_rewrite, mod_log, mod_proxy, mod_speling (it's not a typo!) or whatever modules you have used in your server, all the core routines...
[ TOC ]
A much better approach would be to spawn a sub-process, hand it the
information it needs to do the task, and have it detach (close STDIN,
STDOUT and STDERR + execute setsid()). This is wise only if the parent which spawns this process immediately
continues, and does not wait for the sub-process to complete. This approach
is suitable for a situation when you want to use the Web interface to
trigger a process which takes a long time, such as processing lots of data
or sending email to thousands of registered users (no SPAM please!).
Otherwise, you should convert the code into a module, and call its
functions and methods from a Perl handler or a CGI script.
Just like with fork(), using the system call system() defeats the whole idea behind mod_perl. The Perl interpreter and modules
would be loaded again for this external program to run if it's a Perl
program. Remember that the backticks (`program`) and qx(program)
variants of system() behave in the same way.
If you really have to use system() then the approach to take
is this:
spawn_process.pl
----------------
use FreezeThaw ();
$params=FreezeThaw::freeze(
[all data to pass to the other process]
);
system("detach_program.pl", $params);
|
Notice that we do a system() call with arguments separated by
commas, rather than passing them all as a single argument. This calling
style prevents the unwanted replacement of the shell metacharacters (like
*,?), if some happened to be in $params. We will talk about this in a moment.
And now the source of detach_program.pl :
detach_program.pl ----------------- use POSIX qw(setsid); use FreezeThaw (); @params=FreezeThaw::thaw(shift @ARGV); # check that @params is ok close STDIN; close STDOUT; close STDERR; # you might need to reopen the STDERR, i.e. # open STDERR, ">/dev/null"; setsid(); # to detach # now do something time consuming |
At this point, detach_program.pl is running in the "background"
while the call system() returns and permits Apache to get on with things.
This has obvious problems:
@params must not be bigger than whatever limit is imposed by your architecture.
The communication is one way only. Once the program detached, the process that has spawned it has no control over it.
However, you might be trying to do the "wrong thing". If what you really want is to send information to the browser and then do
some post-processing, look into the PerlCleanupHandler directive. The latter allows you to tell the child process after request
has been processed and user has received the response.
[META: example of PerlCleanupHandler code?]
[ TOC ]
Here is what actually happens when you fork() and call
system(). Let's take a simple fragment of code:
system("echo","Hi"),CORE::exit(0) unless fork();
|
The above code which might be more familiar in this form:
if (fork){
#do nothing
} else {
system("echo","Hi");
CORE::exit(0);
}
|
Notice that we use CORE::exit() and not exit(), which would be automatically overriden by Apache::exit() if used in conjunction with Apache::Registry and similar modules.
In our example script the fork() call creates two execution
paths, one for the parent and the other for the child. The fork call
returns the PID of the spawned child to the parent process and the child
process receives 0. Therefore we can rewrite the example as:
if ($pid = fork){
# I'm parent
print "parent $pid\n";
#do nothing
} else {
# I'm child
print "child $pid\n";
system("echo","Hi");
CORE::exit(0);
}
|
For the sake of completeness, when you write a real code, you should check
the return value from fork(), since if it returns undef it means that the call was unsuccessful and no process was spawned.
Something that can happen when the system is running too many processes and
cannot spawn new ones.
The child shares with parent its memory pages until it has to modify some
of them, which triggers a copy-on-write process which copies these pages to
the child's domain before the child is allowed to modify them. But this all
happens afterwards. At the moment the fork() call executed,
the only work to be done before the child process goes on its separate way
is setting up the page tables for the virtual memory, which imposes almost
no delay at all.
Unfortunately mod_perl is a big process, with a big memory pages table.
When the child processes executes system(), lots of things get
loaded which creates a big overhead and tremendously slows down the request
processing. The whole point of mod_perl is to avoid having to
fork() on every request. Perl can do just about anything by
itself.
Let's complete the explanation of our example: the parent will immediately continue with the code that comes after the fork (none in our case), while the forked (child) process will execute system("echo","Hi") and then quit.
[ TOC ]
Let's reuse the last example and concentrate on this call:
system("echo","Hi");
|
Perl will the first argument as a program to execute, find
/bin/echo along the search path, and invoke it directly.
Perl's system() is not the system(3) call [C-library]. How the arguments to system() get
interpreted? When there is a single argument to system(),
it'll be checked for for having shell metacharacters
first(like *,?), and if there are any--perl interpreter invokes a real shell program (/bin/sh -c on Unix platforms) which adds another delay. If you pass a list of
arguments to system(), they will be not checked, but split
into words if required and passed directly to execvp(), which is more efficient. That's a very nice optimization. In other words, only if you do:
system "sh -c 'echo *'" |
will the operating system actually exec() a copy of /bin/sh to parse your command. But since one is almost certainly already running
somewhere, the system will notice that (via the disk inode reference) and
replace your virtual memory page table with one pointing to the existing
program code plus your data space.
[ TOC ]
Now let's talk about zombie processes.
Normally, every process has its parent. Many processes are children of the init process, whose PID is 1. When you fork a process you must wait() or
waitpid() for it to finish. If you don't wait()
for it, it becomes a zombie.
A zombie is a process that doesn't have a parent. When the child quits, it
reports the termination to its parent. If no parent wait()s to
collect the exit status of the child, it gets "confused" and becomes a ghost process, that can be seen as a process, but not killed.
It will be killed only when you stop the parent process that spawned it!
Generally the ps(1) utility displays these processes with the
<defunc> tag, and you will see the zombies counter increment when doing
top(). These zombie processes can take up system resources and
are generally undesirable.
So the proper way to do a fork is:
my $r = shift;
$r->send_http_header('text/plain');
defined (my $kid = fork) or die "Cannot fork: $!";
if ($kid) {
waitpid($kid,0);
print "Parent has finished\n";
} else {
# do something
CORE::exit(0);
}
|
In most cases the only reason you would want to fork is when you need to spawn a process that will take a long time to complete. So if the Apache process that spawns this new child process has to wait for it to finish, you have gained nothing. You can neither wait for its completion (because you don't have the time to), nor continue because you will get yet another zombie process. This is called a blocking call, since the process is blocked to do anything else before this call gets completed.
The simplest solution is to ignore your dead children. Just add this line
before the fork() call:
$SIG{CHLD} = IGNORE;
|
When you set the CHLD (SIGCHLD in C) signal handler to
IGNORE, all the processes will be collected by the init process and are therefore prevented from becoming zombies. This doesn't
work everywhere, however. It proved to work at least on Linux OS.
Note that you cannot localize this setting with local(). If you do, it won't have the desired effect.
[META: Anyone like to explain why it doesn't work?]
The other thing that you must do is to close all the pipes to the
connection socket that were opened by the parent process (i.e. STDIN and STDOUT) and inherited by the child, so the parent will be able to complete the
request and free itself for serving other requests. You may need to close
and reopen the
STDERR filehandle. It's opened to append to the error_log file as inherited from its parent, so chances are that you will want to
leave it untouched.
Of course if your child needs any of the STDIN, STDOUT or
STDERR streams you should reopen them. But you must untie the parent process, so
you should close them first.
So now the code would look like this:
my $r = shift;
$r->send_http_header('text/plain');
$SIG{CHLD} = IGNORE;
defined (my $kid = fork) or die "Cannot fork: $!\n";
if ($kid) {
print "Parent has finished\n";
} else {
close STDIN;
close STDOUT;
close STDERR;
# do something time-consuming
CORE::exit(0);
}
|
Note that waitpid() call has gone. The $SIG{CHLD} = IGNORE;
statement protects us from zombies, as explained above.
Another, more portable, but slightly more expensive solution is to use a double fork approach.
my $r = shift;
$r->send_http_header('text/plain');
defined (my $kid = fork) or die "Cannot fork: $!\n";
if ($kid) {
waitpid($kid,0);
} else {
defined (my $grandkid = fork) or die "Kid cannot fork: $!\n";
if ($grandkid) {
CORE::exit(0);
} else {
# code here
close STDIN;
close STDOUT;
close STDERR;
# do something long lasting
CORE::exit(0);
}
}
|
Grandkid becomes a "child of init", i.e. the child of the process whose PID is 1.
Note that the previous two solutions do allow you to know the exit status of the process, but in our example we didn't care about it.
Another solution is to use a different SIGCHLD handler:
use POSIX 'WNOHANG';
$SIG{CHLD} = sub { while( waitpid(-1,WNOHANG)>0 ) {} };
|
Which is useful when you fork() more than one process. The
handler could call wait() as well, but for a variety of
reasons involving the handling of stopped processes and the rare event in
which two children exit at nearly the same moment, the best technique is to
call waitpid() in a tight loop with a first argument of -1 and a second argument of WNOHANG. Together these arguments tell waitpid() to reap the next
child that's available, and prevent the call from blocking if there happens
to be no child ready for reaping. The handler will loop until
waitpid() returns a negative number or zero, indicating that
no more reapable children remain.
While you test and debug your code that uses one of the above examples, You might want to write some debug information to the error_log file so you know what happens.
Read perlipc manpage for more information about signal handlers.
Check also Apache::SubProcess for better system() and exec() implementations
for mod_perl.
META: why is it a better thing to use?
[ TOC ]
Most of the mod_perl enabled servers use a proxy front-end server. This is done in order to avoid serving static objects, and also so that generated output which might be received by slow clients does not cause the heavy but very fast mod_perl servers from idly waiting.
There are very important OS parameters that you might want to change in order to improve the server performance. This topic is discussed in the section: Setting the Buffering Limits on Various OSes
[ TOC ]
Correct configuration of the MinSpareServers, MaxSpareServers,
StartServers, MaxClients, and MaxRequestsPerChild parameters is very important. There are no defaults. If they are too low,
you will under-use the system's capabilities. If they are too high, the
chances are that the server will bring the machine to its knees.
All the above parameters should be specified on the basis of the resources
you have. With a plain apache server, it's no big deal if you run many
servers since the processes are about 1Mb and don't eat a lot of your RAM.
Generally the numbers are even smaller with memory sharing. The situation
is different with mod_perl. I have seen mod_perl processes of 20Mb and
more. Now if you have MaxClients
set to 50: 50x20Mb = 1Gb. Do you have 1Gb of RAM? Maybe not. So how do you
tune the parameters? Generally by trying different combinations and
benchmarking the server. Again mod_perl processes can be of much smaller
size with memory sharing.
Before you start this task you should be armed with the proper weapon. You
need the crashme utility, which will load your server with the mod_perl scripts you possess.
You need it to have the ability to emulate a multiuser environment and to
emulate the behavior of multiple clients calling the mod_perl scripts on
your server simultaneously. While there are commercial solutions, you can
get away with free ones which do the same job. You can use the
ApacheBench ab utility which comes with the Apache distribution, the crashme script which uses
LWP::Parallel::UserAgent or httperf (see the Download page).
It is important to make sure that you run the load generator (the client which generates the test requests) on a system that is more powerful than the system being tested. After all we are trying to simulate Internet users, where many users are trying to reach your service at once. Since the number of concurrent users can be quite large, your testing machine must be very powerful and capable of generating a heavy load. Of course you should not run the clients and the server on the same machine. If you do, your test results would be invalid. Clients will eat CPU and memory that should be dedicated to the server, and vice versa.
See also two tools for benchmarking: ApacheBench and crashme test
[ TOC ]
ApacheBench (ab) is a tool for benchmarking your Apache HTTP server. It is designed to give you an idea of the performance that your current Apache installation can give. In particular, it shows you how many requests per second your Apache server is capable of serving. The ab tool comes bundled with the Apache source distribution, and it's free. :)
Let's try it. We will simulate 10 users concurrently requesting a very
light script at www.example.com:81/test/test.pl. Each simulated user makes 10 requests.
% ./ab -n 100 -c 10 http://www.example.com:81/test/test.pl |
The results are:
Document Path: /perl/test.pl
Document Length: 16 bytes
Concurrency Level: 10
Time taken for tests: 1.683 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 16100 bytes
HTML transferred: 1600 bytes
Requests per second: 59.42
Transfer rate: 9.57 kb/s received
Connnection Times (ms)
min avg max
Connect: 0 29 101
Processing: 77 124 1259
Total: 77 153 1360
|
The only numbers we really care about are:
Complete requests: 100 Failed requests: 0 Requests per second: 59.42 |
Let's raise the request load to 100 x 10 (10 users, each makes 100 requests):
% ./ab -n 1000 -c 10 http://www.example.com:81/perl/access/access.cgi Concurrency Level: 10 Complete requests: 1000 Failed requests: 0 Requests per second: 139.76 |
As expected, nothing changes -- we have the same 10 concurrent users. Now let's raise the number of concurrent users to 50:
% ./ab -n 1000 -c 50 http://www.example.com:81/perl/access/access.cgi Complete requests: 1000 Failed requests: 0 Requests per second: 133.01 |
We see that the server is capable of serving 50 concurrent users at 133
requests per second! Let's find the upper limit. Using -n
10000 -c 1000 failed to get results (Broken Pipe?). Using -n 10000
-c 500 resulted in 94.82 requests per second. The server's performance went down
with the high load.
The above tests were performed with the following configuration:
MinSpareServers 8 MaxSpareServers 6 StartServers 10 MaxClients 50 MaxRequestsPerChild 1500 |
Now let's kill each child after it serves a single request. We will use the following configuration:
MinSpareServers 8 MaxSpareServers 6 StartServers 10 MaxClients 100 MaxRequestsPerChild 1 |
Simulate 50 users each generating a total of 20 requests:
% ./ab -n 1000 -c 50 http://www.example.com:81/perl/access/access.cgi |
The benchmark timed out with the above configuration.... I watched the
output of ps as I ran it, the parent process just wasn't capable of respawning the
killed children at that rate. When I raised the
MaxRequestsPerChild to 10, I got 8.34 requests per second. Very bad - 18 times slower! You
can't benchmark the importance of the
MinSpareServers, MaxSpareServers and StartServers with this kind of test.
Now let's reset MaxRequestsPerChild to 1500, but reduce
MaxClients to 10 and run the same test:
MinSpareServers 8 MaxSpareServers 6 StartServers 10 MaxClients 10 MaxRequestsPerChild 1500 |
I got 27.12 requests per second, which is better but still 4-5 times
slower. (I got 133 with MaxClients set to 50.)
Summary: I have tested a few combinations of the server configuration variables (MinSpareServers, MaxSpareServers,
StartServers, MaxClients and MaxRequestsPerChild). The results I got are as follows:
MinSpareServers, MaxSpareServers and StartServers are only important for user response times. Sometimes users will have to
wait a bit.
The important parameters are MaxClients and MaxRequestsPerChild.
MaxClients should be not too big, so it will not abuse your machine's memory
resources, and not too small, for if it is your users will be forced to
wait for the children to become free to serve them.
MaxRequestsPerChild should be as large as possible, to get the full benefit of mod_perl, but
watch your server at the beginning to make sure your scripts are not
leaking memory, thereby causing your server (and your service) to die very
fast.
Also it is important to understand that we didn't test the response times in the tests above, but the ability of the server to respond under a heavy load of requests. If the test script was heavier, the numbers would be different but the conclusions very similar.
The benchmarks were run with:
HW: RS6000, 1Gb RAM SW: AIX 4.1.5 . mod_perl 1.16, apache 1.3.3 Machine running only mysql, httpd docs and mod_perl servers. Machine was _completely_ unloaded during the benchmarking. |
After each server restart when I changed the server's configuration, I made sure that the scripts were preloaded by fetching a script at least once for every child.
It is important to notice that none of the requests timed out, even if it was kept in the server's queue for more than a minute! That is the way ab works, which is OK for testing purposes but will be unacceptable in the real world - users will not wait for more than five to ten seconds for a request to complete, and the client (i.e. the browser) will time out in a few minutes.
Now let's take a look at some real code whose execution time is more than a few milliseconds. We will do some real testing and collect the data into tables for easier viewing.
I will use the following abbreviations:
NR = Total Number of Request NC = Concurrency MC = MaxClients MRPC = MaxRequestsPerChild RPS = Requests per second |
Running a mod_perl script with lots of mysql queries (the script under test is mysqld limited) (http://www.example.com:81/perl/access/access.cgi?do_sub=query_form), with the configuration:
MinSpareServers 8 MaxSpareServers 16 StartServers 10 MaxClients 50 MaxRequestsPerChild 5000 |
gives us:
NR NC RPS comment
------------------------------------------------
10 10 3.33 # not a reliable figure
100 10 3.94
1000 10 4.62
1000 50 4.09
|
Conclusions: Here I wanted to show that when the application is slow (not due to perl loading, code compilation and execution, but limited by some external operation) it almost does not matter what load we place on the server. The RPS (Requests per second) is almost the same. Given that all the requests have been served, you have the ability to queue the clients, but be aware that anything that goes into the queue means a waiting client and a client (browser) that might time out!
Now we will benchmark the same script without using the mysql (code limited by perl only): (http://www.example.com:81/perl/access/access.cgi), it's the same script but it just returns the HTML form, without making SQL queries.
MinSpareServers 8 MaxSpareServers 16 StartServers 10 MaxClients 50 MaxRequestsPerChild 5000 |
NR NC RPS comment
------------------------------------------------
10 10 26.95 # not a reliable figure
100 10 30.88
1000 10 29.31
1000 50 28.01
1000 100 29.74
10000 200 24.92
100000 400 24.95
|
Conclusions: This time the script we executed was pure perl (not limited by I/O or
mysql), so we see that the server serves the requests much faster. You can
see the number of requests per second is almost the same for any load, but
goes lower when the number of concurrent clients goes beyond MaxClients. With 25 RPS, the machine simulating a load of 400 concurrent clients will
be served in 16 seconds. To be more realistic, assuming a maximum of 100
concurrent clients and 30 requests per second, the client will be served in
3.5 seconds. Pretty good for a highly loaded server.
Now we will use the server to its full capacity, by keeping all
MaxClients clients alive all the time and having a big
MaxRequestsPerChild, so that no child will be killed during the benchmarking.
MinSpareServers 50
MaxSpareServers 50
StartServers 50
MaxClients 50
MaxRequestsPerChild 5000
NR NC RPS comment
------------------------------------------------
100 10 32.05
1000 10 33.14
1000 50 33.17
1000 100 31.72
10000 200 31.60
|
Conclusion: In this scenario there is no overhead involving the parent server loading new children, all the servers are available, and the only bottleneck is contention for the CPU.
Now we will change MaxClients and watch the results: Let's reduce
MaxClients to 10.
MinSpareServers 8
MaxSpareServers 10
StartServers 10
MaxClients 10
MaxRequestsPerChild 5000
NR NC RPS comment
------------------------------------------------
10 10 23.87 # not a reliable figure
100 10 32.64
1000 10 32.82
1000 50 30.43
1000 100 25.68
1000 500 26.95
2000 500 32.53
|
Conclusions: Very little difference! Ten servers were able to serve almost with the same
throughput as 50 servers. Why? My guess is because of CPU throttling. It
seems that 10 servers were serving requests 5 times faster than when we
worked with 50 servers. In that case, each child received its CPU time
slice five times less frequently. So having a big value for MaxClients, doesn't mean that the performance will be better. You have just seen the
numbers!
Now we will start drastically to reduce MaxRequestsPerChild:
MinSpareServers 8 MaxSpareServers 16 StartServers 10 MaxClients 50 |
NR NC MRPC RPS comment
------------------------------------------------
100 10 10 5.77
100 10 5 3.32
1000 50 20 8.92
1000 50 10 5.47
1000 50 5 2.83
1000 100 10 6.51
|
Conclusions: When we drastically reduce MaxRequestsPerChild, the performance starts to become closer to plain mod_cgi.
Here are the numbers of this run with mod_cgi, for comparison:
MinSpareServers 8
MaxSpareServers 16
StartServers 10
MaxClients 50
NR NC RPS comment
------------------------------------------------
100 10 1.12
1000 50 1.14
1000 100 1.13
|
Conclusion: mod_cgi is much slower. :) In the first test, when NR/NC was 100/10, mod_cgi was capable of 1.12 requests per second. In the same circumstances, mod_perl was capable of 32 requests per second, nearly 30 times faster! In the first test each client waited about 100 seconds to be served. In the second and third tests they waited 1000 seconds!
[ TOC ]
httperf is a utility written by David Mosberger. Just like ApacheBench, it measures the performance of the webserver.
A sample command line is shown below:
httperf --server hostname --port 80 --uri /test.html \ --rate 150 --num-conn 27000 --num-call 1 --timeout 5 |
This command causes httperf to use the web server on the host with IP name hostname, running at port 80. The web page being retrieved is /test.html and, in this simple test, the same page is retrieved repeatedly. The rate at which requests are issued is 150 per second. The test involves initiating a total of 27,000 TCP connections and on each connection one HTTP call is performed. A call consists of sending a request and receiving a reply.
The timeout option defines the number of seconds that the client is willing to wait to hear back from the server. If this timeout expires, the tool considers the corresponding call to have failed. Note that with a total of 27,000 connections and a rate of 150 per second, the total test duration will be approximately 180 seconds (27,000/150), independently of what load the server can actually sustain. Here is a result that one might get:
Total: connections 27000 requests 26701 replies 26701 test-duration 179.996 s
Connection rate: 150.0 conn/s (6.7 ms/conn, <=47 concurrent connections)
Connection time [ms]: min 1.1 avg 5.0 max 315.0 median 2.5 stddev 13.0
Connection time [ms]: connect 0.3
Request rate: 148.3 req/s (6.7 ms/req)
Request size [B]: 72.0
Reply rate [replies/s]: min 139.8 avg 148.3 max 150.3 stddev 2.7 (36 samples)
Reply time [ms]: response 4.6 transfer 0.0
Reply size [B]: header 222.0 content 1024.0 footer 0.0 (total 1246.0)
Reply status: 1xx=0 2xx=26701 3xx=0 4xx=0 5xx=0
CPU time [s]: user 55.31 system 124.41 (user 30.7% system 69.1% total 99.8%)
Net I/O: 190.9 KB/s (1.6*10^6 bps)
Errors: total 299 client-timo 299 socket-timo 0 connrefused 0 connreset 0
Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0
|
[ TOC ]
This is another crashme suite originally written by Michael Schilli and
located at http://www.linux-magazin.de/ausgabe.1998.08/Pounder/pounder.html
. I made a few modifications, mostly adding my() operators. I
also allowed it to accept more than one url to test, since sometimes you
want to test more than one script.
The tool provides the same results as ab above but it also allows you to set the timeout value, so requests will fail if not served within the time out period. You also get values for Latency (seconds per request) and Throughput (requests per second). It can do a complete simulation of your favorite Netscape browser :) and give you a better picture.
I have noticed while running these two benchmarking suites, that ab gave me results from two and a half to three times better. Both suites were run on the same machine, with the same load and the same parameters, but the implementations were different.
Sample output:
URL(s): http://www.example.com:81/perl/access/access.cgi Total Requests: 100 Parallel Agents: 10 Succeeded: 100 (100.00%) Errors: NONE Total Time: 9.39 secs Throughput: 10.65 Requests/sec Latency: 0.85 secs/Request |
And the code:
lwp-bench.pl -- The LWP::Parallel::UserAgent benchmark
[ TOC ]
The MaxClients directive sets the limit on the number of simultaneous requests that can be
supported. No more than this number of child server processes will be
created. To configure more than 256 clients, you must edit the HARD_SERVER_LIMIT entry in httpd.h
and recompile. In our case we want this variable to be as small as
possible, because in this way we can limit the resources used by the server
children. Since we can restrict each child's process size (see
Limiting the size of the p