PHP strpos() vs. preg_match()

How expensive is the PHP strpos() implementation? When should I start using regular expressions for text matching? Is PHP faster than an implementation in C? The PHP manpage states that strpos() is always to be preferred over regular expressions. The following describes some simple experiments and results that shed some light on these questions.

Test Setup

The following measurements were taken using a clean installation of the 'php5-cgi' and a 'libapache2-mod-php5' package in Debian Lenny (both with package version 5.2.6-3 on top of an, besides for the enabled PHP5 module, unchanged installation of apache2 package 2.2.9-7). For some additional testing the Zend platform was installed to see how it affects the basic string searching methods.

Please do not worry about the absolute times measurement values, only their relative factors are relevant.

Performance of strpos()

One question one might have is wether the native PHP5 strpos() and stripos() is more expensive than a strstr() in C and if a PHP extension might be cheaper. Of course according to the wide use of PHP the expectation is that there is almost no overhead.

To determine some meaningful values lets search a short string of 10 bytes in a larger one of 1kB which doesn't contain the search string.

Here is some C code to test it:

#include <sys/time.h>
#include <string.h>
#include <stdio.h>

void main(void) {
int i, duration;
char *s;
struct timeval start, end;

s = malloc(sizeof(char)*1025);
for(i = 0; i < 1024; i++)
	s[i] = '0' + (i % 10);
s[1025] = 0;

gettimeofday(&start, NULL);

for(i = 0; i < 10000; i++)
	strstr(s, search);

gettimeofday(&end, NULL);

duration = (end.tv_sec - start.tv_sec) * 1000000;
duration += (end.tv_usec - start.tv_usec);
printf("Duration: %d µs\n", duration);
}
Now the same in PHP for strpos():
<?php
$s = "";
for($i = 0; $i < 1024; $i++) 
	$s = $s . ($i % 10);

$start = microtime();

for($i = 0; $i < 10000; $i++)
	strpos($s, "abcdef");

$end = microtime();

echo "Duration: " . ($end - $start) . " s\n";
?>
And here are the average results for 10000 1k string searches on the test setup:

C strstr() 0.08 ms
php5-cgi strpos() 27.7 ms
mod_php5 strpos() 30.6 ms
mod_php5 + ZP strpos() 37.2 ms
php5-cgi stripos() 163.6 ms
mod_php5 stripos() 172.8 ms
mod_php5 + ZP stripos() 177.3 ms
php5-cgi strstr() 25.3 ms
mod_php5 strstr() 27.5 ms
mod_php5 + ZP strstr() 42.7 ms
php5-cgi stristr() 156.7 ms
mod_php5 stristr() 164.3 ms
mod_php5 + ZP stristr() 174.0 ms

It seems like C is a lot faster and standalone PHP is somewhat better than a Apache with the overhead of the Zend platform (ZP). The difference between the different search variants is as expected, the case-insensitive search being more expensive (between 3 and 5 times). The strstr()/stristr() methods which are discouraged by the PHP manpage because they are memory-intensive are about as fast as the strpos()/stripos() methods.

Cost of Regular Expressions

Now let's compare the strpos() function with preg_match(). All tests will be performed using 'php5-cgi'. The following table lists results for several test runs with different numbers of repetitions.

n=1 2 3 10 100 1000 10000
strpos() 0.01 ms 0.02 ms 0.04 ms 0.2 ms 0.9 ms 2.6 ms 25.6 ms
preg_match() 0.2 ms 0.2 ms 0.3 ms 0.47 ms 0.95 ms 7.4 ms 72.2 ms
Ratio 1/20 1/10 1/7 1/2 1/1 1/3 1/3

From the numbers one can see there is a significant overhead for the first preg_match() call. This is for the regular expression compilation which is only necessary on the first run. Thus the execution time ratio is 1/20 at first but goes to 1/3 with larger numbers of executions.

Nonetheless preg_match() will always be more expensive than strpos().

When To Still Prefer preg_match()

To be continued...