PHP strpos() vs. preg_match()

How expensive is the PHP strpos() implementation? When should I start using regular expressions for text matching? Is PHP faster than an implementation in C? The PHP manpage states that strpos() is always to be preferred over regular expressions. The following describes some simple experiments and results that shed some light on these questions.

Test Setup

The following measurements were taken using a clean installation of the 'php5-cgi' and a 'libapache2-mod-php5' package in Debian Lenny (both with package version 5.2.6-3 on top of an, besides for the enabled PHP5 module, unchanged installation of apache2 package 2.2.9-7). For some additional testing the Zend platform was installed to see how it affects the basic string searching methods.

Please do not worry about the absolute times measurement values, only their relative factors are relevant.

Performance of strpos()

One question one might have is wether the native PHP5 strpos() and stripos() is more expensive than a strstr() in C and if a PHP extension might be cheaper. Of course according to the wide use of PHP the expectation is that there is almost no overhead.

To determine some meaningful values lets search a short string of 10 bytes in a larger one of 1kB which doesn't contain the search string.

Here is some C code to test it:

#include <sys/time.h>
#include <string.h>
#include <stdio.h>

void main(void) {
int i, duration;
char *s;
struct timeval start, end;

s = malloc(sizeof(char)*1025);
for(i = 0; i < 1024; i++)
	s[i] = '0' + (i % 10);
s[1025] = 0;

gettimeofday(&start, NULL);

for(i = 0; i < 10000; i++)
	strstr(s, search);

gettimeofday(&end, NULL);

duration = (end.tv_sec - start.tv_sec) * 1000000;
duration += (end.tv_usec - start.tv_usec);
printf("Duration: %d ┬Ás\n", duration);
}
Now the same in PHP for strpos():
<?php
$s = "";
for($i = 0; $i < 1024; $i++) 
	$s = $s . ($i % 10);

$start = microtime();

for($i = 0; $i < 10000; $i++)
	strpos($s, "abcdef");

$end = microtime();

echo "Duration: " . ($end - $start) . " s\n";
?>
And here are the average results for 10000 1k string searches on the test setup:

C strstr() 0.08 ms
php5-cgi strpos() 27.7 ms
mod_php5 strpos() 30.6 ms
mod_php5 + ZP strpos() 37.2 ms
php5-cgi stripos() 163.6 ms
mod_php5 stripos() 172.8 ms
mod_php5 + ZP stripos() 177.3 ms
php5-cgi strstr() 25.3 ms
mod_php5 strstr() 27.5 ms
mod_php5 + ZP strstr() 42.7 ms
php5-cgi stristr() 156.7 ms
mod_php5 stristr() 164.3 ms
mod_php5 + ZP stristr() 174.0 ms

It seems like C is a lot faster and standalone PHP is somewhat better than a Apache with the overhead of the Zend platform (ZP). The difference between the different search variants is as expected, the case-insensitive search being more expensive (between 3 and 5 times). The strstr()/stristr() methods which are discouraged by the PHP manpage because they are memory-intensive are about as fast as the strpos()/stripos() methods.

Cost of Regular Expressions

Now let's compare the strpos() function with preg_match(). All tests will be performed using 'php5-cgi'. The following table lists results for several test runs with different numbers of repetitions.

n=1 2 3 10 100 1000 10000
strpos() 0.01 ms 0.02 ms 0.04 ms 0.2 ms 0.9 ms 2.6 ms 25.6 ms
preg_match() 0.2 ms 0.2 ms 0.3 ms 0.47 ms 0.95 ms 7.4 ms 72.2 ms
Ratio 1/20 1/10 1/7 1/2 1/1 1/3 1/3

From the numbers one can see there is a significant overhead for the first preg_match() call. This is for the regular expression compilation which is only necessary on the first run. Thus the execution time ratio is 1/20 at first but goes to 1/3 with larger numbers of executions.

Nonetheless preg_match() will always be more expensive than strpos().

When To Still Prefer preg_match()

How can preg_match() be useful with static matches when it will never be faster than strpos()? Well, imagine a use case where you have to match a set of static strings. Let's take the following example:

<?php
$s = "Why is string matching so expensive?";
if(strpos($s, "String matching is expensive")) {
   # ...
} else if(strpos($s, "Cars are quite expensive")) {
   # ...
} else if(strpos($s, "It's raining")) {
   # ...
} else if(strpos($s, "These rainy days make you sad.")) {
   # ...
} else if(strpos($s, "Travelling by train is fun.")) {
   # ...
}
?>

The cost of the above is determined by calling strpos() 5 times using the same input string with different comparison strings of a similar length. Running the code in the test setup 10000 times results in an average runtime of ~58 ms.

Now imagine the list of strpos() executions is a bit longer, maybe 100 comparisons serving as a comment spam filtering black list. Each new check sentence would add another 10ms (per 10000 executions). The question is how to avoid having linear increasing runtime costs for adding new checks. One possible answer is by clustering matches. For example: By checking wether a check string contains the word "rain" one can prevent executing the last three checks for "It's raining", "These rainy days make you sad." and "Travelling by train is fun.". Only if the word "rain" matches we have to run the specific checks.

<?php
$s = "Why is string matching so expensive?";
if(strpos($s, "String matching is expensive")) {
   # ...
} else if(strpos($s, "Cars are quite expensive")) {
   # ...
} else if(strpos($s, "rain")) {
   if(strpos($s, "It's raining")) {
      # ...
   } else if(strpos($s, "These rainy days make you sad.")) {
      # ...
   } else if(strpos($s, "Travelling by train is fun.")) {
      # ...
   }
}
?>

The extra checks naturally adds another strpos() execution. And if it is a common word that appears in most of the check string like "is" it would not be very useful. So the idea is to have a very long string ("rain" is pretty bad in that sense) that is very uncommon. Also if the number of "subchecks" should be larger than 2, otherwise there would be no gain when a subcheck is triggered (there would be still 2 strpos() calls).

So when you have a good clustering with rare match strings and the clustered subcheck group is quite large the performance gain might be significant. Imaging 100 test strings clustered into 3 main groups, each of those clustered into 4 other groups each containing only 5 to 10 checks. In such a case you would have a linear matching effort for non-matching string of 3 strpos() runs and a maximum matching effort for matching strings of 17 strpos() matches (3+4 for the clustering group matching, 10 at most for the subcheck matching).

Now the problem remains how to organise matches like this and if it is possible in all cases. With luck one might have similar sentences with really equal substrings, but of the match would need to be case insensitive and sometimes whitespaces or different word order might prevent grouping similar checks. In such case preg_match might help using a bit more complex expression bridging the gaps between word order and white spaces. When preg_match() is used on the upper check levels only the cost impact should stay moderate.

Finally the question remains wether this is a good idea at all? Are there cheaper solutions? Feel free to comment!