Thursday, November 29, 2012

The many ways to get a memory leak in PHP

I wrote this crawler whose job is to check 24/7 a bunch of html pages. The crawler is run by cron. At each run, it check what pages need to be checked, scrap them, update the DB, and exits.
The application crawls URLs in parallel using the multi_exec feature of CURL. This allows to scale up the crawling capacity as the number of URLs to refresh every day increases by increasing the number of pages crawled in parallel.
But it had a massive memory leak, preventing it to actually scale much.

The function memory_get_usage() give how much memory a PHP process is using. However, it turned out the be quite difficult to solve the issue because the application had several leaks, 3 actually. Hereafter I list the cause of each of them and give the solution along with some references.

1rst leak:
The crawler is a Symfony task. Symfony tasks can use 2 types of configuration objects, sfProjectConfiguration and sfApplicationConfiguration. The later one by default activates Symfony debug mode. The debug mode turns out to have Doctrine keeps a copy of every query sent to the database, as mentioned in this thread. If you do need to use a sfApplicationConfiguration, you can turn off the debug mode by adding sfConfig::set('sf_debug', false) at the start of your script, as mentioned in this other thread.

2nd leak:
The crawler uses DOM and xpath to extract data from the html pages. Using lib_xml_use_internal_errors(true) suppresses error output for badly formed HTML but builds a continuous log of errors. The solution is to call libxml_clear_errors() after each DOMDocument creation, as mentioned in that thread.

3rd leak:
PHP garbage collection used to have a bug with circular references that has supposedly been corrected since version 5.3. Doctrine hydrated objects often have circular references and weren't properly disposed by the garbage collector. On way to solve that with PHP < 5.3 is to call $doctrine_object->free() when they are not needed any more. But what I noticed is that the circular reference bug has not been completely solved, even with PHP 5.4 (5.4.8 at the time of writing). So, do use free().

After correcting these 3 issues, my crawler is now using a constant amount of memory while crawling hundreds of URLs in parallel.