Parsing and processing a web page in PHP: choosing the best library

The task of sparring and processing the necessary information from a third-party site arises before the web developer quite often and for a variety of reasons: thus you can populate your project with content, dynamically upload some information and many other uses.
In such cases the programmer faces the question, which of the dozens of libraries to choose? In this article we’ve compiled the most popular options and selected the best ones from them.

Regular expressions

Even though the “regulars” are the first thing that comes to mind it’s not worth using them for real projects. Yes, with regular tasks regular expressions are the best but its use is much more difficult when you need to spar a large and complex piece of HTML code. Moreover, it does not always correspond to any particular pattern and may contain syntax errors. Your regular expression with every slightest change in the code, we recommend using the tools below. It’s easier, more convenient and even more reliable.

XPath and DOM

DOM and XPath are not libraries in the usual sense of the word. They are standard modules that are built into PHP since the fifth version. It is the lack of the need to use third-party solutions that makes them one of the best tools for parsing HTML pages.

Xpath_logo

At first glance it might seem that the low entry threshold is not about them as some places are really quite complex. But this is only at first glance. It is only necessary to understand a little with the syntax and basic principles as XPath will immediately become your tool for parsing number one for parsing the engine, primarily intended for working with XML, is use, XML and HTML. They are very similar languages yet still different. This results in specific requirements for markup. For example, all HTML tags should be closed.

Simple HTML DOM

Simple HTML DOM is a PHP-library which allows you to parse HTML code using convenient jQuery-like selectors. It lacks the main disadvantage of XPath which the library can work even with invalid HTML-code this greatly simplifies the work. You’ll also forget about the encoding problems as all conversions are performed automatically.
Like JQuery, Simple HTML DOM can search, filter nested elements, access their attributes and even select separate logical elements of the code. For example, comments. Despite not being the highest performance in comparison with other options Simple HTML DOM has the largest Russian-speaking community and the greatest prevalence in Runet. For beginners it makes writing code with its use much easier.

phpQuery

Like Simple HTML DOM, phpQuery is a PHP version of JQuery but this time it is more like its “older javascript-brother”. Ported almost everything that is in the JS-framework including support for selectors, attributes, manipulations, workarounds, plug-ins, events (including simulating clicks, etc.) and even AJAX. You can use it with PHP or via the command line as a separate application. According to our benchmarks the phpQuery was 8 times faster than Simple HTML DOM.

htmlSQL

Is an experimental PHP library that allows manipulating HTML markup through SQL-like queries. As with ordinary mysql_ functions it’s using the methods fetch_array () or fetch_objects (). We can get the result of this query in the form of a familiar associative array or object. It’s also worth mentioning the high speed of htmlSQL. It copes several times faster than phpQuery or the same Simple HTML DOM. Nevertheless for complex tasks you may not have enough functionality and the development of the library has long ceased. It’s still of interest to web developers as in some cases it is much more convenient to use SQL instead of CSS selectors.

Conclusion

In our mini-study we came to the conclusion that in most of the cases it’s better to use the phpQuery library for parsing as it’s fast, functional and up-to-date. On the other hand in regards to very simple tasks it would be logical to use standard PHP modules such as XPath, DOM or at least regular expressions.
Something else?
For PHP, there are dozens of different libraries and tools for parsing, but in this article we have considered only the most interesting, functional and productive.