Домой United States USA — software Parsing HTML: Selecting the Right Library (Part 2)

Parsing HTML: Selecting the Right Library (Part 2)

196
0
ПОДЕЛИТЬСЯ

Here we compare the C# HTML parsers AngleSharp and HtmlAgilityPack to see where they are most helpful, where they fall short, and if they suit your use case.
Last time, we looked over the various HTML parsers you can consider when working with Java. This time, we’ll examine a couple of popular C# libraries worth considering as we examine their features, benefits, and drawbacks when processing HTML.
The ultimate angle brackets parser library parsing HTML5, MathML, SVG and CSS to construct a DOM based on the official W3C specifications.
AngleSharp is, quite simply, the default choice for whenever you need a modern HTML parser for a C# project. In fact, it does not just parse HTML5, but also its most-used companions: CSS and SVG. There is also an extension to integrate scripting in the context of parsing HTML documents: both C# and JavaScript, based on Jint. That means that you can parse HTML documents after they have been modified by JavaScript — both the JavaScript included in the page or a script you add yourself.
AngleSharp fully supports modern conventions for easy manipulation, like CSS selectors and jQuery-like constructs. But it is also well-integrated with the. NET world, with support for LINQ for DOM elements. The author mentions that it may evolve into something more than a parser, but for the moment, it can do simple things like submitting forms.
The following example from the documentation shows a few features of AngleSharp.
The documentation may contain all the information you need, but it certainly could use better organization. For the most part, it is delivered within the GitHub project, but there are also tutorials on CodeProject by the author of the library.
HtmlAgilityPack was once considered the default choice for HTML parsing with C#, although some say that was due to the lack of better alternatives — because the quality of the code was low. In any case, it was essentially abandoned for the last few years, until it was recently revived by ZZZ Projects.
In terms of features and quality, it is quite lacking, at least compared to AngleSharp. Support for CSS selector, necessary for modern HTML parsing, and support for. NET Standard, necessary for modern C# projects, are on the roadmap. On the same document, there is also a planned cleanup of the code.
If you are in need of things like XPath, HtmlAgilityPack should be your best choice. In other cases, I do not think it is the best choice right now — unless you are already using it. That is especially true since there is no documentation. That being said, the new maintainer and the prospect for better features are a good reason to keep using it if you are already a user.
Now that you’ve seen the heavy hitters in the C# world of parsing HTML, next time, we’ll take a look at the options out there for Python.

Continue reading...