Articles

Scraping websites built for modern browsers is far more challenging than it was a decade ago. jsoup is a convenient API that makes scraping websites trivial via DOM traversal, CSS Selectors, JQuery-Like methods, and more. But it isn’t without its caveat. Every scraping API is a ticking time bomb.

Real-world HTML is flaky. It changes without notice since it isn’t a documented API. When our Java program fails in scraping, we’re suddenly stuck with a ticking time bomb. In some cases, this is a simple issue that we can reproduce locally and deploy. But some nuanced changes in the DOM tree might be harder to observe in a local test case. In those cases, we need to understand the problem in the parse tree before pushing an update. Otherwise, we might have a broken product in production.

Source de l’article sur DZONE