Scraping websites built for modern browsers is far more challenging than it was a decade ago. jsoup is a convenient API that makes scraping websites trivial via DOM traversal, CSS Selectors, JQuery-Like methods, and more. But it isn’t without its caveat. Every scraping API is a ticking time bomb.

Real-world HTML is flaky. It changes without notice since it isn’t a documented API. When our Java program fails in scraping, we’re suddenly stuck with a ticking time bomb. In some cases, this is a simple issue that we can reproduce locally and deploy. But some nuanced changes in the DOM tree might be harder to observe in a local test case. In those cases, we need to understand the problem in the parse tree before pushing an update. Otherwise, we might have a broken product in production.

Source de l’article sur DZONE

L’assistance proposée par ANKAA PMO

ANKAA PMO présent depuis plus de 20 ans sur le marché des services IT, accompagne les DSI dans leur recherche de compétences pour des besoins de renforts en mode régie ou l’externalisation de projets.
Vous souhaitez plus d’information ? Cliquez ici