Concept - Scraping dynamic /
AJAX web pages
2 possible ways
1. Use a headless Browser
- e.g. HtmlUnit for Java
- much slower
- easier to detect
2. Reverse engineering and calling the undocumented API directly
- use the Browser’s Developer Tools
- very fast
- mostly returns already structured data (XML or JSON)
Concept - Steps
1. Open the page in your Browser and find the API
endpoint with the Developer Tools
2. Reverse engineer the API call (parameters, headers,
cookies, etc.)
3. Replicate the API call with Unirest and parse the data
(XML, JSON, sometimes HTML)
4. Extract the desired data
5. Export the results