I am writing a web-scraper that has to visit a number of different websites and scrape useful information using LLM. However, some of my target websites have pagination, and in particular, some of them don't have a different URL from one page to another. That is, perhaps the number of the displayed page is stored in session information / they work in some Javascript way / etc.
In such situations, can you suggest how I should handle different pages?
Note that since I am not aiming to scrape a single websites, I cannot simply use selenium, having found the element of the "next" button. Each target website may look totally different.
I am writing a web-scraper that has to visit a number of different websites and scrape useful information using LLM. However, some of my target websites have pagination, and in particular, some of them don't have a different URL from one page to another. That is, perhaps the number of the displayed page is stored in session information / they work in some Javascript way / etc.
In such situations, can you suggest how I should handle different pages?
Note that since I am not aiming to scrape a single websites, I cannot simply use selenium, having found the element of the "next" button. Each target website may look totally different.
Assuming it is for text-based or NLP LLM not multi-modal, i.e one with visual-to-text web scraping. In a text-based case, all paginated context would be already loaded in HTML. You can chunk HTML content and use prompts like
"Remove all HTML tags and give me only information: html-text"
to get the text.