python - Handling Pagination in LLM-based Scraping - Stack Overflow

admin2025-05-02  1

I am writing a web-scraper that has to visit a number of different websites and scrape useful information using LLM. However, some of my target websites have pagination, and in particular, some of them don't have a different URL from one page to another. That is, perhaps the number of the displayed page is stored in session information / they work in some Javascript way / etc.

In such situations, can you suggest how I should handle different pages?

Note that since I am not aiming to scrape a single websites, I cannot simply use selenium, having found the element of the "next" button. Each target website may look totally different.

I am writing a web-scraper that has to visit a number of different websites and scrape useful information using LLM. However, some of my target websites have pagination, and in particular, some of them don't have a different URL from one page to another. That is, perhaps the number of the displayed page is stored in session information / they work in some Javascript way / etc.

In such situations, can you suggest how I should handle different pages?

Note that since I am not aiming to scrape a single websites, I cannot simply use selenium, having found the element of the "next" button. Each target website may look totally different.

Share Improve this question edited Jan 2 at 15:59 joanis 12.5k23 gold badges37 silver badges48 bronze badges asked Jan 2 at 13:09 Arash VahabpourArash Vahabpour 1156 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

Assuming it is for text-based or NLP LLM not multi-modal, i.e one with visual-to-text web scraping. In a text-based case, all paginated context would be already loaded in HTML. You can chunk HTML content and use prompts like

"Remove all HTML tags and give me only information: html-text"

to get the text.

转载请注明原文地址:http://anycun.com/QandA/1746118243a91920.html