web scraping - I am trying to scrape images from danbooru but I got SSLEOFError violation of protocol 1002 error, using Beautifu

admin2025-05-01 33

I'm trying to make image scraper for danbooru images, I made a version using web driver 'selenium' and it works fine, but for a large dataset it takes so much time.

Thus I wanted to use bs4 'BeautifulSoup, but I'm getting this error for this second version:

Error processing get_images_srcs: HTTPSConnectionPool(host='danbooru.donmai.us', port=443): Max retries exceeded with url: /posts/'my url' (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1002)')))

The 'my url' part works fine if I tried it in a browser, so it's not a problem from my connection side.

Here is my simple function :

def get_image_src(self, post_id):
    image_src = []  
    search_url = f"{self.base_url}/posts/{post_id}?q={quote(self.tag)}"
    try:
        response = self.session.get(search_url) # self.session is initialized already
        if response.status_code != 200:
            print(f"{response.status_code} is the status code : not 200")
            return image_src 
        soup = BeautifulSoup(response.text, 'html.parser')
        image = soup.find("img", class_="fit-width")
        if image:
            image_src.append(image.get("src"))
    except Exception as e:
        print(f"Error processing get_images_srcs {post_id}: {str(e)}")
    return image_src

This is my session initialization function:

def _make_session(self):
    session = requests.Session()
    adapter = HTTPAdapter(
        pool_connections=25,
        pool_maxsize=25,
        max_retries=Retry(
            total=4,
            backoff_factor=1, 
            status_forcelist=[443,503,504] 
        )
    )
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Referer': ''
    })
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session

I'm trying to make image scraper for danbooru images, I made a version using web driver 'selenium' and it works fine, but for a large dataset it takes so much time.

Thus I wanted to use bs4 'BeautifulSoup, but I'm getting this error for this second version:

Error processing get_images_srcs: HTTPSConnectionPool(host='danbooru.donmai.us', port=443): Max retries exceeded with url: /posts/'my url' (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1002)')))

The 'my url' part works fine if I tried it in a browser, so it's not a problem from my connection side.

Here is my simple function :

def get_image_src(self, post_id):
    image_src = []  
    search_url = f"{self.base_url}/posts/{post_id}?q={quote(self.tag)}"
    try:
        response = self.session.get(search_url) # self.session is initialized already
        if response.status_code != 200:
            print(f"{response.status_code} is the status code : not 200")
            return image_src 
        soup = BeautifulSoup(response.text, 'html.parser')
        image = soup.find("img", class_="fit-width")
        if image:
            image_src.append(image.get("src"))
    except Exception as e:
        print(f"Error processing get_images_srcs {post_id}: {str(e)}")
    return image_src

This is my session initialization function:

def _make_session(self):
    session = requests.Session()
    adapter = HTTPAdapter(
        pool_connections=25,
        pool_maxsize=25,
        max_retries=Retry(
            total=4,
            backoff_factor=1, 
            status_forcelist=[443,503,504] 
        )
    )
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Referer': 'https://danbooru.donmai.us'
    })
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session

Share Improve this question edited Jan 2 at 17:03 JeffC 26k5 gold badges34 silver badges55 bronze badges asked Jan 2 at 16:32 Acno_Sama 678 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

The error you're getting usually means that the server abruptly closed the SSL/TLS connection before it finished the handshake or data transfer.

In your case this means that you're making the same, repeated requests and as a result you're getting rate-limited or blocked on the server side.

What can you do about it?

slow down and/or add delays. You're after a large data set so you might get the job done if you introduce sleeps between requests or limit concurrency.
make sure your headers are correct (in your example the Chrome version is outdate - the latest release is 131).
introduce try-except to handle transient SSL errors.

Having said that, you could adjust your session initialization code:

adapter = HTTPAdapter(
    pool_connections=5,
    pool_maxsize=5,
    max_retries=Retry(
        total=5,
        backoff_factor=2,  # increase the delay between retries
        status_forcelist=[429, 443, 503, 504]
    )
)

转载请注明原文地址:http://anycun.com/QandA/1746107292a91771.html