I'm trying to make image scraper for danbooru images, I made a version using web driver 'selenium' and it works fine, but for a large dataset it takes so much time.
Thus I wanted to use bs4 'BeautifulSoup, but I'm getting this error for this second version:
Error processing get_images_srcs: HTTPSConnectionPool(host='danbooru.donmai.us', port=443): Max retries exceeded with url: /posts/'my url' (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1002)')))
The 'my url' part works fine if I tried it in a browser, so it's not a problem from my connection side.
Here is my simple function :
def get_image_src(self, post_id):
image_src = []
search_url = f"{self.base_url}/posts/{post_id}?q={quote(self.tag)}"
try:
response = self.session.get(search_url) # self.session is initialized already
if response.status_code != 200:
print(f"{response.status_code} is the status code : not 200")
return image_src
soup = BeautifulSoup(response.text, 'html.parser')
image = soup.find("img", class_="fit-width")
if image:
image_src.append(image.get("src"))
except Exception as e:
print(f"Error processing get_images_srcs {post_id}: {str(e)}")
return image_src
This is my session initialization function:
def _make_session(self):
session = requests.Session()
adapter = HTTPAdapter(
pool_connections=25,
pool_maxsize=25,
max_retries=Retry(
total=4,
backoff_factor=1,
status_forcelist=[443,503,504]
)
)
session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': ''
})
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
I'm trying to make image scraper for danbooru images, I made a version using web driver 'selenium' and it works fine, but for a large dataset it takes so much time.
Thus I wanted to use bs4 'BeautifulSoup, but I'm getting this error for this second version:
Error processing get_images_srcs: HTTPSConnectionPool(host='danbooru.donmai.us', port=443): Max retries exceeded with url: /posts/'my url' (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1002)')))
The 'my url' part works fine if I tried it in a browser, so it's not a problem from my connection side.
Here is my simple function :
def get_image_src(self, post_id):
image_src = []
search_url = f"{self.base_url}/posts/{post_id}?q={quote(self.tag)}"
try:
response = self.session.get(search_url) # self.session is initialized already
if response.status_code != 200:
print(f"{response.status_code} is the status code : not 200")
return image_src
soup = BeautifulSoup(response.text, 'html.parser')
image = soup.find("img", class_="fit-width")
if image:
image_src.append(image.get("src"))
except Exception as e:
print(f"Error processing get_images_srcs {post_id}: {str(e)}")
return image_src
This is my session initialization function:
def _make_session(self):
session = requests.Session()
adapter = HTTPAdapter(
pool_connections=25,
pool_maxsize=25,
max_retries=Retry(
total=4,
backoff_factor=1,
status_forcelist=[443,503,504]
)
)
session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'https://danbooru.donmai.us'
})
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
The error you're getting usually means that the server abruptly closed the SSL/TLS connection before it finished the handshake or data transfer.
In your case this means that you're making the same, repeated requests and as a result you're getting rate-limited or blocked on the server side.
What can you do about it?
131
).Having said that, you could adjust your session initialization code:
adapter = HTTPAdapter(
pool_connections=5,
pool_maxsize=5,
max_retries=Retry(
total=5,
backoff_factor=2, # increase the delay between retries
status_forcelist=[429, 443, 503, 504]
)
)