You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Maybe stream=True with stream_timeout can be optional params to download()?
Something like this might work, I monkey patched download():
fromnewspaperimportnetworkfromnewspaper.networkimport_get_html_from_responseimportrequestsfromnewspaper.articleimportArticleDownloadStatedefget_html_2XX_only(url, config=None, response=None, stream=False, stream_timeout=30):
"""Consolidated logic for http requests from newspaper. We handle error cases: - Attempt to find encoding of the html by using HTTP header. Fallback to 'ISO-8859-1' if not provided. - Error out if a non 2XX HTTP response code is returned. """config=configorConfiguration()
useragent=config.browser_user_agenttimeout=config.request_timeoutproxies=config.proxiesheaders=config.headersifresponseisnotNone:
return_get_html_from_response(response)
ifstream:
response=requests.get(
url=url, **get_request_kwargs(timeout, useragent, proxies, headers), stream=True)
body= []
start=time.time()
forchunkinresponse.iter_content(1024):
body.append(chunk)
iftime.time() -start>stream_timeout:
logging.error(f"Stream timed out for url: {url}")
breakbody=b''.join(body)
response._content=bodyelse:
response=requests.get(
url=url, **get_request_kwargs(timeout, useragent, proxies, headers))
html=_get_html_from_response(response)
ifconfig.http_success_only:
# fail if HTTP sends a non 2XX responseresponse.raise_for_status()
returnhtmldefdownload(self, input_html=None, title=None, recursion_counter=0, stream=False, stream_timeout=30):
"""Downloads the link's HTML content, don't use if you are batch async downloading articles recursion_counter (currently 1) stops refreshes that are potentially infinite """ifinput_htmlisNone:
try:
html=get_html_2XX_only(self.url, self.config, stream=stream, stream_timeout=stream_timeout)
exceptrequests.exceptions.RequestExceptionase:
self.download_state=ArticleDownloadState.FAILED_RESPONSEself.download_exception_msg=str(e)
log.debug('Download failed on URL %s because of %s'%
(self.url, self.download_exception_msg))
returnelse:
html=input_htmlifself.config.follow_meta_refresh:
meta_refresh_url=extract_meta_refresh(html)
ifmeta_refresh_urlandrecursion_counter<1:
returnself.download(
input_html=network.get_html(meta_refresh_url),
recursion_counter=recursion_counter+1)
self.set_html(html)
self.set_title(title)
newspaper.Article.download=download
The text was updated successfully, but these errors were encountered:
The following doesn't timeout nor return anything.
Same with:
Related issue: psf/requests#1577
Maybe
stream=True
withstream_timeout
can be optional params todownload()
?Something like this might work, I monkey patched
download()
:The text was updated successfully, but these errors were encountered: