python-requests cannot download an image, downloaded image is 0 bytes - python-3.x

I am trying to download news epaper (the epaper is an image). I am using selenium to login and get the image src and requests module to download the image.
Here is my code (requests part) that i use:
def download(driver,pageNumber):
page,filename = pageNumber,""
if page in range(1,10):
filename = str(currentDT) + "_kompas_{}"+str(page)+".jpg"
filename = filename.format(0)
else: filename = str(currentDT) + "_kompas_"+str(page)+".jpg"
print("Downloading Page " + str(pageNumber) + " ...")
div = driver.find_element_by_xpath("//div[#class='page-wrapper' and #page='" + str(pageNumber) + "']")
img = div.find_element_by_tag_name("img")
imgsrc = img.get_attribute("src")
imgsrc2 = imgsrc.replace("getmedium","getpreview")
img.click()
WebDriverWait(driver,200).until(EC.visibility_of_element_located((By.XPATH,"//img[#src = '"+imgsrc2+"']")))
div2 = driver.find_element_by_xpath("//div[#class='page-wrapper' and #page='" + str(pageNumber) + "']")
img2 = div2.find_element_by_tag_name("img")
url = img2.get_attribute("src")
url = url.replace("https","http")
print(url)
url = img2.get_attribute("src")
r = requests.get(url)
if r.status_code == 200:
with open(download_path + "1.jpg", 'wb') as f:
f.write(r.content)
After running the code, size of downloaded image is 0 bytes. When I check the header using print(r.headers), it throw something like this :
{'Date': 'Fri, 28 Sep 2018 06:14:29 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Set-Cookie': '__cfduid=d2770acf5454bb72630a1936eda1930561538115268; expires=Sat, 28-Sep-19 06:14:28 GMT; path=/; domain=.epaper.id; HttpOnly, ci_session=db77e070cbe346e0ac183d686efae9989e8f2096; path=/; HttpOnly', 'X-Powered-By': 'PHP/5.6.37', 'Expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'Cache-Control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Pragma': 'no-cache', 'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'Server': 'cloudflare', 'CF-RAY': '461411eeef1c31aa-SIN', 'Content-Encoding': 'gzip'}
What should I do to solve this problem? Please help me...

Related

Curl HEAD request returns a last-modified header but Node https.request(..., {method: HEAD},...) does not

Apparently there's something I don't know about HEAD requests.
Here's the URL: 'https://theweekinchess.com/assets/files/pgn/eurbli22.pgn', which I'll refer to as <URL> below.
If I curl this, I see a last-modified entry in the headers:
curl --head <URL>
HTTP/2 200
last-modified: Sun, 18 Dec 2022 18:07:16 GMT
accept-ranges: bytes
content-length: 1888745
host-header: c2hhcmVkLmJsdWVob3N0LmNvbQ==
content-type: application/x-chess-pgn
date: Wed, 11 Jan 2023 23:09:14 GMT
server: Apache
But if I make a HEAD request in Node using https, That information is missing:
https.request(<URL>, { method: 'HEAD' }, res => {
console.log([<URL>, res.headers])}).end()
This returns:
[
<URL>
{
date: 'Wed, 11 Jan 2023 23:16:15 GMT',
server: 'Apache',
p3p: 'CP="NOI NID ADMa OUR IND UNI COM NAV"',
'cache-control': 'private, must-revalidate',
'set-cookie': [
'evof3sqa=4b412b5913b38669fc928a0cca9870e4; path=/; secure; HttpOnly'
],
upgrade: 'h2,h2c',
connection: 'Upgrade, Keep-Alive',
'host-header': 'c2hhcmVkLmJsdWVob3N0LmNvbQ==',
'keep-alive': 'timeout=5, max=75',
'content-type': 'text/html; charset=UTF-8'
}
]
I tried axios instead of https:
const response = await axios.head('https://theweekinchess.com/assets/files/pgn/eurbli22.pgn');
console.log({response: response.headers})
And that works (incl. the proper MIME type):
date: 'Thu, 12 Jan 2023 20:00:48 GMT',
server: 'Apache',
upgrade: 'h2,h2c',
connection: 'Upgrade, Keep-Alive',
'last-modified': 'Sun, 18 Dec 2022 18:07:16 GMT',
'accept-ranges': 'bytes',
'content-length': '1888745',
'host-header': 'c2hhcmVkLmJsdWVob3N0LmNvbQ==',
'keep-alive': 'timeout=5, max=75',
'content-type': 'application/x-chess-pgn'
I also tried waiting for re2.on('end', console.log(res.headers)), but same output as before.
I'm going to close this issue and post it instead as a 'bug' on Node's site. I'm sure there's something that needs to be changed in how I'm executing the HEAD request.

HTTP get request Access Denied

Trying to understand why I am getting access denied when attempting to download the index.html from www.gamestop.com. I have figured out how to get around it. https://www.gamestop.com/on/demandware.static/Sites-gamestop-us-Site/-/default/v1592871955944/js/main.js. I was wondering if anyone understood why the basic url (www.gamestop.com) is rejected.
Code:
import requests
import http.client as http_client
import logging
headers = {
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding':'gzip, deflate, br',
'accept-language':'en-US,en;q=0.9',
'cache-control':'max-age=0',
'connection':'keep-alive',
'dnt':'1',
'downlink':'10',
'ect':'4g',
'rtt':'50',
'sec-fetch-dest':'document',
'sec-fetch-mode':'navigate',
'sec-fetch-site':'none',
'sec-fetch-user':'?1',
'upgrade-insecure-requests':'1',
'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.410 3.97 Safari/537.36'
}
http_client.HTTPConnection.debuglevel = 1
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True
r = requests.get('https://www.gamestop.com', headers=headers)
print(r.text)
print(r.status_code)
print(r.headers)
Output:
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.gamestop.com:443
send: b'GET / HTTP/1.1\r\nHost: www.gamestop.com\r\nuser-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.410 3.97 Safari/537.36\r\naccept-encoding: gzip, deflate, br\r\naccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\nconnection: keep-alive\r\naccept-language: en-US,en;q=0.9\r\ncache-control: max-age=0\r\ndnt: 1\r\ndownlink: 10\r\nect: 4g\r\nrtt: 50\r\nsec-fetch-dest: document\r\nsec-fetch-mode: navigate\r\nsec-fetch-site: none\r\nsec-fetch-user: ?1\r\nupgrade-insecure-requests: 1\r\n\r\n'
reply: 'HTTP/1.1 403 Forbidden\r\n'
header: Server: AkamaiGHost
header: Mime-Version: 1.0
header: Content-Type: text/html
header: Content-Length: 265
header: Expires: Fri, 26 Jun 2020 19:54:19 GMT
header: Date: Fri, 26 Jun 2020 19:54:19 GMT
header: Connection: close
header: Server-Timing: cdn-cache; desc=HIT
header: Server-Timing: cdn-cache; desc=HIT
DEBUG:urllib3.connectionpool:https://www.gamestop.com:443 "GET / HTTP/1.1" 403 265
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access "http://www.gamestop.com/" on this server.<P>
Reference #18.19e8d93f.1593201259.5c2b9d0
</BODY>
</HTML>
403
{'Server': 'AkamaiGHost', 'Mime-Version': '1.0', 'Content-Type': 'text/html', 'Content-Length': '265', 'Expires': 'Fri, 26 Jun 2020 19:54:19 GMT', 'Date': 'Fri, 26 Jun 2020 19:54:19 GMT', 'Connection': 'close', 'Server-Timing': 'cdn-cache; desc=HIT, edge; dur=1'}
This is a code from my another project.
By using python fake user agent you can bypass this;
Use google to learn more about those module that i used here..
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent
ua = UserAgent()
userAgent = ua.random
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument(f'user-agent={userAgent}')
driver = webdriver.Chrome(
executable_path=r'C:\Users\ASHIK\Desktop\chromedriver.exe', options=chrome_options)
driver.get("https://www.myntra.com/men?f=Categories%3ATshirts&p=1")
html_doc = driver.page_source
with open('myntra-ecom.html', 'w', encoding='utf-8') as hfile:
hfile.writelines(html_doc)
hfile.close()
print("Html file Downloaded...")

Python requests text only returning  instead of HTML

I'm trying to scrape the link to a file to download later from a website.
My code:
outage_page = 'https://www.oasis.oati.com/cgi-bin/webplus.dll?script=/woa/woa-planned-outages-report.html&Provider=MISO'
s = requests.Session()
req = s.get(outage_page, stream=True, verify='my cert path is here')
print(req, '\n', req.headers, '\n', req.raw, '\n', req.encoding, '\n', req.content, '\n', req.text)
This is the output I get:
{'Content-Type': 'text/html', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'Server': 'Microsoft-IIS/7.5', 'X-Powered-By': 'ASP.NET', 'X-Content-Type-Options': 'nosniff', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'Date': 'Mon, 26 Aug 2019 15:48:39 GMT', 'Content-Length': '136'}
ISO-8859-1
b'\xef\xbb\xbf\xef\xbb\xbf\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n'

Process finished with exit code 0
I expected req.text to return the html I could scrape, but it only returns . The other print statements are just for reference here. What am I doing wrong?
I'm going to go ahead and post my solution. So I converted my certificate file from .cer to .pem, included the cert in the session instead of the get and added headers to the request. I changed verify to false because it refers to server side certificate not client side.
# create the connection
s = requests.Session()
s.cert = 'path/to/cert.pem'
head = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
}
req = s.get(outage_page, headers=head, verify=False)

Error 403 while using exchangelib to access Outlook Exchange server to read emails

I'm trying to read emails from the Microsoft Exchange server using EWS and exchangelib in Python for an email classification problem. But I am unable to connect to the exchange server.
I've tried specifying the version, auth_type, using a certificate (which gives a ssl verify error), using the smtp address in place of the username and it still doesn't connect.
Here is my code:
from exchangelib import Credentials, Account, EWSDateTime, EWSTimeZone, Configuration, DELEGATE, IMPERSONATION, NTLM, ServiceAccount, Version, Build
USER_NAME = 'domain\\user12345'
ACCOUNT_EMAIL = john.doe#ext.companyname.com'
ACCOUNT_PASSWORD = 'John#1234'
ACCOUNT_SERVER = 'oa.company.com'
creds = Credentials(USER_NAME, ACCOUNT_PASSWORD)
config = Configuration(server=ACCOUNT_SERVER, credentials=creds)
account = Account(primary_smtp_address=ACCOUNT_EMAIL, config=config, autodiscover=False, access_type=DELEGATE)
print('connecting ms exchange server account...')
print(type(account))
print(dir(account))
account.root.refresh()
Here is the error I am getting:
TransportError: Unknown failure
Retry: 0
Waited: 10
Timeout: 120
Session: 26271
Thread: 15248
Auth type: <requests_ntlm.requests_ntlm.HttpNtlmAuth object at 0x00000259AA1BD588>
URL: https://oa.company.com/EWS/Exchange.asmx
HTTP adapter: <requests.adapters.HTTPAdapter object at 0x00000259AA0DB7B8>
Allow redirects: False
Streaming: False
Response time: 0.28100000000085856
Status code: 403
Request headers: {'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'Keep-Alive', 'Content-Type': 'text/xml; charset=utf-8', 'Content-Length': '469', 'Authorization': 'NTLM TlRMTVNTUAADAAAAGAAYAG0AAAAOAQ4BhQAAAAwADABYAAAACQAJAGQAAAAAAAAAbQAAABAAEACTAQAANoKJ4gYBsR0AAAAP7Pyb+wBnMdrlhr4FKVqPbklDSUNJQkFOS0xURElQUlUzODE5MAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJTLmMBLPHowOJZ46XDs+4ABAQAAAAAAAAZdRefQLNUB+Fc6Z26oxvgAAAAAAgAYAEkAQwBJAEMASQBCAEEATgBLAEwAVABEAAEAEgBIAFkARABFAFgAQwBIADAAOAAEACAAaQBjAGkAYwBpAGIAYQBuAGsAbAB0AGQALgBjAG8AbQADADQASABZAEQARQBYAEMASAAwADgALgBpAGMAaQBjAGkAYgBhAG4AawBsAHQAZAAuAGMAbwBtAAUAIABpAGMAaQBjAGkAYgBhAG4AawBsAHQAZAAuAGMAbwBtAAcACAAGXUXn0CzVAQYABAACAAAACgAQAD9EWlwiiUs304wucsxnkyQAAAAAAAAAALfelDwG05hYOMUqY/e60PY=', 'Cookie': 'ClientId=SINZWMOJKWSKDGEKASFG; expires=Fri, 26-Jun-2020 10:13:02 GMT; path=/; HttpOnly'}
Response headers: {'Cache-Control': 'private', 'Server': 'Microsoft-IIS/8.5', 'request-id': 'ae4dee8d-34e0-471c-8252-b8c1056c8ea0', 'X-CalculatedBETarget': 'pqrexch05.domain.com', 'X-DiagInfo': 'PQREXCH05', 'X-BEServer': 'PQREXCH05', 'X-AspNet-Version': '4.0.30319', 'Set-Cookie': 'exchangecookie=681afc8a0905459182363cce9a98d021; expires=Sat, 27-Jun-2020 10:13:02 GMT; path=/; HttpOnly, X-BackEndCookie=S-1-5-21-1343024091-725345543-504838010-1766210=u56Lnp2ejJqBy87Iysqem5nSy8mbnNLLyZ7H0sfIysbSy5vMz8qdzcvPnpzHgYHNz87G0s/I0s3Iq87Pxc7Mxc/N; expires=Sat, 27-Jul-2019 10:13:02 GMT; path=/EWS; secure; HttpOnly', 'Persistent-Auth': 'true', 'X-Powered-By': 'ASP.NET', 'X-FEServer': 'PQREXCH05', 'Date': 'Thu, 27 Jun 2019 10:13:01 GMT', 'Content-Length': '0'}
Request data: b'<?xml version=\'1.0\' encoding=\'utf-8\'?>\n<s:Envelope xmlns:m="http://schemas.microsoft.com/exchange/services/2006/messages" xmlns:s="http://schemas.xmlsoap.org/soap/envelope/" xmlns:t="http://schemas.microsoft.com/exchange/services/2006/types"><s:Header><t:RequestServerVersion Version="Exchange2013_SP1"/></s:Header><s:Body><m:ResolveNames ReturnFullContactData="false"><m:UnresolvedEntry>ICICIBANKLTD\\IPRU38190</m:UnresolvedEntry></m:ResolveNames></s:Body></s:Envelope>'
Response data: b''
You might need to configure access policy for EWS using PowerShell.
For example (to allow all apps to use REST and EWS):
Set-OrganizationConfig -EwsApplicationAccessPolicy EnforceBlockList -EwsBlockList $null
Taken from Microsoft docs on Set-OrganizationConfig.
Please search for EwsApplicationAccessPolicy in the above link for more granular access control examples.

Request email audit export fails with status 400 and "Premature end of file."

according to https://developers.google.com/admin-sdk/email-audit/#creating_a_mailbox_for_export I am trying to request the email audit export of an user in G Suite this way:
def requestAuditExport(account):
credentials = getCredentials()
http = credentials.authorize(httplib2.Http())
url = 'https://apps-apis.google.com/a/feeds/compliance/audit/mail/export/helpling.com/'+account
status, response = http.request(url, 'POST', headers={'Content-Type': 'application/atom+xml'})
print(status)
print(response)
And I get the following result:
{'content-length': '22', 'expires': 'Tue, 13 Dec 2016 14:19:37 GMT', 'date': 'Tue, 13 Dec 2016 14:19:37 GMT', 'x-frame-options': 'SAMEORIGIN', 'transfer-encoding': 'chunked', 'x-xss-protection': '1; mode=block', 'content-type': 'text/html; charset=UTF-8', 'x-content-type-options': 'nosniff', '-content-encoding': 'gzip', 'server': 'GSE', 'status': '400', 'cache-control': 'private, max-age=0', 'alt-svc': 'quic=":443"; ma=2592000; v="35,34"'}
b'Premature end of file.'
I cannot see where the problem is, can someone please give me a hint?
Thanks in advance!
Kay
Fix it by going intp the Admin Console, Manage API client access page under Security and add the Client ID, scope needed for the Directory API. For more information, check this document.
Okay, found out what was wrong and fixed it myself. Finally it looks like this:
http = getCredentials().authorize(httplib2.Http())
url = 'https://apps-apis.google.com/a/feeds/compliance/audit/mail/export/helpling.com/'+account
headers = {'Content-Type': 'application/atom+xml'}
xml_data = """<atom:entry xmlns:atom='http://www.w3.org/2005/Atom' xmlns:apps='http://schemas.google.com/apps/2006'> \
<apps:property name='includeDeleted' value='true'/> \
</atom:entry>"""
status, response = http.request(url, 'POST', headers=headers, body=xml_data)
Not sure if it was about the body or the header. It works now and I hope it will help others.
Thanks anyway.

Resources