Stuck with xml download using python. How to handle that? - python-3.x

I need a hint from you about an issue I'm handling. Using requests to do some webscraping in python, the URL gives me a file to download, but when I get the content from the request, I get the following result:
b'"PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9InllcyI/Pg0KPERhZG9zRWNvbm9taWNvRmluYW5jZWlyb3MgeG1sbnM6eHNpPSJodHRwOi8vd3d3LnczLm9yZy8yMDAxL1hNTFNjaGVtYS1pbnN0YW5jZSI+DQoJPERhZG9zR2VyYWlzPg0KCQk8Tm9tZUZ1bmRvPkZJSSBCVEdQIExPR0lTVElDQTwvTm9tZUZ1bmRvPg0KCQk8Q05QSkZ1bmRvPjExODM5NTkzMDAwMTA5PC9DTlBKRnVuZG8+DQoJCTxOb21lQWRtaW5pc3RyYWRvcj5CVEcgUGFjdHVhbCBTZXJ2acOnb3MgRmluYW5jZWlyb3MgUy5BLiBEVFZNPC9Ob21lQWRtaW5pc3RyYWRvcj4NCgkJPENOUEpBZG1pbmlzdHJhZG9yPjU5MjgxMjUzMDAwMTIzPC9DTlBKQWRtaW5pc3RyYWRvcj4NCgkJPFJlc3BvbnNhdmVsSW5mb3JtYWNhbz5MdWNhcyBNYXNzb2xhPC9SZXNwb25zYXZlbEluZm9ybWFjYW8+DQoJCTxUZWxlZm9uZUNvbnRhdG8+KDExKSAzMzgzLTI1MTM8L1RlbGVmb25lQ29udGF0bz4NCgkJPENvZElTSU5Db3RhPkJSQlRMR0NURjAwMDwvQ29kSVNJTkNvdGE+DQoJCTxDb2ROZWdvY2lhY2FvQ290YT5CVExHMTE8L0NvZE5lZ29jaWFjYW9Db3RhPg0KCTwvRGFkb3NHZXJhaXM+DQoJPEluZm9ybWVSZW5kaW1lbnRvcz4NCgkJPFJlbmRpbWVudG8+DQoJCQk8RGF0YUFwcm92YWNhbz4yMDIxLTEyLTE1PC9EYXRhQXByb3ZhY2FvPg0KCQkJPERhdGFCYXNlPjIwMjEtMTItMTU8L0RhdGFCYXNlPg0KCQkJPERhdGFQYWdhbWVudG8+MjAyMS0xMi0yMzwvRGF0YVBhZ2FtZW50bz4NCgkJCTxWYWxvclByb3ZlbnRvQ290YT4wLjcyPC9WYWxvclByb3ZlbnRvQ290YT4NCgkJCTxQZXJpb2RvUmVmZXJlbmNpYT5Ob3ZlbWJybzwvUGVyaW9kb1JlZmVyZW5jaWE+DQoJCQk8QW5vPjIwMjE8L0Fubz4NCgkJCTxSZW5kaW1lbnRvSXNlbnRvSVI+dHJ1ZTwvUmVuZGltZW50b0lzZW50b0lSPg0KCQk8L1JlbmRpbWVudG8+DQoJCTxBbW9ydGl6YWNhbyB0aXBvPSIiLz4NCgk8L0luZm9ybWVSZW5kaW1lbnRvcz4NCjwvRGFkb3NFY29ub21pY29GaW5hbmNlaXJvcz4="'
and these headers:
{'Date': 'Thu, 13 Jan 2022 13:25:03 GMT', 'Set-Cookie': 'dtCookie=v_4_srv_27_sn_A24AD4C76E5194F3DB0056C40CBABEF7_perc_100000_ol_0_mul_1_app-3A97e61c3a8a7c6a0b_1_rcs-3Acss_0; Path=/; Domain=.bmfbovespa.com.br, JSESSIONID=LWB+pcQEPreUbb+BtwZ9pyOm.sfnNODE01; Path=/fnet; Secure; HttpOnly, TS01871345=011d592ce1f641d52fa6af8d3b5a924eddc7997db2f6611d8d70aeab610f5e34ea2706a45b6f2c35f2b500d01fc681c74e5caa356c; Path=/; HTTPOnly, TS01e3f871=011d592ce1f641d52fa6af8d3b5a924eddc7997db2f6611d8d70aeab610f5e34ea2706a45b6f2c35f2b500d01fc681c74e5caa356c; path=/; domain=.bmfbovespa.com.br; HTTPonly, TS01d1c2dd=011d592ce1f641d52fa6af8d3b5a924eddc7997db2f6611d8d70aeab610f5e34ea2706a45b6f2c35f2b500d01fc681c74e5caa356c; path=/fnet; HTTPonly', 'X-OneAgent-JS-Injection': 'true', 'X-Frame-Options': 'SAMEORIGIN', 'Cache-Control': 'no-cache, no-store, must-revalidate', 'Pragma': 'no-cache', 'Expires': '0', 'Content-Disposition': 'attachment; filename="08706065000169-ACE28022020V01-000083505.xml"', 'Server-Timing': 'dtRpid;desc="258920448"', 'Connection': 'close', 'Content-Type': 'text/xml', 'X-XSS-Protection': '1; mode=block', 'Transfer-Encoding': 'chunked'}
But it works perfectly and download the .xml file when I point the browser to https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=247031 URL address, for example, with the following data
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<DadosEconomicoFinanceiros xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<DadosGerais>
<NomeFundo>FII BTGP LOGISTICA</NomeFundo>
<CNPJFundo>11839593000109</CNPJFundo>
<NomeAdministrador>BTG Pactual Serviços Financeiros S.A. DTVM</NomeAdministrador>
<CNPJAdministrador>59281253000123</CNPJAdministrador>
<ResponsavelInformacao>Lucas Massola</ResponsavelInformacao>
<TelefoneContato>(11) 3383-2513</TelefoneContato>
<CodISINCota>BRBTLGCTF000</CodISINCota>
<CodNegociacaoCota>BTLG11</CodNegociacaoCota>
</DadosGerais>
<InformeRendimentos>
<Rendimento>
<DataAprovacao>2021-12-15</DataAprovacao>
<DataBase>2021-12-15</DataBase>
<DataPagamento>2021-12-23</DataPagamento>
<ValorProventoCota>0.72</ValorProventoCota>
<PeriodoReferencia>Novembro</PeriodoReferencia>
<Ano>2021</Ano>
<RendimentoIsentoIR>true</RendimentoIsentoIR>
</Rendimento>
<Amortizacao tipo=""/>
</InformeRendimentos>
</DadosEconomicoFinanceiros>
It seems to me that the data is cryptographed, but I have no idea how to get the xml data to use the data inside it. Can you help me?
Thank you very much.
EDIT:
The example code I've used is quite simple:
Python 3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> url='fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=247031'
>>> xhtml = requests.get(url,verify=False, headers={'User-Agent':'Mozzila/5.0'})
Then xhtml.content command shows the string. (There is a HTTPS warning due to the verify=False that i will handle after)
I have tried a solution using urllib.request, but got the same result

Data seems to be base64 encoded. Try to decode it:
import requests
import base64
url = 'http://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=247031'
response = requests.get(url,verify=False, headers={'User-Agent':'Mozzila/5.0'})
decoded = base64.b64decode(response.content)
print(decoded)

Related

Error 403 while using exchangelib to access Outlook Exchange server to read emails

I'm trying to read emails from the Microsoft Exchange server using EWS and exchangelib in Python for an email classification problem. But I am unable to connect to the exchange server.
I've tried specifying the version, auth_type, using a certificate (which gives a ssl verify error), using the smtp address in place of the username and it still doesn't connect.
Here is my code:
from exchangelib import Credentials, Account, EWSDateTime, EWSTimeZone, Configuration, DELEGATE, IMPERSONATION, NTLM, ServiceAccount, Version, Build
USER_NAME = 'domain\\user12345'
ACCOUNT_EMAIL = john.doe#ext.companyname.com'
ACCOUNT_PASSWORD = 'John#1234'
ACCOUNT_SERVER = 'oa.company.com'
creds = Credentials(USER_NAME, ACCOUNT_PASSWORD)
config = Configuration(server=ACCOUNT_SERVER, credentials=creds)
account = Account(primary_smtp_address=ACCOUNT_EMAIL, config=config, autodiscover=False, access_type=DELEGATE)
print('connecting ms exchange server account...')
print(type(account))
print(dir(account))
account.root.refresh()
Here is the error I am getting:
TransportError: Unknown failure
Retry: 0
Waited: 10
Timeout: 120
Session: 26271
Thread: 15248
Auth type: <requests_ntlm.requests_ntlm.HttpNtlmAuth object at 0x00000259AA1BD588>
URL: https://oa.company.com/EWS/Exchange.asmx
HTTP adapter: <requests.adapters.HTTPAdapter object at 0x00000259AA0DB7B8>
Allow redirects: False
Streaming: False
Response time: 0.28100000000085856
Status code: 403
Request headers: {'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'Keep-Alive', 'Content-Type': 'text/xml; charset=utf-8', 'Content-Length': '469', 'Authorization': 'NTLM TlRMTVNTUAADAAAAGAAYAG0AAAAOAQ4BhQAAAAwADABYAAAACQAJAGQAAAAAAAAAbQAAABAAEACTAQAANoKJ4gYBsR0AAAAP7Pyb+wBnMdrlhr4FKVqPbklDSUNJQkFOS0xURElQUlUzODE5MAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJTLmMBLPHowOJZ46XDs+4ABAQAAAAAAAAZdRefQLNUB+Fc6Z26oxvgAAAAAAgAYAEkAQwBJAEMASQBCAEEATgBLAEwAVABEAAEAEgBIAFkARABFAFgAQwBIADAAOAAEACAAaQBjAGkAYwBpAGIAYQBuAGsAbAB0AGQALgBjAG8AbQADADQASABZAEQARQBYAEMASAAwADgALgBpAGMAaQBjAGkAYgBhAG4AawBsAHQAZAAuAGMAbwBtAAUAIABpAGMAaQBjAGkAYgBhAG4AawBsAHQAZAAuAGMAbwBtAAcACAAGXUXn0CzVAQYABAACAAAACgAQAD9EWlwiiUs304wucsxnkyQAAAAAAAAAALfelDwG05hYOMUqY/e60PY=', 'Cookie': 'ClientId=SINZWMOJKWSKDGEKASFG; expires=Fri, 26-Jun-2020 10:13:02 GMT; path=/; HttpOnly'}
Response headers: {'Cache-Control': 'private', 'Server': 'Microsoft-IIS/8.5', 'request-id': 'ae4dee8d-34e0-471c-8252-b8c1056c8ea0', 'X-CalculatedBETarget': 'pqrexch05.domain.com', 'X-DiagInfo': 'PQREXCH05', 'X-BEServer': 'PQREXCH05', 'X-AspNet-Version': '4.0.30319', 'Set-Cookie': 'exchangecookie=681afc8a0905459182363cce9a98d021; expires=Sat, 27-Jun-2020 10:13:02 GMT; path=/; HttpOnly, X-BackEndCookie=S-1-5-21-1343024091-725345543-504838010-1766210=u56Lnp2ejJqBy87Iysqem5nSy8mbnNLLyZ7H0sfIysbSy5vMz8qdzcvPnpzHgYHNz87G0s/I0s3Iq87Pxc7Mxc/N; expires=Sat, 27-Jul-2019 10:13:02 GMT; path=/EWS; secure; HttpOnly', 'Persistent-Auth': 'true', 'X-Powered-By': 'ASP.NET', 'X-FEServer': 'PQREXCH05', 'Date': 'Thu, 27 Jun 2019 10:13:01 GMT', 'Content-Length': '0'}
Request data: b'<?xml version=\'1.0\' encoding=\'utf-8\'?>\n<s:Envelope xmlns:m="http://schemas.microsoft.com/exchange/services/2006/messages" xmlns:s="http://schemas.xmlsoap.org/soap/envelope/" xmlns:t="http://schemas.microsoft.com/exchange/services/2006/types"><s:Header><t:RequestServerVersion Version="Exchange2013_SP1"/></s:Header><s:Body><m:ResolveNames ReturnFullContactData="false"><m:UnresolvedEntry>ICICIBANKLTD\\IPRU38190</m:UnresolvedEntry></m:ResolveNames></s:Body></s:Envelope>'
Response data: b''
You might need to configure access policy for EWS using PowerShell.
For example (to allow all apps to use REST and EWS):
Set-OrganizationConfig -EwsApplicationAccessPolicy EnforceBlockList -EwsBlockList $null
Taken from Microsoft docs on Set-OrganizationConfig.
Please search for EwsApplicationAccessPolicy in the above link for more granular access control examples.

HTTP headers format using python's requests

I use python requests to capture a website's http headers. For example, this is a response header:
{'Connection': 'keep-alive',
'Access-Control-Allow-Origin': '*', 'cache-control': 'max-age=600',
'Content-Type': 'text/html; charset=utf-8', 'Expires': 'Fri, 19 Apr
2019 03:16:28 GMT', 'Via': '1.1 varnish, 1.1 varnish', 'X-ESI': 'on',
'Verso': 'false', 'Accept-Ranges': 'none', 'Date': 'Fri, 19 Apr 2019
03:11:12 GMT', 'Age': '283', 'Set-Cookie':
'CN_xid=08f66bff-4001-4173-b4e2-71ac31bb58d7; Expires=Wed, 16 Oct 2019
03:11:12 GMT; path=/;, xid1=1; Expires=Fri, 19 Apr 2019 03:11:27 GMT;
path=/;, verso_bucket=281; Expires=Sat, 18 Apr 2020 03:11:12 GMT;
path=/;', 'X-Served-By': 'cache-iad2133-IAD, cache-gru17122-GRU',
'X-Cache': 'HIT, MISS', 'X-Cache-Hits': '1, 0', 'X-Timer':
'S1555643472.999490,VS0,VE302', 'Content-Security-Policy':
"default-src https: data: 'unsafe-inline' 'unsafe-eval'; child-src
https: data: blob:; connect-src https: data: blob:; font-src https:
data:; img-src https: data: blob:; media-src https: data: blob:;
object-src https:; script-src https: data: blob: 'unsafe-inline'
'unsafe-eval'; style-src https: 'unsafe-inline';
block-all-mixed-content; upgrade-insecure-requests; report-uri
https://l.com/csp/gq",
'X-Fastly-Device-Detect': 'desktop', 'Strict-Transport-Security':
'max-age=7776000; preload', 'Vary': 'Accept-Encoding, Verso,
Accept-Encoding', 'content-encoding': 'gzip', 'transfer-encoding':
'chunked'}
I noted that from several examples I tested, the headers I receive from requests are formatted as 'key':'value' (plz note the single colons surrounding the key and the value). However, when I check the headers from the Firefox-> Web developer -> Inspector, and choose to view the header in raw format, I do not see commas:
HTTP/2.0 200 OK date: Thu, 09 May 2019 18:49:07 GMT expires: -1
cache-control: private, max-age=0 content-type: text/html;
charset=UTF-8 strict-transport-security: max-age=31536000
content-encoding: br server: gws content-length: 55844
x-xss-protection: 0 x-frame-options: SAMEORIGIN set-cookie:
1P_JAR=2019-05-09-18; expires=Sat, 08-Jun-2019 18:49:07 GMT; path=/;
domain=.google.com alt-svc: quic=":443"; ma=2592000; v="46,44,43,39"
X-Firefox-Spdy: h2
I need to know: Does python's requests module always adds single colons? This important from me as I need to include/exclude them in my regex that is used to analyze the headers.
The issue I think you are running into is the request coming back as a dict instead of a value as firefox inspector is giving you. When you do this you could be getting mixed results if one of the value pairs has a numeric or boolean value so when doing your regex you may want to use a Try/Except if you can remove the exterior apostrophes or just use the value given.
It's not the requests module that's adding the colons. Request represents headers as a dict, but you seem to be treating them as a string. When Python converts dicts to strings, they get the colons, the commas, the quotation marks.
The right fix for your program is probably to treat the dictionary as a dictionary, not convert it into a string. But if you really want the headers in string form, you should consider using different tool, such as curl.

DNS query not specified or too small

I'm trying to make a python script to test if a server can answer in DNS-over-HTTPS.
So, I read this article and try to make the same request but in python :
import requests
r=requests.get("https://cloudflare-dns.com/dns-query?name=example.com&type=A", headers={"accept":"application/dns-message"})
print(r.url)
print(r.headers)
print(r.status_code)
Here is the output
https://cloudflare-dns.com/dns-query?name=example.com&type=A
{'Access-Control-Allow-Origin': '*', 'Vary': 'Accept-Encoding',
'CF-RAY': '48b33f92aec83e4a-ZRH', 'Expect-CT': 'max-age=604800,
report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"',
'Date': 'Tue, 18 Dec 2018 17:11:23 GMT', 'Transfer-Encoding':
'chunked', 'Server': 'cloudflare', 'Connection': 'keep-alive'}
400
If I base me on what's written here, my request is not specified or too small.
Does anyone sees where I'm mistaking?
Thanks
The form you are using to pass parameters needs application/dns-json as MIME Accept type. Otherwise for application/dns-message you have only a dns key with the value being the full DNS message encoded.
Compare:
curl -H 'accept: application/dns-json' 'https://cloudflare-dns.com/dns-query?name=example.com&type=AAAA'
(from https://developers.cloudflare.com/1.1.1.1/dns-over-https/json-format/)
with
curl -H 'accept: application/dns-message' -v 'https://cloudflare-dns.com/dns-query?dns=q80BAAABAAAAAAAAA3d3dwdleGFtcGxlA2NvbQAAAQAB' | hexdump
(from https://developers.cloudflare.com/1.1.1.1/dns-over-https/wireformat/)

Request email audit export fails with status 400 and "Premature end of file."

according to https://developers.google.com/admin-sdk/email-audit/#creating_a_mailbox_for_export I am trying to request the email audit export of an user in G Suite this way:
def requestAuditExport(account):
credentials = getCredentials()
http = credentials.authorize(httplib2.Http())
url = 'https://apps-apis.google.com/a/feeds/compliance/audit/mail/export/helpling.com/'+account
status, response = http.request(url, 'POST', headers={'Content-Type': 'application/atom+xml'})
print(status)
print(response)
And I get the following result:
{'content-length': '22', 'expires': 'Tue, 13 Dec 2016 14:19:37 GMT', 'date': 'Tue, 13 Dec 2016 14:19:37 GMT', 'x-frame-options': 'SAMEORIGIN', 'transfer-encoding': 'chunked', 'x-xss-protection': '1; mode=block', 'content-type': 'text/html; charset=UTF-8', 'x-content-type-options': 'nosniff', '-content-encoding': 'gzip', 'server': 'GSE', 'status': '400', 'cache-control': 'private, max-age=0', 'alt-svc': 'quic=":443"; ma=2592000; v="35,34"'}
b'Premature end of file.'
I cannot see where the problem is, can someone please give me a hint?
Thanks in advance!
Kay
Fix it by going intp the Admin Console, Manage API client access page under Security and add the Client ID, scope needed for the Directory API. For more information, check this document.
Okay, found out what was wrong and fixed it myself. Finally it looks like this:
http = getCredentials().authorize(httplib2.Http())
url = 'https://apps-apis.google.com/a/feeds/compliance/audit/mail/export/helpling.com/'+account
headers = {'Content-Type': 'application/atom+xml'}
xml_data = """<atom:entry xmlns:atom='http://www.w3.org/2005/Atom' xmlns:apps='http://schemas.google.com/apps/2006'> \
<apps:property name='includeDeleted' value='true'/> \
</atom:entry>"""
status, response = http.request(url, 'POST', headers=headers, body=xml_data)
Not sure if it was about the body or the header. It works now and I hope it will help others.
Thanks anyway.

Python 3.5.2 Iterating a get request

Hoping someone can tell me whether this script is functioning the way I intended it to, and if not explain what I am doing wrong.
The RESTful API I am using has a parameter pageSize ranging from 10-50. I used pageSize=50. There was another parameter that I did not use called pageNumber
So, I thought this would be the right way to make the get request:
# Python 3.5.2
import requests
r = requests.get(url, stream=True)
with open("file.txt",'w', newline='', encoding='utf-8') as fd:
text_out = r.text
fd.write(text_out)
UPDATE
I think I understand a bit better. I read the documentation in more detail, but I am still missing how to get the entire data set from the API. Here is some more information:
verbs = requests.options(r.url)
print(verbs.headers)
{'Server': 'ninx', 'Date': 'Sat, 24 Dec 2016 22:50:13 GMT',
'Allow': 'OPTIONS,HEAD,GET', 'Content-Length': '0', 'Connection': 'keep-alive'}
print(r.headers)
{'Transfer-Encoding': 'chunked', 'Vary': 'Accept-Encoding',
'X-Entity-Count': '50', 'Connection': 'keep-alive',
'Content-Encoding': 'gzip', 'Date': 'Sat, 24 Dec 2016 23:59:07 GMT',
'Server': 'ninx', 'Content-Type': 'application/json; charset=UTF-8'}
Should I create a session and use the previously unused pageNumber parameter to create a new url until the 'X-Entity-Count' is zero? Or, is there a better way?
I found a discussion that helped clear this matter up for me...this updated question should probably be deleted...
API pagination best practices

Resources