KRS ekrs.ms.gov.pl get documents from requests - python-3.x

I want get information about documents when enter company id 0000000155
My pseudo code I did know where i should pass company id.
url = "https://ekrs.ms.gov.pl/rdf/pd/search_df"
payload={}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)

First of all- you forgot to close the string after the 'Accept' dictionary value. That is to say, your headers should look like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
}
As for the payload, after checking the website you linked, I noticed that the ID is sent in the unloggedForm:krs2 parameter. You can add this to the payload like so:
payload={
'unloggedForm:krs2': 0000000155
}
However, in reality, it's nearly impossible to scrape the website like so, because there is ReCaptcha built into the website. Your only options now are either to use Selenium and hope that ReCaptcha doesn't block you, or to somehow reverse engineer ReCaptcha (unlikely).

Related

Can not download excel file using requests python, I can't get the third step of posting request to download excel file. here is my try

Here is my attempt to download excel file ##----------
How Do i make it work. Can someone please help me to fix last call
import requests
from bs4 import BeautifulSoup
url = "http://lijekovi.almbih.gov.ba:8090/SpisakLijekova.aspx"
useragent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76"
headers={
"User-Agent":useragent
}
session = requests.session() #session
r = session.get(url,headers=headers) #request to get cookies
soup = BeautifulSoup(r.text,"html.parser") #parsing values
viewstate = soup.find('input', {'id': '__VIEWSTATE'}).get('value')
viewstategenerator =soup.find('input', {'id': '__VIEWSTATEGENERATOR'}).get('value')
eventvalidation =soup.find('input', {'id': '__EVENTVALIDATION'}).get('value')
cookies = session.cookies.get_dict()
cookie=""
for k,v in cookies.items():
cookie+=k+"="+v+";"
cookie = cookie[:-1]
#header copied from the requests.
headers={
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Connection':'keep-alive',
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76',
'X-KL-Ajax-Request':'Ajax_Request',
'X-MicrosoftAjax':'Delta=true',
'X-Requested-With':'XMLHttpRequest',
'Cookie':cookie
}
#post request data submission
data={
'ctl00$smMain':'ctl00$MainContent$ReportGrid$ctl103$ReportGrid_top_4',
'__EVENTTARGET':'ctl00$MainContent$ReportGrid$ctl103$ReportGrid_top_4',
'__VIEWSTATE':viewstate,
'__VIEWSTATEGENERATOR':viewstategenerator,
'__EVENTVALIDATION':eventvalidation,
'__ASYNCPOST':'true'
}
#need help with this part
result = requests.get(url,headers=headers,data=data)
print(result.headers)
data = {
"__EVENTTARGET":'ctl00$MainContent$btnExport',
'__VIEWSTATE':viewstate,
}
#remove ajax request for the last call to download excel file
del headers['X-KL-Ajax-Request']
del headers['X-MicrosoftAjax']
del headers['X-Requested-With']
result = requests.post(url,headers=headers,data=data,allow_redirects=True)
print(result.headers)
print(result.status_code)
#print(result.text)
with open("test.xlsx","wb") as f:
f.write(result.content)
I am trying to export excel file without selenium help, but I am not able to get the last step. I need help to convert xmlhttprequest to pure requests using python without any selenium

Steam Api Store Sales

I would like to do a little script with nodejs that tracks interesting steam promos.
First of all I would like to retrieve the list of games on sale.
I tried several things, without success ...
GET request on the store.steampowered.com page (works but only displays the first 50 results because the rest only appears when you scroll to the bottom of the page)
Use of the API but it would be necessary to retrieve the list of all the games but it would take too long to check if each one is in promotion
If anyone has a solution, I'm interested
Thank's a lot
You can get the list of featured games by sending a GET request to https://store.steampowered.com/api/featuredcategories, though this may not give you all of the results you're looking for.
import requests
url = "http://store.steampowered.com/api/featuredcategories/?l=english"
res = requests.get(url)
print(res.json())
You can also get all the games on sale by sending a GET request to https://steamdb.info/sales/ and doing some extensive HTML parsing. Note that SteamDB is not maintained by Valve at all.
Edit: The following script does the GET request.
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'DNT': '1',
'Alt-Used': 'steamdb.info',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Cache-Control': 'max-age=0',
'TE': 'Trailers',
}
response = requests.get('https://steamdb.info/sales/', headers=headers)
print(response)

Scraping links with BeautifulSoup from all pages in Amazon results in error

I'm trying to scrape product URLs from the Amazon Webshop, by going through every page.
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
products = set()
for i in range(1, 21):
url = 'https://www.amazon.fr/s?k=phone%2Bcase&page=' + str(i)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content)
print(soup) # prints the HTML content saying Error on Amazon's side
links = soup.select('a.a-link-normal.a-text-normal')
for tag in links:
url_product = 'https://www.amazon.fr' + tag.attrs['href']
products.add(url_product)
Instead of getting the content of the page, I get a "Sorry, something went wrong on our end" HTML Error Page. What is the reason behind this? How can I successfully bypass this error and scrape the products?
According to your question:
Be informed that AMAZON not allowing automated access to for it's data! So you can double check this by checking the response via r.status_code ! which can lead you to have that error MSG:
To discuss automated access to Amazon data please contact api-services-support#amazon.com
Therefore you can use AMAZON API or you can pass a list of proxies to the GET request via proxies = list_proxies.
Here's the correct way to pass headers to Amazon without getting block and it's Works.
import requests
from bs4 import BeautifulSoup
headers = {
'Host': 'www.amazon.fr',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'TE': 'Trailers'
}
for item in range(1, 21):
r = requests.get(
'https://www.amazon.fr/s?k=phone+case&page={item}&ref=sr_pg_{item}', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('a', attrs={'class': 'a-link-normal a-text-normal'}):
print(f"https://www.amazon.fr{item.get('href')}")
Run Online: Click Here

I have error The header content contains invalid characters when sending HTTP request

I have function getBody, which gets body from url, on some url (I don't know exactly of which one) I always get error:
_http_outgoing.js:494
throw new TypeError('The header content contains invalid characters');
Those urls contains mostly danish accents characters, this is maybe problem. I have set header : 'Content-Type': 'text/plain; charset=UTF-8', which set charset to UTF-8. Probably header host is problem.
I have tried using punycode, or url, which converts url to ASCII, but those converted URLs did not work.
function getBody(n) {
var url = n; //urls[n];
url = (url.indexOf('http://')==-1 && url.indexOf('https://')==-1) ? 'http://'+url : url;
instance.get(url,
{
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Content-Type': 'text/plain; charset=UTF-8'
},
}
}

HTTP header case

I am dealing with server, which is not accepting uncapitalized headers and unfortunately I can't do much with it.
var headers = {};
headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36';
headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8';
headers['Connection'] = 'keep-alive';
headers['Cache-Control'] = 'max-age=0';
headers['Upgrade-Insecure-Requests'] = '1';
headers['Accept-Encoding'] = 'gzip, deflate';
headers['Accept-Language'] = 'en-US,en;q=0.9,ru;q=0.8,hy;q=0.7';
request.post({url: 'http://10.10.10.10/login', headers: headers, ...
this in fact sending out the following
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9,ru;q=0.8,hy;q=0.7
DNT: 1
host: 10.10.10.10
cookie: vvv=765936875155218941
cookie and host are lower cased. How can I alter request, to send out capitalized headers?
This is not Node.js issue but a supposed issue with particular library, request. In fact, not an issue at all because HTTP headers are case-insensitive. request uses caseless package to enforce lower-cased headers, so it's expected that user headers will be lower-case if consistency is required.
These headers may be left as is, as they should be handled correctly by remote server according to the specs.
It may be necessary to specific header case if a request is supposed to mimic real client request. In this case header object can be traversed manually before a request, e.g.:
const normalizeHeaderCase = require("header-case-normalizer");
const req = request.post('...', { headers: ... });
for (const [name, value] of Object.entries(req.headers)) {
delete req.headers[name];
req.headers[normalizeHeaderCase(name)] = value;
}
req.on('response', function(response) {...});

Resources