I would like to do a little script with nodejs that tracks interesting steam promos.
First of all I would like to retrieve the list of games on sale.
I tried several things, without success ...
GET request on the store.steampowered.com page (works but only displays the first 50 results because the rest only appears when you scroll to the bottom of the page)
Use of the API but it would be necessary to retrieve the list of all the games but it would take too long to check if each one is in promotion
If anyone has a solution, I'm interested
Thank's a lot
You can get the list of featured games by sending a GET request to https://store.steampowered.com/api/featuredcategories, though this may not give you all of the results you're looking for.
import requests
url = "http://store.steampowered.com/api/featuredcategories/?l=english"
res = requests.get(url)
print(res.json())
You can also get all the games on sale by sending a GET request to https://steamdb.info/sales/ and doing some extensive HTML parsing. Note that SteamDB is not maintained by Valve at all.
Edit: The following script does the GET request.
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'DNT': '1',
'Alt-Used': 'steamdb.info',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Cache-Control': 'max-age=0',
'TE': 'Trailers',
}
response = requests.get('https://steamdb.info/sales/', headers=headers)
print(response)
Related
I want get information about documents when enter company id 0000000155
My pseudo code I did know where i should pass company id.
url = "https://ekrs.ms.gov.pl/rdf/pd/search_df"
payload={}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
First of all- you forgot to close the string after the 'Accept' dictionary value. That is to say, your headers should look like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
}
As for the payload, after checking the website you linked, I noticed that the ID is sent in the unloggedForm:krs2 parameter. You can add this to the payload like so:
payload={
'unloggedForm:krs2': 0000000155
}
However, in reality, it's nearly impossible to scrape the website like so, because there is ReCaptcha built into the website. Your only options now are either to use Selenium and hope that ReCaptcha doesn't block you, or to somehow reverse engineer ReCaptcha (unlikely).
I have an application running on AWS that makes a request to a page to pull meta tags using requests. I'm finding that page is allowing curl requests, but not allowing requests from the requests library.
Works:
curl https://www.seattletimes.com/nation-world/mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/
Hangs Forever:
imports requests
requests.get('https://www.seattletimes.com/nation-world/mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/')
What is the difference between curl and requests here? Should I just spawn a curl process to make my requests?
Either of the agents below do indeed work. One can also use the user_agent module (located on pypi here) to generate random and valid web user agents.
import requests
agent = (
"Mozilla/5.0 (X11; Linux x86_64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/85.0.4183.102 Safari/537.36"
)
# or can use
# agent = "curl/7.61.1"
url = ("https://www.seattletimes.com/nation-world/"
"mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/")
r = requests.get(url, headers={'user-agent': agent})
Or, using the user_agent module:
import requests
from user_agent import generate_user_agent
agent = generate_user_agent()
url = ("https://www.seattletimes.com/nation-world/"
"mount-st-helens-which-erupted-41-years-ago-starts-reopening-after-covid-closures/")
r = requests.get(url, headers={'user-agent': agent})
To further explain, requests sets a default user agent here, and the seattle times is blocking this user agent. However, with python-requests one can easily change the header parameters in the request as shown above.
To illustrate the default parameters:
r = requests.get('https://google.com/')
print(r.request.headers)
>>> {'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
vs. the updated header parameter
agent = "curl/7.61.1"
r = requests.get('https://google.com/', headers={'user-agent': agent})
print(r.request.headers)
>>>{'user-agent': 'curl/7.61.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
I am trying to figure out a way to find the data out of the website using splash but have no result, I need your help on figuring out a way to do that!
edit:
link:https://www.grubhub.com/search?location=10001(link edited for brevity)
and similarly, this for different states based on ZIP code and data I need is the name of the restaurant and its menu and ratings for all possible or available data.
I tried to reverse-engineer their API but you might have to adjust it to your needs (and probably optimise it according to your needs):
We will have to get an authentication bearer in order to use their API. To get the token we will first need a client_id:
libraries I am using
import requests
from bs4 import BeautifulSoup
import re
import json
get client_id
session = requests.Session()
static = 'https://www.grubhub.com/eat/static-content-unauth?contentOnly=1'
soup = BeautifulSoup(session.get(static).text, 'html.parser')
client = re.findall("beta_[a-zA-Z0-9]+", soup.find('script', {'type': 'text/javascript'}).text)
# print(client)
get authentication bearer
# define and add a proper header
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
'authorization': 'Bearer',
'content-type': 'application/json;charset=UTF-8'
}
session.headers.update(headers)
# straight from networking tools. Device ID appears to accept any 10-digit value
data = '{"brand":"GRUBHUB","client_id":"' + client[0] + '","device_id":1234567890,"scope":"anonymous"}'
resp = session.post('https://api-gtm.grubhub.com/auth', data=data)
# refresh = json.loads(resp.text)['session_handle']['refresh_token']
access = json.loads(resp.text)['session_handle']['access_token']
# update header with new token
session.headers.update({'authorization': 'Bearer ' + access})
make your request
Here comes the part that is different from your request: their API uses a third-party service to get the longitude and latitude for a ZIP code (i.e. -73.99916077, 40.75368499) for your New York ZIP code). There might even be an option to change that: location=POINT(-73.99916077%2040.75368499) looks like it accepts other options, as well.
grub = session.get('https://api-gtm.grubhub.com/restaurants/search/search_listing?orderMethod=delivery&locationMode=DELIVERY&facetSet=umamiV2&pageSize=20&hideHateos=true&searchMetrics=true&location=POINT(-73.99916077%2040.75368499)&facet=promos%3Atrue&facet=open_now%3Atrue&variationId=promosSponsoredRandom&sortSetId=umamiv3&countOmittingTimes=true')
I'm trying to scrape product URLs from the Amazon Webshop, by going through every page.
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
products = set()
for i in range(1, 21):
url = 'https://www.amazon.fr/s?k=phone%2Bcase&page=' + str(i)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content)
print(soup) # prints the HTML content saying Error on Amazon's side
links = soup.select('a.a-link-normal.a-text-normal')
for tag in links:
url_product = 'https://www.amazon.fr' + tag.attrs['href']
products.add(url_product)
Instead of getting the content of the page, I get a "Sorry, something went wrong on our end" HTML Error Page. What is the reason behind this? How can I successfully bypass this error and scrape the products?
According to your question:
Be informed that AMAZON not allowing automated access to for it's data! So you can double check this by checking the response via r.status_code ! which can lead you to have that error MSG:
To discuss automated access to Amazon data please contact api-services-support#amazon.com
Therefore you can use AMAZON API or you can pass a list of proxies to the GET request via proxies = list_proxies.
Here's the correct way to pass headers to Amazon without getting block and it's Works.
import requests
from bs4 import BeautifulSoup
headers = {
'Host': 'www.amazon.fr',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'TE': 'Trailers'
}
for item in range(1, 21):
r = requests.get(
'https://www.amazon.fr/s?k=phone+case&page={item}&ref=sr_pg_{item}', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('a', attrs={'class': 'a-link-normal a-text-normal'}):
print(f"https://www.amazon.fr{item.get('href')}")
Run Online: Click Here
Creating an app in python that uses the Docusign API to delete user accounts. It appears I'll need the users userID to accomplish this so I need to make 2 calls, one to get the UserID and then one to delete the user.
The problem is that when I make a response.get() for the user I get every user account.
email = sys.arg[1]
account_id = "<account_id goes here>"
auth = 'Bearer <long token goes here>'
head = {
'Accept': 'application/json',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8,fa;q=0.6,sv;q=0.4',
'Cache-Control': 'no-cache',
'Origin': 'https://apiexplorer.docusign.com',
'Referer': 'https://apiexplorer.docusign.com/',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) \
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
'Authorization': auth,
'Content-Length': '100',
'Content-Type': 'application/json'}
url = 'https://demo.docusign.net/restapi/v2/accounts/{}/users'.format(account_id)
data = {"users": [{"email": email}]}
response = requests.get(url, headers=head, json=data)
print(response.text)
Why do I get a response.text with every user? And how can I just get a single user's information based on the e-mail address?
With the v2 API you can do this using the email_substring query parameter that allows you to search on specific users.
GET /v2/accounts/{accountId}/users?email_substring=youremail
https://developers.docusign.com/esign-rest-api/v2/reference/Users/Users/list