scraping python with asp - python-3.x

i actually work in python. I need to collect the following information from this website
http://appweb2.cndh.org.mx/SNA/ind_Autoridad_SM_3.asp?Id_Aut=1063&Id_Estado=10&valorEF=317
I tried using requests
from bs4 import BeautifulSoup
import requests
import urllib.request
r=requests.post('http://appweb2.cndh.org.mx/SNA/ind_Autoridad_SM_3.asp?Id_Aut=1063&Id_Estado=10&valorEF=317')
print(r.url)
the output is
http://appweb2.cndh.org.mx/SNA/inicio.asp
always take the home page. i need help. thanks

The cookie has to be passed with the request to get the proper response as the server identifies the request page with cookie only.
Refer the code below.
import requests
url = 'http://appweb2.cndh.org.mx/SNA/ind_Autoridad_SM_3.asp?Id_Aut=1063&Id_Estado=10&valorEF=317'
headers={'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','Accept-Encoding':'gzip, deflate','Accept-Language':'en-US,en;q=0.9','Cache-Control':'max-age=0','Connection':'keep-alive','Cookie':'ASPSESSIONIDASTTTRTA=DIKNJIGADBLCIMCOKEMONJCH','Host':'appweb2.cndh.org.mx','Upgrade-Insecure-Requests':'1','User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
page = requests.get(url, headers=headers).content
print(page)

Related

Error 403 while scraping a website in python using requests and selenium

i am trying to scrape a website "https://coinatmradar.com/" . I am using requests, beautifulsoup and selenium (wherever required) to scrape data. But after a while, my ip got blocked by the website as it was using cloudflare protection.
country_url = "https://coinatmradar.com/country/226/bitcoin-atm-united-states/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
response=requests.get(country_url, headers=headers)
soup=BeautifulSoup(response.content,'lxml')
This is the part of code that i am using. I am getting response 403. Is there other way around to make it work with requests and selenium both?
Try to set your headers like that:
headers = {'Cookie':'_gcar_id=0696b46733edeac962b24561ce67970199ee8668', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

Getting 403 forbidden status through python requests

I am trying to scrape a website content and getting 403 Forbidden status. I have tried solutions like using sessions for cookies and mocking browser through a 'User-Agent' header. Here is the code I have been using
session = requests.Session()
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
}
page = session.get('https://www.sizeofficial.nl/product/zwart-new-balance-992/343646_sizenl/', headers = headers)
Note that this approach works on other websites, it is just this one which does not seem to work. I have even tried using other headers which my browser is sending them, and it does not seem to work. Another approach I have tried is to first create a session cookie and then pass that cookie to session.get, still doesn't work for me. Is it not allowed to scrape the website or am I still missing something?
I am using python 3.8 requests to achieve this purpose.

How can I send http post request on python with protobuf text as params?

I want to send http request using protobuf as params on python. I copied the protobuf data from charles proxy (web debugging proxy tool).
the protobuf text request data was:1 { 1: "2345654456765" }
i tried this but not working:
import requests
r = requests.post('https://api.website.com/version/auth/login?locale=en',data={1:{1:'2345654456765'}},headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4181.9 Safari/537.36','platform': 'web',})
print(r.content)
I have no idea of how can I put this as a param. I always worked with json data. Is there anyone who knows the solution?

Python: Access Denied at Random Points When Using Requests

I am using requests and beautifulsoup to go through the popular comic store comixology in order to make a list of all comic titles and issues and release date for all of them, so I am requesting a massive amount of web pages. Unfortunately, partway through i will get the error:
you do not have access to (URL) on this server
I tried using a function that recursively tries the request. but this isn't working
Im not putting the whole code in because it is very long.
def getUrl(url):
try:
page = requests.get(url)
except:
getUrl(url)
return page
The User-Agent request header contains a characteristic string that allows the network protocol peers to identify the application type, operating system, software vendor or software version of the requesting software user agent. Validating User-Agent header on server side is a common operation so be sure to use valid browser’s User-Agent string to avoid getting blocked.
(Source: http://go-colly.org/articles/scraping_related_http_headers/)
The only thing you need to do is to set a legitimate user-agent. Therefore add headers to emulate a browser. :
# This is a standard user-agent of Chrome browser running on Windows 10
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
Example:
from bs4 import BeautifulSoup
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('http://example.com', headers=headers).text
soup = BeautifulSoup(resp, 'html.parser')
Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}

Can't access certain sites requests.get in Python 3

from requests import get
get('http://www.fb.com')
<Response [200]>
get('http://www.subscene.com')
<Response [403]
I'm trying to build a web scraper to scrape and download subtitles. But I'm unable to request any subtitle pages as they are returning a response code 403.
HTTP Status Code 403 Forbidden means:
the server understood the request, but is refusing to fulfill it. Source
The server identified your script as a non-default browser (Chrome, Firefox, etc.) and is refusing to "speak" with it. It's very common to see sites doing this to avoid scrapers, exactly what you're trying to do...
A workaround is to set a user-agent in your headers, like so:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
url = "http://www.subscene.com"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response) # <Response [200]>
But I advise you to look for a site that provides some sort of API, relying on scraping isn't the best approach.

Resources