How can I circumvent bot protection when scraping full NYTimes articles?

How can I circumvent bot protection when scraping full NYTimes articles? - web

I am trying to scrape full book reviews from the New York Times in order to perform sentiment analysis on them. I am aware of the NY Times API and am using it to get book review URLs, but I need to devise a scraper to get the full article text, as the API only gives a snippet. I believe that nytimes.com has bot protection to prevent bots from scraping the website but I know there are ways to circumvent it.
I found this python scraper that works and can pull full text from nytimes.com, but I would prefer to implement my solution in Go. Should I just port this to Go or is this solution unnecessarily complex? I have already played around with changing the User-Agent header but everything that I do in Go ends in an infinite redirect loop error.
Code:
package main
import (
//"fmt"
"io/ioutil"
"log"
"math/rand"
"net/http"
"time"
//"net/url"
)
func main() {
rand.Seed(time.Now().Unix())
userAgents := [5]string{
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0",
}
url := "http://www.nytimes.com/2015/10/25/books/review/the-tsar-of-love-and-techno-by-anthony-marra.html"
client := &http.Client{}
req, err := http.NewRequest("GET", url, nil)
if err != nil {
log.Fatalln(err)
}
req.Header.Set("User-Agent", userAgents[rand.Intn(len(userAgents))])
resp, err := client.Do(req)
if err != nil {
log.Fatalln(err)
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Fatalln(err)
}
log.Println(string(body))
}
Results in:
2016/12/05 21:57:53 Get http://www.nytimes.com/2015/10/25/books/review/the-tsar-of-love-and-techno-by-anthony-marra.html?_r=4: stopped after 10 redirects
exit status 1
Any help is appreciated! Thank you!

You just have to add cookies to your client:
var cookieJar, _ = cookiejar.New(nil)
var client = &http.Client{Jar: cookieJar}
resp, err := client.Do(req)
if err != nil {
log.Fatalln(err)
}
// now response contains all you need and
// you can show it on the console or save to file

Related

python request get url with predefined cookies and session id

from chrome browser I fetch some header and values as below
accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36
cookies:_gcl_au=<some number>; _ga=<some number>; _gid=<some string>;
csrftoken=<some string>; sessionid=<some string>
I want to fetch (HTTP GET)same url using python request library and use same cookies and session id. My csrftoken and sessionid is already defined that I want to use for requests.get . Is it possible and if yes then how to do it?
Thanks

Can not download excel file using requests python, I can't get the third step of posting request to download excel file. here is my try

Here is my attempt to download excel file ##----------
How Do i make it work. Can someone please help me to fix last call
import requests
from bs4 import BeautifulSoup
url = "http://lijekovi.almbih.gov.ba:8090/SpisakLijekova.aspx"
useragent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76"
headers={
"User-Agent":useragent
}
session = requests.session() #session
r = session.get(url,headers=headers) #request to get cookies
soup = BeautifulSoup(r.text,"html.parser") #parsing values
viewstate = soup.find('input', {'id': '__VIEWSTATE'}).get('value')
viewstategenerator =soup.find('input', {'id': '__VIEWSTATEGENERATOR'}).get('value')
eventvalidation =soup.find('input', {'id': '__EVENTVALIDATION'}).get('value')
cookies = session.cookies.get_dict()
cookie=""
for k,v in cookies.items():
cookie+=k+"="+v+";"
cookie = cookie[:-1]
#header copied from the requests.
headers={
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Connection':'keep-alive',
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76',
'X-KL-Ajax-Request':'Ajax_Request',
'X-MicrosoftAjax':'Delta=true',
'X-Requested-With':'XMLHttpRequest',
'Cookie':cookie
}
#post request data submission
data={
'ctl00$smMain':'ctl00$MainContent$ReportGrid$ctl103$ReportGrid_top_4',
'__EVENTTARGET':'ctl00$MainContent$ReportGrid$ctl103$ReportGrid_top_4',
'__VIEWSTATE':viewstate,
'__VIEWSTATEGENERATOR':viewstategenerator,
'__EVENTVALIDATION':eventvalidation,
'__ASYNCPOST':'true'
}
#need help with this part
result = requests.get(url,headers=headers,data=data)
print(result.headers)
data = {
"__EVENTTARGET":'ctl00$MainContent$btnExport',
'__VIEWSTATE':viewstate,
}
#remove ajax request for the last call to download excel file
del headers['X-KL-Ajax-Request']
del headers['X-MicrosoftAjax']
del headers['X-Requested-With']
result = requests.post(url,headers=headers,data=data,allow_redirects=True)
print(result.headers)
print(result.status_code)
#print(result.text)
with open("test.xlsx","wb") as f:
f.write(result.content)
I am trying to export excel file without selenium help, but I am not able to get the last step. I need help to convert xmlhttprequest to pure requests using python without any selenium

Aiohttp+Asyncio Seems To Be Inconsistent in Tripadvisor Travel Site

I was trying to asynchronously request page data from tripadvisor travel site using aiohttp+asyncio, but it seems that in multiple occasions, the get() method is stuck for almost a minute and then results in TimeoutError.
I created a similar script using the requests library and confirmed that there are times that the code with requests library works while the code with aiohttp+asyncio does not.
Here are the codes:
Using aiohttp + asyncio
from aiohttp import ClientSession
import asyncio
home_url = 'https://www.tripadvisor.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/93.0.4577.63 Safari/537.36'
}
async def main():
async with ClientSession(headers=headers) as session:
tourist_sites_url = home_url + '/Attractions-g294245-Activities-a_allAttractions.true-Philippines.html'
async with session.get(tourist_sites_url) as response:
print(f'{response.status=}\n')
print(await response.text())
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Using requests
from requests import Session
home_url = 'https://www.tripadvisor.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/93.0.4577.63 Safari/537.36'
}
def main():
with Session() as session:
tourist_sites_url = home_url + '/Attractions-g294245-Activities-a_allAttractions.true-Philippines.html'
response = session.get(tourist_sites_url, headers=headers)
print(f'{response.status_code=}\n')
print(response.text)
if __name__ == '__main__':
main()
What shall I do in order for the code with aiohttp+asyncio to work on tripadvisor website?
Thank you very much!

KRS ekrs.ms.gov.pl get documents from requests

I want get information about documents when enter company id 0000000155
My pseudo code I did know where i should pass company id.
url = "https://ekrs.ms.gov.pl/rdf/pd/search_df"
payload={}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)

First of all- you forgot to close the string after the 'Accept' dictionary value. That is to say, your headers should look like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
}
As for the payload, after checking the website you linked, I noticed that the ID is sent in the unloggedForm:krs2 parameter. You can add this to the payload like so:
payload={
'unloggedForm:krs2': 0000000155
}
However, in reality, it's nearly impossible to scrape the website like so, because there is ReCaptcha built into the website. Your only options now are either to use Selenium and hope that ReCaptcha doesn't block you, or to somehow reverse engineer ReCaptcha (unlikely).

hangs on open url with urllib (python3)

I try to open url with python3:
import urllib.request
fp = urllib.request.urlopen("http://lebed.com/")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr)
But it hangs on second line.
What's the reason of this problem and how to fix it?

I suppose the reason is that the url does not support robot visiting a site visit. You need to fake a browser visit by sending browser headers along with your request
import urllib.request
url = "http://lebed.com/"
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
f = urllib.request.urlopen(req)
Tried this one on my system and it works.

Agree with Arpit Solanki. Shown output for a failed request vs successful.
Failed
GET / HTTP/1.1
Accept-Encoding: identity
Host: www.lebed.com
Connection: close
User-Agent: Python-urllib/3.5
Success
GET / HTTP/1.1
Accept-Encoding: identity
Host: www.lebed.com
Connection: close
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How can I circumvent bot protection when scraping full NYTimes articles? - web

You just have to add cookies to your client: var cookieJar, _ = cookiejar.New(nil) var client = &http.Client{Jar: cookieJar} resp, err := client.Do(req) if err != nil { log.Fatalln(err) } // now response contains all you need and // you can show it on the console or save to file

Related

python request get url with predefined cookies and session id

Can not download excel file using requests python, I can't get the third step of posting request to download excel file. here is my try

Aiohttp+Asyncio Seems To Be Inconsistent in Tripadvisor Travel Site

KRS ekrs.ms.gov.pl get documents from requests

hangs on open url with urllib (python3)

Categories

Resources