Node horseman not working on AngularJS select options - node.js

I am trying to change AngularJS based select options using horseman.
Unfortunately, it is not working out for me.
The website is: https://www.cars.com/
I can't seem to change the make, model, price drop downs.
horseman
.userAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36')
.open('https://www.cars.com/').catch(function(error){console.log(error)})
.select('div.sw-input-group-make > select', car.make)
.wait(5000)
.select('.sw-input-group-model > select', car.model)
.select('.sw-input-group-price > select', car.price)
.type('.zip-field',car.zipcode)
.screenshot("C:/Users/Himanshu/Desktop/upwork/car/big1.png").log()
.click('.sw-input-group-submit >input').catch(function(error){console.log(error)})
.waitForNextPage().catch(function(error){console.log(error)})
.screenshot("C:/Users/Himanshu/Desktop/upwork/car/big.png").log()
.close();
});

Related

python request get url with predefined cookies and session id

from chrome browser I fetch some header and values as below
accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36
cookies:_gcl_au=<some number>; _ga=<some number>; _gid=<some string>;
csrftoken=<some string>; sessionid=<some string>
I want to fetch (HTTP GET)same url using python request library and use same cookies and session id. My csrftoken and sessionid is already defined that I want to use for requests.get . Is it possible and if yes then how to do it?
Thanks

How to detect if selenium initializer driver is headless

Let's say I have this code
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument("window-size=1920,1080")
browser=webdriver.Chrome(options=options,executable_path=r"chromedriver.exe")
browser.execute_cdp_cmd('Network.setUserAgentOverride',
{"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
How can I check if the initialized browser is headless or not, programmatically? I mean, if I type
browser.get_window_size() I get {'width': 1920, 'height': 1080}, if I write browser.execute_script('return navigator.languages') it returns ['en-US', 'en']
What I'm looking for is something like browser.is_headless() where I can get if a given browser is headless or not.
options = webdriver.ChromeOptions()
options.headless
Will return True, if --headless argument is set into ChomeOptions(), otherwise, will return False.
Based on the official Selenium documentation
options.headless
should return whether headless is set or not
If you're using Firefox (tested on Firefox 106):
if driver.caps.get("moz:headless", False):
print("Firefox is headless")

BeautifulSoup Python web scraping Missing html Main Body

i am using Beutifull soup to scrape this web page: https://greyhoundbet.racingpost.com//#results-dog/race_id=1765914&dog_id=527442&r_date=2020-03-19&track_id=61&r_time=11:03
Result: i get the javaScript, Css
Desired output: i need the main html
i used this code
import requests
from bs4 import BeautifulSoup
url = 'https://greyhoundbet.racingpost.com//#results-dog/race_id=1765914&dog_id=527442&r_date=2020-03-19&track_id=61&r_time=11:03'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
page = requests.get(url,headers=headers)url = 'https://greyhoundbet.racingpost.com//#results-dog/race_id=1765914&dog_id=527442&r_date=2020-03-19&track_id=61&r_time=11:03'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
page = requests.get(url,headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
I’m afraid you won’t be able to get it directly using BeautifulSoup because the page loads then a javascript loads data.
It’s one of the component’s limitations, you may need to use selenium.
please check the answers on this question
I think what you looking for is this:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
It will contain the text from the page including html tags

hangs on open url with urllib (python3)

I try to open url with python3:
import urllib.request
fp = urllib.request.urlopen("http://lebed.com/")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr)
But it hangs on second line.
What's the reason of this problem and how to fix it?
I suppose the reason is that the url does not support robot visiting a site visit. You need to fake a browser visit by sending browser headers along with your request
import urllib.request
url = "http://lebed.com/"
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
f = urllib.request.urlopen(req)
Tried this one on my system and it works.
Agree with Arpit Solanki. Shown output for a failed request vs successful.
Failed
GET / HTTP/1.1
Accept-Encoding: identity
Host: www.lebed.com
Connection: close
User-Agent: Python-urllib/3.5
Success
GET / HTTP/1.1
Accept-Encoding: identity
Host: www.lebed.com
Connection: close
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36

How can I circumvent bot protection when scraping full NYTimes articles?

I am trying to scrape full book reviews from the New York Times in order to perform sentiment analysis on them. I am aware of the NY Times API and am using it to get book review URLs, but I need to devise a scraper to get the full article text, as the API only gives a snippet. I believe that nytimes.com has bot protection to prevent bots from scraping the website but I know there are ways to circumvent it.
I found this python scraper that works and can pull full text from nytimes.com, but I would prefer to implement my solution in Go. Should I just port this to Go or is this solution unnecessarily complex? I have already played around with changing the User-Agent header but everything that I do in Go ends in an infinite redirect loop error.
Code:
package main
import (
//"fmt"
"io/ioutil"
"log"
"math/rand"
"net/http"
"time"
//"net/url"
)
func main() {
rand.Seed(time.Now().Unix())
userAgents := [5]string{
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0",
}
url := "http://www.nytimes.com/2015/10/25/books/review/the-tsar-of-love-and-techno-by-anthony-marra.html"
client := &http.Client{}
req, err := http.NewRequest("GET", url, nil)
if err != nil {
log.Fatalln(err)
}
req.Header.Set("User-Agent", userAgents[rand.Intn(len(userAgents))])
resp, err := client.Do(req)
if err != nil {
log.Fatalln(err)
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Fatalln(err)
}
log.Println(string(body))
}
Results in:
2016/12/05 21:57:53 Get http://www.nytimes.com/2015/10/25/books/review/the-tsar-of-love-and-techno-by-anthony-marra.html?_r=4: stopped after 10 redirects
exit status 1
Any help is appreciated! Thank you!
You just have to add cookies to your client:
var cookieJar, _ = cookiejar.New(nil)
var client = &http.Client{Jar: cookieJar}
resp, err := client.Do(req)
if err != nil {
log.Fatalln(err)
}
// now response contains all you need and
// you can show it on the console or save to file

Resources