Python Web data colecting

Python Web data colecting - python-3.x

I have 100 sets of BOL need to search on below web. However, i can't find the url to auto replace and keep searching. anyone can help?
Tracking codes:
MSCUZH129687
MSCUJZ365758
The page I'm working on:
https://www.msc.com/track-a-shipment
import requests
url = 'https://www.msc.com/track-a-shipment'
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3346.9 Safari/537.36',
'Referer': 'https://www.msc.com/track-a-shipment'
}
form_data = {'first': 'true',
'pn': '1',
'kd': 'python'}
def getJobs():
res = requests.post(url=url, headers=HEADERS, data=form_data)
result = res.json()
jobs = result['Location']['Description']['responsiveTd']
print(type(jobs))
for job in jobs:
print(job)
getJobs()

tldr: You'll likely need to use a headless browser like selenium to go to the page, input the code and click the search button.
The url to retrieve is generated by the javascript that runs when you click search.
The search button posts the link to their server so when it redirects you to the link the server knows what response to give you.
In order to auto generate the link you'd have to analyze the javascript and understand how it generates the code in order to generate the code yourself, post the code to their server and then make a subsequent get request to retrieve the results like the asp.net framework is doing.
Alternatively you can use a headless browser like selenium to go to the page, input the code and click the search button. After the headless browser navigates to the results you can parse it from there.

Related

Trackjs: ignore rules by token in user agent

In TrackJS, some user agents are parsed as normal browsers, e.g.:
Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36 (compatible; Google-Read-Aloud; +https://support.google.com/webmasters/answer/1061943)
Chrome Mobile 59.0.3071
I tried to do it by ignore rules in settings, but it doesn't work.
So I need to filtrate errors by token in user agent.
Is it possible do this without JS?
More similar user agents: https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers

The TrackJS UI doesn't allow you to create Ignore Rules against the raw UserAgent, only the parsed browser and operating system. Instead, use the client-side ignore capability with the onError function.
Build your function to detect the tokens you want to exclude, and return false from the function if you don't want it to be sent.

Web Scraping from Oddschecker using BeatifulSoup

I was previously able to scrape data from https://www.oddschecker.com/ using BeautifulSoup, however, now all I am getting is the following 403 status:
import requests
import bs4
result = requests.get("https://www.oddschecker.com/")
result.text
Output:
<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor="white">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n
I want to know if this is the same for all users on this website or if there is a way to navigate around this (via another web scraping package or other code) and access the actual data visible on the site.

Just add a user agent. It detects if your a bot by disabling js.
url = 'https://www.oddschecker.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result = requests.get(url, headers=headers)
print(result.text)
You can also use selenium.
from selenium import webdriver
driver.get("https://www.oddschecker.com/")
print(driver.page_source)

python 3: received 403:forbidden error when using request

HTTP Error 403: Forbidden is generated by using the either one of the following two commands.
requests.get('http://www.allareacodes.com')
urllib.request.urlopen('http://www.allareacodes.com')
however, I am able to browse this website in chrome and check its source. Besides, wget in my cygwin is also capable of grabbing the html source.
anyone knows how to grab the source of this website by using packages in python alone?

You have errors in your code for requests. It should be:
import requests
r = requests.get('http://www.allareacodes.com')
print(r.text)
In your case however, the website has a "noindex" file that stops scripts from getting the raw HTML data. As a solution, simply fake your headers so that the website thinks you're an actual user.
Example:
import requests
r = requests.get('http://www.allareacodes.com', headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
})
print(r.text)

Getting content from page which checks for js

I am using "request" module to get page contents with following headers
var headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Firefox/24.0' };
still, the page I am trying to fetch somehow displays different content than View > Source from browser (looks like it detects for javascript support) , before diving into phantomjs (which I want to avoid due performance limitations) is there any way to get the html as it is on the browser?.
Thanks

windows phone browser in desktop mode

How can force a windows phone to use the desktop view mode in the mobile borwser?
In the settings it is possible to set the browser to use the desktop view becasue some featers seem to be missing in the mobile view causing my site not beeing displayed correctly.

If you want to make websites display in desktop mode in the WebBrowser control, you must change its user agent. You can do so using this:
webBrowser.Navigate(new Uri("http://www.google.com", null, "User-Agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)");
That code changes the WebBrowser's user agent to that of desktop Internet Explorer 10.
However, it will only change the User Agent for the page navigated to. When users click links, the user agent will be changed back. To fix this, set the WebBrowser's Navigating event to this:
private void webBrowser_Navigating(object sender, WebBrowserNavigatingEventArgs e)
{
string url = e.Uri.ToString();
if (!url.Contains("#changedua"))
{
e.Cancel = true;
url = url + "#changedua";
webBrowser.Navigate(new Uri(url), null, "User-Agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)");
}
}
In this code, we check to see if the url contains a flag, "#changedua". If it does, we allow the navigation. If it does not, we cancel the navigation. Then, we navigate again using our custom user agent, and adding the flag to show that it is valid.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python Web data colecting - python-3.x

Related

Trackjs: ignore rules by token in user agent

Web Scraping from Oddschecker using BeatifulSoup

python 3: received 403:forbidden error when using request

Getting content from page which checks for js

windows phone browser in desktop mode

Categories

Resources