Getting content from page which checks for js

Getting content from page which checks for js - node.js

I am using "request" module to get page contents with following headers
var headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Firefox/24.0' };
still, the page I am trying to fetch somehow displays different content than View > Source from browser (looks like it detects for javascript support) , before diving into phantomjs (which I want to avoid due performance limitations) is there any way to get the html as it is on the browser?.
Thanks

Related

Web Scraping from Oddschecker using BeatifulSoup

I was previously able to scrape data from https://www.oddschecker.com/ using BeautifulSoup, however, now all I am getting is the following 403 status:
import requests
import bs4
result = requests.get("https://www.oddschecker.com/")
result.text
Output:
<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor="white">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n
I want to know if this is the same for all users on this website or if there is a way to navigate around this (via another web scraping package or other code) and access the actual data visible on the site.

Just add a user agent. It detects if your a bot by disabling js.
url = 'https://www.oddschecker.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result = requests.get(url, headers=headers)
print(result.text)
You can also use selenium.
from selenium import webdriver
driver.get("https://www.oddschecker.com/")
print(driver.page_source)

url for mp4 gives error, but mozila can download it. how can i download it with python instead

I am trying to make a script that downloads videos from a site. I see the video url but when I try to open it it gives `403 ERROR The request could not be satisfied.
But in the video page when i choose view page info firefox can successfully download the video. In the description in the media tab there is a link location i am trying to access it but gives me the same error.
I tried to download the video with pathlib but it saves it with the error. My question is how can i download this video?

If you're using Python's requests module, you can create a browser-specific header to be included in the request. If you wanted to use Firefox for example:
mozhdr = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3'}
Then include it as an argument when you get the request
requests.get("https://www.youtube.com", headers = mozhdr)

Python Web data colecting

I have 100 sets of BOL need to search on below web. However, i can't find the url to auto replace and keep searching. anyone can help?
Tracking codes:
MSCUZH129687
MSCUJZ365758
The page I'm working on:
https://www.msc.com/track-a-shipment
import requests
url = 'https://www.msc.com/track-a-shipment'
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3346.9 Safari/537.36',
'Referer': 'https://www.msc.com/track-a-shipment'
}
form_data = {'first': 'true',
'pn': '1',
'kd': 'python'}
def getJobs():
res = requests.post(url=url, headers=HEADERS, data=form_data)
result = res.json()
jobs = result['Location']['Description']['responsiveTd']
print(type(jobs))
for job in jobs:
print(job)
getJobs()

tldr: You'll likely need to use a headless browser like selenium to go to the page, input the code and click the search button.
The url to retrieve is generated by the javascript that runs when you click search.
The search button posts the link to their server so when it redirects you to the link the server knows what response to give you.
In order to auto generate the link you'd have to analyze the javascript and understand how it generates the code in order to generate the code yourself, post the code to their server and then make a subsequent get request to retrieve the results like the asp.net framework is doing.
Alternatively you can use a headless browser like selenium to go to the page, input the code and click the search button. After the headless browser navigates to the results you can parse it from there.

python 3: received 403:forbidden error when using request

HTTP Error 403: Forbidden is generated by using the either one of the following two commands.
requests.get('http://www.allareacodes.com')
urllib.request.urlopen('http://www.allareacodes.com')
however, I am able to browse this website in chrome and check its source. Besides, wget in my cygwin is also capable of grabbing the html source.
anyone knows how to grab the source of this website by using packages in python alone?

You have errors in your code for requests. It should be:
import requests
r = requests.get('http://www.allareacodes.com')
print(r.text)
In your case however, the website has a "noindex" file that stops scripts from getting the raw HTML data. As a solution, simply fake your headers so that the website thinks you're an actual user.
Example:
import requests
r = requests.get('http://www.allareacodes.com', headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
})
print(r.text)

How a web site recognize the web browser?

Some sites are arrange the layout by itself when accessed through a smartphone or a pc. I wonder how is it done (Javascript? getting the browser data?). I would really appreciate some help, I am learning JAVA, thanks.

Each request of web browser have agent-string, which contain necessary information. Look at this page for description of agent string. http://en.wikipedia.org/wiki/User_agent

The browser sends a header with each GET request with a variety of information about itself. See here for an example, but the particular information your are talking about (browser type) is sent in the User-Agent field. With some http client libraries, you are able to control some of the fields sent in order to assume the identity of other types of client.

This is done by reading the user agent, usually using javascript (on websites).
Javascript example here.

The Website recognizes the Browser via the user agent string. This is a unique identifier that tells the site the browser type and version.
This can be detected in javascript via navigator.userAgent
It is also sent to the server in the Get Request as a header field
Example:
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.56 Safari/536.5
The Java Servlet code to get this would be (More Info Here):
public final void doGet(HttpServletRequest req, HttpServletResponse res)
throws ServletException, IOException {
String agent = req.getHeader("user-agent");
if (agent != null && agent.indexOf("MSIE") > -1) {
// Internet Explorer mode
} else {
// Non-Internet Explorer mode
}
}
Obligatory Wikipedia Reference:
http://en.wikipedia.org/wiki/User_agent
The User-Agent string format is currently specified by Section 14.43
of RFC 2616 (HTTP/1.1) The format of the User-Agent string in HTTP is
a list of product tokens (keywords) with optional comments. For
example if your product were called WikiBrowser, your user agent
string might be WikiBrowser/1.0 Gecko/1.0. The "most important"
product component is listed first. The parts of this string are as
follows:
Product name and version (WikiBrowser/1.0) Layout engine and
version(Gecko/1.0). In this case, this indicates the Layout engine and
version. Unfortunately, during the browser wars, many web servers were
configured to only send web pages that required advanced features to
clients that were identified as some version of Mozilla.
For this reason, most Web browsers use a User-Agent value as follows:
Mozilla/[version] ([system and browser information]) [platform]
([platform details]) [extensions]. For example, Safari on the iPad has
used the following:
Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us)
AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405 The components
of this string are as follows:
Mozilla/5.0: Previously used to indicate compatibility with the
Mozilla rendering engine (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us):
Details of the system in which the browser is running
AppleWebKit/531.21.10: The platform the browser uses (KHTML, like
Gecko): Browser platform details Mobile/7B405: This is used by the
browser to indicate specific enhancements that are available directly
in the browser or through third parties. An example of this is
Microsoft Live Meeting which registers an extension so that the Live
Meeting service knows if the software is already installed, which
means it can provide a streamlined experience to joining meetings.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Getting content from page which checks for js - node.js

Related

Web Scraping from Oddschecker using BeatifulSoup

url for mp4 gives error, but mozila can download it. how can i download it with python instead

Python Web data colecting

python 3: received 403:forbidden error when using request

How a web site recognize the web browser?

Categories

Resources