Trackjs: ignore rules by token in user agent - trackjs

In TrackJS, some user agents are parsed as normal browsers, e.g.:
Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36 (compatible; Google-Read-Aloud; +https://support.google.com/webmasters/answer/1061943)
Chrome Mobile 59.0.3071
I tried to do it by ignore rules in settings, but it doesn't work.
So I need to filtrate errors by token in user agent.
Is it possible do this without JS?
More similar user agents: https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers

The TrackJS UI doesn't allow you to create Ignore Rules against the raw UserAgent, only the parsed browser and operating system. Instead, use the client-side ignore capability with the onError function.
Build your function to detect the tokens you want to exclude, and return false from the function if you don't want it to be sent.

Related

How does Google Chrome's dev tools emulate device mode? How to replicate it with extensions?

I need to open Instagram in mobile mode as it shows Messages tab only in mobile view. I want to know what Chrome does so I can replicate it in my Chrome extension.
I've already noticed Google Chrome updates the User Agent. I have however tried this and replaced the headers with "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Mobile Safari/537.36" but this doesn't seem enough for Instagram to show mobile view. I suspect Google Chrome does something extra or Instagram has some safety measure?
let mobileAgent = "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Mobile Safari/537.36";
chrome.webRequest.onBeforeSendHeaders.addListener(
function(details) {
for (var i = 0; i < details.requestHeaders.length; ++i) {
if (details.requestHeaders[i].name === 'User-Agent') {
details.requestHeaders[i].value = mobileAgent;
j=i;
break;
}
}
return {requestHeaders: details.requestHeaders};
},
{urls: ["*://*.instagram.com/*"]},
["blocking", "requestHeaders"]);
With above code, I checked the headers and sure enough they are using mobile agent but Instagram still doesn't display messages tab.

Python Web data colecting

I have 100 sets of BOL need to search on below web. However, i can't find the url to auto replace and keep searching. anyone can help?
Tracking codes:
MSCUZH129687
MSCUJZ365758
The page I'm working on:
https://www.msc.com/track-a-shipment
import requests
url = 'https://www.msc.com/track-a-shipment'
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3346.9 Safari/537.36',
'Referer': 'https://www.msc.com/track-a-shipment'
}
form_data = {'first': 'true',
'pn': '1',
'kd': 'python'}
def getJobs():
res = requests.post(url=url, headers=HEADERS, data=form_data)
result = res.json()
jobs = result['Location']['Description']['responsiveTd']
print(type(jobs))
for job in jobs:
print(job)
getJobs()
tldr: You'll likely need to use a headless browser like selenium to go to the page, input the code and click the search button.
The url to retrieve is generated by the javascript that runs when you click search.
The search button posts the link to their server so when it redirects you to the link the server knows what response to give you.
In order to auto generate the link you'd have to analyze the javascript and understand how it generates the code in order to generate the code yourself, post the code to their server and then make a subsequent get request to retrieve the results like the asp.net framework is doing.
Alternatively you can use a headless browser like selenium to go to the page, input the code and click the search button. After the headless browser navigates to the results you can parse it from there.

python 3: received 403:forbidden error when using request

HTTP Error 403: Forbidden is generated by using the either one of the following two commands.
requests.get('http://www.allareacodes.com')
urllib.request.urlopen('http://www.allareacodes.com')
however, I am able to browse this website in chrome and check its source. Besides, wget in my cygwin is also capable of grabbing the html source.
anyone knows how to grab the source of this website by using packages in python alone?
You have errors in your code for requests. It should be:
import requests
r = requests.get('http://www.allareacodes.com')
print(r.text)
In your case however, the website has a "noindex" file that stops scripts from getting the raw HTML data. As a solution, simply fake your headers so that the website thinks you're an actual user.
Example:
import requests
r = requests.get('http://www.allareacodes.com', headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
})
print(r.text)

windows phone browser in desktop mode

How can force a windows phone to use the desktop view mode in the mobile borwser?
In the settings it is possible to set the browser to use the desktop view becasue some featers seem to be missing in the mobile view causing my site not beeing displayed correctly.
If you want to make websites display in desktop mode in the WebBrowser control, you must change its user agent. You can do so using this:
webBrowser.Navigate(new Uri("http://www.google.com", null, "User-Agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)");
That code changes the WebBrowser's user agent to that of desktop Internet Explorer 10.
However, it will only change the User Agent for the page navigated to. When users click links, the user agent will be changed back. To fix this, set the WebBrowser's Navigating event to this:
private void webBrowser_Navigating(object sender, WebBrowserNavigatingEventArgs e)
{
string url = e.Uri.ToString();
if (!url.Contains("#changedua"))
{
e.Cancel = true;
url = url + "#changedua";
webBrowser.Navigate(new Uri(url), null, "User-Agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)");
}
}
In this code, we check to see if the url contains a flag, "#changedua". If it does, we allow the navigation. If it does not, we cancel the navigation. Then, we navigate again using our custom user agent, and adding the flag to show that it is valid.

How a web site recognize the web browser?

Some sites are arrange the layout by itself when accessed through a smartphone or a pc. I wonder how is it done (Javascript? getting the browser data?). I would really appreciate some help, I am learning JAVA, thanks.
Each request of web browser have agent-string, which contain necessary information. Look at this page for description of agent string. http://en.wikipedia.org/wiki/User_agent
The browser sends a header with each GET request with a variety of information about itself. See here for an example, but the particular information your are talking about (browser type) is sent in the User-Agent field. With some http client libraries, you are able to control some of the fields sent in order to assume the identity of other types of client.
This is done by reading the user agent, usually using javascript (on websites).
Javascript example here.
The Website recognizes the Browser via the user agent string. This is a unique identifier that tells the site the browser type and version.
This can be detected in javascript via navigator.userAgent
It is also sent to the server in the Get Request as a header field
Example:
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.56 Safari/536.5
The Java Servlet code to get this would be (More Info Here):
public final void doGet(HttpServletRequest req, HttpServletResponse res)
throws ServletException, IOException {
String agent = req.getHeader("user-agent");
if (agent != null && agent.indexOf("MSIE") > -1) {
// Internet Explorer mode
} else {
// Non-Internet Explorer mode
}
}
Obligatory Wikipedia Reference:
http://en.wikipedia.org/wiki/User_agent
The User-Agent string format is currently specified by Section 14.43
of RFC 2616 (HTTP/1.1) The format of the User-Agent string in HTTP is
a list of product tokens (keywords) with optional comments. For
example if your product were called WikiBrowser, your user agent
string might be WikiBrowser/1.0 Gecko/1.0. The "most important"
product component is listed first. The parts of this string are as
follows:
Product name and version (WikiBrowser/1.0) Layout engine and
version(Gecko/1.0). In this case, this indicates the Layout engine and
version. Unfortunately, during the browser wars, many web servers were
configured to only send web pages that required advanced features to
clients that were identified as some version of Mozilla.
For this reason, most Web browsers use a User-Agent value as follows:
Mozilla/[version] ([system and browser information]) [platform]
([platform details]) [extensions]. For example, Safari on the iPad has
used the following:
Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us)
AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405 The components
of this string are as follows:
Mozilla/5.0: Previously used to indicate compatibility with the
Mozilla rendering engine (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us):
Details of the system in which the browser is running
AppleWebKit/531.21.10: The platform the browser uses (KHTML, like
Gecko): Browser platform details Mobile/7B405: This is used by the
browser to indicate specific enhancements that are available directly
in the browser or through third parties. An example of this is
Microsoft Live Meeting which registers an extension so that the Live
Meeting service knows if the software is already installed, which
means it can provide a streamlined experience to joining meetings.

Resources