Googlebot and FAMUOS $_SERVER['HTTP_ACCEPT_LANGUAGE']

Googlebot and FAMUOS $_SERVER['HTTP_ACCEPT_LANGUAGE'] - googlebot

I did $_SERVER['HTTP_ACCEPT_LANGUAGE'] on my site, where have two langs PL and EN.
I really didn't suspect Google would reindex my site (meaning TITLE and DESC) from PL to EN that way.
Shouldnt it use PL since it is crawling Polish domains, hmmmmrrrr cant understand ?
Anyway i can detect googlebot and set them PL again
But it wouldnt be cloacking or smthing ?
Anyone could tell me what would be good solution to get it straight so, me and Google would be happy ?

the HTTP_ACCEPT_LANGUAGE means the language that the CLIENT (the client browser,in this case the crawler)supports and it's an information the client send with the request(like the ip,etc.),it's not something the server sends to the client.
To tell the client what languages your website supports you must use the meta tags,in your case for example
<meta http-equiv="content-language" content="pl, en" />
will tell the client that you site prefer PL language but support ENG.
This is the w3c page about it

Related

Fingerprint browser and drive visitors

Excuse me in advance, if you have submitted the application or the problem is not divided Please help solve this problem Or my topic is moved to the appropriate section I have an Arabic chat site as a wonder chat Chat is a script
During my search for the most powerful methods of banning annoying visitors within the chat and found these files bearing the name of the browser fingerprint Link to the Fingerprint file to make a distinctive fingerprint of the browser.
https://cdnjs.cloudflare.com/ajax/li...rprint2.min.js
Working file idea:
The basic idea of the file is to make a distinctive imprint for the browser to distinguish members on the site even if the member changed his name and change his IP, and also the file can fetch a lot of information through the browser of the member such as the version of the browser and state and city and the private Internet company used by the member. The only problem we have now is how to use the file to bring the fingerprint of the browser to the member and fetch the basic data from the browser such as the state, city and its Internet company, to store this data on the datapize and use it to protect the site and chat from spam and annoying members.
Thank you for your presence.
Site Link:
https://www.3a-chat.com/chat

Unfortunately the library URL in your question does not work, but I would recommend using this existing solution and extending it a bit. For example you may add:
pixel ratio window.devicePixelRatio || 1
languages (navigator.languages || []).join(',')
math precision ${((Math.exp(10) + 1 / Math.exp(10)) / 2)}${Math.tan(-1e300)}

Is it possible to scrape any given URL with NodeJS?

est I'll preface this by saying this is something that is new to me and is purely a learning exercise, so please excuse any naivety.
I've been looking through some articles on scraping and it seems that NodeJS, ExpressJS, Request and Cheerio would be my preferred method as a Front-End guy who is comfortable with JS/jQuery.
All the articles I've read so far focus on scraping data from a specific website in the absence of an API, whereas what I am looking to achieve to start with is a tool which takes any given URL and returns a true/false for a list of which common libraries are being used and which social networks are linked.
For example, a user enters a URL and the results return a "This website uses jQuery, MooTools, BackboneJS, AngularJS, etc" and "This website is linked with Facebook, Twitter, etc". Somewhat similar to Tregia: http://www.tregia.com/process?q=http://smashingmagazine.com.
Is my chosen setup (above) appropriate or limited to only scraping specific pages due to CSS selectors?

You should be able to scrape all pages and then find their tags and read which tools they're using (although keep in mind they may have renamed them [ex angularjs3.1.0.js - > foobar.js] to keep people from knowing their stack). You should also be able to get the specific text within the rest of the tags that you feel relevant as well.
You should try and pay attention to every page's robots.txt as well.
Edit: You probably won't be able to scrape "members"/"login only" areas of sites though.

How fast does Google take to crawl new page, and can we influence Google's crawler?

I want to submit my site to Google. How much time does it take to crawl a new post on the website?
Also, is there a way to feed this post to Google crawler as soon as a post is created?

Google has three modes of entering a website into its results - discover, crawl, index.
In order to 'discover' your site, it must be made aware of it's existence - normally through back-links. If you're site is brand new you can use the submit URL form - but this isn't really a trusted method. You're better off signing up for a Google Webmaster Tools account and submitting your site. An additional step is to submit an XML sitemap of your site. If you are publishing to your site in a blogging/posting way - you can always consider PubSubHubbub.
From there on, crawl frequency is normally based on site popularity (as measured by ye olde PageRank). Depth of crawl (crawl-budget) is also determined by PR.

There are a couple ways to help "feed" the Google Crawler a URL.
The first way is to go here and submit a URL ---> www.google.com/webmasters/tools/submit-url/
The second way is to go to your Google Webmasters Tools and clicking "Fetch as GoogleBot"
And then inputting the URL you want to add:
http://i.stack.imgur.com/Q3Iva.png
The URL will then appear similar to this:
http:\\example.site Web Success URL submitted to index 1/22/12 2:51 AM
As for how long it takes for a question on here to appear on google, there are many factors that are put in to this.
If the owners of the site use Google Webmasters Tools, the following setting is available:
http://i.stack.imgur.com/RqvOi.png

For fast crawl you should submit your xml sitemap in google web master and manually crawled and index your web pages url through google webmaster fetch.
I also used google crawled and index method and after that this practices give me best result.

This is a great resource that really breaks down all the factors that affect a crawl budget and how to optimize your website to increase it. Cleaning up your broken links and removing outdated content, for example, can work wonders. https://prerender.io/crawl-budget-seo/

I acknowledged error in my response by adding a comment to original question a long time ago. Now, I am updating this post in interest of keeping future readers from being misguided as I was. Please see notes from other users below - they are correct. Google does not make use of the revisit-after meta tag. I am still keeping the original response text here to make sure that anyone else looking for similar answer will find it here along with this note confirming that this meta tag IS NOT VALID! Hope this helps someone.
You may use HTML meta tag as follows:
<meta name="revisit-after" content="1 day">
Adjust time period as necessary. There is no guarantee that robots will return in given time frame but this is how you are telling robots about how often a given page is likely to change.
The Revisit Meta Tag is used to tell search engines when to come back next.

How To Prevent GET Requests?

I am creating a site that will encourage users to visit again. Therefore, I'm afraid of people sending spam or bots to the site.
How can I block this type of spam? I've heard of spamming GET requests to make it look like there are more visits. What can I do to protect myself?

The main way to cut down on bot artificial traffic is to use a "captcha" image
look into reCaptcha or secureimage and integrate this. Whether you submit these methods via GET or POST, the captcha var will be checked on the server side at which point you can admit/deny for the purposes of averting bots.
Hope this helps.
R

How I do to block Web scraping without blocking Well behaved bots?

I'm building an e-commerce website with a large database of products. Of course, is nice when Goggle indexes all products of the website. But what if some competitor wants Web Scrape the website and get all images and product descriptions?
I was observing some websites with similar lists of products, and they place a CAPTCHA, so "only humans" can read the list of products. The drawback is... it is invisible for Google, Yahoo or another "Well behaved" bots.

You can discover the IP addresses the Google and others are using by checking visitor IPs with whois (in the command line or on a web site). Then, once you've accumulated a stash of legit search engines, allow them into your product list without the CAPTCHA.

If you're worried about competitors using your text or images, how about a watermark or customized text?
Let them take your images and you'd have your logo on their site!

Since a potential screen-scaping application can spoof the user agent and HTTP referrer (for images) in the header and use a time schedule that is similar to a human browser, it is not possible to completely stop professional scrapers. But you can check for these things nevertheless and prevent casual scraping.
I personally find Captchas annoying for anything other than signing up on a site.

One technique you could try is the "honey pot" method: it can be done either by mining log files are via some simple scripting.
The basic process is you build your own "blacklist" of scraper IPs based by looking for IP addresses which look at 2+ unrelated products in a very short period of time. Chances are these IPs belong to Machines. You can then do a reverse lookup on them to determine if they are nice (like GoogleBot or Slurp) or bad.

Block webscrapers is not easy, and it's even harder trying to avoid false positives.
Anyway you can add some netrange to a whitelist, and don't serve any captcha to them.
All those well known crawlers: Bing, Googlebot, Yahoo etc.. use always specific netranges when crawling, and all those IP addresses resolve to specific reverse lookups.
Few examples:
Google IP 66.249.65.32 resolves to crawl-66-249-65-32.googlebot.com
Bing IP 157.55.39.139 resolves to msnbot-157-55-39-139.search.msn.com
Yahoo IP 74.6.254.109 resolves to h049.crawl.yahoo.net
So let's say that '*.googlebot.com ', '*.search.msn.com ' and '*.crawl.yahoo.net ' addresses should be whitelisted.
There are plenty of white lists you can implement out on internet.
Said that, I don't believe Captcha is a solution against advanced scrapers, since services such as deathbycaptcha.com or 2captcha.com promise to solve any kind of captcha within seconds.
Please have a look into our wiki http://www.scrapesentry.com/scraping-wiki/ we wrote many articles on how to prevent, detect and block web-scrapers.

Perhaps I over-simplify, but if your concern is about server performance then providing an API would lessen the need for scrapers, and save you band/width processor time.
Other thoughts listed here:
http://blog.screen-scraper.com/2009/08/17/further-thoughts-on-hindering-screen-scraping/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Googlebot and FAMUOS $_SERVER['HTTP_ACCEPT_LANGUAGE'] - googlebot

Related

Fingerprint browser and drive visitors

Is it possible to scrape any given URL with NodeJS?

How fast does Google take to crawl new page, and can we influence Google's crawler?

How To Prevent GET Requests?

How I do to block Web scraping without blocking Well behaved bots?

Categories

Resources