Get Google Search Results Content - search

I want to get or buy google search results content (structured) from Google itself or any other source that can sell google data legally. I want all results about a specific keyword for the recent 6 months for example.
It will be a good turnaround if I can only get the page content as a raw text for this stage.

Automatic reading out / scraping of Google SERP is against Google ToS. From this point of view there is no one who sells such data legally - any seller violates Googles ToS.
Tere are many offers on markt, where you can get SERP data as JSON or full HTML through API access - just google for it.
The way every seller does SERP scraping is always the same - you can do it by your own. Run many proxies with IP addresses of countries, from where you need SERPs, and query Google with a kind of headless browser. Use captcha solving services to get data even if IP should be banned. Multithread your queries to get more data at once. Thats the whole magic.

Related

How does Mixpanel's Search Keyword work?

I'm curious on how Mixpanel tracks which Search Keywords an event is affiliated with. Is this from the organic search (vs. paid search ads)?
If yes, how did they do it? From a glance, I guess organic search works this way:
That link goes to a proxy link with some query parameters which contain info about the (encrypted) search term & the real destination link.
Redirect to the real destination link.
Google Analytics know the organic search keyword used on a session because they intercept it in the middle point. I'm not sure if there's any way for someone outside of Google to intercept that info (including Mixpanel). Right? (correct me if I'm wrong)
If there is a way for the destination website to know the organic search keyword, can I be enlightened on the method?
I don't think this is coming from organic search or paid ads due to a couple reasons:
Most of the organic traffic is now in HTTPS which makes it hard to get the search parameters. Google Analytics shows this data through the Webmaster Tools console which is able to grab keyword data in a different way (I assume through the Google backend and not the URL itself). Otherwise, you are stuck with the "Not Provided" issue in Google Analytics.
Mixpanel only captures the default UTM parameters: utm_campaign, utm_source, utm_keyword, utm_medium and utm_content. Mixpanel also calls this properties as expected: UTM Medium, UTM Source, etc.
I can't tell from your screenshot but it seems this might be a custom property that your Mixpanel setup is setting it, perhaps from an internal search engine? Or perhaps you're grabbing a custom URL query?
Can you provide more information as to how this event is being captured?

What sort of technologies were used to build NoHomophobes.com (real-time keyword tracker for Twitter)?

Do you think that they plugged directly into the Twitter API, or do they have some sort of backend which is what connects to the Twitter API directly instead? I didn't realize this kind of functionality was available to standard users.
Link: NoHomophobes.com
This site has a (short) piece about the technology used - it does seem like they're using the standard, public API:
"Using Twitter's API, tweets [...] were pulled, tracked and displayed
in real time
[...]
We couldn't simply pull every tweet ... A lot of research and testing
was conducted to determine which words and phrases to capture, as well
as what parameters the tweets had to follow in order to be funneled
onto the site"
Also, the site's own T&C's mention
This website contains a licensed real time display of Tweets
At a guess, they're effectively continually searching for certain terms in public tweets (as any Twitter client can do) and displaying the results.
Basically, the site uses the Twitter Streaming APIs which allow a persistent connection with Twitter. And as filtered tweets come through, it processes the data and delivers it to website users through web sockets via a 3rd party service called Pusher.

How do search engines recognize search boxes on websites?

I've noticed that a lot of the time when i search something on Google, Google automatically uses the search function of relevant websites and return the result of the website search as if it was just another URL.
How do i let Google and other search engines know what is the search box on my own website and does Open Search has anything to do with it?
do you maybe mean the site search function via the google chrome omnibar?
to get there you just need to have a
form with method type GET
input type text element
submit button
on the root page of your domain
if users go directly to your root page and search something there, google learns of this form and adds it to the search engines accessible via the omnibar (the google chrome address bar).
did you mean this?
Google doesn't use anyones search forms - it just finds a link to search results, you need to
Use GET for your search parameters to make this possible
Create links to common/useful search results pages
Make sure google finds those links
Google makes it look like just another URL because that is exactly what it is.
Most of the time though Google will do a better job than your search engine so actually doing this could lower the quality of results from your site...
I don't think it does. It's impossible to spider sites in real time.
It's just a SEO technique some sites use to improve their ranking by spamming Google with fake results. They feed the Google bot with an endless stream of links to bogus pages:
http://en.wikipedia.org/wiki/Spamdexing

Why does Google Analytics show less visits than One&One stats?

Comparing google analytics results to one&one hosting monthly statics shows a huge discrepancy.
For last month:
Google shows 1046 visits.
One&one stats show 15304 unique visits.
The google code is in the footer which appears on every page.
I'm aware ga only works with js enabled but to assume that many non js users???
Google Analytics is a good indicator of how many humans are visiting your website.
Here are some things to check:
how many bots are in your monthly stats? You can usually find something that says User-Agent in your stats page. GoogleBot, Slurp, msnbot & others will be visiting every page on your site.
that you've read Google Analytics' definition of a visit.
that you have read what your statistics provider means by unique visit. Does that mean unique visitor, page view or something else?
Raw hits on servers can be misleading for a number of reasons..
If you have external style sheets & JavaScript etc, they could be counted as a hit in the webserver log
RSS feed readers will periodically update without being asked to by a human
Check the page views in Google Analytics - it's possible that 1&1 is tracking unique page views instead of the actual visits.
Google Analytics works for almost all users (I believe less than 5% have JS disabled). I have had the same discrepancy, in my case the difference was zeroed out when I took into account the bots (which server-side statistics often take into account, as they produce http-requests). You probably have the same "problem".
Neither stats are wrong, they just count different things. Google Analytics is the more "accurate", i.e. the numbers you want to take a look at. The hosting stats, which look only at http requests, often without filtering, are less interesting.
Blogger, and probably other sites, serve a different page template or skin to mobile visitors. In my case, that template didn't contain the google analytics snippet of code and so those hits were uncounted, until I noticed and fixed it.

How I do to block Web scraping without blocking Well behaved bots?

I'm building an e-commerce website with a large database of products. Of course, is nice when Goggle indexes all products of the website. But what if some competitor wants Web Scrape the website and get all images and product descriptions?
I was observing some websites with similar lists of products, and they place a CAPTCHA, so "only humans" can read the list of products. The drawback is... it is invisible for Google, Yahoo or another "Well behaved" bots.
You can discover the IP addresses the Google and others are using by checking visitor IPs with whois (in the command line or on a web site). Then, once you've accumulated a stash of legit search engines, allow them into your product list without the CAPTCHA.
If you're worried about competitors using your text or images, how about a watermark or customized text?
Let them take your images and you'd have your logo on their site!
Since a potential screen-scaping application can spoof the user agent and HTTP referrer (for images) in the header and use a time schedule that is similar to a human browser, it is not possible to completely stop professional scrapers. But you can check for these things nevertheless and prevent casual scraping.
I personally find Captchas annoying for anything other than signing up on a site.
One technique you could try is the "honey pot" method: it can be done either by mining log files are via some simple scripting.
The basic process is you build your own "blacklist" of scraper IPs based by looking for IP addresses which look at 2+ unrelated products in a very short period of time. Chances are these IPs belong to Machines. You can then do a reverse lookup on them to determine if they are nice (like GoogleBot or Slurp) or bad.
Block webscrapers is not easy, and it's even harder trying to avoid false positives.
Anyway you can add some netrange to a whitelist, and don't serve any captcha to them.
All those well known crawlers: Bing, Googlebot, Yahoo etc.. use always specific netranges when crawling, and all those IP addresses resolve to specific reverse lookups.
Few examples:
Google IP 66.249.65.32 resolves to crawl-66-249-65-32.googlebot.com
Bing IP 157.55.39.139 resolves to msnbot-157-55-39-139.search.msn.com
Yahoo IP 74.6.254.109 resolves to h049.crawl.yahoo.net
So let's say that '*.googlebot.com ', '*.search.msn.com ' and '*.crawl.yahoo.net ' addresses should be whitelisted.
There are plenty of white lists you can implement out on internet.
Said that, I don't believe Captcha is a solution against advanced scrapers, since services such as deathbycaptcha.com or 2captcha.com promise to solve any kind of captcha within seconds.
Please have a look into our wiki http://www.scrapesentry.com/scraping-wiki/ we wrote many articles on how to prevent, detect and block web-scrapers.
Perhaps I over-simplify, but if your concern is about server performance then providing an API would lessen the need for scrapers, and save you band/width processor time.
Other thoughts listed here:
http://blog.screen-scraper.com/2009/08/17/further-thoughts-on-hindering-screen-scraping/

Resources