How does Mixpanel's Search Keyword work? - search

I'm curious on how Mixpanel tracks which Search Keywords an event is affiliated with. Is this from the organic search (vs. paid search ads)?
If yes, how did they do it? From a glance, I guess organic search works this way:
That link goes to a proxy link with some query parameters which contain info about the (encrypted) search term & the real destination link.
Redirect to the real destination link.
Google Analytics know the organic search keyword used on a session because they intercept it in the middle point. I'm not sure if there's any way for someone outside of Google to intercept that info (including Mixpanel). Right? (correct me if I'm wrong)
If there is a way for the destination website to know the organic search keyword, can I be enlightened on the method?

I don't think this is coming from organic search or paid ads due to a couple reasons:
Most of the organic traffic is now in HTTPS which makes it hard to get the search parameters. Google Analytics shows this data through the Webmaster Tools console which is able to grab keyword data in a different way (I assume through the Google backend and not the URL itself). Otherwise, you are stuck with the "Not Provided" issue in Google Analytics.
Mixpanel only captures the default UTM parameters: utm_campaign, utm_source, utm_keyword, utm_medium and utm_content. Mixpanel also calls this properties as expected: UTM Medium, UTM Source, etc.
I can't tell from your screenshot but it seems this might be a custom property that your Mixpanel setup is setting it, perhaps from an internal search engine? Or perhaps you're grabbing a custom URL query?
Can you provide more information as to how this event is being captured?

Related

Get Google Search Results Content

I want to get or buy google search results content (structured) from Google itself or any other source that can sell google data legally. I want all results about a specific keyword for the recent 6 months for example.
It will be a good turnaround if I can only get the page content as a raw text for this stage.
Automatic reading out / scraping of Google SERP is against Google ToS. From this point of view there is no one who sells such data legally - any seller violates Googles ToS.
Tere are many offers on markt, where you can get SERP data as JSON or full HTML through API access - just google for it.
The way every seller does SERP scraping is always the same - you can do it by your own. Run many proxies with IP addresses of countries, from where you need SERPs, and query Google with a kind of headless browser. Use captcha solving services to get data even if IP should be banned. Multithread your queries to get more data at once. Thats the whole magic.

Get/Show google search results in my app

I am facing a problem while developing an app, where I need to display search engine results directly on my app page without directing to www.google.com.
This is how it looks, in the search box I'll enter the RSS feed site name, and now I want to get the google search result on my app page so that I can easily extract RSS feed website and perform the operation I was intended to do.
I am intending only to get RSS feeds from the site just by typing sitename.
Thank you!
Answer.
Almost working..,
Thank you #Chandan,#Suzi
Check under 2. A Better Approach
I didn't try it out practically and am not sure whether its deprecated by this time or not.

How tracking of the web traffic source works?

May be a stupid question, but I can't find any answer to this question on the web.
In Google analytics it is possible to check the origin a connection to our website. My question, how Google can track the origin of those connections?
If there is info in document.referer (for the javascript tracker, with the measurement protocol you'd have to pass a referer as parameter) Google identifies the source as referrer, unless it is configured (in the defaults or per custom settings) as a search engine (which is really just a referrer with a known search parameter). Also via the settings you can exclude urls from the referrer reports so they will appear as direct traffic.
If there are campaign parameters Google uses those (or else a Google click id (gclid) from autotagging in adwords, which serves a similar purpose). If campaign parameters or gclid are stripped out (e.g. by redirects) adwords ad clicks will be reported as organic search.
If there is no referrer and no campaign parameters/gclid (i.e. a direct type in or a bookmark) Google will identify the source as a direct hit, unless you have clicked an adwords ad before. In that case the aquisition report will report the source as CPC (click per cost) in the acquisition report (as Google puts it, they will use the last known marketing channel as source. Direct is not a marketing channel according to Google). However the multichannel reports will identify those more correctly as direct visits (which is why multichannel and acquisition reports usually do not quite match).

How do search engines recognize search boxes on websites?

I've noticed that a lot of the time when i search something on Google, Google automatically uses the search function of relevant websites and return the result of the website search as if it was just another URL.
How do i let Google and other search engines know what is the search box on my own website and does Open Search has anything to do with it?
do you maybe mean the site search function via the google chrome omnibar?
to get there you just need to have a
form with method type GET
input type text element
submit button
on the root page of your domain
if users go directly to your root page and search something there, google learns of this form and adds it to the search engines accessible via the omnibar (the google chrome address bar).
did you mean this?
Google doesn't use anyones search forms - it just finds a link to search results, you need to
Use GET for your search parameters to make this possible
Create links to common/useful search results pages
Make sure google finds those links
Google makes it look like just another URL because that is exactly what it is.
Most of the time though Google will do a better job than your search engine so actually doing this could lower the quality of results from your site...
I don't think it does. It's impossible to spider sites in real time.
It's just a SEO technique some sites use to improve their ranking by spamming Google with fake results. They feed the Google bot with an endless stream of links to bogus pages:
http://en.wikipedia.org/wiki/Spamdexing

How I do to block Web scraping without blocking Well behaved bots?

I'm building an e-commerce website with a large database of products. Of course, is nice when Goggle indexes all products of the website. But what if some competitor wants Web Scrape the website and get all images and product descriptions?
I was observing some websites with similar lists of products, and they place a CAPTCHA, so "only humans" can read the list of products. The drawback is... it is invisible for Google, Yahoo or another "Well behaved" bots.
You can discover the IP addresses the Google and others are using by checking visitor IPs with whois (in the command line or on a web site). Then, once you've accumulated a stash of legit search engines, allow them into your product list without the CAPTCHA.
If you're worried about competitors using your text or images, how about a watermark or customized text?
Let them take your images and you'd have your logo on their site!
Since a potential screen-scaping application can spoof the user agent and HTTP referrer (for images) in the header and use a time schedule that is similar to a human browser, it is not possible to completely stop professional scrapers. But you can check for these things nevertheless and prevent casual scraping.
I personally find Captchas annoying for anything other than signing up on a site.
One technique you could try is the "honey pot" method: it can be done either by mining log files are via some simple scripting.
The basic process is you build your own "blacklist" of scraper IPs based by looking for IP addresses which look at 2+ unrelated products in a very short period of time. Chances are these IPs belong to Machines. You can then do a reverse lookup on them to determine if they are nice (like GoogleBot or Slurp) or bad.
Block webscrapers is not easy, and it's even harder trying to avoid false positives.
Anyway you can add some netrange to a whitelist, and don't serve any captcha to them.
All those well known crawlers: Bing, Googlebot, Yahoo etc.. use always specific netranges when crawling, and all those IP addresses resolve to specific reverse lookups.
Few examples:
Google IP 66.249.65.32 resolves to crawl-66-249-65-32.googlebot.com
Bing IP 157.55.39.139 resolves to msnbot-157-55-39-139.search.msn.com
Yahoo IP 74.6.254.109 resolves to h049.crawl.yahoo.net
So let's say that '*.googlebot.com ', '*.search.msn.com ' and '*.crawl.yahoo.net ' addresses should be whitelisted.
There are plenty of white lists you can implement out on internet.
Said that, I don't believe Captcha is a solution against advanced scrapers, since services such as deathbycaptcha.com or 2captcha.com promise to solve any kind of captcha within seconds.
Please have a look into our wiki http://www.scrapesentry.com/scraping-wiki/ we wrote many articles on how to prevent, detect and block web-scrapers.
Perhaps I over-simplify, but if your concern is about server performance then providing an API would lessen the need for scrapers, and save you band/width processor time.
Other thoughts listed here:
http://blog.screen-scraper.com/2009/08/17/further-thoughts-on-hindering-screen-scraping/

Resources