Suspiciously high number of web visits from "exotic" countries - web

I set up a small business website which is only displaying informations about the offered services and some contact informations. It is not interactive at all and no user is enabled to submit any data.
We are now monitoring the visits and pis with the tools offered by google. Since the first days after the going public we are observing a lot of ips from places in the world we have absolutely no relation to (like Russia, China, Brazil, even some african states...). Also the overall number of visits is much higher than we expected.
Now I'm wondering where these "exotic" visitors may come from. And if this is some kind of attack we should be aware of and protect somehow. Does anybody know what might be happening here?

This is a common situation, Websites with the default Google Analytics tracking code like UA-XXXXXXX-1 have been receiving attacks from what is known as "Ghost referrals". These ghosts are often coming from Russia through different sources such as forum.topic59010277.darodar.com, humanorightswatch.org, o-o-6-o-o.com and s.click.aliexpress.com.
Most recently I have noticed another source simple-share-buttons.com coming from different countries like USA, China, Finland, Singapore and Argentina.
They distort metrics like bounce rate and session duration. Google might deliver a solution soon, meanwhile you can use view-filters to block them from appearing in your GA reports.
Create a filter that only excludes ghosts from your view. Go to your view and set up the Filter as follow:
Filter type: Custom
Exclude
Filter Field: Referral
Filter patter use the following regex:
.*spammer1.tld|.*spammer2.tld|.*spammer3.tld|.*spammer4.tld
Check the tld (com, net, co, etc) of the spammer* and change it accordantly inside the regex. *Find the list of spammers in Google Analytics in the Acquisition>All Traffic>Referrals report (You will need to monitor this section just in case new spammers arrive)

Your domain may be a reason - if it had been used on another site. Or someone used it early. Look at backlinks for your domain. It`s only my humble opinion.

Related

How does Mixpanel's Search Keyword work?

I'm curious on how Mixpanel tracks which Search Keywords an event is affiliated with. Is this from the organic search (vs. paid search ads)?
If yes, how did they do it? From a glance, I guess organic search works this way:
That link goes to a proxy link with some query parameters which contain info about the (encrypted) search term & the real destination link.
Redirect to the real destination link.
Google Analytics know the organic search keyword used on a session because they intercept it in the middle point. I'm not sure if there's any way for someone outside of Google to intercept that info (including Mixpanel). Right? (correct me if I'm wrong)
If there is a way for the destination website to know the organic search keyword, can I be enlightened on the method?
I don't think this is coming from organic search or paid ads due to a couple reasons:
Most of the organic traffic is now in HTTPS which makes it hard to get the search parameters. Google Analytics shows this data through the Webmaster Tools console which is able to grab keyword data in a different way (I assume through the Google backend and not the URL itself). Otherwise, you are stuck with the "Not Provided" issue in Google Analytics.
Mixpanel only captures the default UTM parameters: utm_campaign, utm_source, utm_keyword, utm_medium and utm_content. Mixpanel also calls this properties as expected: UTM Medium, UTM Source, etc.
I can't tell from your screenshot but it seems this might be a custom property that your Mixpanel setup is setting it, perhaps from an internal search engine? Or perhaps you're grabbing a custom URL query?
Can you provide more information as to how this event is being captured?

Fast or rolling contact importer

Im trying to add a feature to my website that involves the typical "invite your friends" with help from a contact importer (cloudsponge). Its a pretty popular and gets the job done but I need something faster..
The problem with cloudsponge is that they request all contacts in one call, this could mean a long wait time for someone with alot of contacts.
I looked at their rest calls and there doesnt seem to be a way to load contacts in pieces. Do any of these contact importing services allow you to pull in a few contacts at a time (lets say 50) so that we can show our user the first 50 contacts and load the rest / updating the view. So they dont have to wait forever for all the contacts to be pulled?
Ive looked at other apis like context io but cant seem to find a solution to this one.
I built the CloudSponge API.
Early on, we decided to support imports across a variety of providers while exposing a simple and consistent interface. Pagination and rolling or real-time access to contacts were things that were excluded in order to do that. To provide end-user feedback on the progress of the import, we added the /events endpoint.
So far import speed hasn't been a major issue for a couple reasons:
In general, end users with an address book of 10000+ contacts are rare (although this may not be the case for certain niches).
End users who do have this many contacts in their address book usually understand that it will take a while to import.
Having said that, the speed is something that we can definitely improve upon. Here's a few ideas:
We can allow for returning only a subset of all contacts by default. For example, we currently return all contacts for Gmail, which is usually a much larger number of contacts than are actually stored in 'my contacts'.
We can implement parallel paginated imports on the server side. This will make our server process work harder and faster to download the user's contacts from, say, Gmail. This adds complexity on our side but keeps the API untouched.
We can implement your suggestion: add a rolling or real-time access to contacts in our API, either in an extended endpoint or a new version of our interface.
I'm happy to work with you on exploring these to improve our service. Send us an email: support#cloudsponge.com
Graeme

How tracking of the web traffic source works?

May be a stupid question, but I can't find any answer to this question on the web.
In Google analytics it is possible to check the origin a connection to our website. My question, how Google can track the origin of those connections?
If there is info in document.referer (for the javascript tracker, with the measurement protocol you'd have to pass a referer as parameter) Google identifies the source as referrer, unless it is configured (in the defaults or per custom settings) as a search engine (which is really just a referrer with a known search parameter). Also via the settings you can exclude urls from the referrer reports so they will appear as direct traffic.
If there are campaign parameters Google uses those (or else a Google click id (gclid) from autotagging in adwords, which serves a similar purpose). If campaign parameters or gclid are stripped out (e.g. by redirects) adwords ad clicks will be reported as organic search.
If there is no referrer and no campaign parameters/gclid (i.e. a direct type in or a bookmark) Google will identify the source as a direct hit, unless you have clicked an adwords ad before. In that case the aquisition report will report the source as CPC (click per cost) in the acquisition report (as Google puts it, they will use the last known marketing channel as source. Direct is not a marketing channel according to Google). However the multichannel reports will identify those more correctly as direct visits (which is why multichannel and acquisition reports usually do not quite match).

Why does Google Analytics show less visits than One&One stats?

Comparing google analytics results to one&one hosting monthly statics shows a huge discrepancy.
For last month:
Google shows 1046 visits.
One&one stats show 15304 unique visits.
The google code is in the footer which appears on every page.
I'm aware ga only works with js enabled but to assume that many non js users???
Google Analytics is a good indicator of how many humans are visiting your website.
Here are some things to check:
how many bots are in your monthly stats? You can usually find something that says User-Agent in your stats page. GoogleBot, Slurp, msnbot & others will be visiting every page on your site.
that you've read Google Analytics' definition of a visit.
that you have read what your statistics provider means by unique visit. Does that mean unique visitor, page view or something else?
Raw hits on servers can be misleading for a number of reasons..
If you have external style sheets & JavaScript etc, they could be counted as a hit in the webserver log
RSS feed readers will periodically update without being asked to by a human
Check the page views in Google Analytics - it's possible that 1&1 is tracking unique page views instead of the actual visits.
Google Analytics works for almost all users (I believe less than 5% have JS disabled). I have had the same discrepancy, in my case the difference was zeroed out when I took into account the bots (which server-side statistics often take into account, as they produce http-requests). You probably have the same "problem".
Neither stats are wrong, they just count different things. Google Analytics is the more "accurate", i.e. the numbers you want to take a look at. The hosting stats, which look only at http requests, often without filtering, are less interesting.
Blogger, and probably other sites, serve a different page template or skin to mobile visitors. In my case, that template didn't contain the google analytics snippet of code and so those hits were uncounted, until I noticed and fixed it.

How I do to block Web scraping without blocking Well behaved bots?

I'm building an e-commerce website with a large database of products. Of course, is nice when Goggle indexes all products of the website. But what if some competitor wants Web Scrape the website and get all images and product descriptions?
I was observing some websites with similar lists of products, and they place a CAPTCHA, so "only humans" can read the list of products. The drawback is... it is invisible for Google, Yahoo or another "Well behaved" bots.
You can discover the IP addresses the Google and others are using by checking visitor IPs with whois (in the command line or on a web site). Then, once you've accumulated a stash of legit search engines, allow them into your product list without the CAPTCHA.
If you're worried about competitors using your text or images, how about a watermark or customized text?
Let them take your images and you'd have your logo on their site!
Since a potential screen-scaping application can spoof the user agent and HTTP referrer (for images) in the header and use a time schedule that is similar to a human browser, it is not possible to completely stop professional scrapers. But you can check for these things nevertheless and prevent casual scraping.
I personally find Captchas annoying for anything other than signing up on a site.
One technique you could try is the "honey pot" method: it can be done either by mining log files are via some simple scripting.
The basic process is you build your own "blacklist" of scraper IPs based by looking for IP addresses which look at 2+ unrelated products in a very short period of time. Chances are these IPs belong to Machines. You can then do a reverse lookup on them to determine if they are nice (like GoogleBot or Slurp) or bad.
Block webscrapers is not easy, and it's even harder trying to avoid false positives.
Anyway you can add some netrange to a whitelist, and don't serve any captcha to them.
All those well known crawlers: Bing, Googlebot, Yahoo etc.. use always specific netranges when crawling, and all those IP addresses resolve to specific reverse lookups.
Few examples:
Google IP 66.249.65.32 resolves to crawl-66-249-65-32.googlebot.com
Bing IP 157.55.39.139 resolves to msnbot-157-55-39-139.search.msn.com
Yahoo IP 74.6.254.109 resolves to h049.crawl.yahoo.net
So let's say that '*.googlebot.com ', '*.search.msn.com ' and '*.crawl.yahoo.net ' addresses should be whitelisted.
There are plenty of white lists you can implement out on internet.
Said that, I don't believe Captcha is a solution against advanced scrapers, since services such as deathbycaptcha.com or 2captcha.com promise to solve any kind of captcha within seconds.
Please have a look into our wiki http://www.scrapesentry.com/scraping-wiki/ we wrote many articles on how to prevent, detect and block web-scrapers.
Perhaps I over-simplify, but if your concern is about server performance then providing an API would lessen the need for scrapers, and save you band/width processor time.
Other thoughts listed here:
http://blog.screen-scraper.com/2009/08/17/further-thoughts-on-hindering-screen-scraping/

Resources