Downloading all web hosts on the Internet - web

I'm working on a project where we work with a distributed crawler to crawl and download hosts found with web content on them. We have a few million hosts at this point, but we're realizing it's not the least expensive thing in the world. Crawling takes time and computing power, etc. etc. So instead of doing this ourselves, we're looking into if we can leverage an outside service to get URLs.
My question is, are there services out there that provide massive lists of web hosts and/or just massive lists of constantly updated URLS (which we can then parse to get hosts)? Stuff I've already looked into:
1) Search engine APIs - typically all of these search engine APIs will (understandably) not just let you download their entire index.
2) DMOZ and Alexa top 1 million - These don't have near enough sites for what we are looking to do, though they're a good start for seed lists.
Anyone have any leads? How would you solve the problem?

Maybe CommonCrawl helps. http://commoncrawl.org/
Common Crawl is a huge open database of crawled websites.

Related

Block traffic from referral spam bots in Azure Web App with DNN

I am sure many of you have found fake referral traffic in your google analytics reports/views. This makes it difficult for low to medium traffic sites to have accurate data for marketing. I am wondering what others are doing to exclude this traffic from their analytics reports.
If you go to your analytics account and go to acquisition -> all traffic -> referrals you will see sites like floating-share-buttons.com. These are the sites I want to filter out. Which you can do by setting up a custom filter for the view as described at the bottom of this page. I have done this and it works.
I would rather block these bots from hitting the site all together. Just a note: my sites are running as web apps in azure.
I am not sure if setting up url rewrite rules described here will work in azure apps or if this will mess with the existing url rewrite functions of the Content Management System I am using (DotNetNuke DNN platform 7).
I am really just looking to hear what others have done to block bots rather than than setting up filters in the analytics view's settings.
Thanks
PS
for those who are interested, this is the current filter list I am using:
webmonetizer\.net|trafficmonetizer\.org|success-seo\.com|event-tracking\.com|Get-Free-Traffic-Now\.com|buttons-for-website\.com|4webmasters\.org|floating-share-buttons\.com|free-social-buttons\.com|e-buyeasy\.com
With regards to this issue, there are a number of things that you can do. You are going the route that I see most commonly used and that is to block the information using the filters in Google Analytics.
You can go the route of an IIS Filter as well, just like you have linked. DNN's Friendly URL's will not necessarily be impacted by this as they are processed BEFORE DNN gets the request. There is a marginal performance impact by having two things process re-writes, but nothing to be concerned about until incredibly high user volume.
This is also a great collection of options.
First you need to know that there are mainly 2 types of spam affecting GA right now, Ghost and Crawlers.
The first(ghosts) never interacts with your page, so any server-side solutions like the HTTP rules or htaccess file won't have any effect and will only fill your config files with.
The crawlers as the name imply do access your website and can be blocked this way, but there are only a few of them compared with the ghost. To give you an Idea there are around 8 active crawlers while there are more than 100 ghosts and each week increasing.
This is because the ghost method is easier to implement for the spammers.
From your expression, only success-seo is a crawler. The rest should be filtered. Now there is a better way to get rid of all ghosts with just one filter based in your valid hostnames instead of creating of updating one every week.
You can find more information about the ghost spam and the solution here
https://stackoverflow.com/a/28354319/3197362
https://moz.com/ugc/stop-ghost-spam-in-google-analytics-with-one-filter
Hope it helps.

Search engine components

I'm a middle school student learning computer programming, and I just have some questions about search engines like Google and Yahoo.
As far as I know, these search engines consist of:
Search algorithm & code
(Example: search.py file that accepts search query from the web interface and returns the search results)
Web interface for querying and showing result
Web crawler
What I am confused about is the Web crawler part.
Do Google's and Yahoo's Web crawlers immediately search through every single webpage existing on WWW? Or do they:
First download all the existing webpages on WWW, save them on their huge server, and then search through these saved pages??
If the latter is the case, then wouldn't the search results appearing on the google search results be outdated, Since I suppose searching through all the webpages on WWW will take tremendous amount of time??
PS. One more question: Actually.. How exactly does a web crawler retrieve all the web pages existing on WWW? For example, does it search through all the possible web addresses, like www.a.com, www.b.com, www.c.com, and so on...? (although I know this can't be true)
Or is there some way to get access to all the existing webpages on world wide web?? (sorry for asking such a silly question..)
Thanks!!
The crawlers search through pages, download them and save (parts of them) for later processing. So yes, you are right that the results that search engines return can easily be outdated. And a couple of years ago they really were quite outdated. Only relatively recently Google and others started to do more realtime searching by collaborating with large content providers (such as Twitter) to get data from them directly and frequently but they took the realtime search again offline in July 2011. Otherwise they for example take notice how often a web page changes so they know which ones to crawl more often than others. And they have special systems for it, such as the Caffeine web indexing system. See also their blogpost Giving you fresher, more recent search results.
So what happens is:
Crawlers retrieve pages
Backend servers process them
Parse text, tokenize it, index it for full text search
Extract links
Extract metadata such as schema.org for rich snippets
Later they do additional computation based on the extracted data, such as
Page rank computation
In parallel they can be doing lots of other stuff such as
Entity extraction for Knowledge graph information
Discovering what pages to crawl happens simply by starting with a page and then its following links to other pages and following their links, etc. In addition to that, they have other ways of learning about new web sites - for example if people use their public DNS server, they will learn about pages that they visit. Sharing links on G+, Twitter, etc.
There is no way of knowing what all the existing web pages are. There may be some that are not linked from anywhere and noone publicly shares a link to them (and doesn't use their DNS, etc.) so they have no way of knowing what these pages are. Then there's the problem of the Deep Web. Hope this helps.
Crawling is not an easy task (for example Yahoo is now outsourcing crawling via Microsoft's Bing). You can read more about it in Page's and Brin's own paper: The Anatomy of a Large-Scale Hypertextual Web Search Engine
More details about storage, architecture, etc. you can find for example on the High Scalability website: http://highscalability.com/google-architecture

I need to speed up my site and reduce the number of files calls

My webhost is aking me to speed up my site and reduce the number of files calls.
Ok let me explain a little, my website is use in 95% as a bridge between my database (in the same hosting) and my Android applications (I have around 30 that need information from my db), the information only goes one way (as now) the app calls a json string like this the one in the site:
http://www.guiasitio.com/mantenimiento/applinks/prlinks.php
and this webpage to show in a web view as welcome message:
http://www.guiasitio.com/movilapp/test.php
this page has some images and jquery so I think this are the ones having a lot of memory usage, they have told me to use some code to create a cache of those files in the person browser to save memory (that is a little Chinese to me since I don't understand it) can some one give me an idea and send me to a tutorial on how to get this done?. Can the webview in a Android app keep caches of this files?
All your help his highly appreciated. Thanks
Using a CDN or content delivery network would be an easy solution if it worked well for you. Essentially you are off-loading the work or storing and serving static files (mainly images and CSS files) to another server. In addition to reducing the load on your your current server, it will speed up your site because files will be served from a location closest to each site visitor.
There are many good CDN choices. Amazon CloudFront is one popular option, though in my optinion the prize for the easiest service to setup is CloudFlare ... they offer a free plan, simply fill in the details, change the DNS settings on your domain to point to CloudFlare and you will be up and running.
With some fine-tuning, you can expect to reduce the requests on your server by up to 80%
I use both Amazon and CloudFlare, with good results. I have found that the main thing to be cautious of is to carefully check all the scripts on your site and make sure they are working as expected. CloudFlare has a simple setting where you can specify the cache settings as well, so there's another detail on your list covered.
Good luck!

Difference between Ad company statistics, Google Analytics and Awstats on adult sites

I have this problem. I have web page with adult content and for several past months i had PPC advertisement on it. And I've noticed a big difference between Ad company statistics of my page, Google Analytics data and Awstats data on my server.
For example, Ad company tells me, that i have 10K pageviews per day, Google Analytics tells me, that i have 15K pageviews and on Awstats it's around 13K pageviews. Which system should I trust? Should i write my own (and reinvent a wheel again)? If so, how? :)
The joke is, that i have another web page, with "normal" content (MMORPG fan site) and those numbers are +- equal in all three systems (ad company, GA, Awstats). Do you think it's because it's not adult oriented page?
And final question, that is totally offtopic, do you know about Ad company that pays per impression and don't mind adult sites?
Thanks for the answers!
First, you should make sure not to mix up »hits«, »files«, »visits« and »unique visits«. They all have a different meaning and are sometimes called differently. I recommend you to look up some definitions if you are confused about the terms.
awstats has probably the most correct statistics, because it has access to the access.log from the web server. Unfortunately, a cached site (maybe cached by the browser, a proxy from an ISP or your own caching server) might not produce a hit on the web server. Especially if your site is served with good caching hints which don't enforce a revalidation and you are running your own web cache (e.g. Squid) in front of your site, the number will be considerable lower, because it only measures the work of the web server.
On the other hand, Google Analytics is only able to count requests from users which haven't blocked Google Analytics and have JavaScript enabled (but they will count pages served by a web cache). So, this count can be influenced by the user, but isn't affected by web caches.
The ad-company is probably simply counting the number of requests which they get from your site (probably based on their access.log). So, to get counted there, the add must not be cached and must not be blocked by the user.
So, as you can see, it's not that easy to get a single correct value. But as long as you use the measured values in comparison to those from the previous months, you should get at least a (nearly) correct rate of growth.
And your porn site probably serves a high amount of static content (e.g. images from the disk) and most of the web servers are really good at serving caching hints automatically for static files. Your MMORPG on the other hand, might mostly consist of some dynamic scripts (PHP?) which don't send any caching hints at all and web servers aren't able to determine those caching headers for dynamic content automatically. That's at least my explanation, without knowing your application and server configuration :)

Plone CMS: how heavy are search requests compared to typical CMS GET requests?

Plone CMS: how heavy are search requests compared to typical CMS GET requests?
I fear that on a large site (0.5 milions of documents) enabling search capability is asking for DOS. If so, how this threat can be mitigated? Can search work on a different ZOE instance?
Plone's portal_catalog is rather efficient/fast/optimized. It's not like an SQL Query where you can construct searches that take minutes to complete.
The heavy part is usually "waking up" objects when presenting the search results, you should work as much as possible with the metadata (so called "brains") that the catalog returns. This is what Plone tries to do by default anyway.
But still, you can use a seperate ZEO instance for handling search request if you feel that this may be a bottleneck. Just make sure requests for /search and /search_form (or generically, /search*) end up at this specific ZEO instance. How you do this is rather specific to your current load balancing setup (apache, squid, nginx, etc)
With that many documents you want to be investigating a dedicated search system - Plone's text indexes are really not that great. Take a look at http://plone.org/products/collective.solr for a Plone integration with http://lucene.apache.org/solr/ or http://pypi.python.org/pypi/collective.gsa if you have a Google Search Appliance.
Plone's search engine is awesome in that it is fully integrated and ships with the default install. When your site grows to areas of 500k documents you generally want a more solid search.
We have used SOLR with great success is large projects, and there already exist several integrations with Plone:
http://plone.org/products/collective.solr
http://plone.org/products/alm.solrindex
http://plone.org/products/collective.recipe.solrinstance

Resources