automating RAPIDSHARE piracy file take down process - mozilla

I found a new search engine that speeds up finding piracy files from rapidshare, how could I automate a tool that finds our product using this engine and outputs the list of the rapidshare URLs that will be then sent to abuse#rapidshare.com.
search engine:
http://rapidlibrary.com/
(note, the captcha image appears just once there)
Below is a nice script that could perhaps do this pretty easily?
http://www.nasser.me/ubiquity/rapidsharecom-link-checker/

I thought about this in the past and being a "tv show pirate" myself it kinda annoys me why free torrent sites like The Pirate Bay and Mininova are being taken down while other not so free sites like Rapidshare, Megaupload and so on host the files and continue to make millions out of piracy.
The marketing model of those sites is viral, meaning the more a user spreads his link the more points he will receive and the less he will have to pay for his "subscription" in the future so is just obvious to suppose that those same links would be well spread over the Internet.
I would just search and scrap all the major warez forums out there, for a week or two and after that a search on the web should find all the remaining blogs / sites that still point to the pirated file.

Related

Sort domains by number of public web pages?

I'd like a list of the top 100,000 domain names sorted by the number of distinct, public web pages.
The list could look something like this
Domain Name 100,000,000 pages
Domain Name 99,000,000 pages
Domain Name 98,000,000 pages
...
I don't want to know which domains are the most popular. I want to know which domains have the highest number of distinct, publicly accessible web pages.
I wasn't able to find such a list in Google. I assume Quantcast, Google or Alexa would know, but have they published such a list?
For a given domain, e.g. yahoo.com you can google-search site:yahoo.com; at the top of the results it says "About 141,000,000 results (0.41 seconds)". This includes subdomains like www.yahoo.com, and it.yahoo.com.
Note also that some websites generate pages on the fly, so they might, in fact, have infinite "pages". A given page will be calculated when asked for, and forgotten as soon as it is sent. Each can have a link to the next page. Since many websites compose their pages on the fly, there is no real difference (except that there are infinite pages, which you can't find out unless you ask for them all).
Keep in mind a few things:
Many websites generate pages dynamically, leaving a potentially infinite number of pages.
Pages are often behind security barriers.
Very few companies are interested in announcing how much information they maintain.
Indexes go out of date as they're created.
What I would be inclined to do for specific answers is mirror the sites of interest using wget and count the pages.
wget -m --wait=9 --limit-rate=10K http://domain.test
Keep it slow, so that the company doesn't recognize you as a Denial of Service attack.
Most search engines will allow you to search their index by site, as well, though the information on result pages might be confusing for more than a rough order of magnitude and there's no way to know how much they've indexed.
I don't see where they keep or have access to the database at a glance, but down the search engine path, you might also be interested in the Seeks and YaCy search engine projects.
The only organization I can think of that might (a) have the information easily available and (b) be friendly and transparent enough to want to share it would be the folks at The Internet Archive. Since they've been archiving the web with their Wayback Machine for a long time and are big on transparency, they might be a reasonable starting point.

Extracting user interests from social profiles

This is my first time dabbling in NLP so please excuse my ignorance. I'm looking for a method to extract interests/likes/hobbies from users' social profiles. Here is an example where all the interests/likes/hobbies are in bold:
"I consider myself a pretty diverse character... I'm a professional
wrestler, but I'd take a bullet for Wall•E. I train like a one-man genocide machine in the gym, but I cried at
"Armageddon." I'll head bang to AC/DC, and I'm seriously
considering getting a Legend of Zelda tattoo. I'm 420-friendly. I
like to party it up with the frat crowd one night, hang out with
my Burning Man friends the next, play Halo and World of
Warcraft the next, and jam with friends that aren't any younger than
40 the next. My youngest friend is 16, my oldest friend is 66. I'll
sing karaoke at the bars, and I'm my friends' collective
psychiatrist/shoulder."
The profiles are plain text. There are no meta tags or ids associated with any of it, it's just a paragraph of text.
My naiive idea was to take each noun and match it against Freebase to see if it's an activity/artist/movie/book etc. The problem is that although most entities mentioned will be things the user likes, she will also mention things she doesn't like and I have no means of distinguishing the 2.
I have 2 questions:
What sub field of NLP should I be looking at? Some googleable algorithms/techniques/authors would be greatly appreciated.
How hard is this problem?
Thanks!
First, unless using NLP to do this is a particular objective for you, check your problem domain to see if you can avoid it completely.
For instance:
do these profiles have tags (supplied either by the Site or by the
user)?
what does the Site's API make available (assuming that's how you are accessing this data; if you are scraping it, then this doesn't of course apply)? A good example, Facebook. if you read a user's posts, you'll see words like "wrestler", "karaoke", etc. but if you look at what fields are exposed via the Graph API, you'll see that these activities nearly always have an associated FB ID.
I am not a specialist in this field, but I can recommend a couple of resources directed to NLP and which are accessible to the non-specialist or novice. The first is a text processing API. This simple web service uses REST and JSON IO. It is free and seems to have a fairly large rate limit.
This API appears to rely heavily on the excellent Natural Language Tooolkit (NLTK) which is a mature stable library in python, that includes modules directed to the problem in your Question, e.g., Sentiment Analysis, Tagging and Chunk Extraction, etc.
Which particular sub-domain is most relevant to solving the Question in the OP? I don't know, but I suspect there's a module somewhere in the NLTK that does what you need. Finding that module is hopefully just a matter of skimming the API Documentation (which is organized by module); reading the Getting Started section which contains an excellent survey of NLTK's modules as well as demos for all of each of them.

The New Google Takeout

Is there a way to include one's search history within Google Takeout?
https://www.google.com/takeout/
Takeout purports to let you download everything stored within your Google account.
As far as I can tell, no. It's an obvious and mysterious gap in service.
You can download your recent google searches via an rss feed.
https://www.google.com/history/?output=rss
You can add commands to the url. The max query is 1000. num is number of searches and start is how many back to draw from. Like so:
https://www.google.com/history/?output=rss&num=1000&start=4000
Unfortunately, it starts to become somewhat reduced (as in not actually all of your searches) after a few thousand. I have over 40,000 searches on google, but I can only go back 7000 on this rss feed. Bummer. This means we still don´t have access to all our data that they have.
Please prove me wrong!
Today is the last day to delete your search history, as suggested by the EFF(https://www.eff.org/deeplinks/2012/02/how-remove-your-google-search-history-googles-new-privacy-policy-takes-effect), before the new google terms come into force, linking that history with all the other google products. So if you can't grab it today, delete it so as to partially anonymise it eventually, or be tentacularised.

I can't figure out where to start with GIS application development, or which technology to select

I am very new to GIS development, and to be be frank I have no background about it at all. I searched the web but the tutorials I found seemed to assume the reader has some background information.
the thing is that I am confused about what to read or learn, there seems to be lots of technologies, and I feel lost since some speak about openlayers, geoserver, mapserver, google maps, and open street maps.
So here is what I am supposed to develop, and I hove you could give me an advice about which technology to use, and where should I start reading - given that I know almost nothing -.
Case 1: a closed system for about 20 users only, who can specify locations on the map, and the web application will store the latitude and longitude of the locations and show the markers. I wanted to use google maps api, but I cancelled that since there license requires you to purchase the service if the system is a closed one. so what technology should I use in such case? I need a free option, also I will be only using web server, so if the solution includes using my own geoserver, or something like that I won't be able to do it.
Case 2: I am supposed to display the roads and routes between two given points, and probably add some notes on the map. For this I case I can use my own map server/geo server, but again I want your suggestions.
of course the solution need to be open source
finally, I hope you could tell me what to start reading first,
Start by looking over at https://gis.stackexchange.com/, starting with the tags [web-mapping] and
Some topics in particluar you may want to look at are:
https://gis.stackexchange.com/questions/8113/steps-to-start-web-mapping
https://gis.stackexchange.com/questions/8238/where-how-to-learn-about-getting-started-with-web-gis
https://gis.stackexchange.com/questions/13868/looking-for-a-developer-friendly-web-gis
As for skills and tuorials, look at:
https://gis.stackexchange.com/questions/17227/free-gis-workshops-tutorials-and-applied-learning-material
https://gis.stackexchange.com/questions/913/web-gis-development-skill-sets

How Prevent Google Duplicate Content Problem | Multi Site

I'm about to launch a multi-domain affiliate sites which have one thing in common which is content. Reading about the problem with duplicate content and Google I'm a little worried that the parent domain or sub sites could get banned from the search engine for duplicated content.
If I have 100 sites with similar look and feel and basically same content with some minor element changes, how will I go on preventing banning, indexing these correctly?
Should I should just prevent sub-sites from been indexed completely with robots?
If so how will people be able to find their site... I actually think the parent is the one that should only be indexed to avoid, but will love to her other expert thoughts.
Google have recently released an update that will allow you to include a link tag in the head of pages that are using duplicated content that point to the original version, they're called canonical links and they exist for the exact reason you mention, to be able to use duplicated content without penalisation
For more information look here..
http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
This doesn't mean that your sites with duplicated content will be ranked well for the duplicated content but it does mean the original is "protected". For decent ranking in the duplicated sites you will need to provide unique content
If I have 100 sites with similar look
and feel and basically same content
with some minor element changes, how
will I go on preventing banning,
indexing these correctly?
Unfortunately for you, this is exactly what Google downgrades in its search listings, to make search results more relevant, and less rigged / gamed.
Fortunately for us (i.e. users of Google), their techniques generally work.
If you want 100s of sites, to be properly ranked, you'll need to make sure they each have unique content.
You won't get banned straight away. You will have to be reported by a person.
I would suggest launching with the duplicate content and then iterating over it in time, creating unique content that is dispersed across your network. This will ensure that not all sites are spammy copies of each other and will result in Google picking up the content as fresh.
I would say go ahead with it, but try to work in as much unique content as possible, especially where it matters most (page titles, headings, etc).
Even if the sites did get banned (more likely they would just have results omitted, but it is certainly possible they would be banned in your situation) you're now just at basicly the same spot you would have been if you decided to "noindex" all the sites.

Resources