The New Google Takeout - search

Is there a way to include one's search history within Google Takeout?
https://www.google.com/takeout/
Takeout purports to let you download everything stored within your Google account.

As far as I can tell, no. It's an obvious and mysterious gap in service.
You can download your recent google searches via an rss feed.
https://www.google.com/history/?output=rss
You can add commands to the url. The max query is 1000. num is number of searches and start is how many back to draw from. Like so:
https://www.google.com/history/?output=rss&num=1000&start=4000
Unfortunately, it starts to become somewhat reduced (as in not actually all of your searches) after a few thousand. I have over 40,000 searches on google, but I can only go back 7000 on this rss feed. Bummer. This means we still donĀ“t have access to all our data that they have.
Please prove me wrong!

Today is the last day to delete your search history, as suggested by the EFF(https://www.eff.org/deeplinks/2012/02/how-remove-your-google-search-history-googles-new-privacy-policy-takes-effect), before the new google terms come into force, linking that history with all the other google products. So if you can't grab it today, delete it so as to partially anonymise it eventually, or be tentacularised.

Related

searching YouTube for videos with specific range of views eg. between 9,000,000 and 11,000,000

first time posting.
I wanted to ask if anyone knows how I can search on YouTube for, let's say, music video's that have been viewed between a set number of times. Like the title says for example, between 9 and 11 million times.
One reason I want to do this is because I want to find good music that I haven't heard before. The logic I'm working on is that the Got Talent type video's that get viewed millions of times are generally viewed that many times for one of two reason. 1) they're amazing. 2) they're embarrassingly horrible.
And though I don't think a song being popular will necessarily mean I'll like it, I'm hoping this method will be successful to some degree.
Another reason is to look for trailers for independent films with a similar logic as above. Though with these movies I think I only hear about them six months to a year after they've been released because they're flying under the radar.
If I were to be able to search for movie trailers with 'x' number of views though.. for example, between 500,000 and a million, maybe I'd be able to find movies that I'll like quicker than via time passing and them getting mentioned to me by a friend.
Any help would be greatly appreciated as I've wanted to be able to perform these kind of searches for awhile now.
thanks
You will need to use YouTube API v3.
I havent written this exact request but it looks like you can list videos then filter by 'Chart' = 'mostPopular'
https://developers.google.com/youtube/v3/docs/videos/list
Perhaps a bit of background reading on the API would help too...
https://developers.google.com/youtube/v3/
First off, you would need the Youtube Data API. "v3" means nothing because it's simply the current version, like "Windows 10."
The API lets you get a video's view count, but doesn't put it in a range like 9 million to 11 million.
Youtube's own search function is pretty sophisticated. For instance,
https://www.youtube.com/results?search_query=movie+trailer&search_sort=video_view_count&filters=month. This gives all results for "movie trailer," within the last month, sorted by view count. You can customize the URL, i.e. "week" instead of month would return only trailers from the last week. Or year, etc. Essentially this is a "Videos: List: MostPopular" query, with subject filter.
I have a few Youtube API scripts, and I hardly think it's worth the hassle to do it that way when Youtube's advanced search get you 99% there. If you did, you would need to to a Search:list query for a given subject (i.e. "movie trailer"). Limited to a given time frame (i.e. last month). Then for each video ID, make a Videos:list query to get its view count. Then print all, sorted by views.

Using GMail as an interface to my database

What if I choose to use GMail's awesome mail archive search capabilities on my database? What if, for every transaction that my database is responsible, I emailed details of that transaction to a GMail address that exists for the sole purpose of searching and retrieving transactions.
Anyone logged into that account could search according to labels, invoice numbers, customer names - whatever using Google's search engine. The results are presented as 'email messages'.
Imagine a user working from the standard (web-based) GMail account searches for an invoice number via GMail's search box - he's returned all instances where the db did anything that included that unique number. Opening any of these 'email messages' would have the static text text included at the time of the transactions (historical and tracking gold) but could also carry a Gadget that could transform the 'message' into an editor so as to execute a new transaction on that invoice.
Imagine further that I wasn't the first one to think of this - cuz surely i'm not - and even if i were, i'm not smart enough to execute the idea alone.
Are you aware of efforts similar to this?
thx
[?belongs on superuser instead?]
An interesting idea, however given your search parameters it might be unreliable. Although gmail's search is great, I have found issues when searching for partial terms. Case in point, I had an email whose subject line was "stuffas". When I searched for "stuffa" I got no results, when I searched for "stuffas" I got the email in the search result. Additionally, I had an email with an 8 digit number inside the body. When I searched for 7 digits out of 8, I got no results, but when I put all 8 digits, the email appeared in the results. So, search in gmail may not be as powerful of a solution as you think. Again this is my experience, I'd love to hear if someone is able to partial search numbers in gmail.
I just had the same idea; 4 years after you. It still doesn't look like this has 'been done before' in any production sense. But now in 2014, I really don't see why not. Python packages for interfacing with gmail are already there and dead-simple to use. It does not take a whole lot of abstraction to turn this into a generalized key-value storage.
Its probably not exactly the fastest database, and not the best solution for everything; but as an easy-to-use, easy to search, trivial to configure, 100% uptime, cloud stored and backed up, free-as-in-beer database, its pretty epic as far as I can see.
Anyone else has seen examples of this having been done before?
Edit: having thought about it some more, there are several answers as to why this is a bad idea:
gmail does not permit random access from different locations; it will block you account. quite a showstopper
amazon simpleDB also gives you a simple key-value store with the same characteristics (plus good python support), and isn't THAT big of a pain to set up if you are willing to spend a day wrapping your head around it. And is also effectively free for the kind of traffic that youd be able to cram into a gmail account.

What is the correct way to implement a massive hierarchical, geographical search for news?

The company I work for is in the business of sending press releases. We want to make it possible for interested parties to search for press releases based on a number of criteria, the most important being location. For example, someone might search for all news sent to New York City, Massachusetts, or ZIP code 89134, sent from a governmental institution, under the topic of "traffic". Or whatever.
The problem is, we've sent, literally, hundreds of thousands of press releases. Searching is slow and complex. For example, a press release sent to Queens, NY should show up in the search I mentioned above even though it wasn't specifically sent to New York City, because Queens is a subset of New York City. We may also want to implement "and" and "or" and negation and text search to the query to create complex searches. These searches also have to be fast enough to function as dynamic RSS feeds.
I really don't know anything about search theory, or how it's properly done. The way we are getting by right now is using a data mart to store the locations the releases were sent to in a single table. However, because of the subset thing mentioned above, the data mart is gigantic with millions of rows. And we haven't even implemented cities yet, and there are about 50,000 cities in the United States, which will exponentially increase the size of the data mart by so much I'm afraid it just won't work anymore.
Anyway, I realize this is not a simple question and there won't be a "do this" answer. However, I'm hoping one of you can point me in the right direction where I can learn about how massive searches are done? Because I really know nothing about it. And such a search engine is turning out to be incredibly difficult to make. Thanks! I know there must be a way because if Google can search the entire internet we must be able to search our own database :-)
Google can search the entire internet, and your data via a Google Appliance!

automating RAPIDSHARE piracy file take down process

I found a new search engine that speeds up finding piracy files from rapidshare, how could I automate a tool that finds our product using this engine and outputs the list of the rapidshare URLs that will be then sent to abuse#rapidshare.com.
search engine:
http://rapidlibrary.com/
(note, the captcha image appears just once there)
Below is a nice script that could perhaps do this pretty easily?
http://www.nasser.me/ubiquity/rapidsharecom-link-checker/
I thought about this in the past and being a "tv show pirate" myself it kinda annoys me why free torrent sites like The Pirate Bay and Mininova are being taken down while other not so free sites like Rapidshare, Megaupload and so on host the files and continue to make millions out of piracy.
The marketing model of those sites is viral, meaning the more a user spreads his link the more points he will receive and the less he will have to pay for his "subscription" in the future so is just obvious to suppose that those same links would be well spread over the Internet.
I would just search and scrap all the major warez forums out there, for a week or two and after that a search on the web should find all the remaining blogs / sites that still point to the pirated file.

How Prevent Google Duplicate Content Problem | Multi Site

I'm about to launch a multi-domain affiliate sites which have one thing in common which is content. Reading about the problem with duplicate content and Google I'm a little worried that the parent domain or sub sites could get banned from the search engine for duplicated content.
If I have 100 sites with similar look and feel and basically same content with some minor element changes, how will I go on preventing banning, indexing these correctly?
Should I should just prevent sub-sites from been indexed completely with robots?
If so how will people be able to find their site... I actually think the parent is the one that should only be indexed to avoid, but will love to her other expert thoughts.
Google have recently released an update that will allow you to include a link tag in the head of pages that are using duplicated content that point to the original version, they're called canonical links and they exist for the exact reason you mention, to be able to use duplicated content without penalisation
For more information look here..
http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
This doesn't mean that your sites with duplicated content will be ranked well for the duplicated content but it does mean the original is "protected". For decent ranking in the duplicated sites you will need to provide unique content
If I have 100 sites with similar look
and feel and basically same content
with some minor element changes, how
will I go on preventing banning,
indexing these correctly?
Unfortunately for you, this is exactly what Google downgrades in its search listings, to make search results more relevant, and less rigged / gamed.
Fortunately for us (i.e. users of Google), their techniques generally work.
If you want 100s of sites, to be properly ranked, you'll need to make sure they each have unique content.
You won't get banned straight away. You will have to be reported by a person.
I would suggest launching with the duplicate content and then iterating over it in time, creating unique content that is dispersed across your network. This will ensure that not all sites are spammy copies of each other and will result in Google picking up the content as fresh.
I would say go ahead with it, but try to work in as much unique content as possible, especially where it matters most (page titles, headings, etc).
Even if the sites did get banned (more likely they would just have results omitted, but it is certainly possible they would be banned in your situation) you're now just at basicly the same spot you would have been if you decided to "noindex" all the sites.

Resources