I am a beginer in web technologies so sry if its a lame question.
Wikipedia and other such sites which host million of web pages. How search works in these pages. Do they store all html pages in memory? if yes What are the possible data structure used to store all the html pages to store in-memory and do search among those so fast?
Wikipedia uses the Lucene full text search engine. Another popular full text search engine is Sphinx.
They have quite detailed descriptions of what exactly they got.
Wikipedia: Web search
How Internet Search Engines Work
How Search Engines Work (rather nice explanation)
Related
Greetings fellow developers,
I would like to ask for help regarding the following problem: Is there a way to request removal of stored website data from search engines? Most of the links that show up when searching my domain are old and non-existent.
What I've found from personal research regarding this question/problem:
From my personal research I have found that removal requests can be made individually to the well-known search engines such as Google, Yahoo and Bing, but this is not what I am looking for, since I am well-aware that it would take a lot of time for the requests to be processed and the removal of the data to be done. Also, I wasn't able to find this "removal-request" webpage for the other search engines.
To be more precise/clear...
... I want to request this website-data-removal to all (most) search engines at once, so that when I upload my new website (to the same domain), working and functional links (URLs) would be displayed. Can this be anyhow achieved and, if so, how? Also, how much time would it take for this removal to be finished?
Hope my question is clear enough, and any answer/help would be very much appreciated.
No, there is not a way to do this for all search engines at once. You will have to request it from each site individually. As for the smaller search engines you can try and find any contact information or customer support however their is a chance they will ignore your request (heck, some sites ignore the robot.txt file and just search your site anyways... it's just a part of being on the web).
If I use Jades template engine with NodeJs will the app be crawlable by search engines and Facebook without using the _escaped_fragment_?
If your application outputs HTML, it is no different than if you had written that HTML in a file and simply served the file. The wider Web doesn't generally know or care what you're using to generate your HTML.
(It is possible to infer what tech a page is using by inspecting headers and looking for common idioms that are unique to a particular technology, but these are just clues, not a fundamental difference in what your Web page is.)
I'm looking for some suggestions from search experts. I want to implement site search and don't really know where to start. What are the pros and cons of the various search technologies for a PHP / MySQL website?
For many websites, I've been using Solr. Solr is a great search server platform. It requires Tomcat and some configurations but it's built by Apache so it's pretty easy to get going.
Take a look : http://lucene.apache.org/solr/
I am looking into the simplest way to integrate Wikipedia into a node.js app.
The requirements are to be able to search for entries and find entities in each entry.
Any known existing libs/methods for that?
Thanks
There's a newly available open source parser for wiki text (http://sweble.org/) that might be useful to you if you roll your own solution. Of course that would require you downloading the wikipedia data dump, parsing, and storing entities in a db.
You could also look at dbpedia (http://dbpedia.org/About), though that would require integrating the rdf stack into your app (either running a local rdf repository or communicating with the often flaky online version via sparql).
One easy approach is to use a search engine api and restrict to site:wikipedia.org - e.g:
http://www.google.com/search?q=node.js+site%3Awikipedia.org
I've found that can work really well.
Spider for scraping using jquery is fantastic:
https://github.com/mikeal/spider
Mikeal is the man
Presumably you'd be using this for a side (personal) project though. Not sure how kosher it is to run wild on wikipedia with a scraper.
I want to write a mobile app which takes a picture and searches google images for similar pictures and then displays the results.
However, with google image search I can only search for text strings, and with the search API it seems there's no possibility to search for similar pictures; this feature seems to be available only through the web interface.
Any idea how I can solve this problem?
thanks,
Christoph
There is a way you can do this now, but its not officially supported, and there are probably some restrictions on the number of queries you can perform. Update
http://images.google.com/searchbyimage?hl=en&biw=1060&bih=766&gbv=2&site=search&image_url={{URL To your image}}&sa=X&ei=H6RaTtb5JcTeiALlmPi2CQ&ved=0CDsQ9Q8
There is also a google image search API, which is being officially deprecated, but it will work for now.
http://code.google.com/apis/imagesearch/
The Google Vision API.
https://cloud.google.com/vision/
This is very simple and easy and powerful.
I had done something like that recently for a mobile app, this is the code for it, it uses google search by image feature, and returns the "best guess" or the whole page
you could use that and modify it to do what you want, but once you get the best guess of the image you could search for any image with that title, etc
https://github.com/hbattat/search-by-image
I don't think it's possible. If you click the link to find similar images from the images result page you get a link with the original query included:
google.com/images?q=ORIGINAL_QUERY&imgtype=i_similar&sa=...
If you remove that GET param manually, the search does not work, it only shows the images search form.
I dont think it is possible to find similar images with google if you do not know what's on it.
I was looking for an answer to this some time ago, and found tineye. You have to pay for it, though. Currently (Jan 2012) USD300 for 5K searches, USD1.5K for 30K searches...
SerpAPI enables to search through Google Images and returns a clean JSON.
URL example:
https://serpapi.com/search.json?q=Apple&tbm=isch&ijn=0
Documentation:
https://serpapi.com/images-results
This service is integrated with most of the programming languages: python, php, java, golang, nodejs...
Google limit the number of search per day. but this service provides unlimited searches...