Is there a way to get all complete sentences that a search engine (e.g. Google) has indexed that contain two search terms? - search

as the question says: "Is there a way to get all complete sentences that a search engine (e.g. Google) has indexed that contain two search terms?"
I would like to use the (e.g. Google) search syntax: BMW AND Toyota. (<-- this is just an example)
And I would then like to have returned all sentences that mention BMW and Toyota. They must be in a single (ideally: short) sentence though.
Is that possible?
Many thanks!
PS.: Sorry - I have difficulties finding the right tags for my question... Please feel free to suggest more appropriate ones and I will update the question.
PPS.: Let me rephrase my question: If it is not readily possible with an existing search engine, are there any programmatical ways to do that? Would one have to write a crawler for that purpose?

No this may not be possible, as google stores this info based on keywords and other algorithms.
For any given keyword or set of keywords, google must be maintaining a reference to one or many matching (some accurate, some not so accurate) titles.
I do not work for google, but that could one way they are maintaining their search results.

Related

Natural language processing keywords for building search engine

I'm recently interested in NLP, and would like to build up search engine for product recommendation. (Actually I'm always wondering about how search engine for Google/Amazon is built up)
Take Amazon product as example, where I could access all "word" information about one product:
Product_Name Description ReviewText
"XXX brand" "Pain relief" "This is super effective"
By applying nltk and gensim packages I could easily compare similarity of different products and make recommendations.
But here's another question I feel very vague about:
How to build a search engine for such products?
For example, if I feel pain and would like to search for medicine online, I'd like to type-in "pain relief" or "pain", whose searching results should include "XXX brand".
So this sounds more like keyword extraction/tagging question? How should this be done in NLP? I know corpus should contain all but single words, so it's like:
["XXX brand" : ("pain", 1),("relief", 1)]
So if I typed in either "pain" or "relief" I could get "XXX brand"; but what about I searched "pain relief"?
I could come up with idea that directly call python in my javascript for calculate similarities of input words "pain relief" on browser-based server and make recommendation; but that's kind of do-able?
I still prefer to build up very big lists of keywords at backends, stored in datasets/database and directly visualized in web page of search engine.
Thanks!
Even though this does not provide a full how-to answer, there are two things that might be helpful.
First, it's important to note that Google does not only treat singular words but also ngrams.
More or less every NLP problem and therefore also information retrieval from text needs to tackle ngrams. This is because phrases carry way more expressiveness and information than singular tokens.
That's also why so called NGramAnalyzers are popular in search engines, be it Solr or elastic. Since both are based on Lucene, you should take a look here.
Relying on either framework, you can use a synonym analyser that adds for each word the synonyms you provide.
For example, you could add relief = remedy (and vice versa if you wish) to your synonym mapping. Then, both engines would retrieve relevant documents regardless if you search for "pain relief" or "pain remedy". However, you should probably also read this post about the issues you might encounter, especially when aiming for phrase synonyms.

How do search engines implement keyword suggestion?

This is a very important programming question (and a popular interview question), yet I wasn't able to find any direct answer on the internet.
All I know is that they are:
Based on real searches
Different for different languages and regions
Censored
Based on your search history
Duplicates removed
Sorted by most relevant and popular
So we're looking at a data structures that are:
Easily sortable
Elements are unique
The word you type in is somehow connected to the suggestions
My first guess would be: it's a graph.
Another option is an adjacency list. Each element contains a link list of suggestions etc.
Does anyone know it's really done?

Improving type-ahead suggestions in Search using SOLR

What are the possible ways you could improve the type-ahead (auto-complete) suggestions that appear in a free-form search?
From my understanding, all the suggestions that appear for keywords are stored in a SOLR table.
How do you ensure that it covers all the industry specific relevant type-ahead suggestions?
Can you automate including most recent user generated queries that are not currently providing search results to lead to relevant ones?
In preprocessing, the documents fed into the search engine need to be enriched with whatever is sensible and provides help to find them. E.g. a document containing the string paris may be enriched by french capital, capital of france, ile-de-france, … You will need a dictionary to do so. You can take data from dbpedia.org or—for English only—WordNet. For not to over-generalize you will need to implement some disambiguation (meaning discovery) in the first step, since paris—for example—could equally be expanded with alexandros, alaksandu of wilusa, king of troy, depending on the context.

smart search by first/last name

I have to build a search facility capable of searching members by their first name/last name and may be some other search parameters (i.e. address).
The search should provide a list of match candidates so that the user can select whatever he/she seems the "correct" match.
The search should be smart enough so that the "correct" result would be among the first few items on the list. The search should also be tolerant to typos and misspellings and, may be, even be aware of name shortcuts i.e. Bob vs. Robert or Bill vs. William.
I started investigating Lucene and the family (like elastic search) as a tool for the job. While it has an impressive array of features addressing similar problems for the full text search, I am not so sure how to use them for my task - up to the point that maybe Lucene is not the right tool here at all.
What do you guys think - how can I harness Elastic Search to solve my problem? Or should I look elsewhere?
Lucene supports edit distance queries so that your search query will tolerate some typos, you define this as the allowed edit distance for a term.
for instance:
name:johnni~0.8
would return "johnny"
Also Solr provides a wide array of ready made search filters and analyzers you can use for search.
In your case I would probably chain several filter factories together:
TrimFilterFactory - trim the query
LowerCaseFilterFactory - to get rid of case differences
ISOLatin1AccentFilterFactory - to remove accents from letters (most people don't search with the accent anyway)
PhoneticFilterFactory - for matching sounds like queries like: kris -> chris
look at the documentation under the link it is pretty straight forward how to set up a new solr instance with an Analyzer that uses all the above filters. I used something similar for searching city names and it worked fairly well.
Lucene can be made tolerant of typos and misspellings, and can use synonyms. As for
The search should be smart enough so that the "correct" result would be among the first few items on the list
Are there any search engines which don't try to do this?
As far as Bob/Robert goes, that can be done with synonyms, but you need to get the synonym data from some reliable source.
In addition to what #Asaf mentioned, you might try to use N-gram indexing to deal with spelling variants. See the CJKAnalyzer for an example of how to do that.

Why doesn't Google offer partial search? Is it because the index would be too large?

Google/GMail/etc. doesn't offer partial or prefix search (e.g. stuff*) though it could be very useful. Often I don't find a mail in GMail, because I don't remember the exact expression.
I know there is stemming and such, but it's not the same, especially if we talk about languages other than English.
Why doesn't Google add such a feature? Is it because the index would explode? But databases offer partial search, so surely there are good algorithms to tackle this problem.
What is the problem here?
Google doesn't actually store the text that it searches. It stores search terms, links to the page, and where in the page the term exists. That data structure is indexed in the traditional database sense. I'd bet using wildcards would make the index of the index pretty slow and as Developer Art says, not very useful.
Google does search partial words. Gmail does not though. Since you ask what's the problem here, my answer is lack of effort. This problem has a solution that enables to search in constant time and linear space but not very cache friendly: Suffix Trees. Suffix Arrays is another option that is more cache-friendly and still time efficient.
It is possible via the Google Docs - follow this article:
http://www.labnol.org/internet/advanced-gmail-search/21623/
Google Code Search can search based on regular expressions, so they do know how to do it. Of course, the amount of data Code Search has to index is tiny compared to the web search. Using regex or wildcard search in the web search would increase index size and decrease performance to impractical levels.
The secret to finding anything in Google is to enter a combination of search terms (or quoted phrases) that are very likely to be in the content you are looking for, but unlikely to appear together in unrelated content. A wildcard expression does the opposite of this. Just enter the terms you expect the wildcard to match, keeping in mind that Google will do stemming for you. Back in the days when computers ran on steam, Lycos (iirc) had pattern matching, but they turned it off several years ago. I presume it was putting too much load on their servers.
Because you can't sensibly derive what is meant with car*:
Cars?
Carpets?
Carrots?
Google's algorithms compare document texts, also external inbound links to determine what a document is about. With these wildcards all these algorithms go into junk

Resources