Is there a search engine for discovering phrases with a given word? - search

For example, let's say I'm I want to find common phrases with the word beholder in them. Like - in the eye of the beholder. But I don't know these phrases, I just have one word and want to discover in combination with which words it usually occurs on the internet. Is there a search engine for such inquiries?

Related

What is the best text analyzer to use on generic data in Azure Cognitive Search when searching for more than 1 word?

I have been looking over the different text analyzers which Azure Cognitive Search offers with this api.
The type of data I have is generic and can be either an email address / name, these are just some examples.
Which is the best analyzer to use on this type of data (generic)?
Also, does the text analyzer in use affect how search works when looking for more than 1 word?
What is the best way to make it do a fuzzy search for more than 1 word i.e. "joe blogs" but all fuzzy.
I don't want "somename blogs" to show up for somename is not a fuzzy match on joe.
I do want "joe clogs" to show up for joe would fuzzy match to joe and clogs would fuzzy match on blogs.
What is the best practice to do fuzzy search with more than 1 word, which would give the end user fewer hits as they give more words?
If you have generic content and don't want to use linguistic processing, you can use a generic analyzer like the Whitespace analyzer. See
https://learn.microsoft.com/en-us/azure/search/index-add-custom-analyzers#built-in-analyzers
How searches for single or multiple words work is determined by the searchMode parameter. The recommended practice is to use all, instead of any. When you specify more terms, you are more specific and you want fewer (more precise) results.
You can specify multi-word queries where individual search terms are fuzzy by using the tilde syntax. E.g. to do a fuzzy search for joe but exact match on blogs you could something like:
joe~ blogs
You can also control how fuzzy you want it to be. See
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#bkmk_fuzzy
PS: From your use case it sounds like proximity matching is also something you could consider using:
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#bkmk_proximity

Natural language processing keywords for building search engine

I'm recently interested in NLP, and would like to build up search engine for product recommendation. (Actually I'm always wondering about how search engine for Google/Amazon is built up)
Take Amazon product as example, where I could access all "word" information about one product:
Product_Name Description ReviewText
"XXX brand" "Pain relief" "This is super effective"
By applying nltk and gensim packages I could easily compare similarity of different products and make recommendations.
But here's another question I feel very vague about:
How to build a search engine for such products?
For example, if I feel pain and would like to search for medicine online, I'd like to type-in "pain relief" or "pain", whose searching results should include "XXX brand".
So this sounds more like keyword extraction/tagging question? How should this be done in NLP? I know corpus should contain all but single words, so it's like:
["XXX brand" : ("pain", 1),("relief", 1)]
So if I typed in either "pain" or "relief" I could get "XXX brand"; but what about I searched "pain relief"?
I could come up with idea that directly call python in my javascript for calculate similarities of input words "pain relief" on browser-based server and make recommendation; but that's kind of do-able?
I still prefer to build up very big lists of keywords at backends, stored in datasets/database and directly visualized in web page of search engine.
Thanks!
Even though this does not provide a full how-to answer, there are two things that might be helpful.
First, it's important to note that Google does not only treat singular words but also ngrams.
More or less every NLP problem and therefore also information retrieval from text needs to tackle ngrams. This is because phrases carry way more expressiveness and information than singular tokens.
That's also why so called NGramAnalyzers are popular in search engines, be it Solr or elastic. Since both are based on Lucene, you should take a look here.
Relying on either framework, you can use a synonym analyser that adds for each word the synonyms you provide.
For example, you could add relief = remedy (and vice versa if you wish) to your synonym mapping. Then, both engines would retrieve relevant documents regardless if you search for "pain relief" or "pain remedy". However, you should probably also read this post about the issues you might encounter, especially when aiming for phrase synonyms.

Is there a way to get all complete sentences that a search engine (e.g. Google) has indexed that contain two search terms?

as the question says: "Is there a way to get all complete sentences that a search engine (e.g. Google) has indexed that contain two search terms?"
I would like to use the (e.g. Google) search syntax: BMW AND Toyota. (<-- this is just an example)
And I would then like to have returned all sentences that mention BMW and Toyota. They must be in a single (ideally: short) sentence though.
Is that possible?
Many thanks!
PS.: Sorry - I have difficulties finding the right tags for my question... Please feel free to suggest more appropriate ones and I will update the question.
PPS.: Let me rephrase my question: If it is not readily possible with an existing search engine, are there any programmatical ways to do that? Would one have to write a crawler for that purpose?
No this may not be possible, as google stores this info based on keywords and other algorithms.
For any given keyword or set of keywords, google must be maintaining a reference to one or many matching (some accurate, some not so accurate) titles.
I do not work for google, but that could one way they are maintaining their search results.

What features do NLP practitioners use to pick out English names?

I am trying named entity recognition for the first time. I'm looking for features that will pick out English names. I am using the methods outlined in the coursera nlp course (week three) and the nltk book. In other words: I am defining features, identifying features of words and then running those words/features through a classifier that I train on labeled data.
What features are used to pick out English names?
I can imagine that you'd look for two capital words in a row, or a capital word and then an initial and then a capital word. (ex. John Smith or James P. Smith).
But what other features are used for NER?
Some common features:
Word lists for common names (John, Adam, etc)
casing
contains symbol or numeric characters (names generally don't)
person prefixes (Mr., Mrs., etc...)
person postfixes (Jr., Sr., etc...)
single letter abbreviation (ie, (J.) Smith).
analysis of surrounding words (you may find some words have a high probability of appearing near names).
Named Entities previously recognized (often it is easy to identify NE in some parts of the corpus based on context, but very hard in other parts. If previously identified, this is an excellent hint towards NER)
Depending what language you are working with there may be more language specific features as well. Frankly you can turn up a wealth of information with a simple Google query, I'm really not sure why you haven't turned there. Some starting points however:
Google
A survey of named entity recognition and classification
Named entity recognition without gazetteers
I had done something similar back in school using machine learning. I suppose that you will use a supervised algorithm and you will classify every single word independently and not words in combination. In that case I would choose some features for the word itself like the ones you mentioned (if the word begins with a capital letter, if the word is an abbreviation) but I would add some more features like if the previous or the next words also start from a capital letter, or if they are abbreviations. This way you can add some context and overcome the problems related to your basic independence assumption.
If you want have a look here. In the machine learning section you can find some more information and examples (the problem is slightly different but the method should be similar).
Whatever features you choose it is important that you use some measure to evaluate their relevance and possibly reduce them to the useful ones to avoid over-fitting. One of the measures you can use to evaluate them is the gain ratio but there are many more. Here you can find some basic information about feature extraction.
Hope it helps!

How to extract meaningful keywords from a query?

I am working on a project on web intelligence in which I have to build a system which accepts user query and extract meaningful keywords. Say for example user enters a query "How to do socket programming in Java", then I have to ignore "how", "to", "do", "in" and take "socket", "programming", "java" for further processing and clustering e.g. socket and programming are two different meaningful keyword but can be used together as keyword which produce different meaning. I am looking for some algorithm like TF-IDF to approach this problem. Any help will be appreciated.
Well what you are looking into a text analytics solution.
I have only used R for this purpose but one way to look at it is you need a list of words that you consider not meaningful keywords, this is often called "stop words". You can find online lists of stop words for almost any popular language. After doing this you might want to get a couple hundred inputs and calculate the frequency of every keyword there (having already removed stop words, as well as punctuation and having all text in lower-case) and try to identify other keywords that you think are irrelevant and add them to your list of words to remove.
After this there are a ton of options you can explore; an example would be stemming which is getting the core term of each word so that "pages" and "page" are considered the same keyword. (as you go deeper you will find a ton of stuff online to fine-tune your approach)
Hope this helps.

Resources