Lucene phrase search - search

I have large text documents. Say, if I search for "computer m", then I want to get "computer monitor", "computer memory", and "computer market share". How can I get matched phrases only?
Should I index files using ShingleAnalyzerWrapper?
Should I use SpellChecker for this purpose?
How can I do this?

org.apache.lucene.search.highlight.Highlighter is used to extract the best-matching text from a found document. Much like how Google will highlight (or display in bold) the matching text in your search results.
This blog entry that might help you get a start on it:
http://hrycan.com/2009/10/25/lucene-highlighter-howto/

You can use MultiPhraseQuery for that.

Related

Do I need to add updated phoneme sequence of words to .dict file while adapting AM using cmusphinx?

I am trying to adapt en-us acoustic model with indian english accent recordings. Since many words are pronounced in different accent, do I need to add the updated phoneme representation of words? Currently I am following this link: https://cmusphinx.github.io/wiki/tutorialadapt/#accumulating-observation-counts and here nothing is mentioned about updating your .dict file.
PS: Should I add new words directly in the dictionary?
There is Indian English model in downloads, you should use it instead. It comes with Indian English dictionary.

Search the sentence in large text sentence corpus

I am a beginner and I want to know if there's way to search a text sentence in a large text sequence of data (say 1 million) and search accordingly like when a user type:
I shouldn't be there
then it should search for sequence like this:
I should not be there
similar like this :
I gonna go there.
to
I going to go there.
I have been thinking for couple of days to figure out solution of this
problem.
If you know anything about how to deal with this problem then please provide a solution or just a hint would be more than enough. Thank you.
I would firstly go trough both the sentence and text and replace all contractions with the long form. Then after that use Knuth-Morris-Pratt.

is it possible to find which page and/or line number a given text was found using full text search and filestream?

I just started using the filestream and full text search technologies available on Microsoft Sql Server, I can index and search txt and pdf files, however, when I get the results I can't see the text, nor which page and/or line number that text was found inside the pdf for example, is it possible to at least retrieve the text from the document when a search is made? I believe it's not possible to return a "region" of text but maybe something I can use to look for in the file afterwards?
I'm trying to figure out the advantages of doing a search like this if I can't see the text that was found.
After doing a lot of research I concluded it isn't possible to search for a given page on an indexed pdf, so I decided to use solr instead and index the information the way I need to search later

Spell checker or search for best fit for a word/phrase?

Can anyone help me figure out a way to replace something like "credits card" with the correct spelling of "credit card". Or "hame imprvment" with "home improvement". I have a custom word/phrase list that I would like to find the closest match for a misspelled word or phrase. I do not want to include a master dictionary, rather have my own master word list only. I tried to use aspell but could not figure out how to use just my own dictionary without the master english dictionary.
Thanks for any help,
John
I just discovered "agrep". It works for me.

Semantic difference between "Find" and "Search"?

When building an application, is there any meaningful difference between the idea of "Find" vs "Search" ? Do you think of them more or less as synonymous?
I'm asking in terms of labeling for application UI as well as API design.
Finding is the completion of searching.
If you might not succeed in finding something, call the feature "Search". For example text search in an editor can fail due to no matches - then calling it "Find" would be lying.
On the other hand: in an established job searching site, you can say "Find a PHP job" because you know that for (almost) anything your users want, there will be offerings. This also makes it sound confident, positive and energetic.
According to Steve Krug in Don't Make Me Think, when talking about usability for a publicly-facing web site, use the word Search for a search box and nothing else. (He specifically prohibits "Find", "Quick Find", "Quick Search", and all variations.)
The rationale is that "Search" is the most commonly understood term, so it's what people will look for when they aren't thinking, and you don't want your users to have to think (at all).
I would say that "find" is focused on getting a single, exact match. As in the example above, you "find" the perfect PHP job.
OTOH, you "search" for jobs that meet your criteria. Searching is what you do when you want to graze through several results. "Search" returns pages of results. "Find" is closer to "I'm feeling lucky."
Of course, the terms get used interchangeably sometimes. But, I think that's the essence of the difference.
In many applications, find means "find on the current page/screen", while search means "search the entire database/Internet." Web browsers, online help, and other applications seem to make this distinction.
Within most applications...
Find typically refers to locating text within the document at hand and jumps to the next occurrence.
Search typically refers to locating multiple documents (or other objects) and returns a list.
I wrote the built-in Find command in Acrobat 1.0 and worked on the full text Search engine for Acrobat 2.0 and 3.0.
Most software at that point that handled large amounts of text had a way to locate an exact match to a single word or phrase and called it Find/Find Next. This is what we called it in Acrobat 1.0. We knew from the start that this wasn't enough to handle entire repositories of documents, so we needed a way to scan across a whole set. We couldn't use Find since that was already in the UI and had established behavior, so we settled on Search. The decision was based on little more than the relatively small set of common words that convey the action.
Even harder is to come up with a reasonable icon for it. Our initial take was to use something similar to the old Yellow Pages logo:
(source: yellowpagecity.com)
but the lawyers shot that down - it was too close. We couldn't use a magnifying glass as we had zoom functions tied to that. We went with binoculars.
I don't think that there is any difference.
But then again, I'm Portuguese. :P
Find = Discover exact
Example: We write "Please find attached" in an email. We don't write "Please search attached".
Search = Discover exact + Related match
Example: Google Search
"Seek and ye shall find"
"Search and you will find"
One angle that (surprisingly) no one has mentioned, is that in English when you say you search something, that something is the thing you're searching within, not the thing you're trying to find. So unless you add the word 'for' (as in, to search for something), the two words are fundamentally different.
It becomes obvious with an example:
Find the room.
Search the room.
Two very different tasks! The first defines the object of your search. The second defines the scope of your search.
That's not completely irrelevant when talking about UIs. If your app has a search feature where the user can specify both the source and the object of their search, you might choose to use the words this way. For example:
Search: Current document
Find: "positive and energetic"
Yes, as some others have pointed out, the word 'Find' does imply a successful search, but let's not start calling app designers liars for using it when success isn't guaranteed. It's become a pretty standard term for searching a document for a particular string.
I think search is more generic and more suitable for text search. Find sounds more like 'find a specific record or a group of records'
After searching You find something.
Search for an answer on stackoverflow that you may find it.
For me Find is the success of a Search, that is to Find is to identify the location of something that's known to exist.
Search should always be used when you have no control on what the user is looking for.
Find talks about a specific one.
Search does not talk about a specific one.
Did you find the picture I requested yet?
No? Please search on internet. I need to present it in an hour.
Another one is below
Please find the attachment in this email.
(or)
You'll find the attachment below.
(or)
Please find attached.
here, we use find because it is a specific document which is attached to email.
we don't use the search here, as there is nothing to search in a larger domain.
Search is the primary interface to the Web for many users. Search should be global (not scoped to a subsite) and available from every page; booleans should be made intimidating since users usually use them wrong
Read this: https://www.nngroup.com/articles/search-and-you-may-find/

Resources