Solr, Special Chars, and Latin to Cyrillic char conversion - search

I am trying to setup a search engine using Solr (or Lucene) which could have text in both Latin with special chars, (special chars would include Ö or Ç as an example) or Cyrilic chars (examples include Б or б and Ж ж).
Anyway, I am trying to find a solution to allow me to search for words with these charicters in them, but for users who do not have the key on their keyboard...
Example would be (making up words here, hopefully won't offend anyone):
"BÖÖK" would be found when searching for "book"
"ЖRAY" would be found when searching for XRAY
"ЖRAY" would also be found if searching for ZRAY, ZHRAY, or žray (see GOST 16876-71 for info on Transliteration of Cylric to Latin Char.
So, how should I go about this? Some theories I have are:
allow multiple text fields to be stored for each original string, one in original form, one in the first pass of transliteration (which, for example, would convert Ö to just O and Ж to ž, but also X) and then one in the third form (from the ž to z or zh) -> means I will be storing a LOT of data...
store in solr as is, and let Solr do the magic -> don't know how well this will work... can't see anything in solr to do this
Magic bullet I have not found yet...
Any ideas? Anyone tried this before?

Take a look at Solr's Analyzers, Tokenizers, and Token Filters which give you a good intro to the type of manipulation you're looking for.

You need to use the accent filter in your index and query text analysis, which would convert foreign characters to their english version
You can use ISOLatin1AccentFilterFactory or ASCIIFoldingFilterFactory depending upon the Solr version you are using.
e.g.
<filter class="solr.ASCIIFoldingFilterFactory" />
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ISOLatin1AccentFilterFactory
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory
So -
"BÖÖK" would be converted and indexed as "book" in Solr.
This would enable the users to search for both, book and BÖÖK and still get back the document.

Related

SOLR: Return missing words for multi word searches

I'm trying to receive the words of a search query in solr, which were not included in a match.
Let's say I'm searching for "Red Hat Linux chickpeas" (without quotes) and one of the hits is "Red Hat Enterprise Linux operating system".. Then I'd like to get the information that the word "chickpeas" is missing in this result.
I think this should somehow be possible with SOLR, however apparently I couldn't come up with the right google/stackoverflow query to find a solution to this.
You could try using a facet to get the number returns with the given terms:
q=Red+Hat+Linux+chickpeas&facet=true&facet.field={!terms=red,hat,linux,chickpeas}text
Where text is a catch-all field (tokenized, lowercase filtered). Note that the facets are case-sensitive.
The answer to my question is using an exists function query for each search term.
See here:
https://stackoverflow.com/a/26163945/467944

Solr custom wildcard

I am pretty new to Solr and I am looking for a way to port the search features I have for my web application having a regular database to use Solr indexes. My problem so far is I have to customize the wildcards behaviour: for example, "?" should be "0 or 1 characters" not any character as it is now, "+" should mean any "white-space", "#" should be any digit and so on. Any good pointer?
Thanks!
There is no simple answer that I know of, I am afraid.
For 0 or 1 characters - you can replace the original query with an 'OR' query. Eg. mp? in your db search usecase becomes - 'mp OR mp?' in Solr.
White spaces are tokenized by default in text field. So, you can look at using a white space tokenizer as part of your custom 'text' field. There are several examples. text_ws in the sample schema only does whitespace tokenizing. You'd want to readup on tokenizers.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
There is no digit equivalent - you can do term1* OR term2* OR term3* ... etc. You can also use function queries that support numerical functions. http://wiki.apache.org/solr/FunctionQuery
It looks like the best choice in this case is to use regular expressions in the search. More details can be found here: http://1opensourcelover.wordpress.com/2013/09/29/solr-regex-tutorial/
It's not exactly what I was looking for as I will have to build my own solr-query on the back and I have a feeling that regular expressions abuse will create a little bit more overhead on my server. For the test I did it looks pretty fast.
I will leave the question open for a while maybe someone can come up with a better answer.

Exact match in google search

I am trying to make an application which find all the copied code in a project.
But basically my question is purely related to google search.
I made a search for the keyword "public void bubbleSort(int[] arr){"
and this was the result.
In the first page of search results, only the last url makes a perfect match with my keyword.
Can i tell google with some search keywords so that it will give more importance to pages with an exact match of my search keyword?
although the plus sign, +, is no longer an available Google search filter, you can use quotes, or after running the query selecting Search Tools and then verbatim under the All Results drop down.
You can also search the Google code archives, https://code.google.com/ or try some of the other code search engines around the Internet.
+"public void bubbleSort(int[] arr){"
the plus sign means to include this term no matter what. the quotes turn the loosely coupled words into a single term.
for a full list of Google syntax operators:
[web]: https://support.google.com/websearch/answer/136861?hl=en

How to compare different language String values in JAVA?

In my web application I am using two different Languages namely English and Arabic.
I have a search box in my web application in which if we search by name or part of the name then it will retrieve the values from DB by comparing the "Hometown" of the user
Explanation:
Like if a user belongs to hometown "California" and he searches a name say "Victor" then my query will first see the people who are having the same hometown "California" and in the list of people who have "California" as hometown the "Victor" *name* will be searched and it retrieve the users having "California" as their hometown and "victor" in their name or part of the name.
The problem is if the hometown "California" is saved in English it will compare and retrieve the values. But "California" will be saved as "كاليفورنيا" in Arabic. In this case the hometown comparison fails and it cant retrieve the values.
I wish that my query should find both are same hometown and retrieve the values. Is it possible?
What alternate I should think of for this logic for comparison. I am confused. Any suggestion please?
EDIT:
*I have an Idea such that if the hometown is got then is it possible to use Google translator or transliterator and change the hometown to another language. if it is in english then to arabic or if it is in english then to arabic and give the search results joining both. Any suggestion?*
The problem you encounter is that you want / need information in 2 or more languages and you want the user of your application to be able to use both languages. One possible approach is to keep multiple records per item and including a language code as part of the primary key, for instance if your record is
id hometown name
001 California Victor
you could introduce a language code and store
id lang hometown name
001 en California Victor
001 ar كاليفورنيا Victor
then your search would match either "California" or "كاليفورنيا" giving you the id 001, which you can then use to load all translations of your data (or just the data in the current output language.) This sceme can be used with any number of languages and has the added advantage that you don't need to prefill the table. You can add new translations for records when they become known.
(Caveat: I just repeated your arabic string, I can't read it, also 'ar' most likely isn't the correct language code for aribic but you get the idea.)
Does the Arabic sound like "California"? If so you will need to compare on a "sounds-like"-basis which will most likely result in a phoneme conversion.
Transliterate all names into the same language (e.g. English) for searching, and use Levenstein edit distance to compute the similarity between the phonetic representations of the names. This will be slow if you simply compare your query with every name, but if you pre-index all of the place names in your database into a Burkhard-Keller tree, then they can be efficiently searched by edit distance from the query term.
This technique allows you to sort names by how close they actually match. You're probably more likely to find a match this way than using metaphone or double-metaphone, though this is more difficult to implement.
Your Google suggestion sounds like it might also be a good one, but you should play around with it, and be sure that you're happy with its accuracy. In testing how it worked going between Hebrew and English, I noticed that sometimes Google just leaves English place names in English letters when translating to Hebrew.
How about you use some localization on client side to display values. Or create a wrapper class for hometown that will override equal(Object) in the manner the instance for California will return true for both "California" and "كاليفورنيا" (sorry if I made mistake here, just copy-pasted from above).
This sounds like a classic encoding problem. Whenever you transfer non-ascii character you need to make sure you're encoding it right. For Arabic and English I suspect you can use UTF-8 (but I don't know arabic, so it may be wrong).
In your setup you will probably have the following points:
Browser <-> Servlet container <-> Database
|
System.out
In any of the system interfaces where chars (16-bit) are converted to byte (8-bit) you will need to make sure the encoding is correct.
Browser to Servlet container
When you do GET or POST requests from a web-page, the browser will look at 1) The HTTP headers from the server, especially the Content-Type: text/html; charset=UTF-8, which if present, will override the HTML meta header <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">.
On the servlet container side, the HttpServletRequest.getParameter(), will have an encoding that you most likely need to set in the server settings.
Example tomcat's server.xml
<Connector port="8080" protocol="HTTP/1.1" URIEncoding="UTF-8"
maxThreads="2000"
connectionTimeout="20000"
redirectPort="8443" />
Servlet container to Database
The database needs to have the correct encodings, or sorting etc will not be right.
Example my.cnf for MySQL
[mysqld]
....
init_connect=''SET collation_connection = utf8_general_ci''
init_connect='SET NAMES utf8'
default-character-set=utf8
character-set-server = utf8
collation-server = utf8_general_ci
[mysql]
....
default-character-set=utf8
Then the JDBC-driver needs to be set for UTF-8.
Example JDBC connect string
jdbc:mysql://localhost:3306/rimario?useUnicode=true&characterEncoding=utf-8
System.out
System.out.printnln() can not be relied upon to verify things. First it depends on the java vm default encoding, set using System.property -Dfile.encoding=UTF-8, secondly the terminal in which you do the System.out, will need to be set to and support UTF-8. Don't trust System.out!
Once a String in the VM is a proper character, it will not be affected by encoding. In memory every char in a string is 16-bit, which (almost) covers all the chars that utf-8 can encode. You can write the string to a file and investigate the file to really know if you got correct chars in your VM.

How to find a match within a single term using Lucene

I am using the Lucene search engine but it only seems to find matches that occur at the beginning of terms.
For example:
Searching for "one" would match "onematch" or "one day a time" but not "loneranger".
The Lucene doc says it doesnt support wildcards at the front of a search string so I am not sure whether Lucene even searches inter-term matches or only can match documents that start with the search term.
Is this a problem with how I have created my index, how I am building my search query or just a limitation of Lucene?
Found some info in another post here on Stack Overflow [LUCENE.NET] Leading wildcard throws an error"
You can set the SetAllowLeadingWildcardCharacters property on your Query Parser to allow leading wildcards during your search. This will of course have the obvious large performance impact but will allow user to find matches within a search term.
Lucene will find a document if the search term appears anywhere within it, but it doesn't allow you to do wildcard queries where the wildcard is on the front of the search term, because it performs horribly. If that is functionality you care about, you will either have to do some low-level Lucene hacking change a config flag (thanks for the interesting link), find a third-party library that has already done that hacking, or find a different search implementation (for small enough datasets, the built in search from a lot of RDBMS engines is sufficient).
Your query should be
"Query query = new WildcardQuery(new Term("contents", "*one *"));"
where contents is the field name in which you are searching.
"one" should be enclosed with asterisk mark. I have given space in the query after *one but there should not be any space. without space the * is not displaying that is why I added star.

Resources