solr Japanese tokenizer not working for katakana

solr Japanese tokenizer not working for katakana - search

I am using solr-6.2.0 and filedType : text_ja .
I am facing problem with JapaneseTokenizer, its properly tokenising ドラゴンボールヒーロー ↓
"ドラゴン"
"ドラゴンボールヒーロー"
"ボール" "ヒーロー"
But its failing to tokenize ドラゴンボールヒーローズ properly,
ドラゴンボールヒーローズ
↓
"ドラゴン"
"ドラゴンボールヒーローズ"
"ボールヒーローズ"
Hence searching with ドラゴンボール doesn't hit in later case .
Also it doesn't seperate ディズニーランド into two words .

First, I'm fairly certain that it is working as intended. Looking into how the Kuromoji morphpological analyzer works would probably be the best way to gain a better understanding of it's rules and rationale.
There are a couple of things you could try. You could put the JapaneseAnalyzer into EXTENDED, instead of SEARCH mode, which should give you significant looser matching (though most likely at the cost of introducing more false positives, of course):
Analyzer analyzer = new JapaneseAnalyzer(
null,
JapaneseTokenizer.Mode.EXTENDED,
JapaneseAnalyzer.getDefaultStopSet(),
JapaneseAnalyzer.getDefaultStopTags()
);
Or you could try using CJKAnalyzer, instead.
(By the way, EnglishAnalyzer doesn't split "Disneyland" into two tokens either)

I was able to solve this using lucene-gosen Sen Tokenizer, and compiling ipadic dictionary with custom rules and word weights.

Related

How to speed up search including special character alternatives and nested loops (Python/Django webapp)?

I have three loops nested in a python/django webapp backend. all_recommended_services has all the service info I need to go through. alternatives has the search criteria entered in the search bar, including all special character alternatives (for example: u is substituted with ú, ö with ő and so on...). Finally, the loop for value in alternative: goes through all search words individually split by empty space.
There are search keyword combinations which yield millions of alternatives, which totally kills the webapp. Is there an efficient way to speed this up? I tried to look into itertools.product to use cartesian, but it didn't really help me avoid more loops or speed up the process. Any help is much appreciated!
for service in all_recommended_services:
county_str = get_county_by_id(all_counties, service['county_id'])
for alternative in alternatives:
something_found = False
for value in alternative:
something_found = search_in_service(service, value, county_str)
if not something_found:
break
if something_found:
if not service in recommended_services:
recommended_services.append(service)

As you are searching, I will suggest this package named Django-haystack. It is easy to use and is highly customizable to fit your needs. Since you didn't include more detail, I can't provide a more detailed demo, but the documentation is comprehensive.

What are some of the data preparation steps or techniques one needs to follow when dealing with multi-lingual data?

I'm working on multilingual word embedding code where I need to train my data on English and test it on Spanish. I'll be using the MUSE library by Facebook for the word-embeddings.
I'm looking for a way to pre-process both my data the same way. I've looked into diacritics restoration to deal with the accents.
I'm having trouble coming up with a way in which I can carefully remove stopwords, punctuations and weather or not I should lemmatize.
How can I uniformly pre-process both the languages to create a vocabulary list which I can later use with the MUSE library.

Hi Chandana I hope you're doing well. I would look into using the library spaCy https://spacy.io/api/doc the man that created it has a youtube video in which he discusses the implementation of of NLP in other languages. Below you will find code that will lemmatize and remove stopwords. as far as punctuation you can always set specific characters such as accent marks to ignore. Personally I use KNIME which is free and open source to do preprocessing. You will have to install nlp extentions but what is nice is that they have different extensions for different languages you can install here: https://www.knime.com/knime-text-processing the Stop word filter (since 2.9) and the Snowball stemmer node can be applied for Spanish language. Make sure to select the right language in the dialog of the node. Unfortunately there is no part of speech tagger node for Spanish so far.
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['your_stopword1', 'your_stop_word2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result
I hope this helps let me know if you have any questions :)

spaCy's rule-based Matcher finds tokens longer than specified by the shape

I want to use the rule-based Matcher (spaCy version 2.0.12) to locate in text codes that consists of 4 letters followed by 4 digits (e.g. CAPA1234). I am trying to use a pattern with attribute SHAPE:
pattern = [{'SHAPE': 'XXXXdddd'}]
You can test it yourself with the Rule-based Matcher Explorer.
It is finding the codes I am expecting but also longer ones like CAPABCD1234 or CAPA1234567. XXXX seems to mean 4 capital letters or more and the same goes for dddd.
Is there a setting to make the shape match the text exactly?

I found a workaround that solves my problem, but doesn't really explain why spaCy behaves the way it does. I will leave the question open.
Use SHAPE and additionally specify LENGTH explicitly:
pattern = [{'LENGTH': 8, 'SHAPE': 'XXXXdddd'}]
Please note that the online Explorer seems to fails when LENGTH is used (no tokens are highlighted). It is working fine on my machine.

Capybara: Should I get rid of extracted constants or keep them?

I was wondering about some best practices regarding extraction of selectors to constants. As a general rule, it is usually recommended to extract magic numbers and string literals to constants so they can be reused, but I am not sure if this is really a good approach when dealing with selectors in Capybara.
At the moment, I have a file called "selectors.rb" which contains the selectors that I use. Here is part of it:
SELECTORS = {
checkout: {
checkbox_agreement: 'input#agreement-1',
input_billing_city: 'input#billing\:city',
input_billing_company: 'input#billing\:company',
input_billing_country: 'input#billing\:country_id',
input_billing_firstname: 'input#billing\:firstname',
input_billing_lastname: 'input#billing\:lastname',
input_billing_postcode: 'input#billing\:postcode',
input_billing_region: 'input#billing\:region_id',
input_billing_street1: 'input#billing\:street1',
....
}
In theory, I put my selectors in this file, and then I could do something like this:
find(SELECTORS[:checkout][:input_billing_city]).click
There are several problems with this:
If I want to know the selector that is used, I have to look it up
If I change the name in selectors.rb, I could forget to change it somewhere else in the file which will result in find(nil).click
With the example above, I can't use this selector with fill_in(SELECTORS[:checkout][:input_billing_city]), because it requires an ID, name or label
There are probably a few more problems with that, so I am considering to get rid of the constants. Has anyone been in a similar spot? What is a good way to deal with this situation?

Someone mentioned the SitePrism gem to me: https://github.com/natritmeyer/site_prism
A Page Object Model DSL for Capybara
SitePrism gives you a simple, clean and semantic DSL for describing
your site using the Page Object Model pattern, for use with Capybara
in automated acceptance testing.
It is very helpful in that regard and I have adjusted my code accordingly.

Ternary operator should not be used on a single line in Node.js. Why?

Consider the following sample codes:
1.Sample
var IsAdminUser = (User.Privileges == AdminPrivileges)
? 'yes'
: 'no';
console.log(IsAdminUser);
2.Sample
var IsAdminUser = (User.Privileges == AdminPrivileges)?'yes': 'no';
console.log(IsAdminUser);
The 2nd sample I am very comfortable with & I code in that style, but it was told that its wrong way of doing without any supportive reasons.
Why is it recommended not to use a single line ternary operator in Node.js?
Can anyone put some light on the reason why it is so?
Advance Thanks for great help.

With all coding standards, they are generally for readability and maintainability. My guess is the author finds it more readable on separate lines. The compiler / interpreter for your language will handle it all the same. As long as you / your project have a set standard and stick to it, you'll be fine. I recommend that the standards be worked on or at least reviewed by everyone on the project before casting them in stone. I think that if you're breaking it up on separate lines like that, you may as well define an if/else conditional block and use that.
Be wary of coding standards rules that do not have a justification.
Personally, I do not like the ternary operator as it feels unnatural to me and I always have to read the line a few times to understand what it's doing. I find separate if/else blocks easier for me to read. Personal preference of course.

It is in fact wrong to put the ? on a new line; even though it doesn’t hurt in practice.
The reason is a JS feature called “Automatic Semicolon Insertion”. When a var statement ends with a newline (without a trailing comma, which would indicate that more declarations are to follow), your JS interpreter should automatically insert a semicolon.
This semicolon would have the effect that IsAdminUser is assigned a boolean value (namely the result of User.Privileges == AdminPrivileges). After that, a new (invalid) expression would start with the question mark of what you think is a ternary operator.
As mentioned, most JS interpreters are smart enough to recognize that you have a newline where you shouldn’t have one, and implicitely fix your ternary operator. And, when minifying your script, the newline is removed anyway.
So, no problem in practice, but you’re relying on an implicit fix of common JS engines. It’s better to write the ternary operator like this:
var foo = bar ? "yes" : "no";
Or, for larger expressions:
var foo = bar ?
"The operation was successful" : "The operation has failed.";
Or even:
var foo = bar ?
"Congratulations, the operation was a total success!" :
"Oh, no! The operation has horribly failed!";

I completely disagree with the person who made this recommendation. The ternary operator is a standard feature of all 'C' style languages (C,C++,Java,C#,Javascript etc.), and most developers who code in these languages are completely comfortable with the single line version.
The first version just looks weird to me. If I was maintaining code and saw this, I would correct it back to a single line.
If you want verbose, use if-else. If you want neat and compact use a ternary.
My guess is the person who made this recommendation simply wasn't very familiar with the operator, so found it confusing.

Because it's easier on the eye and easier to read. It's much easier to see what your first snippet is doing at a glance - I don't even have to read to the end of a line. I can simply look at one spot and immediately know what values IsAdminUser will have for what conditions. Much the same reason as why you wouldn't write an entire if/else block on one line.
Remember that these are style conventions and are not necessarily backed up by objective (or technical) reasoning.

The reason for having ? and : on separate lines is so that it's easier to figure out what changed if your source control has a line-by-line comparison.
If you've just changed the stuff between the ? and : and everything is on a single line, the entire line can be marked as changed (based on your comparison tool).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

solr Japanese tokenizer not working for katakana - search

I was able to solve this using lucene-gosen Sen Tokenizer, and compiling ipadic dictionary with custom rules and word weights.

Related

How to speed up search including special character alternatives and nested loops (Python/Django webapp)?

What are some of the data preparation steps or techniques one needs to follow when dealing with multi-lingual data?

spaCy's rule-based Matcher finds tokens longer than specified by the shape

Capybara: Should I get rid of extracted constants or keep them?

Ternary operator should not be used on a single line in Node.js. Why?

Categories

Resources