How spacy matcher works? - nlp

I am trying to create a dataset for trainin using Spacy matcher, so I am using the matcher explorer but i dont understand exactly how it works.
URL: url-matcher
My idea is from the text in the URL (malware news), label correctly the word "conti", however when I try it using SPacy matcher, it recognize "Costa rica", "one", "attack" and other words as "Conti"!
Why is this? Can somebody clarify it? How should I do it to just label "conti" word?
Thank you

Solved!
I dont know how the official matcher explorer works but i did some tests with pycharm and it is matching correctly.

Related

Spacy Matcher isn't always matching

I can't figure out why the matcher isn't working. This works:
test = ["14k"]
nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
matcher.add("test", [[{"NORM": "14k"}]])
docs = []
for doc in nlp.pipe(test):
matches = matcher(doc)
print(matches)
but if I change 14k to 14K in both my matcher and text, the matcher finds nothing. Why? I just want to understand the difference and why this doesn't work and how I could go about troubleshooting this myself in the future. I've looked at the docs:
https://spacy.io/api/matcher
and can't figure out where I'm going wrong. I changed "NORM" to ORTH and TEXT and it still hasn't found it. Thank you for any help.
EDIT
OK, so I did:
for ent in doc:
print(ent)
and for the lowercase version, Spacy was catorgizing it all as one ent, but when I uppercased the K, Spacy says it two different ents. With this knowledge I did, matcher.add("test", [[{"ORTH": "14"}, {"ORTH":"K"}]]) and it worked.
I still want to know why. Why does Spacy think 14k is one "word" but 14K is two "words"?
It looks like you may be running into issues with differences in tokenization for this kind of sequence. In particular note that things that look like temperatures (so number + [FCK]) may get special treatment. This may seem odd but it usually results in better compatibility with existing corpora.
You can find out why an input is tokenized a particular way using tokenizer.explain() like so:
import spacy
nlp = spacy.blank("en")
print(nlp.tokenizer.explain("14K"))
print("...")
print(nlp.tokenizer.explain("14k"))
That gives the output:
[('TOKEN', '14'), ('SUFFIX', 'K')]
...
[('TOKEN', '14k')]
You can read more about this at the tokenizer.explain docs.

What are some of the data preparation steps or techniques one needs to follow when dealing with multi-lingual data?

I'm working on multilingual word embedding code where I need to train my data on English and test it on Spanish. I'll be using the MUSE library by Facebook for the word-embeddings.
I'm looking for a way to pre-process both my data the same way. I've looked into diacritics restoration to deal with the accents.
I'm having trouble coming up with a way in which I can carefully remove stopwords, punctuations and weather or not I should lemmatize.
How can I uniformly pre-process both the languages to create a vocabulary list which I can later use with the MUSE library.
Hi Chandana I hope you're doing well. I would look into using the library spaCy https://spacy.io/api/doc the man that created it has a youtube video in which he discusses the implementation of of NLP in other languages. Below you will find code that will lemmatize and remove stopwords. as far as punctuation you can always set specific characters such as accent marks to ignore. Personally I use KNIME which is free and open source to do preprocessing. You will have to install nlp extentions but what is nice is that they have different extensions for different languages you can install here: https://www.knime.com/knime-text-processing the Stop word filter (since 2.9) and the Snowball stemmer node can be applied for Spanish language. Make sure to select the right language in the dialog of the node. Unfortunately there is no part of speech tagger node for Spanish so far.
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['your_stopword1', 'your_stop_word2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result
I hope this helps let me know if you have any questions :)

spaCy's rule-based Matcher finds tokens longer than specified by the shape

I want to use the rule-based Matcher (spaCy version 2.0.12) to locate in text codes that consists of 4 letters followed by 4 digits (e.g. CAPA1234). I am trying to use a pattern with attribute SHAPE:
pattern = [{'SHAPE': 'XXXXdddd'}]
You can test it yourself with the Rule-based Matcher Explorer.
It is finding the codes I am expecting but also longer ones like CAPABCD1234 or CAPA1234567. XXXX seems to mean 4 capital letters or more and the same goes for dddd.
Is there a setting to make the shape match the text exactly?
I found a workaround that solves my problem, but doesn't really explain why spaCy behaves the way it does. I will leave the question open.
Use SHAPE and additionally specify LENGTH explicitly:
pattern = [{'LENGTH': 8, 'SHAPE': 'XXXXdddd'}]
Please note that the online Explorer seems to fails when LENGTH is used (no tokens are highlighted). It is working fine on my machine.

How to view and interpret the output of lda model using gensim

I am able to create the lda model and save it. Now I am trying load the model, and pass a new document
lda = LdaModel.load('..\\models\\lda_v0.1.model')
doc_lda = lda[new_doc_term_matrix]
print(doc_lda )
On printing the doc_lda I am getting the object. <gensim.interfaces.TransformedCorpus object at 0x000000F82E4BB630>
However I want to get the topic words associated with it. What is the method I have to use. I was referring to this.
Not sure if this is still relevant, but have you tried get_document_topics()? Though I assume that would only work if you've updated your LDA model using update().
I don't think there is anything wrong with your code - the "Usage example" from the documentation link you posted uses doc2bow which returns a sparse vector - I don't know what new_doc_term_matrix consists of, but I'll assume it worked fine.
You might want to look at this stackoverflow question: you want to print an "object" - that isn't printable, the data you want is somewhere in the object, and that in itself is printable.
Alternatively, you can also use your IDE's capabilities - the Variable explorer in Spyder, for example - to click yourself into the objects and get the info you need.
For more info on similarity analysis with gensim, see this tutorial.
Use this
doc_lda.print_topics(-1)

solr Japanese tokenizer not working for katakana

I am using solr-6.2.0 and filedType : text_ja .
I am facing problem with JapaneseTokenizer, its properly tokenising ドラゴンボールヒーロー ↓
"ドラゴン"
"ドラゴンボールヒーロー"
"ボール" "ヒーロー"
But its failing to tokenize ドラゴンボールヒーローズ properly,
ドラゴンボールヒーローズ
↓
"ドラゴン"
"ドラゴンボールヒーローズ"
"ボールヒーローズ"
Hence searching with ドラゴンボール doesn't hit in later case .
Also it doesn't seperate ディズニーランド into two words .
First, I'm fairly certain that it is working as intended. Looking into how the Kuromoji morphpological analyzer works would probably be the best way to gain a better understanding of it's rules and rationale.
There are a couple of things you could try. You could put the JapaneseAnalyzer into EXTENDED, instead of SEARCH mode, which should give you significant looser matching (though most likely at the cost of introducing more false positives, of course):
Analyzer analyzer = new JapaneseAnalyzer(
null,
JapaneseTokenizer.Mode.EXTENDED,
JapaneseAnalyzer.getDefaultStopSet(),
JapaneseAnalyzer.getDefaultStopTags()
);
Or you could try using CJKAnalyzer, instead.
(By the way, EnglishAnalyzer doesn't split "Disneyland" into two tokens either)
I was able to solve this using lucene-gosen Sen Tokenizer, and compiling ipadic dictionary with custom rules and word weights.

Resources