I'm working on multilingual word embedding code where I need to train my data on English and test it on Spanish. I'll be using the MUSE library by Facebook for the word-embeddings.
I'm looking for a way to pre-process both my data the same way. I've looked into diacritics restoration to deal with the accents.
I'm having trouble coming up with a way in which I can carefully remove stopwords, punctuations and weather or not I should lemmatize.
How can I uniformly pre-process both the languages to create a vocabulary list which I can later use with the MUSE library.
Hi Chandana I hope you're doing well. I would look into using the library spaCy https://spacy.io/api/doc the man that created it has a youtube video in which he discusses the implementation of of NLP in other languages. Below you will find code that will lemmatize and remove stopwords. as far as punctuation you can always set specific characters such as accent marks to ignore. Personally I use KNIME which is free and open source to do preprocessing. You will have to install nlp extentions but what is nice is that they have different extensions for different languages you can install here: https://www.knime.com/knime-text-processing the Stop word filter (since 2.9) and the Snowball stemmer node can be applied for Spanish language. Make sure to select the right language in the dialog of the node. Unfortunately there is no part of speech tagger node for Spanish so far.
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['your_stopword1', 'your_stop_word2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result
I hope this helps let me know if you have any questions :)
I am able to create the lda model and save it. Now I am trying load the model, and pass a new document
lda = LdaModel.load('..\\models\\lda_v0.1.model')
doc_lda = lda[new_doc_term_matrix]
print(doc_lda )
On printing the doc_lda I am getting the object. <gensim.interfaces.TransformedCorpus object at 0x000000F82E4BB630>
However I want to get the topic words associated with it. What is the method I have to use. I was referring to this.
Not sure if this is still relevant, but have you tried get_document_topics()? Though I assume that would only work if you've updated your LDA model using update().
I don't think there is anything wrong with your code - the "Usage example" from the documentation link you posted uses doc2bow which returns a sparse vector - I don't know what new_doc_term_matrix consists of, but I'll assume it worked fine.
You might want to look at this stackoverflow question: you want to print an "object" - that isn't printable, the data you want is somewhere in the object, and that in itself is printable.
Alternatively, you can also use your IDE's capabilities - the Variable explorer in Spyder, for example - to click yourself into the objects and get the info you need.
For more info on similarity analysis with gensim, see this tutorial.
Use this
doc_lda.print_topics(-1)
I am trying to transform English statements into SQL queries.
e.g. How many products were created last year?
This should get transformed to select count(*) from products
where manufacturing date between 1/1/2015 and 31/12/2015
I am not able to understand how to map the verb "created" to "manufacturing date" attribute in my table. I am using Stanford core nlp suite to parse my statement. I am also using wordnet taxonomies with JWI framework.
I have tried to map the verbs to the attributes by defining simple rules. But it is not a very generic approach, since I can not know all the verbs in advance. Is there any better way to achieve this?
I would appreciate any help in this regard.
I know this would require a tool change, but I would reccommend checking out Adapt by Mycroft AI.
It is a very straightforward intent parser which transforms user input into a json semantic representation.
For example:
Input: "Put on my Joan Jett Pandora station."
JSON:
{
"confidence": 0.61,
"target": null,
"Artist": "joan jett",
"intent_type": "MusicIntent",
"MusicVerb": "put on",
"MusicKeyword": "pandora"
}
It looks like the rules are very easy to specify and expand so you would just need to build out your rules and then have whatever tool you want process the JSON and send the SQL query.
The official documentation of token.tag_ in spaCy is as follows:
A fine-grained, more detailed tag that represents the word-class and some basic morphological information for the token. These tags are primarily designed to be good features for subsequent models, particularly the syntactic parser. They are language and treebank dependent. The tagger is trained to predict these fine-grained tags, and then a mapping table is used to reduce them to the coarse-grained .pos tags.
But it doesn't list the full available tags and each tag's explanation. Where can I find it?
Finally I found it inside spaCy's source code: glossary.py. And this link explains the meaning of different tags.
Available values for token.tag_ are language specific. With language here, I don't mean English or Portuguese, I mean 'en_core_web_sm' or 'pt_core_news_sm'. In other words, they are language model specific and they are defined in the TAG_MAP, which is customizable and trainable. If you don't customize it, it will be default TAG_MAP for that language.
As of the writing of this answer, spacy.io/models lists all of the pre trained models and their labeling scheme.
Now, for the explanations. If you are working with English or German text, you're in luck! You can use spacy.explain() or access its glossary on github for the full list. If you are working with other languages, token.pos_ values are always those of Universal dependencies and will work regardless.
To finish up, if you are working with other languages, for a full explanation of the tags, you are going to have to look for them in the sources listed in the models page for your model of interest. For instance, for Portuguese I had to track the explanations for the tags in the Portuguese UD Bosque Corpus used to train the model.
Here is the list of tags:
TAG_MAP = [
".",
",",
"-LRB-",
"-RRB-",
"``",
"\"\"",
"''",
",",
"$",
"#",
"AFX",
"CC",
"CD",
"DT",
"EX",
"FW",
"HYPH",
"IN",
"JJ",
"JJR",
"JJS",
"LS",
"MD",
"NIL",
"NN",
"NNP",
"NNPS",
"NNS",
"PDT",
"POS",
"PRP",
"PRP$",
"RB",
"RBR",
"RBS",
"RP",
"SP",
"SYM",
"TO",
"UH",
"VB",
"VBD",
"VBG",
"VBN",
"VBP",
"VBZ",
"WDT",
"WP",
"WP$",
"WRB",
"ADD",
"NFP",
"GW",
"XX",
"BES",
"HVS",
"_SP",
]
Here is the list of tags and POS Spacy uses in the below link.
https://spacy.io/api/annotation
Universal parts of speech tags
English
German
You can get an explaination using
from spacy import glossary
tag_name = 'ADP'
glossary.explain(tag_name)
Version: 3.3.0
Source: https://github.com/explosion/spaCy/blob/master/spacy/glossary.py
You can use below:
dir(spacy.parts_of_speech)
I am new in the NLP domain, but my current research needs some text parsing (or called keyword extraction) from URL addresses, e.g. a fake URL,
http://ads.goole.com/appid/heads
Two constraints are put on my parsing,
The first "ads" and last "heads" should be distinct because "ads" in the "heads" means more suffix rather than an advertisement.
The "appid" can be parsed into two parts; that is 'app' and 'id', both taking semantic meanings on the Internet.
I have tried the Stanford NLP toolkit and Google search engine. The former tries to classify each word in a grammar meaning which is under my expectation. The Google engine shows more smartness about "appid" which gives me suggestions about "app id".
I can not look over the reference of search history in Google search so that it gives me "app id" because there are many people have searched these words. Can I get some offline line methods to perform similar parsing??
UPDATE:
Please skip the regex suggestions because there is a potentially unknown number of compositions of words like "appid" in even simple URLs.
Thanks,
Jamin
Rather than tokenization, what it sounds like you really want to do is called word segmentation. This is for example a way to make sense of asentencethathasnospaces.
I haven't gone through this entire tutorial, but this should get you started. They even give urls as a potential use case.
http://jeremykun.com/2012/01/15/word-segmentation/
The Python wordsegment module can do this. It's an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.
Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).
Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.
Installation is easy with pip:
$ pip install wordsegment
Simply call segment to get a list of words:
>>> import wordsegment as ws
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'appid', 'heads']
As you noticed, the old corpus doesn't rank "app id" very high. That's ok. We can easily teach it. Simply add it to the bigram_counts dictionary.
>>> ws.bigram_counts['app id'] = 10.2e6
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'app', 'id', 'heads']
I chose the value 10.2e6 by doing a Google search for "app id" and noting the number of results.