Thank you for your assistance. I am using spacy to parse though documents to find instances of certain words and extract the sentence in a new df[column].
Here are some texts:
text = 'Many people like Germany. It is a great country. Germany exports lots of technology. France is also a great country. France exports wine. Europeans like to travel. They spend lot of time of beaches. Spain is one of their travel locations. Spain appreciates tourists. Spain's economy is strengthened by tourism. Spain has asked and Germany is working to assist with the travel of tourists to Spanish beaches. Spain also like to import French wine. France would like to sell more wine to Spain.'
My code works like this:
def sent_matcher(text: str) -> list:
doc = nlp(text)
sent_list = []
phrase_matcher = PhraseMatcher(nlp.vocab)
phrases = ['Germany', 'France']
patterns = nlp(data) for data in phrases]
phrase_matcher.add('EU entity', None, * patterns)
for sent in doc.sents:
for match_id, start, end in phrase_matcher(nlp(sent.text)):
if nlp.vocab.strings[match_id] in ['EU entity']:
sent_list.append(sent)
text = (sent_list)
return text
This code works fine and pulls all the sentences that include the EU entity.
However, I wanted to take this to the next level and pull out sentences where the EU entity is the actor and identify what type of action they were taking. I tried using POS/Dependency to pull out Proper nouns combined with the verb but the nsubj was not always correct or the nsubj was linked to another word in a compound noun structure. I tried extracting instances where the country was the first actor (if token == 'x') but I always threw a string error even if I tokenized the word. I also tried using noun_chunks but then I couldn't isolate the instance of the country or tie that chunk back to the verb.
I am pretty new to NLP so any thoughts would be greatly appreciated on how to code this and reap the desired output.
Thank you for your help!
It sounds like if you use merge_entities and follow the rule-based matching guide for the DependencyMatcher you should be able to do this pretty easily. It won't be perfect but you should be able to match many instances.
Related
I was trying semcor corpus in nltk.
I found this code here:
>>> list(map(str, semcor.tagged_chunks(tag='both')[:3]))
['(DT The)', "(Lemma('group.n.01.group') (NE (NNP Fulton County Grand Jury)))", "(Lemma('state.v.01.say') (VB said))"]
I tried the same on colab (check last cell in this notebook):
>>> list(map(str, semcor.tagged_chunks(tag='both')[:3]))
['(DT The)',
'(group.n.01 (NE (NNP Fulton County Grand Jury)))',
'(say.v.01 (VB said))']
Here is the screenshot from colab:
The problem
Note that on nltk page, for Fulton County Grand Jury output is given as Lemma('group.n.01.group'), but on colab, I am getting group.n.01. So I am not getting sense / synset lemma.
In group.n.01.group
first group is a "stem for sense word"
last group is "stem for input"
In group.n.01
(first and only) group is "stem for input"
no "stem for sense word" is returned
Weird thing is that it was giving me correct output yesterday. This notebook will clear the doubt as it has same two lines executed today and yesterday. Yesterday (2/9/2021), I was getting tags in format group.n.01.group, but today I am getting tags in group.n.01 format (NOTICE RED AND BLUE COMMENTS):
What I am missing here?
I knew that semcor uses wordnet senses to tag to subset of brown corpus. But I was not aware that semcor APIs can work with or without wordnet predownloaded and it will give tags in different format in these different scenarios. I honestly feel, at least semcor API documentation should have some mention of this.
So, without wordnet predownloaded, it does not return sense stems:
With wordnet pre-downloaded, it does return sense stems:
For example, I get I document that contains 2 sentences: I am a person. He also likes apples.
Do we need to count the cooccurrence of "person" and "He" ?
Each document is separated with a line break. Context windows of cooccurrences are limited to each document.
Based on the implementation here.
A newline is taken as indicating a new document (contexts won't cross newline).
So, depending on how you prepare sentences, you may get different results:
Setting 1: ('He', 'person') cooccurred
...
I am a person. He also likes apples.
...
Setting 2: ('He', 'person') not cooccurred
...
I am a person.
He also likes apples.
...
I'm working on multilingual word embedding code where I need to train my data on English and test it on Spanish. I'll be using the MUSE library by Facebook for the word-embeddings.
I'm looking for a way to pre-process both my data the same way. I've looked into diacritics restoration to deal with the accents.
I'm having trouble coming up with a way in which I can carefully remove stopwords, punctuations and weather or not I should lemmatize.
How can I uniformly pre-process both the languages to create a vocabulary list which I can later use with the MUSE library.
Hi Chandana I hope you're doing well. I would look into using the library spaCy https://spacy.io/api/doc the man that created it has a youtube video in which he discusses the implementation of of NLP in other languages. Below you will find code that will lemmatize and remove stopwords. as far as punctuation you can always set specific characters such as accent marks to ignore. Personally I use KNIME which is free and open source to do preprocessing. You will have to install nlp extentions but what is nice is that they have different extensions for different languages you can install here: https://www.knime.com/knime-text-processing the Stop word filter (since 2.9) and the Snowball stemmer node can be applied for Spanish language. Make sure to select the right language in the dialog of the node. Unfortunately there is no part of speech tagger node for Spanish so far.
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['your_stopword1', 'your_stop_word2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result
I hope this helps let me know if you have any questions :)
I'm trying to extract named entities (people, persons and organizations) using LingPipe and following this tutorial. Here is the full text that I am trying to extract names from and here is the code (exception handling omitted for brevity):
Chunker chunker = readChunker("/path-to-chunker"); // custom method for
reading the model
String article = "Some long news article spanning multiple lines...";
Chunking chunking = chunker.chunk(article);
Set<Chunk> chunkingSet = chunking.chunkSet();
for (Chunk chunk : chunkingSet) {
String name = article.substring(chunk.start(), chunk.end()));
System.out.println(name);
}
And this is (part of) the output I get:
Tony Abbott
Indonesia
Joko Widodo
Sukumaran
Andrew Chan
Bali.
pair
the Bali
Nusa Kambangan
Indonesian
Indonesian
I’
Widodo. I
” Abbott
Julie Bishop
Widodo
al-Jazeera
Sukumaran
Chan
Bishop
”
As you can see, there are a lot of mismatches/partial matches like Bali., pair, the Bali, I', Widodo. I, " Abbott, ". I assume library's NER is working just fine and the problem is that the above code is somehow misusing the classes/methods from this library. But I just can't seem to find what is wrong about the code?
Any ideas?
It seems that the problem is in the model of chunker that you read in the first line. Probably, it uses wrong tokenizer, which is the source of Bali., I', Widodo. I, " Abbott, ".
pair and the Bali can be explained by ordinary errors (taggers usually have not more than 80-90% precision). However, the source of such errors can also be explained by bad model - for example, it can be trained for another domain.
Btw, why do you use longpipe, but not Stanford NER? It shows better results, as a rule; e.g. the first available (i.e. random) article.
Also, here is a good step-by-step tutorial for Standford NER.
I am new in the NLP domain, but my current research needs some text parsing (or called keyword extraction) from URL addresses, e.g. a fake URL,
http://ads.goole.com/appid/heads
Two constraints are put on my parsing,
The first "ads" and last "heads" should be distinct because "ads" in the "heads" means more suffix rather than an advertisement.
The "appid" can be parsed into two parts; that is 'app' and 'id', both taking semantic meanings on the Internet.
I have tried the Stanford NLP toolkit and Google search engine. The former tries to classify each word in a grammar meaning which is under my expectation. The Google engine shows more smartness about "appid" which gives me suggestions about "app id".
I can not look over the reference of search history in Google search so that it gives me "app id" because there are many people have searched these words. Can I get some offline line methods to perform similar parsing??
UPDATE:
Please skip the regex suggestions because there is a potentially unknown number of compositions of words like "appid" in even simple URLs.
Thanks,
Jamin
Rather than tokenization, what it sounds like you really want to do is called word segmentation. This is for example a way to make sense of asentencethathasnospaces.
I haven't gone through this entire tutorial, but this should get you started. They even give urls as a potential use case.
http://jeremykun.com/2012/01/15/word-segmentation/
The Python wordsegment module can do this. It's an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.
Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).
Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.
Installation is easy with pip:
$ pip install wordsegment
Simply call segment to get a list of words:
>>> import wordsegment as ws
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'appid', 'heads']
As you noticed, the old corpus doesn't rank "app id" very high. That's ok. We can easily teach it. Simply add it to the bigram_counts dictionary.
>>> ws.bigram_counts['app id'] = 10.2e6
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'app', 'id', 'heads']
I chose the value 10.2e6 by doing a Google search for "app id" and noting the number of results.