English word segmentation in NLP? - web

I am new in the NLP domain, but my current research needs some text parsing (or called keyword extraction) from URL addresses, e.g. a fake URL,
http://ads.goole.com/appid/heads
Two constraints are put on my parsing,
The first "ads" and last "heads" should be distinct because "ads" in the "heads" means more suffix rather than an advertisement.
The "appid" can be parsed into two parts; that is 'app' and 'id', both taking semantic meanings on the Internet.
I have tried the Stanford NLP toolkit and Google search engine. The former tries to classify each word in a grammar meaning which is under my expectation. The Google engine shows more smartness about "appid" which gives me suggestions about "app id".
I can not look over the reference of search history in Google search so that it gives me "app id" because there are many people have searched these words. Can I get some offline line methods to perform similar parsing??
UPDATE:
Please skip the regex suggestions because there is a potentially unknown number of compositions of words like "appid" in even simple URLs.
Thanks,
Jamin

Rather than tokenization, what it sounds like you really want to do is called word segmentation. This is for example a way to make sense of asentencethathasnospaces.
I haven't gone through this entire tutorial, but this should get you started. They even give urls as a potential use case.
http://jeremykun.com/2012/01/15/word-segmentation/

The Python wordsegment module can do this. It's an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.
Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).
Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.
Installation is easy with pip:
$ pip install wordsegment
Simply call segment to get a list of words:
>>> import wordsegment as ws
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'appid', 'heads']
As you noticed, the old corpus doesn't rank "app id" very high. That's ok. We can easily teach it. Simply add it to the bigram_counts dictionary.
>>> ws.bigram_counts['app id'] = 10.2e6
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'app', 'id', 'heads']
I chose the value 10.2e6 by doing a Google search for "app id" and noting the number of results.

Related

Extracting sentences with Spacy POS/DEP : actor and action

Thank you for your assistance. I am using spacy to parse though documents to find instances of certain words and extract the sentence in a new df[column].
Here are some texts:
text = 'Many people like Germany. It is a great country. Germany exports lots of technology. France is also a great country. France exports wine. Europeans like to travel. They spend lot of time of beaches. Spain is one of their travel locations. Spain appreciates tourists. Spain's economy is strengthened by tourism. Spain has asked and Germany is working to assist with the travel of tourists to Spanish beaches. Spain also like to import French wine. France would like to sell more wine to Spain.'
My code works like this:
def sent_matcher(text: str) -> list:
doc = nlp(text)
sent_list = []
phrase_matcher = PhraseMatcher(nlp.vocab)
phrases = ['Germany', 'France']
patterns = nlp(data) for data in phrases]
phrase_matcher.add('EU entity', None, * patterns)
for sent in doc.sents:
for match_id, start, end in phrase_matcher(nlp(sent.text)):
if nlp.vocab.strings[match_id] in ['EU entity']:
sent_list.append(sent)
text = (sent_list)
return text
This code works fine and pulls all the sentences that include the EU entity.
However, I wanted to take this to the next level and pull out sentences where the EU entity is the actor and identify what type of action they were taking. I tried using POS/Dependency to pull out Proper nouns combined with the verb but the nsubj was not always correct or the nsubj was linked to another word in a compound noun structure. I tried extracting instances where the country was the first actor (if token == 'x') but I always threw a string error even if I tokenized the word. I also tried using noun_chunks but then I couldn't isolate the instance of the country or tie that chunk back to the verb.
I am pretty new to NLP so any thoughts would be greatly appreciated on how to code this and reap the desired output.
Thank you for your help!
It sounds like if you use merge_entities and follow the rule-based matching guide for the DependencyMatcher you should be able to do this pretty easily. It won't be perfect but you should be able to match many instances.

Getting sense stems for nltk semcor corpus words

I was trying semcor corpus in nltk.
I found this code here:
>>> list(map(str, semcor.tagged_chunks(tag='both')[:3]))
['(DT The)', "(Lemma('group.n.01.group') (NE (NNP Fulton County Grand Jury)))", "(Lemma('state.v.01.say') (VB said))"]
I tried the same on colab (check last cell in this notebook):
>>> list(map(str, semcor.tagged_chunks(tag='both')[:3]))
['(DT The)',
'(group.n.01 (NE (NNP Fulton County Grand Jury)))',
'(say.v.01 (VB said))']
Here is the screenshot from colab:
The problem
Note that on nltk page, for Fulton County Grand Jury output is given as Lemma('group.n.01.group'), but on colab, I am getting group.n.01. So I am not getting sense / synset lemma.
In group.n.01.group
first group is a "stem for sense word"
last group is "stem for input"
In group.n.01
(first and only) group is "stem for input"
no "stem for sense word" is returned
Weird thing is that it was giving me correct output yesterday. This notebook will clear the doubt as it has same two lines executed today and yesterday. Yesterday (2/9/2021), I was getting tags in format group.n.01.group, but today I am getting tags in group.n.01 format (NOTICE RED AND BLUE COMMENTS):
What I am missing here?
I knew that semcor uses wordnet senses to tag to subset of brown corpus. But I was not aware that semcor APIs can work with or without wordnet predownloaded and it will give tags in different format in these different scenarios. I honestly feel, at least semcor API documentation should have some mention of this.
So, without wordnet predownloaded, it does not return sense stems:
With wordnet pre-downloaded, it does return sense stems:

How to get a sort of inverse lemmatizations for every language?

I found the spacy lib that allows me to apply lemmatization to words (blacks -> black, EN) (bianchi -> bianco, IT). My work is to analyze entities, not verbs or adjectives.
I'm looking for something that allows me to have all the possible words starting from the caninical form.
Like from "black" to "blacks", for english, or from "bianco" (in italian) and get "bianca", "bianchi", "bianche", etc. Is there any library that do this?
I'm not clear on exactly what you're looking for but if a list of English lemma is all you need you can extract that easily enough from a GitHub library I have. Take a look at lemminflect. Initially, this uses a dictionary approach to lemmatization and there is a .csv file in here with all the different lemmas and their inflections. The file is LemmInflect/lemminflect/resources/infl_lu.csv.gz. You'll have to extract the lemmas from it. Something like...
with gzip.open('LemmInflect/lemminflect/resources/infl_lu.csv.gz)` as f:
for line in f.readlines():
parts = lines.split(',')
lemma = parts[0]
pos = parts[1]
print(lemma, pos)
Alternatively, if you need a system to inflect words, this is what Lemminflect is designed to do. You can use it as a stand-alone library or as an extension to SpaCy. There's examples on how to use it in the README.md or in the ReadTheDocs documentation.
I should note that this is for English only. I haven't seen a lot of code for inflecting words and you may have some difficulty finding this for other languages.

What are some of the data preparation steps or techniques one needs to follow when dealing with multi-lingual data?

I'm working on multilingual word embedding code where I need to train my data on English and test it on Spanish. I'll be using the MUSE library by Facebook for the word-embeddings.
I'm looking for a way to pre-process both my data the same way. I've looked into diacritics restoration to deal with the accents.
I'm having trouble coming up with a way in which I can carefully remove stopwords, punctuations and weather or not I should lemmatize.
How can I uniformly pre-process both the languages to create a vocabulary list which I can later use with the MUSE library.
Hi Chandana I hope you're doing well. I would look into using the library spaCy https://spacy.io/api/doc the man that created it has a youtube video in which he discusses the implementation of of NLP in other languages. Below you will find code that will lemmatize and remove stopwords. as far as punctuation you can always set specific characters such as accent marks to ignore. Personally I use KNIME which is free and open source to do preprocessing. You will have to install nlp extentions but what is nice is that they have different extensions for different languages you can install here: https://www.knime.com/knime-text-processing the Stop word filter (since 2.9) and the Snowball stemmer node can be applied for Spanish language. Make sure to select the right language in the dialog of the node. Unfortunately there is no part of speech tagger node for Spanish so far.
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['your_stopword1', 'your_stop_word2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result
I hope this helps let me know if you have any questions :)

How to programmatically access wordnet hierarchy?

Suppose for any word I want to access its IS-A parent value and HAS-A value then is it possible using any api?
You can use the python API of the Natural Language Toolkit. In Wordnet, the IS-A-relation is called hypernym (opposite: hyponym) and the HAS-A relation is called meronym (opposite: holonym).
from nltk.corpus import wordnet
book = wordnet.synsets('book')[0]
book.hypernyms()
>>> [Synset('publication.n.01')]
book.part_meronyms()
>>> [Synset('running_head.n.01'), Synset('signature.n.05')]
I also found the NodeBox Linguistics API easier to use:
import en
en.noun.hypernym('book')
>>> [['publication']]
Shameless plug:
I'm writing a Scala library to access WordNet. While not all the similarity measures have been implemented, all the word senses and relations are available. I use it for my research so its in active development.
import com.github.mrmechko.swordnet._
SKey("book",SPos.Noun)
//> List(SKey("publication%1:10:00::"))
SKey("publication%1:10:00::").getRelation(SRelationType.hypernym) //Hypernyms
SKey("publication%1:10:00::").getRelation(SRelationType.hyponym) //Hyponyms etc
SWordNet is available on GitHub and Sonatype
You can use commandline. The command is "wn book -hypen" to get the hypernyms of the noun book. For meronyms, use the command "wn book -meron".
Also the -o option gives the synset offset.
Here is the link for further information.

Resources