I'm using cloud NL to analyze text from Google Speech and it seems to be having trouble with tokenizing contractions
for example
"I don't like you"
comes back as tokens whose content_text are:
"I" "do" "n't" "like" "you"
escaping quotes did not help, in this case it came back as
"I" "don" "\'t" "like" "you"
but I found that removing apos' did and the tokens
I dont like you
came back with "dont" as a verb (correct enough)
Is this the correct workaround for now?
Related
I'm doing some natural language processing with Arabic. Since I'm working with a couple different NLP tools in tandem, I want to be able to be able to give raw text to a StanfordCoreNLP pipeline, but provide my own list of tokens rather than having it do the tokenization. Is there a way to do that?
The best thing to do is to merge your tokens with whitespace and then use the -tokenize.whitespace option.
So for instance if I had the raw text: This is a sentence., and I tokenize it into ("This", "is", "a", "sentence", ".") I would merge that back into a string "This is a sentence ." and use the tokenize.whitespace option which will just split on whitespace.
I have a fields in index with [Analyzer(<name>)] applied. This analyzer is of type CustomAnalyzer with tokenizer = Keyword. I assume it treats both field value and search text as one term each. E.g.
ClientName = My Test Client (in index, is broken into 1 term). Search term = My Test Client (broken in 1 term). Result = match.
But surprisingly that's not the case until I apply phrasal search (enclose term in double quotes). Does anyone know why? And how to solve it? I'd rather treat search term as the whole, then do enclosing
Regards,
Sergei.
This is expected behavior. Query text is processed first by the query parser and only individual query terms go through lexical analysis. When you issue a phrase query, the whole expression between quotes is treated as a phrase term and as one goes through lexical analysis. You can find a complete explanation of this process here: How full text search works in Azure Search.
I need to get the pieces of text out of text)). Very simple example actually, but gives me quite some pain.
Here is the sample text, it is an email template:
{!Account.Name}
Hi hi there {!Account.Id + 'cool'}.
Very interesting stuff - {!Contact.Description}
Now we get {!Contact.Description + Contact.Email__c}
So I need all the occurances of text like Account.Name, but only those which are within opening "{!" and closing "}" tags.
What is the simplest/starting approach to do it? Note that in case of the last line, I need to get the two occurances, Contact.Description and Contact.Email__c.
Thanks a lot for any help!
I would just do a plain text search for {...} blocks and parse their content with a simple expression parser. Don't try to come up with a parser that gets all the text and must be prepared to deal with any rubbish that can come in outside of the blocks (which could ultimatively lead to security problems).
I'm coding this telegram bot for my clan. The bot should send a reply based on a few words in the text msg. Suppose I type a text in the group containing the words "Thalia" and "love" and I want the bot to respond. The following works.
elif "thalia" in text.lower():
if "love" in text.lower():
reply("I love u too babe <3." "\nBut I love my maker even more ;).")
else:
reply("Say my name!")
msg containing thalia and love
I coded it like this because when I use the "and" or "or" keywords the statement doesn't work and the bot goes crazy. In the above, if I code: elif "thalia" and "love"..... it doesn't work.
If there is another way to code this I would appreciate the tip!
Now I am trying the same technique on more words with "and" and "or" but it doesn't work. If I leave "and" and "or" out it works fine. But of course then I can't use the combinations of words I want with this particular response.
elif "what" or "when" in text.lower():
if "time" or "do" in text.lower():
if "match" in text.lower():
reply ("If you need assistence with matches, type or press /matches")
it triggered the command without the 3 words in one sentence
How can I rewrite this code in a more "professional" way and what do I need to change to get it to work? The bot responds only when the combination of the words are used like in the thalia love code. Instead of when "matches" is used.*
Python is much like natural language but the interpreter cannot fill in what human listeners can. 'a and b in c' must be written out as 'a in c and b in c'.
Before writing the if statements, you should lower case text once, not repeatedly. Then turn it into a set of words, after removing punctuation and symbols, to avoid repeated linear searches of the lower-cased string. Here is an incomplete example for ascii-only input.
d = str.maketrans('', '', '.,!') # 3rd arg is chars to delete
text = set(text.lower().translate(d).split())
Your 'matches' snippet can then be written as follows.
elif (("what" in text or "when" in text) and
("time" in text or "do" in text) and
"match" in text)
reply ("If you need assistence with matches, type or press /matches")
You could also use regular expression matching to do the same thing, but logic statements like the above are probably easier to start with.
I am new in the NLP domain, but my current research needs some text parsing (or called keyword extraction) from URL addresses, e.g. a fake URL,
http://ads.goole.com/appid/heads
Two constraints are put on my parsing,
The first "ads" and last "heads" should be distinct because "ads" in the "heads" means more suffix rather than an advertisement.
The "appid" can be parsed into two parts; that is 'app' and 'id', both taking semantic meanings on the Internet.
I have tried the Stanford NLP toolkit and Google search engine. The former tries to classify each word in a grammar meaning which is under my expectation. The Google engine shows more smartness about "appid" which gives me suggestions about "app id".
I can not look over the reference of search history in Google search so that it gives me "app id" because there are many people have searched these words. Can I get some offline line methods to perform similar parsing??
UPDATE:
Please skip the regex suggestions because there is a potentially unknown number of compositions of words like "appid" in even simple URLs.
Thanks,
Jamin
Rather than tokenization, what it sounds like you really want to do is called word segmentation. This is for example a way to make sense of asentencethathasnospaces.
I haven't gone through this entire tutorial, but this should get you started. They even give urls as a potential use case.
http://jeremykun.com/2012/01/15/word-segmentation/
The Python wordsegment module can do this. It's an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.
Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).
Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.
Installation is easy with pip:
$ pip install wordsegment
Simply call segment to get a list of words:
>>> import wordsegment as ws
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'appid', 'heads']
As you noticed, the old corpus doesn't rank "app id" very high. That's ok. We can easily teach it. Simply add it to the bigram_counts dictionary.
>>> ws.bigram_counts['app id'] = 10.2e6
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'app', 'id', 'heads']
I chose the value 10.2e6 by doing a Google search for "app id" and noting the number of results.