Tokenization not working the same for both case. - nlp

I have a document
doc = nlp('x-xxmessage-id:')
When I want to extract the tokens of this one I get 'x', 'xx', 'message' and 'id', ':'. Everything goes well.
Then I create a new document
test_doc = nlp('id')
If I try to extract the tokens of test_doc, I will get 'i' and 'd'. Is there any way to get past this problem? Because I want to get the same token as above and this is creating problems in the text processing.

Just like language itself, tokenization is context-dependent and the language-specific data defines rules that tell spaCy how to split the text based on the surrounding characters. spaCy's defaults are also optimised for general-purpose text, like news text, web texts and other modern writing.
In your example, you've come across an interesting case: the abstract string "x-xxmessage-id:" is split on punctuation, while the isolated lowercase string "id" is split into "i" and "d", because in written text, it's most commonly an alternate spelling of "I'd" or "i'd" ("I could", "I would" etc.). You can find the respective rules here.
If you're dealing with specific texts that are substantially different from regular natural language texts, you usually want to customise the tokenization rules or possibly even add a Language subclass for your own custom "dialect". If there's a fixed number of cases you want to tokenize differently that can be expressed by rules, another option would be to add a component to your pipeline that merges the split tokens back together.
Finally, you could also try using the language-independent xx / MultiLanguage class instead. It still includes very basic tokenization rules, like splitting on punctuation, but none of the rules specific to the English language.
from spacy.lang.xx import MultiLanguage
nlp = MultiLanguage()

Related

Filtering specific words from a string based on a words position in a text

I have several part of speech rules and they are triggered only if the text being looked at matches the rule. However I'm curious if there is a way to remove "any" word that appears between a phrase that would otherwise trigger the rule. I tried using stop words, but it strips the text too much to where the rule becomes non-sensical. Heres an example.
Text: I want to attack this player's base.
attack_rule = [
('nn', 'i'),
('vbp', 'want'),
('to', 'to'),
('vb', ('exterminate', 'waste', 'attack', 'shoot'))
]
The text will trigger this rule, however if the text is written as such:
Text2: I f***ing want to attack this player's base.
Text2: I want to f***ing attack this player's base.
The rule won't trigger. So I'm wondering if there is a way to filter expletive/fillers from text that would otherwise trigger a rule? Ideally by position.
I'm currently using nltk's POS tagger. If there is a way to either make sure the word doesn't have a contextual effect on the sentence (like a superlative) which would seem way harder. Or just remove a word if it appears between text that would otherwise trigger a rule.
I tried using stop words but like I said it filtered far too much, especially when the object of the sentence was one of the most important parts.
He will attack all of them <- Stop words present
he attack <- Filtered stop words
What does your logic that checks if the POS tagged from the sentence matches your pattern looks like? feels like you could just keep ignoring a certain number of words with tags that don't match until the end of the sentence, and have a match if you found all the words with tags (in the correct order) you wanted in the end. You could also enforce a maximum number of consecutive words with bad tags in a row.
Also you could ignore words with only a few kinds of tags, like adverb or adjective
You can use dependency parsing to remove modifiers, as illustrated in the following code:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_lg', disable = ['ner'])
sentences = [ "I fucking want to attack this player's base." , "I want to fucking attack this player's base."]
for s in sentences:
doc = nlp(s)
print(s)
print("=>", " ".join([t.text for t in doc if not t.dep_.endswith('mod') ]))
#I fucking want to attack this player's base.
#=> I want to attack this player 's base .
#I want to fucking attack this player's base.
#=> I want to attack this player 's base .

Solr exact search with a hyphen

I am trying to search for a term in Solr in the Title that contains only the string 1604-04. But the results come back with anything that contains 1604 or 04. What would the syntax be to force solr to search on the exact string of 1604-04?
You can also use Classic Tokenizer.The Classic Tokenizer preserves the same behavior as the Standard Tokenizer with the following exceptions:-
Words are split at hyphens, unless there is a number in the word, in which case the token is not split and
the numbers and hyphen(s) are preserved.
This means if someone searches for 1604-04 then this Tokenizer won't break search string into two tokens.
If you want exact matches only, use a string field or a text field with a KeywordTokenizer as the tokenizer. These will keep your tokens intact as one single entry, and won't break it up into multiple tokens.
The difference is that if you use a Textfield with a KeywordTokenizer, you can still apply other filters, such as a LowercaseFilter, while a string field will store anything verbatim without any further processing possible.
Your analyzer is splitting "1604-04" into two terms, "1604" and "04". You've received answer on how to change your analysis to stop doing that.
Changing your analysis my not be the best solution (can't be entirely sure based on what you've written). Using a phrase query would be the usual way to do this. You can use a phrase query by wrapping it in quotes:
field:"1604-04"
This will still analyze and split it into two terms, but it will look for those terms in sequence. So, that query would match "1604-04" and "1604 04", but not "1604 some other stuff 04".

text mining for filtering search

I m developing question answering system in java,in tht I have created templates manually which will be match to user asked question.
Problem is after pre processing i have list of
Keywords and these keywords I want to match with keywords in stored template to filter search.is there any algorithm?
Ex.ques. wht is features of java?
Keywords-features java
Extract Templates containing keywords features and java.
the thing which i have understood from your question you have some keywords and some other patterns containing those keywords in your lexicon and some others which come to your system from users question. then you need an algorithm to find these patterns in your system input.
as i know in java if you define a pattern from Pattern class then you can do like below to achieve the thing which you want:(A simple example)
Pattern pat = Pattern.compile("[A-Z]+");
Matcher matcher = pat.matcher("ABCD");
if(matcher.matches()) {
System.out.println("it matchs.");
}

Extracting Important words from a sentence using Node

I admit that I havent searched extensively in the SO database. I tried reading the natural npm package but doesnt seem to provide the feature. I would like to know if the below requirement is somewhat possible ?
I have a database that has list of all cities of a country. I also have rating of these cities (best place to live, worst place to live, best rated city, worsrt rated city etc..). Now from the User interface, I would like to enable the user to enter free text and from there I should be able to search my database.
For e.g Best place to live in California
or places near California
or places in California
From the above sentence, I want to extract the nouns only (may be ) as this will be name of the city or country that I can search for.
Then extract 'best' means I can sort is a particular order etc...
Any suggestions or directions to look for?
I risk a chance that the question will be marked as 'debatable'. But the reason I posted is to get some direction to proceed.
[I came across this question whilst looking for some use cases to test a module I'm working on. Obviously the question is a little old, but since my module addresses the question I thought I might as well add some information here for future searchers.]
You should be able to do what you want with a POS chunker. I've recently released one for Node that is modelled on chunkers provided by the NLTK (Python) and Standford NLP (Java) libraries (the chunk() and TokensRegex() methods, resepectively).
The module processes strings that already contain parts-of-speech, so first you'll need to run your text through a parts-of-speech tagger, such as pos:
var pos = require('pos');
var words = new pos.Lexer().lex('Best place to live in California');
var tags = new pos.Tagger()
.tag(words)
.map(function(tag){return tag[0] + '/' + tag[1];})
.join(' ');
This will give you:
Best/JJS place/NN to/TO live/VB in/IN California/NNP ./.
Now you can use pos-chunker to find all proper nouns:
var chunker = require('pos-chunker');
var places = chunker.chunk(tags, '[{ tag: NNP }]');
This will give you:
Best/JJS place/NN to/TO live/VB in/IN {California/NNP} ./.
Similarly you could extract verbs to understand what people want to do ('live', 'swim', 'eat', etc.):
var verbs = chunker.chunk(tags, '[{ tag: VB }]');
Which would yield:
Best/JJS place/NN to/TO {live/VB} in/IN California/NNP ./.
You can also match words, sequences of words and tags, use lookahead, group sequences together to create chunks (and then match on those), and other such things.
You probably don't have to identify what is a noun. Since you already have a list of city and country names that your system can handle, you just have to check whether the user input contains one of these names.
Well firstly you'll need to find a way to identify nouns. There is no core node module or anything that can do this for you. You need to loop through all words in the string and then compare them against some kind of dictionary database so you can find each word and check if it's a noun.
I found this api which looks pretty promising. You query the API for a word and it sends you back a blob of data like this:
<?xml version="1.0" encoding="UTF-8"?>
<results>
<result>
<term>consistent, uniform</term>
<definition>the same throughout in structure or composition</definition>
<partofspeech>adj</partofspeech>
<example>bituminous coal is often treated as a consistent and homogeneous product</example>
</result>
</results>
You can see that it includes a partofspeech member which tells you that the word "consistent" is an adjective.
Another (and better) option if you have control over the text being stored is to use some kind of markup language to identify important parts of the string before you save it. Something like BBCode. I even found a BBCode node module that will help you do this.
Then you can save your strings to the database like this:
Best place to live in [city]California[/city] or places near [city]California[/city] or places in [city]California[/city].
or
My name is [first]Alex[/first] [last]Ford[/last].
If you're letting user's type whole sentences of text and then you're trying to figure out what parts of those sentences is data you should use in your app then you're making things very unnecessarily hard on yourself. You should either ask them to input important pieces of data into their own text boxes or you should give the user a formatting language such as the aforementioned BBCode syntax so they can identify important bits for you. The job of finding out which parts of a string are important is going to be a huge one for you I think.

Structured format for bilingual texts?

I want to format a parallel text so that words and sentences are aligned in two or more languages. Most of the structured text formats I found are XML based and are used by translation tools or Bible software. I want to find or create a format suitable for reading foreign language texts. The reader will have the ability to select words and see their equivalent in the source or target language.
I've thought about using multidimensional arrays with words aligned by index. But the issue is that there are many words and phrases which do not have a one-to-one mapping. So then I thought about using a relational database, such as SQLite. I could have a table for each language with each word indexed by id and join tables for the alignment. But then the question is how to represent punctuation, paragraph breaks, and other necessary formatting.
Are there other data structures or formats I have not thought of? Ideally it would be a flat-file, markup format to facilitate editing.
Presumably you have one or more text files, one in Language A, one in Language B, etcetera, the latter being a translation of the first. With that assumption:
You could mark up your plain-text file(s) with uniquely numbered tags around words, phrases and/or punctuation, e.g.: "Dear Sir, How are you today?" translated to German becomes: "Sehr geehrter Herr, wie geht es dir heute?":
<Language-English:<11:<4:<1:Dear> <2:Sir><3:,>> <10:<7:<5:How are> <6:you>> <8:today><9:?>>>>
<Language-Deutsch:<11:<4:<1:Sehr geehrter> <2:Herr><3:,>> <10:<7:<5:wie geht> <6:es>> <8:dir heute><9:?>>>>
My German is fairly rusty, so I may not have the tags quite correct, but they should still show what I have in mind.
As you can see, the entire sentence and its parts each have their own tags. When displaying the text, each <n: > pair would be stripped out, and could be replaced with an underline or some other form of highlighting to indicate the groups. Of course, there could be multiple underlines/highlights (this example would have up to four). When clicking on (and visually emphasizing) the highlight on the text in Language A, the corresponding highlight(s) in Language B (and other languages if present) would also be emphasized.
Naturally, it would most likely be the job of a human translator to do the markup as automating the actual translation and applying tags at that point is a non-trivial task.
However, a UI where elements in each language could be simultaneously highlighted then marked as being equivalent could facilitate the process of generating the marked-up file(s).
As to your other considerations (arrays and databases), they seem to be something of an over-complication. You would still have to somehow mark-up your texts so that they could be loaded into these structures, since words or even phrases in one language don't necessarily have a 1:1 transliteration to the equivalent in another language), and usually can't easily be translated by machine. Once you have the markup, talking about array/dictionary/database/other structures becomes a bit irrelevant, and only of concern to the UI programmer.
EDIT:
On further consideration, the tags may not be perfectly nested, and may be split, so you may need a <n: :n> tag pair, to allow partially overlapping and split tagged areas. E.g.:
<1:The:1> <2:black:2> <1:dog:1> <3:and <4:the dog:3>'s puppies:4>
has the fragments: "The dog", "black", "and the dog", and "the dog's puppies".
Edit 2:
You could support multi-texts by just having the tag pair IDs unique across all languages:
EN: "The Knight's coat of arms was defaced"
DE: "Das Wappen des Ritters wurde verunstaltet"
FR: "Le blason du Chevalier a été abîmé"
<1:The Knight's:1> <2:coat of arms:2> <5:<3:was:3> <4:defaced:4>:5>.
<2:Das Wappen:2> <1:des Ritters:1> <3:wurde:3> <4:verunstaltet:4>.
<2:Le blason:2> <1:du Chevalier:1> <5:a été abîmé:5>.
As you can see tags 1, 2, 3 & 4 are applicable to English and German, and tags 1, 2 & 5 are applicable to English and French. The tags could quite easily be split and partially overlapping.

Resources