Bag-of-words in CRF++ - crf++

What is the syntax for a bag-of-words feature in CRF++ template file?
Template example:
#Unigrams
U00:%x[0,0]
U01:%x[0,1]
U02:%x[1,0]
#Bigrams
B

I think it is this way:
#Unigrams
U00:%x[0,0]
U00:%x[0,1]
U00:%x[1,0]
#Bigrams
B
Using the same identifier.

The syntax of bag-of-words might be like this:
#Unigrams
U00:%x[0,0]/%x[0,1]/%x[1,0]
#Bigrams
B
Description of CRF++ using a template of CoNLL 2000 for bag-of-words

Here's the correct template for using Bag of (3) Words :
#Unigrams
U00:%x[-1,0]
U00:%x[0,0]
U00:%x[1,0]
#Bigrams
B
Note that the identifiers are same (U00).
[-1,0] -> previous word
[0,0] -> current word
[1,0] -> next word

Related

If-Then-Else in Ruta

is there something like if then else in Ruta available? I'd like to do something like:
if there's at least one term from catA, then label the document with "one"
else if there's at least one term from catB, then label the document with "two"
else label the document with "three".
All the best
Philipp
There is no language structure for if-then-else in UIMA Ruta (2.7.0).
You need to duplicate some parts of the rule in order to model the else part, e.g., something like the following:
Document{CONTAINS(CatA) -> One};
Document{-CONTAINS(CatA), CONTAINS(CatB) -> Two};
Document{-CONTAINS(CatA), -CONTAINS(CatB) -> Three};
You could also check if the previous rule has matched and depend on that.
How the rule should actually look like depends mainly on the type system and how you want to model the information (features?).
DISCLAIMER: I am a developer of UIMA Ruta
I think you are asking about If-else-if in Ruta. This is possible using "ONLYFIRST"
PACKAGE uima.ruta.example;
DECLARE CatA,CatB,CatC;
"CatA"->CatA;
"CatB"->CatB;
"CatC"->CatC;
DECLARE one,two,three;
ONLYFIRST Document{}{
Document{CONTAINS(CatA) -> one};
Document{CONTAINS(CatB) -> two};
Document{CONTAINS(CatC) -> three};
}

How to extract relationships from a text

I am currently new with NLP and need guidance as of how I can solve this problem.
I am currently doing a filtering technique where I need to brand data in a database as either being correct or incorrect. I am given a structured data set, with columns and rows.
However, the filtering conditions are given to me in a text file.
An example filtering text file could be the following:
Values in the column ID which are bigger than 99
Values in the column Cash which are smaller than 10000
Values in the column EndDate that are smaller than values in StartDate
Values in the column Name that contain numeric characters
Any value that follows those conditions should be branded as bad.
However, I want to extract those conditions and append them to the program that I've made so far.
For instance, for the conditions above, I would like to produce
`if ID>99`
`if Cash<10000`
`if EndDate < StartDate`
`if Name LIKE %[1-9]%`
How can I achieve the above result using the Stanford NLP? (or any other NLP library).
This doesn't look like a machine learning problem; it's a simple parser. You have a simple syntax, from which you can easily extract the salient features:
column name
relationship
target value or target column
The resulting "action rule" is simply removing the "syntactic sugar" words and converting the relationship -- and possibly the target value -- to its symbolic form.
Enumerate all of your critical words for each position in a lexicon. Then use basic string manipulation operators in your chosen implementation language to find the three needed fields.
EXAMPLE
Given the data above, your lexicons might be like this:
column_trigger = "Values in the column"
relation_dict = {
"are bigger than" : ">",
"are smaller than" : "<",
"contain" : "LIKE",
...
}
value_desc = {
"numeric characters" : "%[1-9]%",
...
}
From here, use these items in standard parsing. If you're not familiar with that, please look up the basics of a simple sentence grammar in your favourite programming language, with rules such as such as
SENTENCE => SUBJ VERB OBJ
Does that get you going?

Replace words marked with offsets

I have a sentence like that:
"My name is Bond. It's a fake name."
and I have to replace some words in a list with offsets of each word:
name, 29-33; Bond, 11-15; name, 3-7
In addition, each word must replace with a specific word:
name -> noun
Bond -> proper
I have to obtain this output:
"My noun is proper. It's a fake noun."
I tried to manage the offsets with a post-offset variable that I update after each replacement but it is not valid because is an unordered list. Note that find method is not valid due to names repetition. Is there any algorithm to do it? Any vectorial implementation (String, Numpy, NLTK) that computes it in one step?
Bro Check this one :
string = "My name is Bond. It's a fake name."
y=list()
y=string.split(" ") #now it will break your strings into words
Now traverse the list and set the condition
for i in y:
if(i==name):
i="noun"
if(i==Bond):
i="Proper"
Now the list values will be changed and use the Join() method to make back the list into string
For more Please refer to this website https://www.tutorialspoint.com/python/python_strings.htm
This page contains all the data related to string processing in python.

How to Interpret NLTK Brill Tagger Rules

For the generated Brill Tagger Rule:
Rule('016', 'CS', 'QL', [(Word([1, 2, 3]),'as')])
I know:
'CS' is subordinating conjunction
'QL' is qualifier
I guess:
[(Word([1, 2, 3]),'as')] means the condition of the rule. It stands for the word 'as' appear as the first, second or third position before the target word. Target word is word that is going to be tagged by POS tag.
I do not know:
What is the meaning for '016'?
How to interpret the rule as a whole?
The documentation for the rules is here.
016 would be the the templateid, i.e. the template that was used to create the rule.
You can also get a description for the rule:
q = Rule('016', 'CS', 'QL', [(Word([1, 2, 3]),'as')])
q.format('verbose')
'CS -> QL if the Word of words i+1...i+3 is "as"'
In this case it is actually the words that come after the target word. (Indicated by i+1...)

Using the Tokenizer in openNLP

I am getting the POS tagged text in R in the form of:
id type start end features
1 word 1 5 POS=NNP
2 word 7 8 POS=IN
.....
I want to retrieve the word that it has tagged for example instead of the column 'type' with all values as words retrieve the actual words. I can use scan_tokenizer, but problem comes in when there are forms like "isn't" the POS tagger breaks it into "is" and "not", which is great but the scan_tokenizer doesn't tokenize that way it just keeps it at "isn't". Can anyone please help me retrieve the word that R has tokenized and used to POS tag?
Thanks
Why don't you use Illinois POS tagger? It is easy to use and visualize:
http://cogcomp.cs.illinois.edu/page/software_view/3
http://cogcomp.cs.illinois.edu/demo/pos/?id=4

Resources