Using the Tokenizer in openNLP - nlp

I am getting the POS tagged text in R in the form of:
id type start end features
1 word 1 5 POS=NNP
2 word 7 8 POS=IN
.....
I want to retrieve the word that it has tagged for example instead of the column 'type' with all values as words retrieve the actual words. I can use scan_tokenizer, but problem comes in when there are forms like "isn't" the POS tagger breaks it into "is" and "not", which is great but the scan_tokenizer doesn't tokenize that way it just keeps it at "isn't". Can anyone please help me retrieve the word that R has tokenized and used to POS tag?
Thanks

Why don't you use Illinois POS tagger? It is easy to use and visualize:
http://cogcomp.cs.illinois.edu/page/software_view/3
http://cogcomp.cs.illinois.edu/demo/pos/?id=4

Related

Need guidance with Regular Expression in Python

I need help with one of my current tasks wherein i am trying to pick only the table names from the query via Python
So basically lets say a query looks like this
Create table a.dummy_table1
as
select a.dummycolumn1,a.dummycolumn2,a.dummycolumn3 from dual
Now i am passing this query into Python using STRINGIO and then reading only the strings where it starts with "a" and has "_" in it like below
table_list = set(re.findall(r'\ba\.\w+', str(data)))
Here data is the dataframe in which i have parsed the query using StringIO
now in table_list i am getting the below output
a.dummy_table1
a.dummycolumn1
a.dummycolumn2
whereas the Expected output should have been like
a.dummy_table1
<Let me know how we can get this done , have tried the above regular expression but that is not working properly>
Any help on same would be highly appreciated
Your current regex string r"\ba.\w+" simply matches any string which:
Begins with "a" (the "\ba" part)
Followed by a period (the "." part)
Followed by 1 or more alphanumeric characters (the "\w+" part).
If I've understood your problem correctly, you are looking to extract from str(data) any string fragments which match this pattern instead:
Begins with "a"
Followed by a period
Followed by 1 or more alphanumeric characters
Followed by an underscore
Followed by 1 or more alphanumeric characters
Thus, the regular expression should have "_\w+" added to the end to match criteria 4 and 5:
table_list = set(re.findall(r"\ba\.\w+_\w+", str(data)))

Using REGEX to grab the information after the match

I ran a PDF through a series of processes to extra the text from it. I was successful in that regard. However, now I want to extract specific text from documents.
The document is set up as a multi lined string (I believe. when I paste it into Word the paragraph character is at the end of each line):
Send Unit: COMPLETE
NOA Selection: 20-0429.07
#for some reason, in this editor, despite the next line having > infront of it, the following line (Pni/Trk) keeps wrapping up to the line above. This doesn't exist in the actual doc.
Pni/Trk: 3 Panel / 3 Track
Panel Stack: STD
Width: 142.0000
The information is want to extract are the numbers following "NOA Selection:".
I know I can do a regex something to the effect of:
pattern = re.compile(r'NOA\sSelection:\s\d*-\d*\.\d*)
but I only want the numbers after the NOA selection, especially because NOA Selection will always be the same but the format of the numbers/letters/./-/etc. can vary pretty wildly. This looked promising but it is in Java and I haven't had much luck recreating it in Python.
I think I need to use (?<=...), but haven't been able to implement it.
Also, several of the examples show the string stored in the python file as a variable, but I'm trying to access it from a .txt file, so I might be going wrong there. This is what I have so far.
with open('export1.txt', 'r') as d:    
contents = d.read()    
p = re.compile('(?<=NOA)')
s = re.search(p, contents)
print(s.group())
Thank you for any help you can provide.
With your shown samples, you could try following too. For sample 20-0429.07 I have kept .07 part optional in regex in case you have values 20-0429 only it should work for those also.
import re
val = """Send Unit: COMPLETE
NOA Selection: 20-0429.07"""
matches = re.findall(r'NOA\s+Selection:\s+(\d+-\d+(?:\.\d+)?)', val)
print(matches)
['20-0429.07']
Explanation: Adding detailed explanation(only for explanation purposes).
NOA\s+Selection:\s+ ##matching NOA spaces(1 or more occurrences) Selection: spaces(1 or more occurrences)
(\d+-\d+(?:\.\d+)?) ##Creating capturing group matching(1 or more occurrences) digits-digits(1 or more occurrences)
##and in a non-capturing group matching dot followed by digits keeping it optional.
Keeping it simple, you could use re.findall here:
inp = """Send Unit: COMPLETE
NOA Selection: 20-0429.07"""
matches = re.findall(r'\bNOA Selection: (\S+)', inp)
print(matches) # ['20-0429.07']

String matching keywords and key phrases in Python

I am trying to perform a smart dynamic lookup with strings in Python for a NLP-like task. I have a large amount of similar-structure sentences that I would like to parse through each, and tokenize parts of the sentence. For example, I first parse a string such as "bob goes to the grocery store".
I am taking this string in, splitting it into words and my goal is to look up matching words in a keyword list. Let's say I have a list of single keywords such as "store" and a list of keyword phrases such as "grocery store".
sample = 'bob goes to the grocery store'
keywords = ['store', 'restaurant', 'shop', 'office']
keyphrases = ['grocery store', 'computer store', 'coffee shop']
for word in sample.split():
# do dynamic length lookups
Now the issue is this Sometimes my sentences might be simply "bob goes to the store" instead of "bob goes to the grocery store".
I want to find the keyword "store" for sure but if there are descriptive words such as "grocery" or "computer" before the word store I would like to capture that as well. That is why I have the keyphrases list as well. I am trying to figure out a way to basically capture a keyword at the very least then if there are words related to it that might be a possible "phrase" I want to capture those too.
Maybe an alternative is to have some sort of adjective list instead of a phrase list of multiple words?
How could I go about doing these sort of variable length lookups where I look at more than just a single word if one is captured, or is there an entirely different method I should be considering?
Here is how you can use a nested for loop and a formatted string:
sample = 'bob goes to the grocery store'
keywords = ['store', 'restaurant', 'shop', 'office']
keyphrases = ['grocery', 'computer', 'coffee']
for kw in keywords:
for kp in keyphrases:
if f"{kp} {kw}" in sample:
# Do something

replace 2 or 3 words in one sentence in one cell from another words in another cell on Excel

I was looking for a solution and I found it here
replacing many words every one with alternative word
But now I'm using a alternative code that I've got from the link below that post, which is case sensitve.
Function SubstituteMultipleCS(text As String, old_text As Range, new_text As Range)
Dim i As Single
For i = 1 To old_text.Cells.Count
Result = Replace(text, old_text.Cells(i), new_text.Cells(i))
text = Result
Next i
SubstituteMultipleCS = Result
End Function
I'm using it to make German Anki cards so I need to replace some words with ___. It's working with one single word or a bunch of words if they are together, but...
The problem is the following:
Some verbs conjugation have a sentence structure when I must place the main verb after the noun and the particle, which belongs to the verb, at the end. Something like this
As you can see in the picture, the verb "schaute an" is not replaced by the new word because "schaute" is separated from "an" in the original sentence.
Is there any way to fix this?
thank you.
Here is a formula you may use (which works for your current sample data:
Formula in C2:
=IFERROR(TRIM(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(" "&SUBSTITUTE(B2,"."," ")&" "," "&FILTERXML("<t><s>"&SUBSTITUTE(A2," ","</s><s>")&"</s></t>","//s[position() = 1]")&" ",D2,1),IFERROR(" "&FILTERXML("<t><s>"&SUBSTITUTE(A2," ","</s><s>")&"</s></t>","//s[position() = 2]")&" ",""),D2,1),IFERROR(" "&FILTERXML("<t><s>"&SUBSTITUTE(A2," ","</s><s>")&"</s></t>","//s[position() = 3]")&" ",""),D2,1))&".","")
The advantage of nested substitutes is that we can tell the function to only replace the first occurence if you had a sentence where multiple could occur. Not sure if it's watertight.

Bag-of-words in CRF++

What is the syntax for a bag-of-words feature in CRF++ template file?
Template example:
#Unigrams
U00:%x[0,0]
U01:%x[0,1]
U02:%x[1,0]
#Bigrams
B
I think it is this way:
#Unigrams
U00:%x[0,0]
U00:%x[0,1]
U00:%x[1,0]
#Bigrams
B
Using the same identifier.
The syntax of bag-of-words might be like this:
#Unigrams
U00:%x[0,0]/%x[0,1]/%x[1,0]
#Bigrams
B
Description of CRF++ using a template of CoNLL 2000 for bag-of-words
Here's the correct template for using Bag of (3) Words :
#Unigrams
U00:%x[-1,0]
U00:%x[0,0]
U00:%x[1,0]
#Bigrams
B
Note that the identifiers are same (U00).
[-1,0] -> previous word
[0,0] -> current word
[1,0] -> next word

Resources