I am looking to do the opposite of what has been done here:
import re
text = '1234-5678-9101-1213 1415-1617-1819-hello'
re.sub(r"(\d{4}-){3}(?=\d{4})", "XXXX-XXXX-XXXX-", text)
output = 'XXXX-XXXX-XXXX-1213 1415-1617-1819-hello'
Partial replacement with re.sub()
My overall goal is to replace all XXXX within a text using a neural network. XXXX can represent names, places, numbers, dates, etc. that are in a .csv file.
The end result would look like:
XXXX went to XXXX XXXXXX
Sponge Bob went to Disney World.
In short, I am unmasking text and replacing it with a generated dataset using fuzzy.
You can do it using named-entity recognition (NER). It's fairly simple and there are out-of-the-shelf tools out there to do it, such as spaCy.
NER is an NLP task where a neural network (or other method) is trained to detect certain entities, such as names, places, dates and organizations.
Example:
Sponge Bob went to South beach, he payed a ticket of $200!
I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.
Returns:
Just be aware that this is not 100%!
Here are a little snippet for you to try out:
import spacy
phrases = ['Sponge Bob went to South beach, he payed a ticket of $200!', 'I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.']
nlp = spacy.load('en')
for phrase in phrases:
doc = nlp(phrase)
replaced = ""
for token in doc:
if token in doc.ents:
replaced+="XXXX "
else:
replaced+=token.text+" "
Read more here: https://spacy.io/usage/linguistic-features#named-entities
You could, instead of replacing with XXXX, replace based on the entity type, like:
if ent.label_ == "PERSON":
replaced += "<PERSON> "
Then:
import re, random
personames = ["Jack", "Mike", "Bob", "Dylan"]
phrase = re.replace("<PERSON>", random.choice(personames), phrase)
Related
How do I get the correct NER using SpaCy from text like "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site."
here "Criticized Trump" is recognized as person instead of "Trump" as person.
How to pre-process and lower case the text like "Criticized" or "Texts" from the above string to overcome above issue or any other technique to do so.
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
from pprint import pprint
sent = ("F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site")
doc = nlp(sent)
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])
Result from above code:-
"Criticized Trump" as 'PERSON' and "Texts" as 'GPE'
Expected result should be:-
"Trump" as 'PERSON' instead of "Criticized Trump" as 'PERSON' and "Texts" as '' instead of "Texts" as 'GPE'
You can add more examples of Named Entities to tune the NER model. Here you have all the information needed for the preparation of train data https://spacy.io/usage/training. You can use prodigy (annotation tool from spaCy creators, https://prodi.gy) to mark Named Entities in your data.
Indeed, you can pre-process using POS tagging in order to change to lower case words like "Criticized" or "Texts" which are not proper nouns.
Proper capitalization (lower vs. upper case) will help the NER tagger.
sent = "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site"
doc = nlp(sent)
words = []
spaces = []
for a in doc:
if a.pos_ != 'PROPN':
words.append( a.text.lower() )
else:
words.append(a.text)
spaces.append(a.whitespace_)
spaces = [len(sp) for sp in spaces]
docNew = Doc(nlp.vocab, words=words, spaces=spaces)
print(docNew)
# F.B.I. Agent Peter Strzok, who criticized Trump in texts, is fired - the New York Times SectionsSEARCHSkip to contentskip to site
I'm testing such sentence to extract entity values:
s = "Height: 3m, width: 4.0m, others: 3.4 m, 4m, 5 meters, 10 m. Quantity: 6."
sent = nlp(s)
for ent in sent.ents:
print(ent.text, ent.label_)
And got some misleading values:
3 CARDINAL
4.0m CARDINAL
3.4 m CARDINAL 4m CARDINAL 5 meters QUANTITY 10 m QUANTITY 6 CARDINAL
namely, number 3m is not paired with m. This is the case for many examples as I can't rely on this engine when want to separate meters from quantities.
Should I do this manually?
One potential difficulty in your example is that it's not very close to natural language. The pre-trained English models were trained on ~2m words of general web and news text, so they're not always going to perform perfect out-of-the-box on text with a very different structure.
While you could update the model with more examples of QUANTITY in your specific texts, I think that a rule-based approach might actually be a better and more efficient solution here.
The example in this blog post is actually very close to what you're trying to do:
import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load("en_core_web_sm")
weights_pattern = [
{"LIKE_NUM": True},
{"LOWER": {"IN": ["g", "kg", "grams", "kilograms", "lb", "lbs", "pounds"]}}
]
patterns = [{"label": "QUANTITY", "pattern": weights_pattern}]
ruler = EntityRuler(nlp, patterns=patterns)
nlp.add_pipe(ruler, before="ner")
doc = nlp("U.S. average was 2 lbs.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('U.S.', 'GPE'), ('2 lbs', 'QUANTITY')]
The statistical named entity recognizer respects pre-defined entities and wil "predict around" them. So if you're adding the EntityRuler before it in the pipeline, your custom QUANTITY entities will be assigned first and will be taken into account when the entity recognizer predicts labels for the remaining tokens.
Note that this example is using the latest version of spaCy, v2.1.x. You might also want to add more patterns to cover different constructions. For more details and inspiration, check out the documentation on the EntityRuler, combining models and rules and the token match pattern syntax.
I have a sample text like
'I'm travelling from Spain to India i.e on 23/09/2017 to 27/09/2017
From this type of text i want to separate from and to countries and dates.
How can i approach?
To Install Follow These Steps https://spacy.io/docs/usage/
string = "I'm travelling from Spain to India i.e on 23/09/2017 to 27/09/2017"
import re
import spacy
nlp = spacy.load('en')
doc = nlp(string)
sentence = doc.text
for ent in doc.ents:
if ent.label_ == 'GPE':
print ent.text
Output
Spain
India
Reference
https://spacy.io/docs/usage/entity-recognition
I'm trying to take a sentence and extract the relationship between Person(PER) and Place(GPE).
Sentence: "John is from Ohio, Michael is from Florida and Rebecca is from Nashville which is in Tennessee."
For the final person, she has both a city and a state that could get extracted as her place. So far, I've tried using nltk to do this, but have only been able to extract her city and not her state.
What I've tried:
import re
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.sem.relextract import extract_rels, rtuple
sentence = "John is from Ohio, Michael is from Florida and Rebecca is from Nashville which is in Tennessee."
chunked = ne_chunk(pos_tag(word_tokenize(sentence)))
ISFROM = re.compile(r'.*\bfrom\b.*')
rels = extract_rels('PER', 'GPE', chunked, corpus = 'ace', pattern = ISFROM)
for rel in rels:
print(rtuple(rel))
My output is:
[PER: 'John/NNP'] 'is/VBZ from/IN' [GPE: 'Ohio/NNP']
[PER: 'Michael/NNP'] 'is/VBZ from/IN' [GPE: 'Florida/NNP']
[PER: 'Rebecca/NNP'] 'is/VBZ from/IN' [GPE: 'Nashville/NNP']
The problem is Rebecca. How can I extract that both Nashville and Tennesee are part of her location? Or even just Tennessee alone?
It seems to me that you have to first extract intra-location relationship (Nashville in Tennessee). Then ensure that you transitively assign all locations to Rebecca (if Rebecca is in Nashville and Nashville is in Tennessee then Rebecca is in Nashville and Rebecca is in Tennessee).
That would be one more relationship type and some logic for the above inference (things get complicated pretty quickly but it is hard to avoid it).
I am using NLTK to analyze a corpus that has been OCRed. I'm new to NLTK. Most of the OCR is good -- but sometimes I come across lines that are plainly junk. For instance: oomfi ow Ba wmnondmam BE wBwHo<oBoBm. Bowman as: Ham: 8 ooww om $5
I want to identify (and filter out) such lines from my analysis.
How do NLP practitioners handle this situation? Something like: if 70 % of the words in the sentence are not in wordnet, discard. Or if NLTK can't identify the part of speech for 80% of the word, then discard? What algorithms work for this? Is there a "gold standard" way to do this?
Using n-grams is probably your best option. You can use google n-grams, or you can use n-grams built into nltk. The idea is to create a language model and see what probability any given sentence gets. You can define a probability threshold, and all sentences with scores below it are removed. Any reasonable language model will give a very low score for the example sentence.
If you think that some words may be only slightly corrupted, you may try spelling correction before testing with the n-grams.
EDIT: here is some sample nltk code for doing this:
import math
from nltk import NgramModel
from nltk.corpus import brown
from nltk.util import ngrams
from nltk.probability import LidstoneProbDist
n = 2
est = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(n, brown.words(categories='news'), estimator=est)
def sentenceprob(sentence):
bigrams = ngrams(sentence.split(), n)
sentence = sentence.lower()
tot = 0
for grams in bigrams:
score = lm.logprob(grams[-1], grams[:-1])
tot += score
return tot
sentence1 = "This is a standard English sentence"
sentence2 = "oomfi ow Ba wmnondmam BE wBwHo<oBoBm. Bowman as: Ham: 8 ooww om $5"
print sentenceprob(sentence1)
print sentenceprob(sentence2)
The results look like:
>>> python lmtest.py
42.7436688972
158.850086668
Lower is better. (Of course, you can play with the parameters).