I am using spacy to analyze the terrorist and it is weird that spacy cannot find the organization such as fatah. The code is below
import spacy
nlp = spacy.load('en')
def read_file_to_list(file_name):
with open(file_name, 'r') as file:
return file.readlines()
terrorism_articles = read_file_to_list('data/rand-terrorism-dataset.txt')
terrorism_articles_nlp = [nlp(art) for art in terrorism_articles]
common_terrorist_groups = [
'taliban',
'al - qaeda',
'hamas',
'fatah',
'plo',
'bilad al - rafidayn'
]
common_locations = [
'iraq',
'baghdad',
'kirkuk',
'mosul',
'afghanistan',
'kabul',
'basra',
'palestine',
'gaza',
'israel',
'istanbul',
'beirut',
'pakistan'
]
location_entity_dict = defaultdict(Counter)
for article in terrorism_articles_nlp:
article_terrorist_groups = [ent.lemma_ for ent in article.ents if ent.label_=='PERSON' or ent.label_ =='ORG']#人或者组织
article_locations = [ent.lemma_ for ent in article.ents if ent.label_=='GPE']
terrorist_common = [ent for ent in article_terrorist_groups if ent in common_terrorist_groups]
locations_common = [ent for ent in article_locations if ent in common_locations]
for found_entity in terrorist_common:
for found_location in locations_common:
location_entity_dict[found_entity][found_location] += 1
location_entity_dict
I simply get nothing from the file.
Here is The text data link
Thank you!
I reproduced your example and it looks like you will get empty lists for article_terrorist_groups and terrorist_common. Therefore, you won't get the output (that I assume) you require. I changed the model (for my machine) to en_core_web_sm and I observed that the ent.label is different from ones that you are specifying in the if statement in your list comprehensions. I am almost certain this is the case whether you use spacy.load('en') or spacy.load('en_core_web_sm').
You are using if ent.label_=='PERSON' or ent.label_ =='ORG' which is leading to empty lists. You would need to change this in order for it to work. Basically, in your list comprehension for article_terrorist_groups and terrorist_common, the for loop is trying to iterate through an empty list.
If you look at the output that I posted, you will see that ent.label is not 'PERSON' or 'ORG'
Note: I would recommend adding print statements (or using a debugger) in your code to check from time to time.
My Code
import spacy
from collections import defaultdict, Counter
nlp = spacy.load('en_core_web_sm') # I changed this
def read_file_to_list(file_name):
with open(file_name, 'r') as file:
return file.readlines()
terrorism_articles = read_file_to_list('rand-terrorism-dataset.txt')
terrorism_articles_nlp = [nlp(art) for art in terrorism_articles]
common_terrorist_groups = [
'taliban',
'al - qaeda',
'hamas',
'fatah',
'plo',
'bilad al - rafidayn'
]
common_locations = [
'iraq',
'baghdad',
'kirkuk',
'mosul',
'afghanistan',
'kabul',
'basra',
'palestine',
'gaza',
'israel',
'istanbul',
'beirut',
'pakistan'
]
location_entity_dict = defaultdict(Counter)
for article in terrorism_articles_nlp:
print([(ent.lemma_, ent.label) for ent in article.ents])
Output
[('CHILE', 383), ('the Santiago Binational Center', 383), ('21,000', 394)]
[('ISRAEL', 384), ('palestinian', 381), ('five', 397), ('Masada', 384)]
[('GUATEMALA', 383), ('U.S. Marines', 381), ('Guatemala City', 384)]
truncated output in the interest of length of this answer
Because groups & locations in common_terrorist_groups and common_locations are lowercase while finded data terrorist_common and locations_common are uppercase. So just change the code if ent in common_terrorist_groups to if ent.lower() in common_terrorist_groups
common_terrorist_groups = [
'taliban',
'al - qaeda',
'hamas',
'fatah',
'plo',
'bilad al - rafidayn'
]
common_locations = [
'iraq',
'baghdad',
'kirkuk',
'mosul',
'afghanistan',
'kabul',
'basra',
'palestine',
'gaza',
'israel',
'istanbul',
'beirut',
'pakistan'
]
location_entity_dict = defaultdict(Counter)
for article in terrorism_articles_nlp:
article_terrorist_cands = [ent.lemma_ for ent in article.ents if ent.label_ == 'PERSON' or ent.label_ == 'ORG']
article_location_cands = [ent.lemma_ for ent in article.ents if ent.label_ == 'GPE']
terrorist_candidates = [ent for ent in article_terrorist_cands if ent.lower() in common_terrorist_groups]
location_candidates = [loc for loc in article_location_cands if loc.lower() in common_locations]
for found_entity in terrorist_candidates:
for found_location in location_candidates:
location_entity_dict[found_entity][found_location] += 1
Related
I have a dict that looks like this:
TRAIN_DATA = {'here is some text': [('1', '4', 'entity_label')], 'here is more text': [('2', '7', 'entity_label_2')], 'and even more text': [('1', '4', 'entity_label')]}
I'm trying to convert this to the format required for spaCy's NER model, using the following:
import pandas as pd
import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for text, annot in TRAIN_DATA: # data in previous format
doc = nlp.make_doc(text) # create doc object from text
ents = []
for start, end, label in annot: # add character indexes
span = doc.char_span(start, end, label=label, alignment_mode="contract")
if span is None:
print("Skipping entity")
else:
ents.append(span)
doc.ents = ents # label the text with the ents
db.add(doc)
db.to_disk("train.spacy") # save the docbin object
It yields ValueError: not enough values to unpack (expected 3, got 2)
When I try something slightly different:
nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for body, [(entities)] in TRAIN_DATA.items():
doc = nlp(body)
ents = []
for start, end, label in entities:
span = doc.char_span(int(start), int(end), label=label, alignment_mode='contract')
ents.append(span)
doc.ents = ents
db.add(doc)
db.to_disk("train.spacy")
It yields the same error. When I remove the tuple and list notation (i.e. for body, entities... vs for body, [(entities)]) I get expected 2, got 3 instead of expected 3 got 2...
I've tried troubleshooting by unpacking the tuple manually (i.e. for i in entities.split(", ") print (i), and that seems to find all the values in the tuple, so I'm not sure what I'm doing wrong.
I am very new to ML and also Spacy in general. I am trying to show Named Entities from an input text.
This is my method:
def run():
nlp = spacy.load('en_core_web_sm')
sentence = "Hi my name is Oliver!"
doc = nlp(sentence)
#Threshold for the confidence socres.
threshold = 0.2
beams = nlp.entity.beam_parse(
[doc], beam_width=16, beam_density=0.0001)
entity_scores = defaultdict(float)
for beam in beams:
for score, ents in nlp.entity.moves.get_beam_parses(beam):
for start, end, label in ents:
entity_scores[(start, end, label)] += score
#Create a dict to store output.
ners = defaultdict(list)
ners['text'] = str(sentence)
for key in entity_scores:
start, end, label = key
score = entity_scores[key]
if (score > threshold):
ners['extractions'].append({
"label": str(label),
"text": str(doc[start:end]),
"confidence": round(score, 2)
})
pprint(ners)
The above method works fine, and will print something like:
'extractions': [{'confidence': 1.0,
'label': 'PERSON',
'text': 'Oliver'}],
'text': 'Hi my name is Oliver'})
So far so good. Now I am trying to get the actual position of the found named entity. In this case "Oliver".
Looking at the documentation, there is: ent.start_char, ent.end_char available, but if I use that:
"start_position": doc.start_char,
"end_position": doc.end_char
I get the following error:
AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'start_char'
Can someone guide me in the right direction?
If someone has come here wanting a simple answer to the question, I believe the following should do it:
nlp = spacy.load('en_core_web_sm')
sentence = "Hi my name is Oliver!"
doc = nlp(sentence)
for ent in doc.ents:
print(f"Entity {ent} found with start at {ent.start_char} and end at {ent.end_char}")
So I actually found an answer right after posting this question (typical).
I found that I didn't need to save the information into entity_scores, but instead just iterate over the actual found entities ent:
I ended up adding for ent in doc.ents: instead and this gives me access to all the standard Spacy attributes. See below:
ners = defaultdict(list)
ners['text'] = str(sentence)
for beam in beams:
for score, ents in nlp.entity.moves.get_beam_parses(beam):
for ent in doc.ents:
if (score > threshold):
ners['extractions'].append({
"label": str(ent.label_),
"text": str(ent.text),
"confidence": round(score, 2),
"start_position": ent.start_char,
"end_position": ent.end_char
My entire method ends up looking like this:
def run():
nlp = spacy.load('en_core_web_sm')
sentence = "Hi my name is Oliver!"
doc = nlp(sentence)
threshold = 0.2
beams = nlp.entity.beam_parse(
[doc], beam_width=16, beam_density=0.0001)
ners = defaultdict(list)
ners['text'] = str(sentence)
for beam in beams:
for score, ents in nlp.entity.moves.get_beam_parses(beam):
for ent in doc.ents:
if (score > threshold):
ners['extractions'].append({
"label": str(ent.label_),
"text": str(ent.text),
"confidence": round(score, 2),
"start_position": ent.start_char,
"end_position": ent.end_char
})
I was trying to train a custom NER model in spacy. Initially I had installed the latest spacy version but was getting the following error during the training
ValueError: [E103] Trying to set conflicting doc.ents: A token can only be part of one entity, so make sure the entities you're setting don't overlap.
After that I installed spacy version spacy==2.0.11 and tried running my code. When I am having around 10 rows of data to train, the model is working fine and it's saving to my output directory. But when there is more data(5K rows) which is the original training data, my jupyter kernel dies or when I run in spyder, the console just exists!!
I understand that the deprecated version of spacy is not throwing the value error but still it's of no use as I am unable to train my model.
Sample data:
CarryBag 09038820815c.txt
Stopperneedle 0903882080f4.txt
Foilbags 09038820819.txt
I have around 700 files like this with data to be tagged and in each file multiple entities need tagging.
Code for reference:
import spacy
# import en_core_web_sm
import re
import csv
from spacy.matcher import PhraseMatcher
import plac
from pathlib import Path
import random
#Function to convert PhraseMatcher return value to string indexes
def str_index_conversion(lbl, doc, matchitem):
o_one = len(str(doc[0:matchitem[1]]))
subdoc = doc[matchitem[1]:matchitem[2]]
o_two = o_one + len(str(subdoc))
return (o_one, o_two, lbl)
# nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
else:
ner = nlp.get_pipe('ner')
ner.add_label('PRODUCT')
DIR = 'D:/Docs/'
matcher = PhraseMatcher(nlp.vocab)
list_str_index = []
to_train_ents = []
with open(r'D:\ner_dummy_pack.csv', newline='', encoding ='utf-8') as myFile:
reader = csv.reader(myFile)
for row in reader:
try:
product = row[0].lower()
#print('K---'+ product)
filename = row[1]
file = open(DIR+filename, "r", encoding ='utf-8')
print(file)
filecontents = file.read()
for s in filecontents:
filecontents = re.sub(r'\s+', ' ', filecontents)
filecontents = re.sub(r'^https?:\/\/.*[\r\n]*', '', filecontents, flags=re.MULTILINE)
filecontents = re.sub(r"http\S+", "", filecontents)
filecontents = re.sub(r"[-\"#/#;:<>?{}*`• ?+=~|$.!‘?“”?,_]", " ", filecontents)
filecontents = re.sub(r'\d+', '', filecontents)#removing all numbers
filecontents = re.sub(' +', ' ',filecontents)
#filecontents = filecontents.encode().decode('unicode-escape')
filecontents = ''.join([line.lower() for line in filecontents])
if "," in product:
product_patterns = product.split(',')
product_patterns = [i.strip() for i in product_patterns]
for elem in product_patterns:
matcher.add('PRODUCT', None, nlp(elem))
else:
matcher.add('PRODUCT', None, nlp(product))
print(filecontents)
doc = nlp(filecontents)
matches = matcher(doc)
#print(matches)
list_str_index = [str_index_conversion('PRODUCT', doc, x) for x in matches]
to_train_ents.append((filecontents, dict(entities=list_str_index)))
break
except Exception as e:
print(e)
pass
to_train_entsfinal=to_train_ents
def main(model=None, output_dir=None, n_iter=100):
# nlp.vocab.vectors.name = 'spacy_pretrained_vectors'
optimizer = nlp.begin_training()
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
for itn in range(10):
losses = {}
random.shuffle(to_train_entsfinal)
for item in to_train_entsfinal:
nlp.update([item[0]],
[item[1]],
sgd=optimizer,
drop=0.50,
losses=losses)
print(losses)
print("OUTTTTT")
if output_dir is None:
output_dir = "C:\\Users\\APRIL"
noutput_dir = Path(output_dir)
if not noutput_dir.exists():
noutput_dir.mkdir()
#nlp.meta['name'] = new_model_name
nlp.to_disk(output_dir)
random.shuffle(to_train_entsfinal)
if __name__ == '__main__':
main()
Can anyone help me solve this. Even when I removed conflicting entities in a sample of 10+ rows, example:
Blister abc.txt
Blisterpack abc.txt
Blisters abc.txt
the same issue is happening and the model is not training
Suggested changes:
def main(model=None, output_dir=None, n_iter=100):
top_memory_precentage_use = 75 # or what ever number you choose
def handle_memory(ruler):
if psutil.virtual_memory().percent < top_memory_precentage_use:
dump_ruler_nonascii(ruler)
ruler = nlp.begin_training() #or just init the nlp object again
return ruler
# This fitted for my use case
def dump_ruler_nonascii(ruler):
path = Path(os.path.join(self.data_path, 'config.jsonl'))
pattern = ruler.patterns
with open(path, "a", encoding="utf-8") as f:
for line in pattern:
f.write(json.dumps(line, ensure_ascii=False) + "\n")
return ruler
# nlp.vocab.vectors.name = 'spacy_pretrained_vectors'
optimizer = nlp.begin_training()
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
for itn in range(10):
losses = {}
random.shuffle(to_train_entsfinal)
for item in to_train_entsfinal:
nlp.update([item[0]],
[item[1]],
sgd=optimizer,
drop=0.50,
losses=losses)
print(losses)
print("OUTTTTT")
if output_dir is None:
output_dir = "C:\\Users\\APRIL"
noutput_dir = Path(output_dir)
if not noutput_dir.exists():
noutput_dir.mkdir()
#nlp.meta['name'] = new_model_name
nlp.to_disk(output_dir)
random.shuffle(to_train_entsfinal)
if __name__ == '__main__':
main()
It is hard to tell you why it is happening, but I can supply you 2 helper functions your training loop. that you can adjust to your use. In my case it was writing patterns and I checked the memory use every iteration.
#add the following imports
import psutil
import os
top_memory_precentage_use = 75 # or what ever number you choose
def handle_memory(ruler):
if psutil.virtual_memory().percent < top_memory_precentage_use:
dump_ruler_nonascii(ruler)
ruler = nlp.begin_training() #or just init the nlp object again
return ruler
# This fitted for my use case
def dump_ruler_nonascii(ruler):
path = Path(os.path.join(self.data_path, 'config.jsonl'))
pattern = ruler.patterns
with open(path, "a", encoding="utf-8") as f:
for line in pattern:
f.write(json.dumps(line, ensure_ascii=False) + "\n")
Below is the code
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
for w in Wrd_Freq:
print(ps.stem(w))
Output
read
peopl
say
work
I need the output as
['read',
'people',
'say',
'work']
Full Code without Potter Stemmer
lower = []
for item in df_text['job_description']:
lower.append(item.lower()) # lowercase description
tokens = []
type(tokens)
token_string= [str(i) for i in lower]
string = "".join(token_string)
string = string.replace("-","")
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r"\W+", gaps=True)
tokens = tokenizer.tokenize(string)
tokens
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
tokens = [token for token in tokens if token not in stopwords_list]
tokens
freq6000 = []
Wrd_Freq = nltk.FreqDist(tokens)
Wrd_Freq
df_WrdFreq = pd.DataFrame.from_dict(Wrd_Freq, orient='Index')
df_WrdFreq.columns=['Word Frequency']
freq6000= df_WrdFreq[(df_WrdFreq['Word Frequency'] >= 6000)]
freq6000.sort_values(by=['Word Frequency'],ascending=False).head(10)
I need to use potter stemmer separately to check whether there is any change to the count list. I need to perform the same after including potter stemmer and compare the output.
Use list comprehension:
L= [ps.stem(w) for w in Wrd_Freq]
EDIT:
If need top values by counts:
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
tokens = [token for token in tokens if token not in stopwords_list]
tokens
freq6000 = []
Wrd_Freq = nltk.FreqDist(tokens)
from collections import Counter
c = Counter(tokens)
top = [x for x, y in c.most_common(10)]
print (top)
['data', 'experience', 'business', 'work', 'science',
'learning', 'analytics', 'team', 'analysis', 'machine']
df_WrdFreq = pd.DataFrame.from_dict(Wrd_Freq, orient='Index')
df_WrdFreq.columns=['Word Frequency']
freq6000= df_WrdFreq[(df_WrdFreq['Word Frequency'] >= 6000)]
df = freq6000.sort_values(by=['Word Frequency'],ascending=False).head(10)
print (df)
Word Frequency
data 124289
experience 59135
business 33528
work 28146
science 26864
learning 26850
analytics 21828
team 20825
analysis 20607
machine 20484
I am doing some text mining in python and want to set up a new column with the value 1 if the return of my search function is true and 0 if it's false.
I have tried various if statements, but cannot get anything to work.
A simplified version of what I'm doing is below:
import pandas as pd
import nltk
nltk.download('punkt')
df = pd.DataFrame (
{
'student number' : [1,2,3,4,5],
'answer' : [ 'Yes, she is correct.', 'Yes', 'no', 'north east', 'No its North East']
# I know there's an apostrophe missing
}
)
print(df)
# change all text to lower case
df['answer'] = df['answer'].str.lower()
# split the answer into individual words
df['text'] = df['answer'].apply(nltk.word_tokenize)
# Check if given words appear together in a list of sentence
def check(sentence, words):
res = []
for substring in sentence:
k = [ w for w in words if w in substring ]
if (len(k) == len(words) ):
res.append(substring)
return res
# Driver code
sentence = df['text']
words = ['no','north','east']
print(check(sentence, words))
This is what you want I think:
df['New'] = df['answer'].isin(words)*1
This one works for me:
for i in range(0, len(df)):
if set(words) <= set(df.text[i]):
df['NEW'][i] = 1
else:
df['NEW'][i] = 0
You don't need the function if you use this method.