Convert NER SpaCy format to IOB format - nlp

I have data which is already labelled in SpaCy format. For example:
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]})
But I want to try training it with any other NER model, such as BERT-NER, which requires IOB tagging instead. Is there any conversion code from SpaCy data format to IOB?
Thanks!

This is closely related to and mostly copied from https://stackoverflow.com/a/59209377/461847, see the notes in the comments there, too:
import spacy
from spacy.gold import biluo_tags_from_offsets
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
tags = biluo_tags_from_offsets(doc, annot['entities'])
# then convert L->I and U->B to have IOB tags for the tokens in the doc

I am afraid, you will have to write your own conversion because IOB encoding depends on what tokenization will the pre-trained representation model (BERT, RoBERTa or whatever pre-trained model of your choice) uses.
The SpaCy format specifies the character span of the entity, i.e.
"Who is Shaka Khan?"[7:17]
will return "Shaka Khan". You need to match that to tokens used by the pre-trained model.
Here are examples of how different models tokenize the example sentence when you used Huggingface's Transformers.
BERT: ['Who', 'is', 'S', '##hak', '##a', 'Khan', '?']
RoBERTa: ['Who', '_is', '_Sh', 'aka', '_Khan', '?']
XLNet: ['▁Who', '▁is', '▁Shak', 'a', '▁Khan', '?']
When knowing how the tokenizer work, you can implement the conversion. Something like this can work for BERT tokenization.
entities = [(7, 17, "PERSON")]}
tokenized = ['Who', 'is', 'S', '##hak', '##a', 'Khan', '?']
cur_start = 0
state = "O" # Outside
tags = []
for token in tokenized:
# Deal with BERT's way of encoding spaces
if token.startswith("##"):
token = token[2:]
else:
token = " " + token
cur_end = cur_start + len(token)
if state == "O" and cur_start < entities[0][0] < cur_end:
tags.append("B-" + entitites[0][2])
state = "I-" + entitites[0][2]
elif state.startswith("I-") and cur_start < entities[0][1] < cur_end:
tags.append(state)
state = "O"
entities.pop(0)
else:
tags.append(state)
cur_start = cur_end
Note that the snippet would break if one BERT token would contain the end of one entity and the start of the following one. The tokenizer also does not distinguish how many spaces (or other whitespaces) there were in the original string, this is a potential source of errors as well.

First You need to convert your annotated json file to csv.
Then you can run the below code to convert into spaCy V2 Binary format
df = pd.read_csv('SC_CSV.csv')
l1 = []
l2 = []
for i in range(0, len(df['ner'])):
l1.append(df['ner'][i])
l2.append({"entities":[(0,len(df['ner'][i]),df['label'][i])]})
TRAIN_DATA = list(zip(l1, l2))
TRAIN_DATA
Now the TRAIN_DATA in spaCy V2 format
This helps to convert the file from your old Spacy v2 formats to the brand new Spacy v3 format.
import pandas as pd
from tqdm import tqdm
import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for text, annot in tqdm(TRAIN_DATA): # data in previous format
doc = nlp.make_doc(text) # create doc object from text
ents = []
for start, end, label in annot["entities"]: # add character indexes
span = doc.char_span(start, end, label=label, alignment_mode="contract")
if span is None:
print("Skipping entity")
else:
ents.append(span)
doc.ents = ents # label the text with the ents
db.add(doc)
db.to_disk("./train.spacy") # save the docbin object

I have faced this kind of problem.
what i did is transforming the data to spacy binary then I load the data from docbin object using this code.
import spacy
from spacy.tokens import DocBin
db=DocBin().from_disk("your_docbin_name.spacy")
nlp=spacy.blank("language_used")
Documents=list(db.get_docs(nlp.vocab))
`
then this code may help you to extract the iob format from it.
for elem in Documents[0]:
if(elem.ent_iob_!="O"):
print(elem.text,elem.ent_iob_,"-",elem.ent_type_)
else :
print(elem.text,elem.ent_iob_)
here is the example of my output :
عبرت O
الديناميكية B - POLITIQUE
النسوية I - POLITIQUE
التي O
تأسست O
بعد O
25 O
جويلية O
2021 O
عن O
رفضها O
القطعي O
لمشروع O
تنقيح B - POLITIQUE
المرسوم B - POLITIQUE
عدد O
88 O
لسنة O

import spacy
from spacy.gold import biluo_tags_from_offsets
data = data
nlp = spacy.blank("en")
for text, labels in data:
doc = nlp("read our spacy format data here")
ents = []
for start, end, label in labels["entities"]:
ents.append(doc.char_span(start, end, label))
doc.ents = ents
for tok in doc:
label = tok.ent_iob_
if tok.ent_iob_ != "O":
label += '-' + tok.ent_type_
print(tok, label, sep="\t")
if getting none-type error do add try block depending on your dataset or clean your dataset.

Related

Text Classification on a custom dataset with spacy v3

I am really struggling to make things work with the new spacy v3 version. The documentation is full. However, I am trying to run a training loop in a script.
(I am also not able to perform text classification training with CLI approach).
Data are publically available here.
import pandas as pd
from spacy.training import Example
import random
TRAIN_DATA = pd.read_json('data.jsonl', lines = True)
nlp = spacy.load('en_core_web_sm')
config = {
"threshold": 0.5,
}
textcat = nlp.add_pipe("textcat", config=config, last=True)
label = TRAIN_DATA['label'].unique()
for label in label:
textcat.add_label(str(label))
nlp = spacy.blank("en")
nlp.begin_training()
# Loop for 10 iterations
for itn in range(100):
# Shuffle the training data
losses = {}
TRAIN_DATA = TRAIN_DATA.sample(frac = 1)
# Batch the examples and iterate over them
for batch in spacy.util.minibatch(TRAIN_DATA.values, size=4):
texts = [nlp.make_doc(text) for text, entities in batch]
annotations = [{"cats": entities} for text, entities in batch]
# uses an example object rather than text/annotation tuple
print(texts)
print(annotations)
examples = [Example.from_dict(a)]
nlp.update(examples, losses=losses)
if itn % 20 == 0:
print(losses)

Retraining pre-trained word embeddings in Python using Gensim

I want to retrain pre-trained word embeddings in Python using Gensim. The pre-trained embeddings I want to use is Google's Word2Vec in the file GoogleNews-vectors-negative300.bin.
Following Gensim's word2vec tutorial, "it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there."
Therefore I can't use the KeyedVectors and for training a model the tutorial suggests to use:
model = gensim.models.Word2Vec.load('/tmp/mymodel')
model.train(more_sentences)
(https://rare-technologies.com/word2vec-tutorial/)
However, when I try this:
from gensim.models import Word2Vec
model = Word2Vec.load('data/GoogleNews-vectors-negative300.bin')
I get an error message:
1330 # Because of loading from S3 load can't be used (missing readline in smart_open)
1331 if sys.version_info > (3, 0):
-> 1332 return _pickle.load(f, encoding='latin1')
1333 else:
1334 return _pickle.loads(f.read())
UnpicklingError: invalid load key, '3'.
I didn't find a way to convert the binary google new file into a text file properly, and even if so I'm not sure whether that would solve my problem.
Does anyone have a solution to this problem or knows about a different way to retrain pre-trained word embeddings?
The Word2Vec.load() method can only load full models in gensim's native format (based on Python object-pickling) – not any other binary/text formats.
And, as per the documentation's note that "it’s not possible to resume training with models generated by the C tool", there's simply not enough information in the GoogleNews raw-vectors files to reconstruct the full working model that was used to train them. (That would require both some internal model-weights, not saved in that file, and word-frequency-information for controlling sampling, also not saved in that file.)
The best you could do is create a new Word2Vec model, then patch some/all of the GoogleNews vectors into it before doing your own training. This is an error-prone process with no real best-practices and many caveats about the interpretation of final results. (For example, if you bring in all the vectors, but then only re-train a subset using only your own corpus & word-frequencies, the more training you do – making the word-vectors better fit your corpus – the less such re-trained words will have any useful comparability to retained untrained words.)
Essentially, if you can look at the gensim Word2Vec source & work-out how to patch-together such a frankenstein-model, it may be appropriate. But there's no built-in support or handy off-the-shelf recipes that make it easy, because it's an inherently murky process.
I have already answered it here .
Save the google news model as text file in wor2vec format using gensim.
Refer this answer to save it as text file
Then try this code .
import os
import pickle
import numpy as np
import gensim
from gensim.models import Word2Vec, KeyedVectors
from gensim.models.callbacks import CallbackAny2Vec
import operator
os.mkdir("model_dir")
# class EpochSaver(CallbackAny2Vec):
# '''Callback to save model after each epoch.'''
# def __init__(self, path_prefix):
# self.path_prefix = path_prefix
# self.epoch = 0
# def on_epoch_end(self, model):
# list_of_existing_files = os.listdir(".")
# output_path = 'model_dir/{}_epoch{}.model'.format(self.path_prefix, self.epoch)
# try:
# model.save(output_path)
# except:
# model.wv.save_word2vec_format('model_dir/model_{}.bin'.format(self.epoch), binary=True)
# print("number of epochs completed = {}".format(self.epoch))
# self.epoch += 1
# list_of_total_files = os.listdir(".")
# saver = EpochSaver("my_finetuned")
# function to load vectors from existing model.
# I am loading glove vectors from a text file, benefit of doing this is that I get complete vocab of glove as well.
# If you are using a previous word2vec model I would recommed save that in txt format.
# In case you decide not to do it, you can tweak the function to get vectors for words in your vocab only.
def load_vectors(token2id, path, limit=None):
embed_shape = (len(token2id), 300)
freqs = np.zeros((len(token2id)), dtype='f')
vectors = np.zeros(embed_shape, dtype='f')
i = 0
with open(path, encoding="utf8", errors='ignore') as f:
for o in f:
token, *vector = o.split(' ')
token = str.lower(token)
if len(o) <= 100:
continue
if limit is not None and i > limit:
break
vectors[token2id[token]] = np.array(vector, 'f')
i += 1
return vectors
# path of text file of your word vectors.
embedding_name = "word2vec.txt"
data = "<training data(new line separated tect file)>"
# Dictionary to store a unique id for each token in vocab( in my case vocab contains both my vocab and glove vocab)
token2id = {}
# This dictionary will contain all the words and their frequencies.
vocab_freq_dict = {}
# Populating vocab_freq_dict and token2id from my data.
id_ = 0
training_examples = []
file = open("{}".format(data),'r', encoding="utf-8")
for line in file.readlines():
words = line.strip().split(" ")
training_examples.append(words)
for word in words:
if word not in vocab_freq_dict:
vocab_freq_dict.update({word:0})
vocab_freq_dict[word] += 1
if word not in token2id:
token2id.update({word:id_})
id_ += 1
# Populating vocab_freq_dict and token2id from glove vocab.
max_id = max(token2id.items(), key=operator.itemgetter(1))[0]
max_token_id = token2id[max_id]
with open(embedding_name, encoding="utf8", errors='ignore') as f:
for o in f:
token, *vector = o.split(' ')
token = str.lower(token)
if len(o) <= 100:
continue
if token not in token2id:
max_token_id += 1
token2id.update({token:max_token_id})
vocab_freq_dict.update({token:1})
with open("vocab_freq_dict","wb") as vocab_file:
pickle.dump(vocab_freq_dict, vocab_file)
with open("token2id", "wb") as token2id_file:
pickle.dump(token2id, token2id_file)
# converting vectors to keyedvectors format for gensim
vectors = load_vectors(token2id, embedding_name)
vec = KeyedVectors(300)
vec.add(list(token2id.keys()), vectors, replace=True)
# setting vectors(numpy_array) to None to release memory
vectors = None
params = dict(min_count=1,workers=14,iter=6,size=300)
model = Word2Vec(**params)
# using build from vocab to build the vocab
model.build_vocab_from_freq(vocab_freq_dict)
# using token2id to create idxmap
idxmap = np.array([token2id[w] for w in model.wv.index2entity])
# Setting hidden weights(syn0 = between input layer and hidden layer) = your vectors arranged accoring to ids
model.wv.vectors[:] = vec.vectors[idxmap]
# Setting hidden weights(syn0 = between hidden layer and output layer) = your vectors arranged accoring to ids
model.trainables.syn1neg[:] = vec.vectors[idxmap]
model.train(training_examples, total_examples=len(training_examples), epochs=model.epochs)
output_path = 'model_dir/final_model.model'
model.save(output_path)

determine most similar phrase using word2vec

I try create a model which determine the most similar sentence for another sentence using word2vec.
The idea is to determine the most similar for a sentence, I created an average vector for the words composed this sentence.
Then, I should to predict the most similar sentence using embedding words.
My question is: How can I determine the best similar target sentence after created an average vector of source sentence?
Here the code :
import gensim
from gensim import utils
import numpy as np
import sys
from sklearn.datasets import fetch_20newsgroups
from nltk import word_tokenize
from nltk import download
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
download('punkt') #tokenizer, run once
download('stopwords') #stopwords dictionary, run once
stop_words = stopwords.words('english')
def preprocess(text):
text = text.lower()
doc = word_tokenize(text)
doc = [word for word in doc if word not in stop_words]
doc = [word for word in doc if word.isalpha()] #restricts string to alphabetic characters only
return doc
############ doc content -> num label -> string label
#note to self: texts[XXXX] -> y[XXXX] = ZZZ -> ng20.target_names[ZZZ]
# Fetch ng20 dataset
ng20 = fetch_20newsgroups(subset='all',
remove=('headers', 'footers', 'quotes'))
# text and ground truth labels
texts, y = ng20.data, ng20.target
corpus = [preprocess(text) for text in texts]
def filter_docs(corpus, texts, labels, condition_on_doc):
"""
Filter corpus, texts and labels given the function condition_on_doc which takes
a doc.
The document doc is kept if condition_on_doc(doc) is true.
"""
number_of_docs = len(corpus)
print(number_of_docs)
if texts is not None:
texts = [text for (text, doc) in zip(texts, corpus)
if condition_on_doc(doc)]
labels = [i for (i, doc) in zip(labels, corpus) if condition_on_doc(doc)]
corpus = [doc for doc in corpus if condition_on_doc(doc)]
print("{} docs removed".format(number_of_docs - len(corpus)))
return (corpus, texts, labels)
corpus, texts, y = filter_docs(corpus, texts, y, lambda doc: (len(doc) != 0))
def document_vector(word2vec_model, doc):
# remove out-of-vocabulary words
#print("doc:")
#print(doc)
doc = [word for word in doc if word in word2vec_model.vocab]
return np.mean(word2vec_model[doc], axis=0)
def has_vector_representation(word2vec_model, doc):
"""check if at least one word of the document is in the
word2vec dictionary"""
return not all(word not in word2vec_model.vocab for word in doc)
corpus, texts, y = filter_docs(corpus, texts, y, lambda doc: has_vector_representation(model, doc))
x =[]
for doc in corpus: #look up each doc in model
x.append(document_vector(model, doc))
X = np.array(x) #list to array
model.most_similar(positive=X, topn=1)
Just use the cosine distance. It's implemented in scipy.
For better efficiency, you can implement it yourself and precompute the norms of vectors in X:
X_norm = np.linalg.norm(X, axis=1).expand_dims(0)
Calling expand_dims ensures that dimensions got broadcasted. Then for vectors Y, you can get the most similar, you can get the most similar:
def get_most_similar_in_X(Y):
Y_norm = np.linalg.norm(Y, axis=1).expand_dims(1)
similarities = np.dot(Y, X.T) / Y_norm / X_norm
return np.argmax(distances, axis=2)
And you get indices of vectors in X that are most similar to vectors in Y.

scikit-learn - Using a single string with RandomForestClassifier.predict()?

I'm an sklearn dummy... I'm trying to predict the label for a given string from a RandomForestClassifier() fitted with text, labels.
It's obvious I don't know how to use predict() with a single string. The reason I'm using reshape() is because I got this error some time ago "Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
How can I predict the label of a single text string?
The script:
#!/usr/bin/env python
''' Read a txt file consisting of '<label>: <long string of text>'
to use as a model for predicting the label for a string
'''
from argparse import ArgumentParser
import json
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
def main(args):
'''
args: Arguments obtained by _Get_Args()
'''
print('Loading data...')
# Load data from args.txtfile and split the lines into
# two lists (labels, texts).
data = open(args.txtfile).readlines()
labels, texts = ([], [])
for line in data:
label, text = line.split(': ', 1)
labels.append(label)
texts.append(text)
# Print a list of unique labels
print(json.dumps(list(set(labels)), indent=4))
# Instantiate a CountVectorizer class and git the texts
# and labels into it.
cv = CountVectorizer(
stop_words='english',
strip_accents='unicode',
lowercase=True,
)
matrix = cv.fit_transform(texts)
encoder = LabelEncoder()
labels = encoder.fit_transform(labels)
rf = RandomForestClassifier()
rf.fit(matrix, labels)
# Try to predict the label for args.string.
prediction = Predict_Label(args.string, cv, rf)
print(prediction)
def Predict_Label(string, cv, rf):
'''
string: str() - A string of text
cv: The CountVectorizer class
rf: The RandomForestClassifier class
'''
matrix = cv.fit_transform([string])
matrix = matrix.reshape(1, -1)
try:
prediction = rf.predict(matrix)
except Exception as E:
print(str(E))
else:
return prediction
def _Get_Args():
parser = ArgumentParser(description='Learn labels from text')
parser.add_argument('-t', '--txtfile', required=True)
parser.add_argument('-s', '--string', required=True)
return parser.parse_args()
if __name__ == '__main__':
args = _Get_Args()
main(args)
The actual learning data text file is 43663 lines long but a sample is in small_list.txt which consists of lines each in the format: <label>: <long text string>
The error is noted in the Exception output:
$ ./learn.py -t small_list.txt -s 'This is a string that might have something to do with phishing or fraud'
Loading data...
[
"Vulnerabilities__Unknown",
"Vulnerabilities__MSSQL Browsing Service",
"Fraud__Phishing",
"Fraud__Copyright/Trademark Infringement",
"Attacks and Reconnaissance__Web Attacks",
"Vulnerabilities__Vulnerable SMB",
"Internal Report__SBL Notify",
"Objectionable Content__Russian Federation Objectionable Material",
"Malicious Code/Traffic__Malicious URL",
"Spam__Marketing Spam",
"Attacks and Reconnaissance__Scanning",
"Malicious Code/Traffic__Unknown",
"Attacks and Reconnaissance__SSH Brute Force",
"Spam__URL in Spam",
"Vulnerabilities__Vulnerable Open Memcached",
"Malicious Code/Traffic__Sinkhole",
"Attacks and Reconnaissance__SMTP Brute Force",
"Illegal content__Child Pornography"
]
Number of features of the model must match the input. Model n_features is 2070 and input n_features is 3
None
You need to get the vocabulary of the first CountVectorizer (cv) and use to transform the new single text before predict.
...
cv = CountVectorizer(
stop_words='english',
strip_accents='unicode',
lowercase=True,
)
matrix = cv.fit_transform(texts)
encoder = LabelEncoder()
labels = encoder.fit_transform(labels)
rf = RandomForestClassifier()
rf.fit(matrix, labels)
# Try to predict the label for args.string.
cv_new = CountVectorizer(
stop_words='english',
strip_accents='unicode',
lowercase=True,
vocabulary=cv.vocabulary_
)
prediction = Predict_Label(args.string, cv_new, rf)
print(prediction)
...

Classify text using NaiveBayesClassifier

I have a text file with a sentence on each line:
eg ""Have you registered your email ID with your Bank Account?"
I want to classify it into interrogative or not. FYI these are sentences from bank websites.
I've seen this answer
with this nltk code block:
import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
features = {}
for word in nltk.word_tokenize(post):
features['contains({})'.format(word.lower())] = True
return features
featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
So I did some preprocessing to my text file i.e. stemming words, removing stop words etc, to make each sentence into a bag of words. From the code above, I have a trained classifier. How do I implement it on my text file of sentences (either raw or preprocessed)?
Update: here is an example of my text file.
Assuming that you have preprocessed the document data as we discussed, you can do the following:
import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
features = {}
for word in nltk.word_tokenize(post):
features['contains({})'.format(word.lower())] = True
return features
featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(nltk.classify.accuracy(classifier, test_set))
0.668
For your data, you can iterate in your lines and fit, predict:
classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(classifier.classify(dialogue_act_features(line)))
Doing this for all lines in the text file works:
classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(classifier.classify(dialogue_act_features(line)))

Resources