python Spacy custom NER – how to prepare multi-words entities? - python-3.x

:) Please help :)
I`m preparing custom Name Entity Recognition using Spacy (blank) model. I use only one entity: Brand (we can name it 'ORG' as Organisation). I have short texts with ORGs & have prepared data like this (but I can change it):
train_data = [
(‘First text in string with Name I want’, {'entities': [(START, END, ‘ORG')]}),
(‘Second text with Name and Name2’, {'entities': [(START, END, ‘ORG'), (START2, END2, ‘ORG')]})
]
START, END – are the start and end indexes of the brand name in text , of course.
This is working well, but...
The problem I have is how to prepare entities for Brands that are made of 2 (or more) words.
Lets say Brand Name is a full name of a company. How to prepare an entity?
Consider the tuple itself for a single text:
text = 'Third text with Brand Name'
company = 'Brand Name'
Can I treat company as a one word?
(‘Third text with Brand Name', {“entities”: [(16, 26, 'ORG')]})
Or 2 separated brands ‘Brand’ & ‘Name’ ? (will not be useful in my case while using :( the model later)
(‘Third text with Brand Name', {“entities”: [(16, 21, 'ORG'), (22, 26, 'ORG')]})
Or I should use a different format of labeling eg. BIO ?
So Brand will be B-ORG and Name will be I-ORG ?
IF so can I prepare it like this for Spacy:
(‘Third text with Brand Name', {“entities”: [(16, 21, 'B-ORG'), (22, 26, 'I-ORG')]})
or should I change the format of train_data because I also need the ‘O’ from BIO?
How? Like this? :
(‘Third text with Brand Name', {"entities": ["O", "O", "O", "B-ORG", "I-ORG"]})
The question is on the format of the train_data for ‘Third text with Brand Name' - how to label the entity. If I have the format, I will handle the code. :)
The same question for 3 or more words entities. :)

You can just provide the start and end offsets for the whole entity. You describe this as "treating it as one word", but the character offsets don't have any direct relation to tokenization - they won't affect tokenizer output.
You will get an error if the start and end of your entity don't match token boundaries, but it doesn't matter if the entity is one token or many.
I recommend you take a look at the training data section in the spaCy docs. Your specific question isn't answered explicitly, but that's only because multi-token entries don't require special treatment. Examples include multi-token entities.
Regarding BIO tagging, for details on how to use it with spaCy you can see the docs for spacy convert.

Related

Is there a way to randomly select multiple values from an array in python?

For example, in making a text based game, I'm trying to select a few countries out of a whole array, like only a few not the whole thing, but it needs to be random and different every time. I'll try display it in pseudocode:
Import random
nations = [UK, USA, France, Spain, Germany, Russia, Sweden, Norway, Austria, Turkey, KSA, UAE,
India, PRC, Japan, Mongolia, Kyrgyzstan, Egypt, Algeria, Morocco, Nigeria, Ghana, Laos,
Vietnam, Cambodia, Congo, Kenya, Somali, Sudan]
nationsForThisGame = nations.random(9)
// This gives me 9 random nations from that array
output(F"You are allies with {nationsForThisGame(1)}")
// This means that from the second array called nationsForThisGame, the first nation you're
allies with
Now how do I make this into python? I tried this similar structure, but it says: 'list' object has no attribute 'random'
So from the first array, I just want a few random values to put into my second array. How do I do that?
Use the following:
Nations = ["Nation1", "Nation2", "Nation3", "Nation4", "Nation5", "Nation6",
"Nation7", "Nation8", "Nation9"]
print(random.sample(Nations, 6))
It works good.
This means that from the array 'Nations', you could randomly choose as many values as you want. This helped me in a text-based game I'm making.

Check if the words in a column matches the words in a dictionary

data snapshot
Hi all,
I am new to python programming. I have data with a set of columns as shown in the picture. For each of the occupation there are relevant hard skills but it is actually a combination of both hard skills and technology skills. My aim is to create a new column which classifies each of the skill into hard or tech so that it would be easier for me to filter it later. I have created a dictionary of tech skills. I wanted to check if there is match between the dictionary words and hard skills column words. Can you please help me with the code? The dictionary words are as follows:
dict = {tech_skills:['Android (Operating System)','Apple IOS','Apple IPhone','Computer Keyboards','Computer Terminals','Corel Wordperfect Office', 'FaceTime','Gmail','Google Apps','Google Docs','Google Voice','Google+','Microsoft Excel', 'Microsoft Internet Explorer','Microsoft Office','Microsoft Outlook','Microsoft PowerPoint', 'Microsoft Visio','Microsoft Windows','Microsoft Windows NT','Microsoft Windows XP','Microsoft Word', 'Mobile Devices','Skype','Tableau (Business Intelligence Software)']}
# Copy hard_skills_name onto a new column
df['matched'] = df['hard_skills_name']
# replace with 1 if matched with the techskills
df['matched'].replace(dict['tech_skills'],1,inplace=True)
# replace non-matched with 0
df['matched'][df['matched'] != 1] = 0
Then you can check the data frame which matched:
print(df[df['matched']=1])

String matching keywords and key phrases in Python

I am trying to perform a smart dynamic lookup with strings in Python for a NLP-like task. I have a large amount of similar-structure sentences that I would like to parse through each, and tokenize parts of the sentence. For example, I first parse a string such as "bob goes to the grocery store".
I am taking this string in, splitting it into words and my goal is to look up matching words in a keyword list. Let's say I have a list of single keywords such as "store" and a list of keyword phrases such as "grocery store".
sample = 'bob goes to the grocery store'
keywords = ['store', 'restaurant', 'shop', 'office']
keyphrases = ['grocery store', 'computer store', 'coffee shop']
for word in sample.split():
# do dynamic length lookups
Now the issue is this Sometimes my sentences might be simply "bob goes to the store" instead of "bob goes to the grocery store".
I want to find the keyword "store" for sure but if there are descriptive words such as "grocery" or "computer" before the word store I would like to capture that as well. That is why I have the keyphrases list as well. I am trying to figure out a way to basically capture a keyword at the very least then if there are words related to it that might be a possible "phrase" I want to capture those too.
Maybe an alternative is to have some sort of adjective list instead of a phrase list of multiple words?
How could I go about doing these sort of variable length lookups where I look at more than just a single word if one is captured, or is there an entirely different method I should be considering?
Here is how you can use a nested for loop and a formatted string:
sample = 'bob goes to the grocery store'
keywords = ['store', 'restaurant', 'shop', 'office']
keyphrases = ['grocery', 'computer', 'coffee']
for kw in keywords:
for kp in keyphrases:
if f"{kp} {kw}" in sample:
# Do something

How to extract relationships from a text

I am currently new with NLP and need guidance as of how I can solve this problem.
I am currently doing a filtering technique where I need to brand data in a database as either being correct or incorrect. I am given a structured data set, with columns and rows.
However, the filtering conditions are given to me in a text file.
An example filtering text file could be the following:
Values in the column ID which are bigger than 99
Values in the column Cash which are smaller than 10000
Values in the column EndDate that are smaller than values in StartDate
Values in the column Name that contain numeric characters
Any value that follows those conditions should be branded as bad.
However, I want to extract those conditions and append them to the program that I've made so far.
For instance, for the conditions above, I would like to produce
`if ID>99`
`if Cash<10000`
`if EndDate < StartDate`
`if Name LIKE %[1-9]%`
How can I achieve the above result using the Stanford NLP? (or any other NLP library).
This doesn't look like a machine learning problem; it's a simple parser. You have a simple syntax, from which you can easily extract the salient features:
column name
relationship
target value or target column
The resulting "action rule" is simply removing the "syntactic sugar" words and converting the relationship -- and possibly the target value -- to its symbolic form.
Enumerate all of your critical words for each position in a lexicon. Then use basic string manipulation operators in your chosen implementation language to find the three needed fields.
EXAMPLE
Given the data above, your lexicons might be like this:
column_trigger = "Values in the column"
relation_dict = {
"are bigger than" : ">",
"are smaller than" : "<",
"contain" : "LIKE",
...
}
value_desc = {
"numeric characters" : "%[1-9]%",
...
}
From here, use these items in standard parsing. If you're not familiar with that, please look up the basics of a simple sentence grammar in your favourite programming language, with rules such as such as
SENTENCE => SUBJ VERB OBJ
Does that get you going?

Replace words marked with offsets

I have a sentence like that:
"My name is Bond. It's a fake name."
and I have to replace some words in a list with offsets of each word:
name, 29-33; Bond, 11-15; name, 3-7
In addition, each word must replace with a specific word:
name -> noun
Bond -> proper
I have to obtain this output:
"My noun is proper. It's a fake noun."
I tried to manage the offsets with a post-offset variable that I update after each replacement but it is not valid because is an unordered list. Note that find method is not valid due to names repetition. Is there any algorithm to do it? Any vectorial implementation (String, Numpy, NLTK) that computes it in one step?
Bro Check this one :
string = "My name is Bond. It's a fake name."
y=list()
y=string.split(" ") #now it will break your strings into words
Now traverse the list and set the condition
for i in y:
if(i==name):
i="noun"
if(i==Bond):
i="Proper"
Now the list values will be changed and use the Join() method to make back the list into string
For more Please refer to this website https://www.tutorialspoint.com/python/python_strings.htm
This page contains all the data related to string processing in python.

Resources