Custom Word Tagger - nlp

I am new to NLP and am getting to know NLTK, but am having some trouble getting off the ground on something I am trying to accomplish.
I would like to build my own word tagger such that if I pass a string like "The Porsche is red" the function would return ('Porsche','Car', 'red', 'Color').
I already have the dictionaries built that define the categories. I am just struggling on how to get start. Could anyone offer some assistance?
Thanks very much.
UPDATE: The dictionary at this time is a simple two column list in .csv format with the word and its corresponding category.
Example Link: http://www.filedropper.com/carexampledictionary
Sincerely,
Mick

I think simple lookup in the list might work. First tokenize the text, then iterate through the tokens and look up each token in the list in your lists of categories.
One problem you might have is overlap between the categories. Is there any word which occurrs in more than one category list? If so you'd need a method to disambiguate which category a given token belongs to. If not simple list-lookup should work.
More precisely, here is what I would do step-by-step:
Import the data into a dictionary
Tokenize the text
For each token, look up whether the token is in the keys of your dictionary
Tag the word according to what category it belongs to
Hope that helps.

Related

Dataframe text column with spelling mistakes

Sorry for bad title, I wasn't sure how best to describe the issue.
I have a dataframe with a column for occupation, df['occupation'], users can enter their occupuation during signup using whatever terms they like.
I'm attempting to do an EDA on the column, however, I'm not sure how to clean the column to get it from this,
Occupation
a-level student
a level
alavls
university physics student
physics student
6th form student
builder
Into something like this,
Occupation
a-levels
University student
Full time employment
Without writing out hundreds of lines renaming each unique entry.
TYIA
Any help or links to useful modules would be great.
The simplest way to do this is by applying a function that measures the similarity between the two sentences, there are plenty of similiraty mesures that could be used in this context like the hamming distance, however they are all relatively very limited, and you might be forced at some point -if in production- to have a machine learning model for this task.
import pandas as pd
def hamming_distance(chaine1, chaine2):
"""mesures the similiraty between two chaines of caracters
Note: this is a very limited as it only examines the positions of the letters in both chaines.
"""
return sum(c1 != c2 for c1, c2 in zip(chaine1, chaine2))
OCCUPATIONS = [ "Occupation","a-levels" , "University student" , "Full time employment"]
def get_most_similar(ocup,OCCUPATIONS):
"""return the most similar occupation from the unique values OCCUPATIONs to the entry ocup
"""
return min([(oc,hamming_distance(ocup.lower(),oc.lower())) for oc in OCCUPATIONS],key=lambda item:item[1])[0]
column = ["Occupation","a-level student","a level","alavls","university physics student","physics student","6th form student","builder"]
df = pd.DataFrame(column,columns=['occupation']) # this is just a reconstruction of your dataframe you probably don't need this line.
df['occupation']=df['occupation'].apply(lambda ocup : get_most_similar(ocup,OCCUPATIONS))
df.head(100)
Hi the approach you can possibly use in this problem is similar to the Solution Covered here Using apply map with Regex
The Regex approach will allow you to use Wildcards for cases you have not explored in your Dataset.

Which technique for training should I use for the following dataset?

I have a dataset which is a csv having 2 columns "Text", "Name".
"Text" column contains the news article.
"Name" column contains the extracted name from the corresponding text.
I have to train the model on this dataset, dataset contains 4000 plus unique news, where in, once your model is trained and validated, User should be able to pass any text and it should fetch the proper name.
What technique should I use and implement it. Please suggest.
Thanks in advance.
It sounds like you are looking to search for an item by keywords. In a basic case you could use a bag of words approach, in which you tokenise the words in the Text-field and index each document accordingly.
The relevance of each document can then be calculated given some measure (for instance cosine similarity).
You can find an example using the genesis-library here: https://radimrehurek.com/gensim/tut3.html
It is quite basic, note however that it does use LSI.

How can I find text between two headings from docx in python

I want to extract information from the resume, for this, I have to identify headings and take text data underneath that heading.
I think you need to be more specific to your issue and approach you want to take. As of now, for heading extraction, you can define a corpus first form all the headings after reading in beautiful soup. Once such corpus is created you can now match the corpus with heading of the resume and get the section by defining the starting and ending data point. and then match skills et. whatever you want to do with it.
This is the simplest approach based on your current question. Be more specific so, i can guide with more precise approach.
Best,

Alteryx Analyse the similarity of the words

I am currently doing out the top 10 types of fault chart. So the user will key in what is the fault about, ex. light bulb fused. As it is free flow text box, the words may not be the same. Is there anyway to make Alteryx understand that some words may be the same, allowing me to find the top 10 types of fault. Thank you.
You have a couple of ways. You can use the Fuzzy Match tools in the Join category to sort out slight spelling mistakes. You can find Alteryx examples of Fuzzy Match on Youtube.
You can also use the Record ID followed by Text to Columns (Split to Rows based on space) to get a list of single words.
In what you are trying to do, I would advise building up a bit of a lookup table. You can then use the Find-Replace Tool to Append the Category from the lookup depending on the words that are found.
Depending on the cleanliness of your data and how different each category is will guide you as to how far down the above paths you should go.

Excel: Search for keywords within a matrix to build a relation in another

First of all, thanks for all who read through - It's not easy to explain it with words, as iam not allowed to post pictures yet.
Here we go:
Imagine 2 sheets within an excel document.
Sheet1: The Matrix
In Column A you'll find a 'description text' for a specific problem
Column B (and the following) are filled with several information to the problem, but we don't need them for my current problem.
Sheet2: Relation
I've got another table with Error-Description and IDs. Unfortunetaly those who write the description text don't use the same sentences as I do for my error-description. But i need to get a relation, to further analyze the problem.
To do so, i created another sheet with 'keywords' and the ID of the error.
Trying several things iam not able to....: (Now the problem starts for all tldr;)
HOW do I build up the relation. Keywords are set as
"*test*" "*whaterrorcustomer*" and so on.
While the description text is more like
"We had a test here but it failed" and so on..
I want to add a column in the first sheet (sheet1: Matrix) with the ID.
So I need to search within the description text for the keywords and return the ID.
Any idea how I could do this?
Edit:
Here an picture (just got my rep. points) :
I want the Text beeing related to the ID by looking for the keywords.
Iam very thankful for ANY help
For those who face the same problem, got it solved.
=lookup(2;1/SEARCH(Relation!A$2:A$5;Sheet1!A2);Relation!B:B)

Resources