Multi class text classification when having only one sample for classes - text

I have a dataset of texts, each text was identified with an ID number. I would like to do a prediction by finding the best match ID number for upcoming new texts. To use multi text classification, I am not sure if this is the right approach since there is only one text for most of ID numbers. In this case, I wouldn't have any test set. Can up-sampling help? Or is there any other approach than classification for such a problem?
The data set looks like this:
id1 'text1',
id2 'text2',
id3 'text3',
id3 'text4',
id3 'text5',
id4 'text6',
.
.
id200 'text170'
I would appreciate any guidance to find the best approach for this problem.

Related

Is there a logical function in excel to extract unique text values from a range of similar texts?

I am working on a dataset which has data (text) entries captured in different styles like we see in the table below in 1000's of rows:
**School Name **
Abirem school
Abirem sec School
Abirem Secondary school
Abirem second. School
Metropolitan elementary
Metropolitan Element.
Metropolitan ele
I need help to extract the unique data values within a group of similar entries regardless of the style it was entered. The output I want should look like we see below:
**School Name **
Abirem school
Metropolitan elementary
I have tried using the functions; EXACT, UNIQUE, MATCH and even XLOOKUP (with the wildcard option) but none of them gives me the output I want.
Is there a logical function that can be used?
This will prove to be tricky. Excel would not know wheather or not two different names that look similar are actually meant to be similar. Even for us humans it will become trivial. I mean; would School1 ABC be similar to School1 DEF or not? Without actually knowing geographical locations about these two schools these could well be two different schools with a similar first word in their names.
Either way, if you happen to be willing to accept this ambiguity you could make a match on the 1st word of each line here and return only those where they match first:
Formula in C1:
=LET(a,A1:A7,UNIQUE(XLOOKUP(TEXTSPLIT(a," ")&" *",a&" ",a,,2)))

Trying to create a table where the columns are IDs, and the rows contain yes/no based on whether or not specific text is linked to the ID

I would like the result to look something like this.
ID1
ID2
ID3
Text 1
yes
yes
no
Text 2
yes
no
no
Text 3
no
yes
yes
And I have a table that looks like this. Notice that the text/strings can exist under multiple different IDs. Is there a way to use index/match for this? I've tried to use index/match, but to be honest I don't understand it that well.
ID1
ID2
ID3
Text 1
Text 1
Text 3
Text 2
Text 3
null
null
null
null
See above. I tried to use index/match, and it's just not making sense to me.
Edit: updated the tables so that values match. Sorry!
Use a nested INDEX(MATCH) to return the correct column to a MATCH. Then test if the return is numeric:
=IF(ISNUMBER(MATCH($E2,INDEX($A$2:$C$4,0,MATCH(F$1,$A$1:$C$1,0)),0)),"yes","no")

Dataframe text column with spelling mistakes

Sorry for bad title, I wasn't sure how best to describe the issue.
I have a dataframe with a column for occupation, df['occupation'], users can enter their occupuation during signup using whatever terms they like.
I'm attempting to do an EDA on the column, however, I'm not sure how to clean the column to get it from this,
Occupation
a-level student
a level
alavls
university physics student
physics student
6th form student
builder
Into something like this,
Occupation
a-levels
University student
Full time employment
Without writing out hundreds of lines renaming each unique entry.
TYIA
Any help or links to useful modules would be great.
The simplest way to do this is by applying a function that measures the similarity between the two sentences, there are plenty of similiraty mesures that could be used in this context like the hamming distance, however they are all relatively very limited, and you might be forced at some point -if in production- to have a machine learning model for this task.
import pandas as pd
def hamming_distance(chaine1, chaine2):
"""mesures the similiraty between two chaines of caracters
Note: this is a very limited as it only examines the positions of the letters in both chaines.
"""
return sum(c1 != c2 for c1, c2 in zip(chaine1, chaine2))
OCCUPATIONS = [ "Occupation","a-levels" , "University student" , "Full time employment"]
def get_most_similar(ocup,OCCUPATIONS):
"""return the most similar occupation from the unique values OCCUPATIONs to the entry ocup
"""
return min([(oc,hamming_distance(ocup.lower(),oc.lower())) for oc in OCCUPATIONS],key=lambda item:item[1])[0]
column = ["Occupation","a-level student","a level","alavls","university physics student","physics student","6th form student","builder"]
df = pd.DataFrame(column,columns=['occupation']) # this is just a reconstruction of your dataframe you probably don't need this line.
df['occupation']=df['occupation'].apply(lambda ocup : get_most_similar(ocup,OCCUPATIONS))
df.head(100)
Hi the approach you can possibly use in this problem is similar to the Solution Covered here Using apply map with Regex
The Regex approach will allow you to use Wildcards for cases you have not explored in your Dataset.

What is the input for adjusted_rand_score and adjusted_mutual_info_score?

I read the documentation and I get that it is labels_true and labels_pred but in what form do they have to be?
For the labels_pred I use kmeans.labels_ which is an numpy array containing 0, 1, 2 or 3.
For the labels_true I use top100.Genre. Top100 is a dataframe which contains movies with different genres, there are 4 total genres. I read somewhere that I can just insert these two labels, and that the two score measures will find a way to assign the numbers to a genre.
Is this true? The values I get are slightly above 0 (like 0.014) so that doesn't say much to me.
Or do I have to convert top100.Genre to a numpy array as well and change the names of the genres to 0, 1, 2 and 3? And how would I assign each number?
I think the score measures will find a way to match the assigned labels and predicted ones. As per the Scikit-learn documentation the input should be an array. In this case assign class labels for the genres such as 0,1,2,3 like you have mentioned. They can be assigned randomly.
Also the Scikit_learn documentation mentions that the score need not be a perfect 0, but close to 0. So a score of 0.014 is acceptable.
The link below has a some good information:
https://esigma6.wordpress.com/2018/11/03/2-3-9-clustering-performance-evaluation/

Which technique for training should I use for the following dataset?

I have a dataset which is a csv having 2 columns "Text", "Name".
"Text" column contains the news article.
"Name" column contains the extracted name from the corresponding text.
I have to train the model on this dataset, dataset contains 4000 plus unique news, where in, once your model is trained and validated, User should be able to pass any text and it should fetch the proper name.
What technique should I use and implement it. Please suggest.
Thanks in advance.
It sounds like you are looking to search for an item by keywords. In a basic case you could use a bag of words approach, in which you tokenise the words in the Text-field and index each document accordingly.
The relevance of each document can then be calculated given some measure (for instance cosine similarity).
You can find an example using the genesis-library here: https://radimrehurek.com/gensim/tut3.html
It is quite basic, note however that it does use LSI.

Resources