Format of the input dataset for Google AutoML Natural Language multi-label text classification - google-cloud-nl

What should the format of the input dataset be for Google AutoML Natural Language multi-label text classification? I know that for multi-class classification I need a column of text and another column for labels. The labels column include one label per row.
I have multiple labels for each text and I want to do multi-label classification. I tried having one column per label and one-hot encoding but I got this error message:
Max 1000 labels supported. Found 9823 labels.

It was very confusing at first but later I managed to find the format in the documentation, which is a CSV file like:
text1, label1, label2
text2, label2
text3, label3, label2, label1
The parser doesn't understand a table with NULL cells saved as a standard CSV file, which is like:
text1, label1, label2,
text2, label2,,
text3, label3, label2, label1
I had to manually remove extra commas from the CSV file generated by Pandas.

Google AutoML has updated their parser. The following format is fine:
text1, label1, label2, label3,
text1, label1, label2, ,
text1, label1, label2, , ,
At least that worked for me on 27th Jan 2019

One column per label is the way to go. If you have less than 1000 labels, you probably have a mistake in your CSV file, where the parser is getting confused and thinks some of the tokens in the text of the example are labels. Please make sure that your text is correctly escaped with quotes around.

Related

How to preprocess string columns of NSL-KDD?

I have a NSL-KDD dataset, where it has some string columns
I have checked some code from github, most of whom just use one-hot encoder to transform them as the following.
I want to know whether I can transform these string columns as numerical column with 3 column rather than 84 one-hot encoder column. After all, they create too wide sparse space, which may make training accuracy worse.

Get closest text entry of a list from a string

I am trying to build a RNN model for text classification and I am currently building my dataset.
I am trying to do some of the work automatically and I'm using an API that gets me some information for each text I send to it.
So basically :
I have, for each text on my dataframe, I have a df['label'] that contain a 1 to 3 word string.
I have a list of vocabulary (my futur classes) and for each on the df['label'] item, and want to attribute one of the vocabulary list item, depending on which is closest in meaning.
So I need to measure how close each of the labels are close in meaning to my vocabulary list.
Any help ?

Data missing when importing a text file into excel

i'm trying to import a text file of csv data into excel. The data contains mostly integers but there's one column with strings. I'm using the Data tab of excel professional plus 2019. However, when I select comma as the delimiter i loose 5 of the 16 columns, starting with the one containing strings. The data looks like the below. the date and the 7 numbers are in their own columns (just white space separated) . can anyone help or explain many thanks
2143, Wed, 6,Jul,2016, 38,20,03,39,01,24,04, 2198488, 0, Lancelot , 6
Before
after
full data is on https://github.com/CH220/textfileforexcel
Your problem stems from the very first line of data in your text file:
40,03,52,02,07,20,14, 13137760, 1, Lancelot , 7
As you can see, there are only eleven "segments". Hence, when you try to use the import dialog to separate by comma, there will only be 11 columns even though subsequent rows have 16 columns.
Possible solutions:
Correct the text file so the first line has the desired number of segments
Change the Import Dialog, as you did, to comma, then
Transform
Edit the second line of the generated M-code to change from Columns=11 to Columns=16. You do this in the Advanced Editor
Source = Csv.Document(File.Contents("C:\Users\ron\Desktop\new 2.txt"),[Delimiter=",", Columns=16, Encoding=1252]),
Change the Fixed Width "argument" from 0,23 => 0
Transform
Split Column by delimiter (using the comma) in Power Query.
To me, the "best" way would be to correct the text file.

how to read text from excel file in python pandas?

I am working on a excel file with large text data. 2 columns have lot of text data. Like descriptions, job duties.
When i import my file in python df=pd.read_excel("form1.xlsx"). It shows the columns with text data as NaN.
How do I import all the text in the columns ?
I want to do analysis on job title , description and job duties. Descriptions and Job Title are long text. I have over 150 rows.
Try converting the file from .xlsx to .CSV
I had the same problem with text columns so i tried converting to CSV (Comma Delimited) and it worked. Not very helpful, but worth a try.
You can pass a dictionary of column names and datatypes to read_excel with the dtype keyword:
col_dict = {‘a’: str, ‘b’: int}
pd.read_excel("form1.xls", dtype=col_dict)

how to create and label feature set in nltk

i want to do a sentiment analysis on twitter tweet and for that i have test data set in csv file in which tweet column as input but how to create a training data and how to label the data and how to use as feature set.
please correct me if I'm wrong here.. i have to manually create the feature set and label each word?
till now i have done word tokenize and stop word and porter stemming to my test dataset. now from here i want apply my training set and as output i want positive and negative tweet result. please give me some correct direction..

Resources