I have a NSL-KDD dataset, where it has some string columns
I have checked some code from github, most of whom just use one-hot encoder to transform them as the following.
I want to know whether I can transform these string columns as numerical column with 3 column rather than 84 one-hot encoder column. After all, they create too wide sparse space, which may make training accuracy worse.
I am trying to build a RNN model for text classification and I am currently building my dataset.
I am trying to do some of the work automatically and I'm using an API that gets me some information for each text I send to it.
So basically :
I have, for each text on my dataframe, I have a df['label'] that contain a 1 to 3 word string.
I have a list of vocabulary (my futur classes) and for each on the df['label'] item, and want to attribute one of the vocabulary list item, depending on which is closest in meaning.
So I need to measure how close each of the labels are close in meaning to my vocabulary list.
Any help ?
i'm trying to import a text file of csv data into excel. The data contains mostly integers but there's one column with strings. I'm using the Data tab of excel professional plus 2019. However, when I select comma as the delimiter i loose 5 of the 16 columns, starting with the one containing strings. The data looks like the below. the date and the 7 numbers are in their own columns (just white space separated) . can anyone help or explain many thanks
2143, Wed, 6,Jul,2016, 38,20,03,39,01,24,04, 2198488, 0, Lancelot , 6
Before
after
full data is on https://github.com/CH220/textfileforexcel
Your problem stems from the very first line of data in your text file:
40,03,52,02,07,20,14, 13137760, 1, Lancelot , 7
As you can see, there are only eleven "segments". Hence, when you try to use the import dialog to separate by comma, there will only be 11 columns even though subsequent rows have 16 columns.
Possible solutions:
Correct the text file so the first line has the desired number of segments
Change the Import Dialog, as you did, to comma, then
Transform
Edit the second line of the generated M-code to change from Columns=11 to Columns=16. You do this in the Advanced Editor
Source = Csv.Document(File.Contents("C:\Users\ron\Desktop\new 2.txt"),[Delimiter=",", Columns=16, Encoding=1252]),
Change the Fixed Width "argument" from 0,23 => 0
Transform
Split Column by delimiter (using the comma) in Power Query.
To me, the "best" way would be to correct the text file.
I am working on a excel file with large text data. 2 columns have lot of text data. Like descriptions, job duties.
When i import my file in python df=pd.read_excel("form1.xlsx"). It shows the columns with text data as NaN.
How do I import all the text in the columns ?
I want to do analysis on job title , description and job duties. Descriptions and Job Title are long text. I have over 150 rows.
Try converting the file from .xlsx to .CSV
I had the same problem with text columns so i tried converting to CSV (Comma Delimited) and it worked. Not very helpful, but worth a try.
You can pass a dictionary of column names and datatypes to read_excel with the dtype keyword:
col_dict = {‘a’: str, ‘b’: int}
pd.read_excel("form1.xls", dtype=col_dict)
i want to do a sentiment analysis on twitter tweet and for that i have test data set in csv file in which tweet column as input but how to create a training data and how to label the data and how to use as feature set.
please correct me if I'm wrong here.. i have to manually create the feature set and label each word?
till now i have done word tokenize and stop word and porter stemming to my test dataset. now from here i want apply my training set and as output i want positive and negative tweet result. please give me some correct direction..