How to create a Pandas Dataframe from the comma separated values in txt file - python-3.x
I have a txt file which contains data as in below format
"column1,column2,column3,column4,column5,column6,column7,column8"
"abc,abc,abc,10,datetime,abc,abc,abc"
"xyz,xyz,""xyz1,xyz2"",2,datetime2,xyz,xyz,xyz"
"xyz,xyz,""xyz1 , xyz2"",2,datetime2,xyz,xyz,xyz"
I want to convert it into a Pandas DataFrame which will be having 8 columns of header same as row 1
it is different from normal/regular Dataframe question.
I tried with following code,
df = pd.read_csv('tst.txt')
But output was
column1,column2,column3,column4,column5,column6,column7,column8
0 abc,abc,abc,10,datetime,abc,abc,abc
1 xyz,xyz,"xyz1,xyz2",2,datetime2,xyz,xyz,xyz
2 xyz,xyz,"xyz1 , xyz2",2,datetime2,xyz,xyz,xyz
I tried with other things as well like
df1 = pd.DataFrame([line.replace(' , ','$$$').replace('"','').replace('\n','').split(',') for line in open('tst.txt')])
but output was different and not expected
0 1 2 3 4 5 6 7 8
0 column1 column2 column3 column4 column5 column6 column7 column8 None
1 abc abc abc 10 datetime abc abc abc None
2 xyz xyz xyz1 xyz2 2 datetime2 xyz xyz xyz
3 xyz xyz xyz1$$$xyz2 2 datetime2 xyz xyz xyz None
So you can see here that only 8 columns should be there not 9.
datetime should be in 5th column.
Actual output should be like,
column1 column2 column3 column4 column5 column6 column7 column8
0 abc abc abc 10 datetime abc abc abc
1 xyz xyz xyz1,xyz2 2 datetime2 xyz xyz xyz
2 xyz xyz xyz1 , xyz2 2 datetime2 xyz xyz xyz
Try pass the quotechar with "
df=pd.read_csv('tst.txt', quotechar='"', sep=',')
column1 column2 column3 column4 column5 column6 column7 column8
0 abc abc abc 10 datetime abc abc abc
1 xyz xyz xyz1,xyz2 2 datetime2 xyz xyz xyz
2 xyz xyz xyz1 , xyz2 2 datetime2 xyz xyz xyz
Related
How can I replace the column name in a panda dataframe
I have a excel file as Old_name new_name xyz abc opq klm And I have my dataframe as like this Id timestamp xyz opq 1 04-10-2021 3 4 2 05-10-2021 4 9 As you see I have my old names as column name and I would like to map and replace them with new name as in my csv file. How can I do that?
Try with rename: df.rename(columns=col_names.set_index('Old_name')['new_name'], inplace=True) # verify print(df) Output: Id timestamp abc klm 0 1 04-10-2021 3 4 1 2 05-10-2021 4 9
Dropping columns with high missing values
I have a situation where I need to drop a lot of my dataframe columns where there are high missing values. I have created a new dataframe that gives me the missing values and the ratio of missing values from my original data set. My original data set - data_merge2 looks like this : A B C D 123 ABC X Y 123 ABC X Y NaN ABC NaN NaN 123 ABC NaN NaN 245 ABC NaN NaN 345 ABC NaN NaN The count data set looks like this that gives me the missing count and ratio: missing_count missing_ratio C 4 0.10 D 4 0.66 The code that I used to create the count dataset looks like : #Only check those columns where there are missing values as we have got a lot of columns new_df = (data_merge2.isna() .sum() .to_frame('missing_count') .assign(missing_ratio = lambda x: x['missing_count']/len(data_merge2)*100) .loc[data_merge2.isna().any()] ) print(new_df) Now I want to drop the columns from the original dataframe whose missing ratio is >50% How should I achieve this?
Use: data_merge2.loc[:,data_merge2.count().div(len(data_merge2)).ge(0.5)] #Alternative #df[df.columns[df.count().mul(2).gt(len(df))]] or DataFrame.drop using new_df DataFrame data_merge2.drop(columns = new_df.index[new_df['missing_ratio'].gt(50)]) Output A B 0 123.0 ABC 1 123.0 ABC 2 NaN ABC 3 123.0 ABC 4 245.0 ABC 5 345.0 ABC
Adding another way with query and XOR: data_merge2[data_merge2.columns ^ new_df.query('missing_ratio>50').index] Or pandas way using Index.difference data_merge2[data_merge2.columns.difference(new_df.query('missing_ratio>50').index)] A B 0 123.0 ABC 1 123.0 ABC 2 NaN ABC 3 123.0 ABC 4 245.0 ABC 5 345.0 ABC
How to get the column names along with the rows by using pivot stage in datastage
I have a table in format Data: ID Sev1 Sev2 Sev3 ABC 0.45 1 1 PQR 0.45 1 2 XYZ 0.45 1 1 I want to change this to the new format as below by using horizontal pivot . How can i get column names ( severity ) as well along with its data Expected Output: ID Severity Values ABC Sev1 0.45 ABC Sev2 1 ABC Sev3 1 PQR Sev1 0.45 PQR Sev2 1 PQR Sev3 2 XYZ Sev1 0.45 XYZ Sev2 1 XYZ Sev3 1 To bring the column names as rows and add along with the corresponding values After converting using horizontal pivot and after then i am using CDC stage to define if its an insert or update record with CDC_Changecode = 1 or CdcChangeCode = 3. Please find below sample data ID Severity Values CDCCode ABC Sev1 0.45 1 ABC Sev2 1 1 ABC Sev3 1 1 PQR Sev1 0.45 3 PQR Sev2 1 3 PQR Sev3 2 3 XYZ Sev1 0.45 3 XYZ Sev2 1 3 XYZ Sev3 1 3 Once i get this output after using CDC stage ( 1 = Insert record , 3 = Update record ) then i want to convert them columns to rows using vertical pivot. When Expected output after vertical pivot CDC Code = 1 ( insert record - i am trying to get below data as ouput ) ID Sev1 Sev2 Sev3 ABC 0.45 1 PQR 1 CDC Code = 3 ( Update record - i am trying to get below data as ouput ) ID Sev1 Sev2 Sev3 ABC 1 PQR 0.45 2 XYZ 0.45 1 1
Check the Pivot Index checkbox on the Pivot Properties tab. Then add the additional column to your output on the mapping tab. This will generate a column with a number - the index of the column pivoted. In a additional step you could then transform the numbers back into column names
Unstacking a pandas dataframe
Suppose I have a dataframe with two columns called 'column' and 'value' that looks like this: Dataframe 1: column value 0 column1 1 1 column2 1 2 column3 1 3 column4 1 4 column5 2 5 column6 1 6 column7 1 7 column8 1 8 column9 8 9 column10 2 10 column1 1 11 column2 1 12 column3 1 13 column4 3 14 column5 2 15 column6 1 16 column7 1 17 column8 1 18 column9 1 19 column10 2 20 column1 5 .. ... ... I want to transform this dataframe so that it looks like this: Dataframe 2: column1 column2 column3 column4 column5 column6 column7 column8 column9 column10 0 1 1 1 1 2 1 1 1 8 2 1 1 1 1 3 2 1 1 1 1 2 2 5 .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. Now I know how to do it the other way around. If you have a dataframe called df that looks like dataframe 2 you can stack it with the following code: df = (df.stack().reset_index(level=0, drop=True).rename_axis(['column']).reset_index(name='value')) Unfortunately, I don't know how to go back! Question: How do I manipulate dataframe 1 (unstack it, if that's a word) so that it looks like dataframe 2?
Create MultiIndex by set_index with counter Series by cumcount and reshape by unstack: g = df.groupby('column').cumcount() df1 = df.set_index([g, 'column'])['value'].unstack(fill_value=0) print (df1) column column1 column10 column2 column3 column4 column5 column6 \ 0 1 2 1 1 1 2 1 1 1 2 1 1 3 2 1 2 5 0 0 0 0 0 0 column column7 column8 column9 0 1 1 8 1 1 1 1 2 0 0 0 Last if need sorting by numeric value of columns names use extract for integers, convert them and get positions of columns by argsort - last reorder by iloc: df1 = df1.iloc[:, df1.columns.str.extract('(\d+)', expand=False).astype(int).argsort()] print (df1) column column1 column2 column3 column4 column5 column6 column7 \ 0 1 1 1 1 2 1 1 1 1 1 1 3 2 1 1 2 5 0 0 0 0 0 0 column column8 column9 column10 0 1 8 2 1 1 1 2 2 0 0 0
New column with in a Pandas Dataframe with respect to duplicates in given column
Hi i have a dataframe with a column "id" like below id abc def ghi abc abc xyz def I need a new column "id1" with a number 1 appended to it and number should be incremented for every duplicate. output should be like below. id id1 abc abc1 def def1 ghi ghi1 abc abc2 abc abc3 xyz xyz1 def def2 Can anyone suggest me a solution for this?
Use groupby.cumcount for count ids, add 1 and convert to strings: df['id1'] = df['id'] + df.groupby('id').cumcount().add(1).astype(str) print (df) id id1 0 abc abc1 1 def def1 2 ghi ghi1 3 abc abc2 4 abc abc3 5 xyz xyz1 6 def def2 Detail: print (df.groupby('id').cumcount()) 0 0 1 0 2 0 3 1 4 2 5 0 6 1 dtype: int64