How to create a Pandas Dataframe from the comma separated values in txt file - python-3.x

I have a txt file which contains data as in below format
"column1,column2,column3,column4,column5,column6,column7,column8"
"abc,abc,abc,10,datetime,abc,abc,abc"
"xyz,xyz,""xyz1,xyz2"",2,datetime2,xyz,xyz,xyz"
"xyz,xyz,""xyz1 , xyz2"",2,datetime2,xyz,xyz,xyz"
I want to convert it into a Pandas DataFrame which will be having 8 columns of header same as row 1
it is different from normal/regular Dataframe question.
I tried with following code,
df = pd.read_csv('tst.txt')
But output was
column1,column2,column3,column4,column5,column6,column7,column8
0 abc,abc,abc,10,datetime,abc,abc,abc
1 xyz,xyz,"xyz1,xyz2",2,datetime2,xyz,xyz,xyz
2 xyz,xyz,"xyz1 , xyz2",2,datetime2,xyz,xyz,xyz
I tried with other things as well like
df1 = pd.DataFrame([line.replace(' , ','$$$').replace('"','').replace('\n','').split(',') for line in open('tst.txt')])
but output was different and not expected
0 1 2 3 4 5 6 7 8
0 column1 column2 column3 column4 column5 column6 column7 column8 None
1 abc abc abc 10 datetime abc abc abc None
2 xyz xyz xyz1 xyz2 2 datetime2 xyz xyz xyz
3 xyz xyz xyz1$$$xyz2 2 datetime2 xyz xyz xyz None
So you can see here that only 8 columns should be there not 9.
datetime should be in 5th column.
Actual output should be like,
column1 column2 column3 column4 column5 column6 column7 column8
0 abc abc abc 10 datetime abc abc abc
1 xyz xyz xyz1,xyz2 2 datetime2 xyz xyz xyz
2 xyz xyz xyz1 , xyz2 2 datetime2 xyz xyz xyz

Try pass the quotechar with "
df=pd.read_csv('tst.txt', quotechar='"', sep=',')
column1 column2 column3 column4 column5 column6 column7 column8
0 abc abc abc 10 datetime abc abc abc
1 xyz xyz xyz1,xyz2 2 datetime2 xyz xyz xyz
2 xyz xyz xyz1 , xyz2 2 datetime2 xyz xyz xyz

Related

How can I replace the column name in a panda dataframe

I have a excel file as
Old_name new_name
xyz abc
opq klm
And I have my dataframe as like this
Id timestamp xyz opq
1 04-10-2021 3 4
2 05-10-2021 4 9
As you see I have my old names as column name and I would like to map and replace them with new name as in my csv file. How can I do that?
Try with rename:
df.rename(columns=col_names.set_index('Old_name')['new_name'], inplace=True)
# verify
print(df)
Output:
Id timestamp abc klm
0 1 04-10-2021 3 4
1 2 05-10-2021 4 9

Dropping columns with high missing values

I have a situation where I need to drop a lot of my dataframe columns where there are high missing values. I have created a new dataframe that gives me the missing values and the ratio of missing values from my original data set.
My original data set - data_merge2 looks like this :
A B C D
123 ABC X Y
123 ABC X Y
NaN ABC NaN NaN
123 ABC NaN NaN
245 ABC NaN NaN
345 ABC NaN NaN
The count data set looks like this that gives me the missing count and ratio:
missing_count missing_ratio
C 4 0.10
D 4 0.66
The code that I used to create the count dataset looks like :
#Only check those columns where there are missing values as we have got a lot of columns
new_df = (data_merge2.isna()
.sum()
.to_frame('missing_count')
.assign(missing_ratio = lambda x: x['missing_count']/len(data_merge2)*100)
.loc[data_merge2.isna().any()] )
print(new_df)
Now I want to drop the columns from the original dataframe whose missing ratio is >50%
How should I achieve this?
Use:
data_merge2.loc[:,data_merge2.count().div(len(data_merge2)).ge(0.5)]
#Alternative
#df[df.columns[df.count().mul(2).gt(len(df))]]
or DataFrame.drop using new_df DataFrame
data_merge2.drop(columns = new_df.index[new_df['missing_ratio'].gt(50)])
Output
A B
0 123.0 ABC
1 123.0 ABC
2 NaN ABC
3 123.0 ABC
4 245.0 ABC
5 345.0 ABC
Adding another way with query and XOR:
data_merge2[data_merge2.columns ^ new_df.query('missing_ratio>50').index]
Or pandas way using Index.difference
data_merge2[data_merge2.columns.difference(new_df.query('missing_ratio>50').index)]
A B
0 123.0 ABC
1 123.0 ABC
2 NaN ABC
3 123.0 ABC
4 245.0 ABC
5 345.0 ABC

How to get the column names along with the rows by using pivot stage in datastage

I have a table in format
Data:
ID Sev1 Sev2 Sev3
ABC 0.45 1 1
PQR 0.45 1 2
XYZ 0.45 1 1
I want to change this to the new format as below by using horizontal pivot . How can i get column names ( severity ) as well along with its data
Expected Output:
ID Severity Values
ABC Sev1 0.45
ABC Sev2 1
ABC Sev3 1
PQR Sev1 0.45
PQR Sev2 1
PQR Sev3 2
XYZ Sev1 0.45
XYZ Sev2 1
XYZ Sev3 1
To bring the column names as rows and add along with the corresponding values
After converting using horizontal pivot and after then i am using CDC stage to define if its an insert or update record with CDC_Changecode = 1 or CdcChangeCode = 3. Please find below sample data
ID Severity Values CDCCode
ABC Sev1 0.45 1
ABC Sev2 1 1
ABC Sev3 1 1
PQR Sev1 0.45 3
PQR Sev2 1 3
PQR Sev3 2 3
XYZ Sev1 0.45 3
XYZ Sev2 1 3
XYZ Sev3 1 3
Once i get this output after using CDC stage ( 1 = Insert record , 3 = Update record ) then i want to convert them columns to rows using vertical pivot. When
Expected output after vertical pivot
CDC Code = 1 ( insert record - i am trying to get below data as ouput )
ID Sev1 Sev2 Sev3
ABC 0.45 1
PQR 1
CDC Code = 3 ( Update record - i am trying to get below data as ouput )
ID Sev1 Sev2 Sev3
ABC 1
PQR 0.45 2
XYZ 0.45 1 1
Check the Pivot Index checkbox on the Pivot Properties tab.
Then add the additional column to your output on the mapping tab.
This will generate a column with a number - the index of the column pivoted.
In a additional step you could then transform the numbers back into column names

Unstacking a pandas dataframe

Suppose I have a dataframe with two columns called 'column' and 'value' that looks like this:
Dataframe 1:
column value
0 column1 1
1 column2 1
2 column3 1
3 column4 1
4 column5 2
5 column6 1
6 column7 1
7 column8 1
8 column9 8
9 column10 2
10 column1 1
11 column2 1
12 column3 1
13 column4 3
14 column5 2
15 column6 1
16 column7 1
17 column8 1
18 column9 1
19 column10 2
20 column1 5
.. ... ...
I want to transform this dataframe so that it looks like this:
Dataframe 2:
column1 column2 column3 column4 column5 column6 column7 column8 column9 column10
0 1 1 1 1 2 1 1 1 8 2
1 1 1 1 3 2 1 1 1 1 2
2 5 .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. ..
Now I know how to do it the other way around. If you have a dataframe called df that looks like dataframe 2 you can stack it with the following code:
df = (df.stack().reset_index(level=0, drop=True).rename_axis(['column']).reset_index(name='value'))
Unfortunately, I don't know how to go back!
Question: How do I manipulate dataframe 1 (unstack it, if that's a word) so that it looks like dataframe 2?
Create MultiIndex by set_index with counter Series by cumcount and reshape by unstack:
g = df.groupby('column').cumcount()
df1 = df.set_index([g, 'column'])['value'].unstack(fill_value=0)
print (df1)
column column1 column10 column2 column3 column4 column5 column6 \
0 1 2 1 1 1 2 1
1 1 2 1 1 3 2 1
2 5 0 0 0 0 0 0
column column7 column8 column9
0 1 1 8
1 1 1 1
2 0 0 0
Last if need sorting by numeric value of columns names use extract for integers, convert them and get positions of columns by argsort - last reorder by iloc:
df1 = df1.iloc[:, df1.columns.str.extract('(\d+)', expand=False).astype(int).argsort()]
print (df1)
column column1 column2 column3 column4 column5 column6 column7 \
0 1 1 1 1 2 1 1
1 1 1 1 3 2 1 1
2 5 0 0 0 0 0 0
column column8 column9 column10
0 1 8 2
1 1 1 2
2 0 0 0

New column with in a Pandas Dataframe with respect to duplicates in given column

Hi i have a dataframe with a column "id" like below
id
abc
def
ghi
abc
abc
xyz
def
I need a new column "id1" with a number 1 appended to it and number should be incremented for every duplicate. output should be like below.
id id1
abc abc1
def def1
ghi ghi1
abc abc2
abc abc3
xyz xyz1
def def2
Can anyone suggest me a solution for this?
Use groupby.cumcount for count ids, add 1 and convert to strings:
df['id1'] = df['id'] + df.groupby('id').cumcount().add(1).astype(str)
print (df)
id id1
0 abc abc1
1 def def1
2 ghi ghi1
3 abc abc2
4 abc abc3
5 xyz xyz1
6 def def2
Detail:
print (df.groupby('id').cumcount())
0 0
1 0
2 0
3 1
4 2
5 0
6 1
dtype: int64

Resources