Groupby id and get each string from an id, in each diferent column - pandas-groupby

Hello I just want to group the elements by id and show each string in a separated column
Original dataframe:
id|elements|
1|a
1|b
1|c
1|d
2|a
2|b
2|b
3|a
3|a
3|b
3|c
3|c
3|c
Desired output:
id|column1|column2|column3|column4|column5|
1 |a|b|c|d| | |
2 |a|b|b|
3 |a|a|b|c|c|c|
Any ideas? Thank you very much in advance

Given your original data frame, you can simply do:
df.groupby('id').apply(lambda x: x['element'].to_list()).apply(pd.Series)
Output:
0 1 2 3 4 5
id
1 a b c d NaN NaN
2 a b b NaN NaN NaN
3 a a b c c c
If you do not want id to be the index, use .reset_index().

Try this
import pandas as pd
import numpy as np
F = {'id': [1,1,1,1,2,2,2,3,3,3,3,3], 'element': ['a','b','c','d','a','b','b','a','a','b','c','c']}
df = pd.DataFrame(data = F)
df2 = df.set_index('id').stack().groupby(level=[0,1]).apply(list).unstack()
df3 = pd.DataFrame(df2["element"].to_list(), columns=['element1', 'element2','element3', 'element4','element5'])

Related

How to map/replace multiple values in a column for each row in pandas dataframe

I have this sample
col1 result
1 A
1,2,3
2 B
2,3,4
3,4
4 D
1,3,4
3 C
Here's my map variable.
vals_to_replace = {'1':'A', '2':'B', '3':'C' , '4':'D'}
I map this to col1, and only getting some values from the col result, not sure why why single value got mapped only.
Any ideas on how to solve it?
Thanks
Maybe this is what works for you:
import pandas as pd
df = pd.DataFrame({'col1': ['1', '1,2,3', '2', '2,3,4', '3, 4', '4', '1,3,4', '3']})
translation = {'1':'A', '2':'B', '3':'C' , '4':'D'}
df['result'] = df.col1.str.translate(str.maketrans(translation))
print(df)
Result:
col1 result
0 1 A
1 1,2,3 A,B,C
2 2 B
3 2,3,4 B,C,D
4 3, 4 C, D
5 4 D
6 1,3,4 A,C,D
7 3 C

Sum in Column based on condition in rows in pandas dataframe [duplicate]

I have a dataframe which I want to plot with matplotlib, but the index column is the time and I cannot plot it.
This is the dataframe (df3):
but when I try the following:
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
I'm getting an error obviously:
KeyError: 'YYYY-MO-DD HH-MI-SS_SSS'
So what I want to do is to add a new extra column to my dataframe (named 'Time) which is just a copy of the index column.
How can I do it?
This is the entire code:
#Importing the csv file into df
df = pd.read_csv('university2.csv', sep=";", skiprows=1)
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
#Add Magnetic Magnitude Column
df['magnetic_mag'] = np.sqrt(df['MAGNETIC FIELD X (μT)']**2 + df['MAGNETIC FIELD Y (μT)']**2 + df['MAGNETIC FIELD Z (μT)']**2)
#Subtract Earth's Average Magnetic Field from 'magnetic_mag'
df['magnetic_mag'] = df['magnetic_mag'] - 30
#Copy interesting values
df2 = df[[ 'ATMOSPHERIC PRESSURE (hPa)',
'TEMPERATURE (C)', 'magnetic_mag']].copy()
#Hourly Average and Standard Deviation for interesting values
df3 = df2.resample('H').agg(['mean','std'])
df3.columns = [' '.join(col) for col in df3.columns]
df3.reset_index()
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
Thank you !!
I think you need reset_index:
df3 = df3.reset_index()
Possible solution, but I think inplace is not good practice, check this and this:
df3.reset_index(inplace=True)
But if you need new column, use:
df3['new'] = df3.index
I think you can read_csv better:
df = pd.read_csv('university2.csv',
sep=";",
skiprows=1,
index_col='YYYY-MO-DD HH-MI-SS_SSS',
parse_dates='YYYY-MO-DD HH-MI-SS_SSS') #if doesnt work, use pd.to_datetime
And then omit:
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
EDIT: If MultiIndex or Index is from groupby operation, possible solutions are:
df = pd.DataFrame({'A':list('aaaabbbb'),
'B':list('ccddeeff'),
'C':range(8),
'D':range(4,12)})
print (df)
A B C D
0 a c 0 4
1 a c 1 5
2 a d 2 6
3 a d 3 7
4 b e 4 8
5 b e 5 9
6 b f 6 10
7 b f 7 11
df1 = df.groupby(['A','B']).sum()
print (df1)
C D
A B
a c 1 9
d 5 13
b e 9 17
f 13 21
Add parameter as_index=False:
df2 = df.groupby(['A','B'], as_index=False).sum()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
Or add reset_index:
df2 = df.groupby(['A','B']).sum().reset_index()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
You can directly access in the index and get it plotted, following is an example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
#Get index in horizontal axis
plt.plot(df.index, df[0])
plt.show()
#Get index in vertiacal axis
plt.plot(df[0], df.index)
plt.show()
You can also use eval to achieve this:
In [2]: df = pd.DataFrame({'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')}, index=list('ABCDE'))
In [3]: df
Out[3]:
num date
A 0 2022-06-30
B 1 2022-07-01
C 2 2022-07-02
D 3 2022-07-03
E 4 2022-07-04
In [4]: df.eval('index_copy = index')
Out[4]:
num date index_copy
A 0 2022-06-30 A
B 1 2022-07-01 B
C 2 2022-07-02 C
D 3 2022-07-03 D
E 4 2022-07-04 E

Delete row from dataframe having "None" value in all the columns - Python

I need to delete the row completely in a dataframe having "None" value in all the columns. I am using the following code -
df.dropna(axis=0,how='all',thresh=None,subset=None,inplace=True)
This does not bring any difference to the dataframe. The rows with "None" value are still there.
How to achieve this?
There Nones should be strings, so use replace first:
df = df.replace('None', np.nan).dropna(how='all')
df = pd.DataFrame({
'a':['None','a', 'None'],
'b':['None','g', 'None'],
'c':['None','v', 'b'],
})
print (df)
a b c
0 None None None
1 a g v
2 None None b
df1 = df.replace('None', np.nan).dropna(how='all')
print (df1)
a b c
1 a g v
2 NaN NaN b
Or test values None with not equal and DataFrame.any:
df1 = df[df.ne('None').any(axis=1)]
print (df1)
a b c
1 a g v
2 None None b
You should be dropping in the axis 1. Use the how keyword to drop columns with any or all NaN values. Check the docs
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,3], 'b':[-1, 0, np.nan], 'c':[np.nan, np.nan, np.nan]})
df
a b c
0 1 -1.0 NaN
1 2 0.0 NaN
2 3 NaN 5.0
df.dropna(axis=1, how='any')
a
0 1
1 2
2 3
df.dropna(axis=1, how='all')
a b
0 1 -1.0
1 2 0.0
2 3 NaN

Pandas insert alternate blank rows

Given the following data frame:
import pandas as pd
import numpy as np
df1=pd.DataFrame({'A':['a','b','c','d'],
'B':['d',np.nan,'c','f']})
df1
A B
0 a d
1 b NaN
2 c c
3 d f
I'd like to insert blank rows before each row.
The desired result is:
A B
0 NaN NaN
1 a d
2 NaN NaN
3 b NaN
4 NaN NaN
5 c c
6 NaN NaN
7 d f
In reality, I have many rows.
Thanks in advance!
I think you could change your index like #bananafish did and then use reindex:
df1.index = range(1, 2*len(df1)+1, 2)
df2 = df1.reindex(index=range(2*len(df1)))
In [29]: df2
Out[29]:
A B
0 NaN NaN
1 a d
2 NaN NaN
3 b NaN
4 NaN NaN
5 c c
6 NaN NaN
7 d f
Use numpy and pd.DataFrame
def pir(df):
nans = np.where(np.empty_like(df.values), np.nan, np.nan)
data = np.hstack([nans, df.values]).reshape(-1, df.shape[1])
return pd.DataFrame(data, columns=df.columns)
pir(df1)
Testing and Comparison
Code
def banana(df):
df1 = df.set_index(np.arange(1, 2*len(df)+1, 2))
df2 = pd.DataFrame(index=range(0, 2*len(df1), 2), columns=df1.columns)
return pd.concat([df1, df2]).sort_index()
def anton(df):
df = df.set_index(np.arange(1, 2*len(df)+1, 2))
return df.reindex(index=range(2*len(df)))
def pir(df):
nans = np.where(np.empty_like(df.values), np.nan, np.nan)
data = np.hstack([nans, df.values]).reshape(-1, df.shape[1])
return pd.DataFrame(data, columns=df.columns)
Results
pd.concat([f(df1) for f in [banana, anton, pir]],
axis=1, keys=['banana', 'anton', 'pir'])
Timing
A bit roundabout but this works:
df1.index = range(1, 2*len(df1)+1, 2)
df2 = pd.DataFrame(index=range(0, 2*len(df1), 2), columns=df1.columns)
df3 = pd.concat([df1, df2]).sort()

Pandas take value from columns if not NaN

Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['One','Two',np.nan],
'B':[np.nan,np.nan,'Three'],
})
df
A B
0 One NaN
1 Two NaN
2 NaN Three
I'd like to create one column ('C') that takes the value of either 'A' or 'B' if it is not NaN like this:
A B C
0 One NaN One
1 Two NaN Two
2 NaN Three Three
Thanks in advance!
You can use combine_first:
df['C'] = df.A.combine_first(df.B)
print df
A B C
0 One NaN One
1 Two NaN Two
2 NaN Three Three
Or fillna:
df['C']= df.A.fillna(df.B)
print df
A B C
0 One NaN One
1 Two NaN Two
2 NaN Three Three
Or np.where and add value if both conditions are False e.g. 1:
df['C'] = np.where(df.A.notnull(), df.A,np.where(df.B.notnull(), df.B, 1))
print df
A B C
0 One NaN One
1 Two NaN Two
2 NaN Three Three

Resources