How to explode/split a nested list, inside a list inside a pandas dataframe column and make separate columns out of them? - python-3.x

I have a dataframe. I want to split the Options column into id, AUD,ud.
id col1 col2 Options
1 A B [{'id':25,'X': {'AUD': None, 'ud':0}}]
2 C D [{'id':27,'X': {'AUD': None, 'ud':0}}]
3 E F [{'id':28,'X': {'AUD': None, 'ud':0}}]
4 G H [{'id':29,'X': {'AUD': None, 'ud':0}}]
Expected output dataframe:
id col1 col2 id Aud ud
1 A B 25 None 0
2 C D 27 None 0
3 E F 28 None 0
4 G H 29 None 0
How do you go about it using python3.6 and pandas dataframe?

Use list comprehension with json_normalize for get DataFrames and join together by concat, also added DataFrame.add_prefix for avoid duplicated columns names:
from pandas.io.json import json_normalize
import ast
L = [json_normalize(x) for x in df.pop('Options')]
#if strings instead dicts
#L = [json_normalize(ast.literal_eval(x)) for x in df.pop('Options')]
df = df.join(pd.concat(L, ignore_index=True, sort=False).add_prefix('opt_'))
print (df)
id col1 col2 opt_id opt_X.AUD opt_X.ud
0 1 A B 25 None 0
1 2 C D 27 None 0
2 3 E F 28 None 0
3 4 G H 29 None 0
Another solution with extract X values of nested dictionaries:
L = [{k: v for y in ast.literal_eval(x) for k, v in {**y.pop('X'), **y}.items()}
for x in df.pop('Options')]
df = df.join(pd.DataFrame(L, index=df.index).add_prefix('opt_'))
print (df)
id col1 col2 opt_AUD opt_ud opt_id
0 1 A B None 0 25
1 2 C D None 0 27
2 3 E F None 0 28
3 4 G H None 0 29

Try this:
for dit in df['Options'].iteritems():
df.loc[dit[0],'id'] = dit[1][0]['id']
df.loc[dit[0],'Aud'] = dit[1][0]['X']['AUD']
df.loc[dit[0],'ud'] = dit[1][0]['X']['ud']

Related

Sum in Column based on condition in rows in pandas dataframe [duplicate]

I have a dataframe which I want to plot with matplotlib, but the index column is the time and I cannot plot it.
This is the dataframe (df3):
but when I try the following:
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
I'm getting an error obviously:
KeyError: 'YYYY-MO-DD HH-MI-SS_SSS'
So what I want to do is to add a new extra column to my dataframe (named 'Time) which is just a copy of the index column.
How can I do it?
This is the entire code:
#Importing the csv file into df
df = pd.read_csv('university2.csv', sep=";", skiprows=1)
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
#Add Magnetic Magnitude Column
df['magnetic_mag'] = np.sqrt(df['MAGNETIC FIELD X (μT)']**2 + df['MAGNETIC FIELD Y (μT)']**2 + df['MAGNETIC FIELD Z (μT)']**2)
#Subtract Earth's Average Magnetic Field from 'magnetic_mag'
df['magnetic_mag'] = df['magnetic_mag'] - 30
#Copy interesting values
df2 = df[[ 'ATMOSPHERIC PRESSURE (hPa)',
'TEMPERATURE (C)', 'magnetic_mag']].copy()
#Hourly Average and Standard Deviation for interesting values
df3 = df2.resample('H').agg(['mean','std'])
df3.columns = [' '.join(col) for col in df3.columns]
df3.reset_index()
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
Thank you !!
I think you need reset_index:
df3 = df3.reset_index()
Possible solution, but I think inplace is not good practice, check this and this:
df3.reset_index(inplace=True)
But if you need new column, use:
df3['new'] = df3.index
I think you can read_csv better:
df = pd.read_csv('university2.csv',
sep=";",
skiprows=1,
index_col='YYYY-MO-DD HH-MI-SS_SSS',
parse_dates='YYYY-MO-DD HH-MI-SS_SSS') #if doesnt work, use pd.to_datetime
And then omit:
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
EDIT: If MultiIndex or Index is from groupby operation, possible solutions are:
df = pd.DataFrame({'A':list('aaaabbbb'),
'B':list('ccddeeff'),
'C':range(8),
'D':range(4,12)})
print (df)
A B C D
0 a c 0 4
1 a c 1 5
2 a d 2 6
3 a d 3 7
4 b e 4 8
5 b e 5 9
6 b f 6 10
7 b f 7 11
df1 = df.groupby(['A','B']).sum()
print (df1)
C D
A B
a c 1 9
d 5 13
b e 9 17
f 13 21
Add parameter as_index=False:
df2 = df.groupby(['A','B'], as_index=False).sum()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
Or add reset_index:
df2 = df.groupby(['A','B']).sum().reset_index()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
You can directly access in the index and get it plotted, following is an example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
#Get index in horizontal axis
plt.plot(df.index, df[0])
plt.show()
#Get index in vertiacal axis
plt.plot(df[0], df.index)
plt.show()
You can also use eval to achieve this:
In [2]: df = pd.DataFrame({'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')}, index=list('ABCDE'))
In [3]: df
Out[3]:
num date
A 0 2022-06-30
B 1 2022-07-01
C 2 2022-07-02
D 3 2022-07-03
E 4 2022-07-04
In [4]: df.eval('index_copy = index')
Out[4]:
num date index_copy
A 0 2022-06-30 A
B 1 2022-07-01 B
C 2 2022-07-02 C
D 3 2022-07-03 D
E 4 2022-07-04 E

Easily generate edge list from specific structure using pandas

This is a question about how to make things properly with pandas (I use version 1.0).
Let say I have a DataFrame with missions which contains an origin and one or more destinations:
mid from to
0 0 A [C]
1 1 A [B, C]
2 2 B [B]
3 3 C [D, E, F]
Eg.: For the mission (mid=1) people will travel from A to B, then from B to C and finally from C to A. Notice, that I have no control on the datamodel of the input DataFrame.
I would like to compute metrics on each travel of the mission. The expected output would be exactly:
tid mid from to
0 0 0 A C
1 1 0 C A
2 2 1 A B
3 3 1 B C
4 4 1 C A
5 5 2 B B
6 6 2 B B
7 7 3 C D
8 8 3 D E
9 9 3 E F
10 10 3 F C
I have found a way to achieve my goal. Please, find bellow the MCVE:
import pandas as pd
# Input:
df = pd.DataFrame(
[["A", ["C"]],
["A", ["B", "C"]],
["B", ["B"]],
["C", ["D", "E", "F"]]],
columns = ["from", "to"]
).reset_index().rename(columns={'index': 'mid'})
# Create chain:
df['chain'] = df.apply(lambda x: list(x['from']) + x['to'] + list(x['from']), axis=1)
# Explode chain:
df = df.explode('chain')
# Shift to create travel:
df['end'] = df.groupby("mid")["chain"].shift(-1)
# Remove extra row, clean, reindex and rename:
df = df.dropna(subset=['end']).reset_index(drop=True).reset_index().rename(columns={'index': 'tid'})
df = df.drop(['from', 'to'], axis=1).rename(columns={'chain': 'from', 'end': 'to'})
My question is: Is there a better/easier way to make it with Pandas? By saying better I mean, not necessary more performant (it can be off course), but more readable and intuitive.
Your operation is basically explode and concat:
# turn series of lists in to single series
tmp = df[['mid','to']].explode('to')
# new `from` is concatenation of `from` and the list
df1 = pd.concat((df[['mid','from']],
tmp.rename(columns={'to':'from'})
)
).sort_index()
# new `to` is concatenation of list and `to``
df2 = pd.concat((tmp,
df[['mid','from']].rename(columns={'from':'to'})
)
).sort_index()
df1['to'] = df2['to']
Output:
mid from to
0 0 A C
0 0 C A
1 1 A B
1 1 B C
1 1 C A
2 2 B B
2 2 B B
3 3 C D
3 3 D E
3 3 E F
3 3 F C
If you don't mind re-constructing the entire DataFrame then you can clean it up a bit with np.roll to get the pairs of destinations and then assign the value of mid based on the number of trips (length of each sublist in l)
import pandas as pd
import numpy as np
from itertools import chain
l = [[fr]+to for fr,to in zip(df['from'], df['to'])]
df1 = (pd.DataFrame(data=chain.from_iterable([zip(sl, np.roll(sl, -1)) for sl in l]),
columns=['from', 'to'])
.assign(mid=np.repeat(df['mid'].to_numpy(), [*map(len, l)])))
from to mid
0 A C 0
1 C A 0
2 A B 1
3 B C 1
4 C A 1
5 B B 2
6 B B 2
7 C D 3
8 D E 3
9 E F 3
10 F C 3

Column label of max in pandas

I am trying to extract maximum value in row and contributing column label from pandas dataframe. For example,
A B C D
index
x 0 1 2 3
y 3 2 1 0
I expect the following output,
A B C D Maxv Con
index
x 0 1 2 3 3 D
y 3 2 1 0 3 A
I tried the following,
df['Maxv'] = df.apply(max,axis=1)
df['Con'] = df.idxmax(axis='rows')
It returned only the max column and 'NaN' for Con column. What is the error here?
Thanks in Advance.
AP
Need axis='columns' or axis=1 in DataFrame.idxmax:
df['Con'] = df.idxmax(axis='columns')
print (df)
A B C D Maxv Con
index
x 0 1 2 3 3 D
y 3 2 1 0 3 A
Or:
df['Con'] = df.idxmax(axis=1)
print (df)
A B C D Maxv Con
index
x 0 1 2 3 3 D
y 3 2 1 0 3 A
You get NaNs, because data are not align to index:
print (df.idxmax(axis='rows'))
A y
B y
C x
D x
dtype: object

iterate through rows and columns in excel using pandas-Python 3

I have an excel spreadsheet that I read with this code:
df=pd.ExcelFile('/Users/xxx/Documents/Python/table.xlsx')
ccg=df.parse("CCG")
With the sheet that I want inside the spreadsheet being CCG
The sheet looks like this:
col1 col2 col3
x a 1 2
x b 3 4
x c 5 6
x d 7 8
x a 9 10
x b 11 12
x c 13 14
y a 15 16
y b 17 18
y c 19 20
y d 21 22
y a 23 24
How would I write code that gets values of col 2 and col3 for rows that contain both a and x. So the proposed output for this table would be: col1=[1,9], col2=[2,10]
Try this:
df = pd.read_excel('/Users/xxx/Documents/Python/table.xlsx', 'CCG', index_col=0, usecols=['col1','col2']) \
.query("index == 'x' and col1 == 'a'")
Demo:
Excel file:
In [243]: fn = r'C:\Temp\.data\41718085.xlsx'
In [244]: pd.read_excel(fn, 'CCG', index_col=0, usecols=['col1','col2']) \
.query("index == 'x' and col1 == 'a'")
Out[244]:
col1 col2
x a 1
x a 9
You can do:
df = pd.read_excel('/Users/xxx/Documents/Python/table.xlsx'),sheetname='CCG', index_col=0)
filter = df[(df.index == 'x') & (df.col1 == 'a')]
Then from here, you can return all the values as a numpy array with:
filter['col2']
filter['col3']
Managed to create a count that iterates until it finds a adds +1 to the count and only appends to the list index if it is between the ranges that x is in, once i have the indices i search through col 2 and 3 and pull the values out for the indices

Pandas Dynamic Stack

Given the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'foo':['a','b','c','d'],
'bar':['e','f','g','h'],
0:['i','j','k',np.nan],
1:['m',np.nan,'o','p']})
df=df[['foo','bar',0,1]]
df
foo bar 0 1
0 a e i m
1 b f j NaN
2 c g k o
3 d h NaN p
...which resulted from a previous procedure that produced columns 0 and 1 (and may have produced more or fewer columns than 0 and 1 depending on the data):
I want to somehow stack (if that's the correct term) the data so that each value of 0 and 1 (ignoring NaNs) produces a new row like this:
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p
You probably noticed that the common field is foo.
It will likely occur that there are more common fields in my actual data set.
Also, I'm not sure how important it is that the index values repeat in the end result across values of foo. As long as the data is correct, that's my main concern.
Update:
What if I have 2+ common fields like this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'foo':['a','a','b','b'],
'foo2':['a2','b2','c2','d2'],
'bar':['e','f','g','h'],
0:['i','j','k',np.nan],
1:['m',np.nan,'o','p']})
df=df[['foo','foo2','bar',0,1]]
df
foo foo2 bar 0 1
0 a a2 e i m
1 a b2 f j NaN
2 b c2 g k o
3 b d2 h NaN p
You can use set_index, stack and reset_index:
print df.set_index('foo').stack().reset_index(level=1, drop=True).reset_index(name='bar')
foo bar
0 a e
1 a i
2 a m
3 b f
4 b j
5 c g
6 c k
7 c o
8 d h
9 d p
If you need index, use melt:
print pd.melt(df.reset_index(),
id_vars=['index', 'foo'],
value_vars=['bar', 0, 1],
value_name='bar')
.sort_values('index')
.set_index('index', drop=True)
.dropna()
.drop('variable', axis=1)
.rename_axis(None)
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p
Or use not well known lreshape:
print pd.lreshape(df.reset_index(), {'bar': ['bar', 0, 1]})
.sort_values('index')
.set_index('index', drop=True)
.rename_axis(None)
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p

Resources