parsing a panda dataframe column from a dictionary data form into new columns for each dictionary key - python-3.x

In python 3, pandas. Imagine there is a dataframe df with a column x
df=pd.DataFrame(
[
{'x':'{"a":"1","b":"2","c":"3"}'},
{'x':'{"a":"2","b":"3","c":"4"}'}
]
)
The column x has data which looks like a dictionary. Wonder how can I parse them into a new dataframe, so each key here becomes a new column?
The desired output dataframe is like
x,a,b,c
'{"a":"1","b":"2","c":"3"}',1,2,3
'{"a":"2","b":"3","c":"4"}',2,3,4
None of the solution in this post seems to work in this case
parsing a dictionary in a pandas dataframe cell into new row cells (new columns)
df1=pd.DataFrame(df.loc[:,'x'].values.tolist())
print(df1)
result the same dataframe. didn't separate the column into each key per column
Any 2 cents?
Thanks!

You can also map json.loads and convert to a dataframe like;
import json
df1 = pd.DataFrame(df['x'].map(json.loads).tolist(),index=df.index)
print(df1)
a b c
0 1 2 3
1 2 3 4
this tests to be faster than evaluating via ast , below is the benchmark for 40K rows:
m = pd.concat([df]*20000,ignore_index=True)
%%timeit
import json
df1 = pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
#256 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
import ast
df1 = pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
#1.32 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
import ast
df1 = pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
#1.34 s ± 71.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Because string repr of dictionaries is necessary convert values to dictionaries:
import ast, json
#performance for repeated sample data, in real data should be different
m = pd.concat([df]*20000,ignore_index=True)
In [98]: %timeit pd.DataFrame([json.loads(x) for x in m['x']], index=m.index)
206 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#anky_91 solution
In [99]: %timeit pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
210 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [100]: %timeit pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
903 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [101]: %timeit pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
893 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(df1)
a b c
0 1 2 3
1 2 3 4
Last for append to original:
df = df.join(df1)
print(df)
x a b c
0 {"a":"1","b":"2","c":"3"} 1 2 3
1 {"a":"2","b":"3","c":"4"} 2 3 4

Related

How to make a calculation in a pandas daframe depending on a value of a certain column

I have this dataframe and I want to make a calculation depending on a condition, like below:
count prep result
0 10 100
10 100 100
I wanto to create a new column evaluated that is:
if df['count']==0:
df['evaluated'] = df['result'] / df['prep']
else:
df['evaluated'] = df['result'] / df['count']
expected result is:
count prep result evaluated
0 10 100 10
100 10 100 1
What's the best way to do it? My real dataframe has 30k rows.
You can use where or mask:
df['evaluated'] = df['result'].div(df['prep'].where(df['count'].eq(0), df['count']))
Or:
df['evaluated'] = df['result'].div(df['count'].mask(df['count'].eq(0), df['prep']))
Output (assuming there was an error in the provided input):
count prep result evaluated
0 0 10 100 10.0
1 100 10 100 1.0
You can also use np.where from numpy to do that:
df['evaluated'] = np.where(df['count'] == 0,
df['result'] / df['prep'], # == 0
df['result'] / df['count']) # != 0
Performance (not really significant) over 30k rows:
>>> %timeit df['result'].div(df['prep'].where(df['count'].eq(0), df['count']))
652 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df['result'].div(df['count'].mask(df['count'].eq(0), df['prep']))
638 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit np.where(df['count'] == 0, df['result'] / df['prep'], df['result'] / df['count'])
462 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Pandas using map or apply to make a new column from adjustments using a dictionary

I have data from a sporting event and the knowledge is that there is a bias at each home arena that I want to make adjustments for. I have already created a dictionary where the arena is the key and the value is the adjustment I want to make.
So for each row, I want to take the home team, get the adjustment, and then subtract that from the distance column. I have the following code but I cannot seem to get it working.
#Making the dictionary, this is working properly
teams = df.home_team.unique().tolist()
adj_shot_dict = {}
for team in teams:
df_temp = df[df.home_team == team]
average = round(df_temp.event_distance.mean(),2)
adj_shot_dict[team] = average
def make_adjustment(df):
team = df.home_team
distance = df.event_distance
adj_dist = distance - adj_shot_dict[team]
return adj_dist
df['adj_dist'] = df['event_distance'].apply(make_adjustment)
IIUC, you already have the dict and you want simply subtract adj_shot_dict to event_distance column:
df['adj_dist'] = df['event_distance'] - df['home_team'].map(adj_shot_dict)
Old answer
Group by home_team, compute the average of event_distance then subtract the result to event_distance:
df['adj_dist'] = df['event_distance'] \
- df.groupby('home_team')['event_distance'] \
.transform('mean').round(2)
# OR
df['adj_dist'] = df.groupby('home_team')['event_distance'] \
.apply(lambda x: x - x.mean().round(2))
Performance
>>> len(df)
60000
>>> df.sample(5)
home_team event_distance
5 team3 60
4 team2 50
1 team2 20
1 team2 20
0 team1 10
def loop():
teams = df.home_team.unique().tolist()
adj_shot_dict = {}
for team in teams:
df_temp = df[df.home_team == team]
average = round(df_temp.event_distance.mean(),2)
adj_shot_dict[team] = average
def loop2():
df.groupby('home_team')['event_distance'].transform('mean').round(2)
>>> %timeit loop()
13.5 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit loop2()
3.62 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Total process
>>> %timeit df['event_distance'] - df.groupby('home_team')['event_distance'].transform('mean').round(2)
3.7 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas speedup when working with transposed numpy matrix

I was trying to figure which is faster to standardize date between numpy and pandas and using the whole matrix/DataFrame or column by column and I've found this strange behavior showed in the code below
import pandas as pd
import numpy as np
def stand(df):
res = pd.DataFrame()
for col in df:
res[col] = (df[col] - df[col].min()) / df[col].max()
return res
matrix = pd.DataFrame(np.random.randint(0,174000,size=(1000000, 100)))
matrix.shape
(1000000, 100)
%timeit res = (matrix - matrix.min(axis=0))/ matrix.max(axis=0)
2.64 s ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit stand(matrix)
5.32 s ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
But when starting from a "flipped" numpy matrix and transposing it to create the DataFrame
matrix = pd.DataFrame(np.random.randint(0,174000,size=(100, 1000000)).T)
matrix.shape
(1000000, 100)
%timeit res = (matrix - matrix.min(axis=0))/ matrix.max(axis=0)
2.37 s ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit stand(matrix)
1.2 s ± 8.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The execution of the standardization column by column get ~4 times faster.
This behavior remains also using .values or numpy operations as showed below:
%timeit res = (matrix.values - matrix.min(axis=0).values)/ matrix.max(axis=0).values
2.58 s ± 417 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit stand(matrix)
5.26 s ± 42.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res = np.divide(np.subtract(matrix.values, matrix.min(axis=0).values), matrix.max(axis=0).values)
2.17 s ± 7.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Flipped matrix transpose
matrix = pd.DataFrame(np.random.randint(0,174000,size=(100, 1000000)).T)
matrix.shape
(1000000, 100)
%timeit res = (matrix.values - matrix.min(axis=0).values)/ matrix.max(axis=0).values
2.2 s ± 8.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit stand(matrix)
1.33 s ± 190 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res = np.divide(np.subtract(matrix.values, matrix.min(axis=0).values), matrix.max(axis=0).values)
2.46 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Can someone explain why starting from a reversed matrix and then transposing it before to create the DataFrame changes the performance w.r.t. starting from a non-reversed matrix?

Optimizing pandas operation for column intersection

I have a DataFrame with 2 columns (event and events) . Event column contains a particular eventid and events column contain list of event Ids.
Example :-
df
event events
'a' ['x','y','abc','a']
'b' ['x','y','c','a']
'c' ['a','c']
'd' ['b']
I want to create another column(eventoccured) indicating whether event isin events.
eventoccured
1
0
1
0
I am currently using
df['eventoccured']= df.apply(lambda x: x['event'] in x['events'], axis=1)
which gives the desired result but is slow, I want a faster solution for this.
Thanks
One idea is use list comprehension:
#40k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [217]: %timeit df['eventoccured']= df.apply(lambda x: x['event'] in x['events'], axis=1)
1.15 s ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [218]: %timeit df['eventoccured1'] = [x in y for x, y in zip(df['event'], df['events'])]
15.2 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

converting pandas dataframe into dictionary??

I have a pandas dataframe as news_datasetwhere column id is an article ID and column Content is Article content (large text). Given as,
ID Content
17283 WASHINGTON — Congressional Republicans have...
17284 After the bullet shells get counted, the blood...
17285 When Walt Disney’s “Bambi” opened in 1942, cri...
17286 Death may be the great equalizer, but it isn’t...
17287 SEOUL, South Korea — North Korea’s leader, ...
Now, all I want to convert pandas dataframe into dictionary such as ID would be a key and Content will the value. Basically, what I have done at first something like,
dd={}
for i in news_dataset['ID']:
for j in news_dataset['Content']:
dd[j]=i
This piece of code is pathetic and taking so much time(> 4 minutes) to get processed. So, after checking for some better approaches(stackoverflow). What I have finally did is,
id_array=[]
content_array=[]
for id_num in news_dataset['ID']:
id_array.append(id_num)
for content in news_dataset['Content']:
content_array.append(content)
news_dict=dict(zip(id_array,content_array))
This code takes nearly 15 seconds to get executed.
What I want to ask is,
i) what's wrong in first code and why it take so much time to get processed?
ii) Does using for loop inside another for loop is wrong way to do iterations when it comes to large text data?
iii) what would be right way to create dictionary using for loop within single piece of query?
I think generally loops in pandas should be avoid if exist some non loop, obviously vectorized alternatives.
You can create index by column ID and call Series.to_dict:
news_dict=news_dataset.set_index('ID')['Content'].to_dict()
Or zip:
news_dict=dict(zip(news_dataset['ID'],news_dataset['Content']))
#alternative
#news_dict=dict(zip(news_dataset['ID'].values, news_dataset['Content'].values))
Performance:
np.random.seed(1425)
#1000rows sample
news_dataset = pd.DataFrame({'ID':np.arange(1000),
'Content':np.random.choice(list('abcdef'), size=1000)})
#print (news_dataset)
In [98]: %%timeit
...: dd={}
...: for i in news_dataset['ID']:
...: for j in news_dataset['Content']:
...: dd[j]=i
...:
61.7 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [99]: %%timeit
...: id_array=[]
...: content_array=[]
...: for id_num in news_dataset['ID']:
...: id_array.append(id_num)
...: for content in news_dataset['Content']:
...: content_array.append(content)
...: news_dict=dict(zip(id_array,content_array))
...:
251 µs ± 3.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [100]: %%timeit
...: news_dict=news_dataset.set_index('ID')['Content'].to_dict()
...:
584 µs ± 9.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [101]: %%timeit
...: news_dict=dict(zip(news_dataset['ID'],news_dataset['Content']))
...:
106 µs ± 3.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [102]: %%timeit
...: news_dict=dict(zip(news_dataset['ID'].values, news_dataset['Content'].values))
...:
122 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Resources