Spread dataframe issue in Python pandas? - python-3.x

I am trying to reformat / spread my dataframe from key, value columns to wide format:
test = pd.DataFrame({
'id': [0,1,2,3,4,5,6,7],
'a':['aa','bb', 'a', 'k', 'aa','bb', 'a', 'k'],
'value': ['zzuz', 44, 'DE', 55, 'zdfdz', 454, 'SE', 155]})
test.pivot(columns='a', values='value',index='id')
I want the outcome to be:
aa bb a k
zzuz 44 DE 55
zdfdz 454 SE 155
I am trying to do this with .pivot without luck, please guide me what I am missing here?
test.pivot(columns='a', values='value')

Try via groupby()+cumcount() to tracking position and that will act as index of pivot() and rename_axis() to renaming axis(a bit of cleanup/if needed):
test['key']=test.groupby('a').cumcount()
out=test.pivot(columns='a', values='value',index='key')
out=out.rename_axis(columns=None,index=None)
OR(in 1 step)
out=(test.assign(key=test.groupby('a').cumcount())
.pivot(columns='a', values='value',index='key')
.rename_axis(columns=None,index=None))
output of out:
a aa bb k
0 DE zzuz 44 55
1 SE zdfdz 454 155

Related

The output of my code comes out too slowly.. How can i speed up my process

Thanks to the help from some users of this sites.
My code seems to work fine, but it's taking too long..
I'm trying to compare two data frames.(df1 has 1,291,250 rows / df2 has 1,286,692 rows)
if df1.iloc[0,0] == df2.iloc[0,0] and df1.iloc[0,1] == df2.iloc[0,1], then compare df1.iloc[0,2], df2.iloc[0,2].
If the first(df1.iloc[0,2]) is larger, I want to put the first index into the list, and if the second(df2.iloc[0,2]) is larger, I want to put the second index into the list.
Example DataFrame
In [1]: df1 = pd.DataFrame([[0, 1, 98], [1, 1, 198], [2, 2, 228]], columns = ['A1', 'B1', 'C1'])
In [2]: df1
Out[3]:
A1 B1 C1
0 0 1 98
1 1 1 198
2 2 2 228
In [4]: df2 = pd.DataFrame([[0, 1, 228], [1, 2, 110], [2, 2, 130]], columns = ['A2', 'B2', 'C2'])
In [5]: df2
Out[6]:
A2 B2 C2
0 0 1 228
1 1 2 110
2 2 2 130
In [7]: def find_high(df1, df2) # def function code is below
Out[8]: ([2], [0]) # The result what i want
This is just simple example. my data is bigger than this
my code is:
for i in range(60):
setattr(mod, f'df_1_{i}', np.array_split(df1, 60)[i])
getattr(mod, f'df_1_{i}').to_pickle(f'df_1_{i}')
import glob
files = glob.glob('df_1_*')
def find_high_pre(df1 = files, df2):
subtract_df2 = []
subtract_df1 = []
same_data = []
for df1_index, line in enumerate(df1.to_numpy()):
for df2_idx, row in enumerate(df2.to_numpy()):
if (line[0:2] == row[0:2]).all():
if line[2] < row[2]:
subtract_df2.append(df2_idx)
break
elif line[2] > row[2]:
subtract_df1.append(df1_idx)
break
else:
continue
break
return df1.iloc[subtract_df1].index.tolist(), df2.iloc[subtract_df2].index.tolist(), df1.iloc[same_data].index.to_list();
data_1 = []
for i in files:
e_data = pd.read_pickle(i)
num_cores = 30
df_split = np.array_split(e_data, num_cores)
data_1 += parmap.map(find_high_pre, df_split, pm_pbar=True, pm_processes =num_cores)
My code seems to work fine, but it's taking too long..
Chances are that replacing your nested for loops with a DataFrame.merge operation will take less time:
keys = ['A', 'B']
df1.columns = [*keys, 'C1']
df2.columns = [*keys, 'C2']
df = df1.reset_index().set_index(keys).merge(
df2.reset_index().set_index(keys), on=keys)
# now we have a merged dataframe like this:
# index_x C1 index_y C2
# A B
# 0 1 0 98 0 228
# 2 2 2 228 2 130
# therefrom we can easily extract the wanted indexes
data = [df.loc[df['C1'] > df['C2'], 'index_x'].values,
df.loc[df['C1'] < df['C2'], 'index_y'].values]

Structural Question Regarding pandas .drop method

df2=df.drop(df[df['issue']=="prob"].index)
df2.head()
The code immediately below works fine.
But why is there a need to type df[df[ rather than the below?
df2=df.drop(df['issue']=="prob"].index)
df2.head()
I know that the immediately above won't work while the former does. I would like to understand why or know what exactly I should google.
Also ~ any advice on a more relevant title would be appreciated.
Thanks!
Option 1: df[df['issue']=="prob"] produces a DataFrame with a subset of values.
Option 2: df['issue']=="prob" produces a pandas.Series with a Boolean for every row.
.drop works for Option 1, because it knows to just drop the selected indices, vs. all of the indices returned from Option 2.
I would use the following methods to remove rows.
Use ~ (not) to select the opposite of the Boolean selection.
df = df[~(df.treatment == 'Yes')]
Select rows with only the desired value
df = df[(df.treatment == 'No')]
import pandas as pd
import numpy as np
import random
# sample dataframe
np.random.seed(365)
random.seed(365)
rows = 25
data = {'a': np.random.randint(10, size=(rows)),
'groups': [random.choice(['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']) for _ in range(rows)],
'treatment': [random.choice(['Yes', 'No']) for _ in range(rows)],
'date': pd.bdate_range(datetime.today(), freq='d', periods=rows).tolist()}
df = pd.DataFrame(data)
df[df.treatment == 'Yes'].index
Produces just the indices where treatment is 'Yes', therefore df.drop(df[df.treatment == 'Yes'].index) only drops the indices in the list.
df[df.treatment == 'Yes'].index
[out]:
Int64Index([0, 1, 2, 4, 6, 7, 8, 11, 12, 13, 14, 15, 19, 21], dtype='int64')
df.drop(df[df.treatment == 'Yes'].index)
[out]:
a groups treatment date
3 5 6-25 No 2020-08-15
5 2 500-1000 No 2020-08-17
9 0 500-1000 No 2020-08-21
10 3 100-500 No 2020-08-22
16 8 1-5 No 2020-08-28
17 4 1-5 No 2020-08-29
18 3 1-5 No 2020-08-30
20 6 500-1000 No 2020-09-01
22 6 6-25 No 2020-09-03
23 8 100-500 No 2020-09-04
24 9 26-100 No 2020-09-05
(df.treatment == 'Yes').index
Produces all of the indices, therefore df.drop((df.treatment == 'Yes').index) drops all of the indices, leaving an empty dataframe.
(df.treatment == 'Yes').index
[out]:
RangeIndex(start=0, stop=25, step=1)
df.drop((df.treatment == 'Yes').index)
[out]:
Empty DataFrame
Columns: [a, groups, treatment, date]
Index: []

Create Multiple New Columns From Multiple Dictionaries

I can successfully create a single new attribute on a dataframe called df using a single dictionary as follows:
Create Precursor DataFrame mye2_men:
In [13]: mye2_men = pd.read_csv("~/03_Maps_March_2020/mye2_men.csv",index_col="Code")
...: mye2_men.head()
Out[13]:
Name Geography1 All ages 0 1 2 3 4 5 6 7 8 9 ... 78 79 80 81 82 83 84 85 86 87 88 89 90
Code ...
K02000001 UNITED KINGDOM Country 32790202 382332 395273 408684 408882 412553 421934 434333 427809 419161 414994 ... 192839 186251 175626 160475 146314 132941 116050 103669 93155 81174 68110 55652 183486
K03000001 GREAT BRITAIN Country 31864002 370474 382933 395754 396181 399764 409061 420947 414613 406062 401647 ... 188073 181546 171350 156506 142851 129815 113306 101194 91038 79342 66699 54387 179629
K04000001 ENGLAND AND WALES Country 29215251 343642 355122 366722 366885 370156 379046 389944 382853 375940 370701 ... 172046 166392 157065 143896 131207 119193 104143 93055 83798 73224 61794 50297 167009
E92000001 ENGLAND Country 27667942 327309 338368 349229 349199 352148 360688 370995 363496 356965 351790 ... 161540 156343 147733 135514 123492 112133 98000 87528 79030 69067 58264 47498 157788
E12000001 NORTH EAST Region 1305486 13992 14423 15124 15159 15542 15839 16314 16283 16068 15748 ... 8130 8108 7601 6977 6118 5723 4958 4383 3889 3360 2747 2148 6822
[5 rows x 94 columns]
Create Target DataFrame df
In [14]: df = pd.DataFrame({"A":[num for num in range(0,430)],
...: "B":[num**2 for num in range(0,430)],
...: "Code":mye2_men.index})
...: df.head()
Out[14]:
A B Code
0 0 0 K02000001
1 1 1 K03000001
2 2 4 K04000001
3 3 9 E92000001
4 4 16 E12000001
Create Dictionary To Use In Mapping:
In [15]: male_counts = mye2_men["All ages"].to_dict()
...: male_counts
Out[15]:
{'K02000001': 32790202,
'K03000001': 31864002,
'K04000001': 29215251,
'E92000001': 27667942,
'E12000001': 1305486,
'E06000047': 259299,
'E06000005': 51919,
'E06000001': 45524 ....}
Map the male_counts Dictionary to DataFrame df To Create New Column "male_count":
In [19]: # CREATE NEW male_count COLUMN IN df
...: df["male_count"] = df["Code"].map(male_counts)
...: df.head()
Out[19]:
A B Code male_count
0 0 0 K02000001 32790202
1 1 1 K03000001 31864002
2 2 4 K04000001 29215251
3 3 9 E92000001 27667942
4 4 16 E12000001 1305486
For the 2nd dictionary:
In [20]: female_counts = (mye2_men["All ages"]+10).to_dict()
...: female_counts
Out[20]:
{'K02000001': 32790212,
'K03000001': 31864012,
'K04000001': 29215261,
'E92000001': 27667952,
'E12000001': 1305496,
'E06000047': 259309,
'E06000005': 51929 ...}
I can successfully produce a second attribute called df["female_count"] by repeating step 4 above, but this time using the female_counts dictionary.
How can I create multiple new df columns (ie df["male_count"] and df["female_count"]) in a single step?
Many thanks
Note:
The mye2_men data is from the "MYE2 - Males" tab of the following excel doc:
https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fpopulationandmigration%2fpopulationestimates%2fdatasets%2fpopulationestimatesforukenglandandwalesscotlandandnorthernireland%2fmid2019april2020localauthoritydistrictcodes/ukmidyearestimates20192020ladcodes.xls
Create DataFrame from dictionaries and then use DataFrame.join:
new = pd.DataFrame({'male_count':male_counts, 'female_count':female_count})
df = df.join(new, on='Code')

Pandas - Fastest way indexing with 2 dataframes

I am developing a software in Python 3 with Pandas library.
Time is very important but memory not so much.
For better visualization I am using the names a and b with few values, although there are many more:
a -> 50000 rows
b -> 5000 rows
I need to select from dataframe a and b (using multiples conditions)
a = pd.DataFrame({
'a1': ['x', 'y', 'z'] ,
'a2': [1, 2, 3],
'a3': [3.14, 2.73, -23.00],
'a4': [pd.np.nan, pd.np.nan, pd.np.nan]
})
a
a1 a2 a3 a4
0 x 1 3.14 NaN
1 y 2 2.73 NaN
2 z 3 -23.00 NaN
b = pd.DataFrame({
'b1': ['x', 'y', 'z', 'k', 'l'],
'b2': [2018, 2019, 2020, 2015, 2012]
})
b
b1 b2
0 x 2018
1 y 2019
2 z 2020
3 k 2015
4 l 2012
So far my code is like this:
for index, row in a.iterrows():
try:
# create a key
a1 = row["a1"]
mask = b.loc[(b['b1'] == a1) & (b['b2'] != 2019)]
# check if exists
if (len(mask.index) != 0): #not empty
a.loc[[index], ['a4']] = mask.iloc[0]['b2']
except KeyError: #not found
pass
But as you can see, I'm using for iterrows that is very slow compared to other methods and I'm changing the value of the DataFrame I'm iterating, that is not recommended.
Could you help me find a better way? The results should be like this:
a
a1 a2 a3 a4
0 x 1 3.14 2018
1 y 2 2.73 NaN
2 z 3 -23.00 2020
I tried things like this below, but I didnt made it work.
a.loc[ (a['a1'] == b['b1']) , 'a4'] = b.loc[b['b2'] != 2019]
*the real code has more conditions
Thanks!
EDIT
I benchmark using: iterrows, merge, set_index/loc. Here is the code:
import timeit
import pandas as pd
def f_iterrows():
for index, row in a.iterrows():
try:
# create a key
a1 = row["a1"]
a3 = row["a3"]
mask = b.loc[(b['b1'] == a1) & (b['b2'] != 2019)]
# check if exists
if len(mask.index) != 0: # not empty
a.loc[[index], ['a4']] = mask.iloc[0]['b2']
except: # not found
pass
def f_merge():
a.merge(b[b.b2 != 2019], left_on='a1', right_on='b1', how='left').drop(['a4', 'b1'], 1).rename(columns={'b2': 'a4'})
def f_lock():
df1 = a.set_index('a1')
df2 = b.set_index('b1')
df1.loc[:, 'a4'] = df2.b2[df2.b2 != 2019]
#variables for testing
number_rows = 100
number_iter = 100
a = pd.DataFrame({
'a1': ['x', 'y', 'z'] * number_rows,
'a2': [1, 2, 3] * number_rows,
'a3': [3.14, 2.73, -23.00] * number_rows,
'a4': [pd.np.nan, pd.np.nan, pd.np.nan] * number_rows
})
b = pd.DataFrame({
'b1': ['x', 'y', 'z', 'k', 'l'] * number_rows,
'b2': [2018, 2019, 2020, 2015, 2012] * number_rows
})
print('For: %s s' % str(timeit.timeit(f_iterrows, number=number_iter)))
print('Merge: %s s' % str(timeit.timeit(f_merge, number=number_iter)))
print('Loc: %s s' % str(timeit.timeit(f_iterrows, number=number_iter)))
They all worked :) and the time to run is:
For: 277.9994369489998 s
Loc: 274.04929955067564 s
Merge: 2.195712725706926 s
So far Merge is the fastest.
If another option appears I will update here, thanks again.
IIUC
a.merge(b[b.b2!=2019],left_on='a1',right_on='b1',how='left').drop(['a4','b1'],1).rename(columns={'b2':'a4'})
Out[263]:
a1 a2 a3 a4
0 x 1 3.14 2018.0
1 y 2 2.73 NaN
2 z 3 -23.00 2020.0

Convert list of Pandas Dataframe JSON objects

I have a Dataframe with one column where each cell in the column is a JSON object.
players
0 {"name": "tony", "age": 57}
1 {"name": "peter", age": 46}
I want to convert this to a data frame as:
name age
tony 57
peter 46
Any ideas how I do this?
Note: the original JSON object looks like this...
{
"players": [{
"age": 57,
"name":"tony"
},
{
"age": 46,
"name":"peter"
}]
}
Use DataFrame constructor if types of values are dicts:
#print (type(df.loc[0, 'players']))
#<class 'str'>
#import ast
#df['players'] = df['players'].apply(ast.literal_eval)
print (type(df.loc[0, 'players']))
<class 'dict'>
df = pd.DataFrame(df['players'].values.tolist())
print (df)
age name
0 57 tony
1 46 peter
But better is use json_normalize from jsons object as suggested #jpp:
json = {
"players": [{
"age": 57,
"name":"tony"
},
{
"age": 46,
"name":"peter"
}]
}
df = json_normalize(json, 'players')
print (df)
age name
0 57 tony
1 46 peter
This can do it:
df = df['players'].apply(pd.Series)
However, it's slow:
In [20]: timeit df.players.apply(pd.Series)
1000 loops, best of 3: 824 us per loop
#jezrael suggestion is faster:
In [24]: timeit pd.DataFrame(df.players.values.tolist())
1000 loops, best of 3: 387 us per loop

Resources