I have the following dataframes:
> df1
id begin conditional confidence discoveryTechnique
0 278 56 false 0.0 1
1 421 18 false 0.0 1
> df2
concept
0 A
1 B
How do I merge on the indices to get:
id begin conditional confidence discoveryTechnique concept
0 278 56 false 0.0 1 A
1 421 18 false 0.0 1 B
I ask because it is my understanding that merge() i.e. df1.merge(df2) uses columns to do the matching. In fact, doing this I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4618, in merge
copy=copy, indicator=indicator)
File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 58, in merge
copy=copy, indicator=indicator)
File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 491, in __init__
self._validate_specification()
File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 812, in _validate_specification
raise MergeError('No common columns to perform merge on')
pandas.tools.merge.MergeError: No common columns to perform merge on
Is it bad practice to merge on index? Is it impossible? If so, how can I shift the index into a new column called "index"?
Use merge, which is an inner join by default:
pd.merge(df1, df2, left_index=True, right_index=True)
Or join, which is a left join by default:
df1.join(df2)
Or concat, which is an outer join by default:
pd.concat([df1, df2], axis=1)
Samples:
df1 = pd.DataFrame({'a':range(6),
'b':[5,3,6,9,2,4]}, index=list('abcdef'))
print (df1)
a b
a 0 5
b 1 3
c 2 6
d 3 9
e 4 2
f 5 4
df2 = pd.DataFrame({'c':range(4),
'd':[10,20,30, 40]}, index=list('abhi'))
print (df2)
c d
a 0 10
b 1 20
h 2 30
i 3 40
# Default inner join
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
print (df3)
a b c d
a 0 5 0 10
b 1 3 1 20
# Default left join
df4 = df1.join(df2)
print (df4)
a b c d
a 0 5 0.0 10.0
b 1 3 1.0 20.0
c 2 6 NaN NaN
d 3 9 NaN NaN
e 4 2 NaN NaN
f 5 4 NaN NaN
# Default outer join
df5 = pd.concat([df1, df2], axis=1)
print (df5)
a b c d
a 0.0 5.0 0.0 10.0
b 1.0 3.0 1.0 20.0
c 2.0 6.0 NaN NaN
d 3.0 9.0 NaN NaN
e 4.0 2.0 NaN NaN
f 5.0 4.0 NaN NaN
h NaN NaN 2.0 30.0
i NaN NaN 3.0 40.0
You can use concat([df1, df2, ...], axis=1) in order to concatenate two or more DFs aligned by indexes:
pd.concat([df1, df2, df3, ...], axis=1)
Or merge for concatenating by custom fields / indexes:
# join by _common_ columns: `col1`, `col3`
pd.merge(df1, df2, on=['col1','col3'])
# join by: `df1.col1 == df2.index`
pd.merge(df1, df2, left_on='col1' right_index=True)
or join for joining by index:
df1.join(df2)
By default:
join is a column-wise left join
pd.merge is a column-wise inner join
pd.concat is a row-wise outer join
pd.concat:
takes Iterable arguments. Thus, it cannot take DataFrames directly (use [df,df2])
Dimensions of DataFrame should match along axis
Join and pd.merge:
can take DataFrame arguments
This answer has been resolved for a while and all the available
options are already out there. However in this answer I'll attempt to
shed a bit more light on these options to help you understand when to
use what.
This post will go through the following topics:
Merging with index under different conditions
options for index-based joins: merge, join, concat
merging on indexes
merging on index of one, column of other
effectively using named indexes to simplify merging syntax
Index-based joins
TL;DR
There are a few options, some simpler than others depending on the use
case.
DataFrame.merge with left_index and right_index (or left_on and right_on using named indexes)
DataFrame.join (joins on index)
pd.concat (joins on index)
PROS
CONS
merge
• supports inner/left/right/full • supports column-column, index-column, index-index joins
• can only join two frames at a time
join
• supports inner/left (default)/right/full • can join multiple DataFrames at a time
• only supports index-index joins
concat
• specializes in joining multiple DataFrames at a time • very fast (concatenation is linear time)
• only supports inner/full (default) joins • only supports index-index joins
Index to index joins
Typically, an inner join on index would look like this:
left.merge(right, left_index=True, right_index=True)
Other types of joins (left, right, outer) follow similar syntax (and can be controlled using how=...).
Notable Alternatives
DataFrame.join defaults to a left outer join on the index.
left.join(right, how='inner',)
If you happen to get ValueError: columns overlap but no suffix specified, you will need to specify lsuffix and rsuffix= arguments to resolve this. Since the column names are same, a differentiating suffix is required.
pd.concat joins on the index and can join two or more DataFrames at once. It does a full outer join by default.
pd.concat([left, right], axis=1, sort=False)
For more information on concat, see this post.
Index to Column joins
To perform an inner join using index of left, column of right, you will use DataFrame.merge a combination of left_index=True and right_on=....
left.merge(right, left_index=True, right_on='key')
Other joins follow a similar structure. Note that only merge can perform index to column joins. You can join on multiple levels/columns, provided the number of index levels on the left equals the number of columns on the right.
join and concat are not capable of mixed merges. You will need to set the index as a pre-step using DataFrame.set_index.
This post is an abridged version of my work in Pandas Merging 101. Please follow this link for more examples and other topics on merging.
A silly bug that got me: the joins failed because index dtypes differed. This was not obvious as both tables were pivot tables of the same original table. After reset_index, the indices looked identical in Jupyter. It only came to light when saving to Excel...
I fixed it with: df1[['key']] = df1[['key']].apply(pd.to_numeric)
Hopefully this saves somebody an hour!
If you want to join two dataframes in Pandas, you can simply use available attributes like merge or concatenate.
For example, if I have two dataframes df1 and df2, I can join them by:
newdataframe = merge(df1, df2, left_index=True, right_index=True)
You can try these few ways to merge/join your dataframe.
merge (inner join by default)
df = pd.merge(df1, df2, left_index=True, right_index=True)
join (left join by default)
df = df1.join(df2)
concat (outer join by default)
df = pd.concat([df1, df2], axis=1)
Related
I need to forward fill values in a column of a dataframe within groups. I should note that the first value in a group is never missing by construction. I have the following solutions at the moment.
df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, 2, np.nan, np.nan]})
# desired output
a b
1 1
1 1
2 2
2 2
2 2
Here are the three solutions that I've tried so far.
# really slow solutions
df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
df['b'] = df.groupby('a')['b'].fillna(method='ffill')
# much faster solution, but more memory intensive and ugly all around
tmp = df.drop_duplicates('a', keep='first')
df.drop('b', inplace=True, axis=1)
df = df.merge(tmp, on='a')
All three of these produce my desired output, but the first two take a really long time on my data set, and the third solution is more memory intensive and feels rather clunky. Are there any other ways to forward fill a column?
You need to sort by both columns df.sort_values(['a', 'b']).ffill() to ensure robustness. If an np.nan is left in the first position within a group, ffill will fill that with a value from the prior group. Because np.nan will be placed at the end of any sort, sorting by both a and b ensures that you will not have np.nan at the front of any group. You can then .loc or .reindex with the initial index to get back your original order.
This will obviously be a tad slower than the other proposals... However, I contend it will be correct where the others are not.
demo
Consider the dataframe df
df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, np.nan, 2, np.nan]})
print(df)
a b
0 1 1.0
1 1 NaN
2 2 NaN
3 2 2.0
4 2 NaN
Try
df.sort_values('a').ffill()
a b
0 1 1.0
1 1 1.0
2 2 1.0 # <--- this is incorrect
3 2 2.0
4 2 2.0
Instead do
df.sort_values(['a', 'b']).ffill().loc[df.index]
a b
0 1 1.0
1 1 1.0
2 2 2.0
3 2 2.0
4 2 2.0
special note
This is still incorrect if an entire group has missing values
Using ffill() directly will give the best results. Here is the comparison
%timeit df.b.ffill(inplace = True)
best of 3: 311 µs per loop
%timeit df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
best of 3: 2.34 ms per loop
%timeit df['b'] = df.groupby('a')['b'].fillna(method='ffill')
best of 3: 4.41 ms per loop
what about this
df.groupby('a').b.transform('ffill')
I have the following dataframes:
df1=pd.DataFrame({'fr':[1,2,3],'to':[4,5,6],'R':[0.1,0.2,0.3]})
df2=pd.DataFrame({'fr':[1,5,3],'to':[4,2,6],'X':[0.4,0.5,0.6]})
I would like to merge these two dataframes on fr and to, ignoring the order of fr and to, i.e., (2,5) is the same as (5,2). The desired output is:
dfO=pd.DataFrame({'fr':[1,2,3],'to':[4,5,6],'R':[0.1,0.2,0.3],'X':[0.4,0.5,0.6]})
or
dfO=pd.DataFrame({'fr':[1,5,3],'to':[4,2,6],'R':[0.1,0.2,0.3],'X':[0.4,0.5,0.6]})
I can do the following:
pd.merge(df1,df2,on=['fr','to'],how='left')
However, as expected, the X value of the second row is NaN.
Thank you for your help.
You need do numpy sort first
df1[['fr','to']] = np.sort(df1[['fr','to']].values,1)
df2[['fr','to']] = np.sort(df2[['fr','to']].values,1)
out = df1.merge(df2,how='left')
out
Out[44]:
fr to R X
0 1 4 0.1 0.4
1 2 5 0.2 0.5
2 3 6 0.3 0.6
You can create a temp field and then join on it
df1['tmp'] = df1.apply(lambda x: ','.join(sorted([str(x.fr), str(x.to)])), axis=1)
df2['tmp'] = df2.apply(lambda x: ','.join(sorted([str(x.fr), str(x.to)])), axis=1)
This will give the result that you expect
pd.merge(df1,df2[['tmp', 'X']],on=['tmp'], how='left').drop(columns=['tmp'])
I am trying to create a dataframe in which i have Timestamps as index but it throws error. I can use the same methodology to create a dataframe in case the index is not a Timestamp. The following codebit is a bare min example:
Works fine:
pd.DataFrame.from_dict({'1':{'a':1,'b':2,'c':3},'2':{'a':1,'c':4},'3':{'b':6}})
output:
1 2 3
a 1 1.0 NaN
b 2 NaN 6.0
c 3 4.0 NaN
BREAKS
o=np.arange(np.datetime64('2017-11-01 00:00:00'),np.datetime64('2017-11-01 00:00:00')+np.timedelta64(3,'D'),np.timedelta64(1,'D'))
pd.DataFrame.from_records({o[0]:{'a':1,'b':2,'c':3},o[1]:{'a':1,'c':4},o[2]:{'b':6}})
output:
KeyError Traceback (most recent call last)
<ipython-input-627-f9a075f611c0> in <module>
1 o=np.arange(np.datetime64('2017-11-01 00:00:00'),np.datetime64('2017-11-01 00:00:00')+np.timedelta64(3,'D'),np.timedelta64(1,'D'))
2
----> 3 pd.DataFrame.from_records({o[0]:{'a':1,'b':2,'c':3},o[1]:{'a':1,'c':4},o[2]:{'b':6}})
~/anaconda3/envs/dfs/lib/python3.6/site-packages/pandas/core/frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
1617 if columns is None:
1618 columns = arr_columns = ensure_index(sorted(data))
-> 1619 arrays = [data[k] for k in columns]
1620 else:
1621 arrays = []
~/anaconda3/envs/dfs/lib/python3.6/site-packages/pandas/core/frame.py in <listcomp>(.0)
1617 if columns is None:
1618 columns = arr_columns = ensure_index(sorted(data))
-> 1619 arrays = [data[k] for k in columns]
1620 else:
1621 arrays = []
KeyError: Timestamp('2017-11-01 00:00:00')
Please help me understand the behavior and what I am missing. Also, how do go about creating a dataframe from records that has Timestamps as indices
Change from_records to from_dict (just like in your working example)
and everything executes fine.
Another, optional hint: Since you create a Pandas DataFrame, use
pandasonic native way to create datetime values to use as column names:
o = pd.date_range(start='2017-11-01', periods=3)
Edit
I noticed that if you create o object the way I proposed (as a
date_range), you can use even from_records.
Edit 2
You wrote that you want datetime objects as the index, whereas
your code attempts to set them as column names.
If you want datetime objects as the index, run something like:
df = pd.DataFrame.from_records({'1':{o[0]:1, o[1]:2, o[2]:3},
'2':{o[0]:1, o[2]:4}, '3':{o[1]:6}})
The result is:
1 2 3
2017-11-01 1 1.0 NaN
2017-11-02 2 NaN 6.0
2017-11-03 3 4.0 NaN
Another way to create the above result is:
df = pd.DataFrame.from_records([{'1':1, '2':1}, {'1':2, '3':6}, {'1':3, '2':4}], index=o)
I have several csv files with approximately the following structure:
name,title,status,1,2,3
name,title,status,4,5,6
name,title,status,7,8,9
Most of the name columns is the same in all files, only the columns 1,2,3,4... are different.
I need to take turns adding new columns to existing and new rows, as well as updating the remaining rows each time.
For example, I have 2 tables:
name,title,status,1,2,3
Foo,Bla-bla-bla,10,45.6,12.3,45.2
Bar,Too-too,13,13.4,22.6,75.1
name,title,status,4,5,6
Foo,Bla-bla-bla,14,25.3,125.3,5.2
Fobo,Dom-dom,20,53.4,2.9,11.3
And at the output I expect a table:
name,title,status,1,2,3,4,5,6
Foo,Bla-bla-bla,14,45.6,12.3,45.2,25.3,125.3,5.2
Bar,Too-too,13,13.4,22.6,75.1,,,
Fobo,Dom-dom,20,,,,53.4,2.9,11.3
I did not find anything similar, who can tell how I can do this?
It looks like you want to keep just one version of ['name', 'title', 'status'] and from your example, you prefer to keep the last 'status' encountered.
I'd use pd.concat and follow that up with a groupby to filter out duplicate status.
df = pd.concat([
pd.read_csv(fp, index_col=['name', 'title', 'status'])
for fp in ['data1.csv', 'data2.csv']
], axis=1).reset_index('status').groupby(level=['name', 'title']).last()
df
status 1 2 3 4 5 6
name title
Bar Too-too 13 13.4 22.6 75.1 NaN NaN NaN
Fobo Dom-dom 20 NaN NaN NaN 53.4 2.9 11.3
Foo Bla-bla-bla 14 45.6 12.3 45.2 25.3 125.3 5.2
Then df.to_csv() produces
name,title,status,1,2,3,4,5,6
Bar,Too-too,13,13.4,22.6,75.1,,,
Fobo,Dom-dom,20,,,,53.4,2.9,11.3
Foo,Bla-bla-bla,14,45.6,12.3,45.2,25.3,125.3,5.2
Keep merging them:
df = None
for path in ['data1.csv', 'data2.csv']:
sub_df = pd.read_csv(path)
if df is None:
df = sub_df
else:
df = df.merge(sub_df, on=['name', 'title', 'status'], how='outer')
I have a very large dataframe (around 1 million rows) with data from an experiment (60 respondents).
I would like to split the dataframe into 60 dataframes (a dataframe for each participant).
In the dataframe, data, there is a variable called 'name', which is the unique code for each participant.
I have tried the following, but nothing happens (or execution does not stop within an hour). What I intend to do is to split the data into smaller dataframes, and append these to a list (datalist):
import pandas as pd
def splitframe(data, name='name'):
n = data[name][0]
df = pd.DataFrame(columns=data.columns)
datalist = []
for i in range(len(data)):
if data[name][i] == n:
df = df.append(data.iloc[i])
else:
datalist.append(df)
df = pd.DataFrame(columns=data.columns)
n = data[name][i]
df = df.append(data.iloc[i])
return datalist
I do not get an error message, the script just seems to run forever!
Is there a smart way to do it?
Can I ask why not just do it by slicing the data frame. Something like
#create some data with Names column
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, 'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})
#create unique list of names
UniqueNames = data.Names.unique()
#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame() for elem in UniqueNames}
for key in DataFrameDict.keys():
DataFrameDict[key] = data[:][data.Names == key]
Hey presto you have a dictionary of data frames just as (I think) you want them. Need to access one? Just enter
DataFrameDict['Joe']
Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.
However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?
I would sort the dataframe by column 'name', set the index to be this and if required not drop the column.
Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.
Use pandas.DataFrame.sort_values and pandas.DataFrame.set_index:
# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'
You can convert groupby object to tuples and then to dict:
df = pd.DataFrame({'Name':list('aabbef'),
'A':[4,5,4,5,5,4],
'B':[7,8,9,4,2,3],
'C':[1,3,5,7,1,0]}, columns = ['Name','A','B','C'])
print (df)
Name A B C
0 a 4 7 1
1 a 5 8 3
2 b 4 9 5
3 b 5 4 7
4 e 5 2 1
5 f 4 3 0
d = dict(tuple(df.groupby('Name')))
print (d)
{'b': Name A B C
2 b 4 9 5
3 b 5 4 7, 'e': Name A B C
4 e 5 2 1, 'a': Name A B C
0 a 4 7 1
1 a 5 8 3, 'f': Name A B C
5 f 4 3 0}
print (d['a'])
Name A B C
0 a 4 7 1
1 a 5 8 3
It is not recommended, but possible create DataFrames by groups:
for i, g in df.groupby('Name'):
globals()['df_' + str(i)] = g
print (df_a)
Name A B C
0 a 4 7 1
1 a 5 8 3
Easy:
[v for k, v in df.groupby('name')]
Groupby can helps you:
grouped = data.groupby(['name'])
Then you can work with each group like with a dataframe for each participant. And DataFrameGroupBy object methods such as (apply, transform, aggregate, head, first, last) return a DataFrame object.
Or you can make list from grouped and get all DataFrame's by index:
l_grouped = list(grouped)
l_grouped[0][1] - DataFrame for first group with first name.
In addition to Gusev Slava's answer, you might want to use groupby's groups:
{key: df.loc[value] for key, value in df.groupby("name").groups.items()}
This will yield a dictionary with the keys you have grouped by, pointing to the corresponding partitions. The advantage is that the keys are maintained and don't vanish in the list index.
The method in the OP works, but isn't efficient. It may have seemed to run forever, because the dataset was long.
Use .groupby on the 'method' column, and create a dict of DataFrames with unique 'method' values as the keys, with a dict-comprehension.
.groupby returns a groupby object, that contains information about the groups, where g is the unique value in 'method' for each group, and d is the DataFrame for that group.
The value of each key in df_dict, will be a DataFrame, which can be accessed in the standard way, df_dict['key'].
The original question wanted a list of DataFrames, which can be done with a list-comprehension
df_list = [d for _, d in df.groupby('method')]
import pandas as pd
import seaborn as sns # for test dataset
# load data for example
df = sns.load_dataset('planets')
# display(df.head())
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
# Using a dict-comprehension, the unique 'method' value will be the key
df_dict = {g: d for g, d in df.groupby('method')}
print(df_dict.keys())
[out]:
dict_keys(['Astrometry', 'Eclipse Timing Variations', 'Imaging', 'Microlensing', 'Orbital Brightness Modulation', 'Pulsar Timing', 'Pulsation Timing Variations', 'Radial Velocity', 'Transit', 'Transit Timing Variations'])
# or a specific name for the key, using enumerate (e.g. df1, df2, etc.)
df_dict = {f'df{i}': d for i, (g, d) in enumerate(df.groupby('method'))}
print(df_dict.keys())
[out]:
dict_keys(['df0', 'df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8', 'df9'])
df_dict['df1].head(3) or df_dict['Astrometry'].head(3)
There are only 2 in this group
method number orbital_period mass distance year
113 Astrometry 1 246.36 NaN 20.77 2013
537 Astrometry 1 1016.00 NaN 14.98 2010
df_dict['df2].head(3) or df_dict['Eclipse Timing Variations'].head(3)
method number orbital_period mass distance year
32 Eclipse Timing Variations 1 10220.0 6.05 NaN 2009
37 Eclipse Timing Variations 2 5767.0 NaN 130.72 2008
38 Eclipse Timing Variations 2 3321.0 NaN 130.72 2008
df_dict['df3].head(3) or df_dict['Imaging'].head(3)
method number orbital_period mass distance year
29 Imaging 1 NaN NaN 45.52 2005
30 Imaging 1 NaN NaN 165.00 2007
31 Imaging 1 NaN NaN 140.00 2004
For more information about the seaborn datasets
NASA Exoplanets
Alternatively
This is a manual method to create separate DataFrames using pandas: Boolean Indexing
This is similar to the accepted answer, but .loc is not required.
This is an acceptable method for creating a couple extra DataFrames.
The pythonic way to create multiple objects, is by placing them in a container (e.g. dict, list, generator, etc.), as shown above.
df1 = df[df.method == 'Astrometry']
df2 = df[df.method == 'Eclipse Timing Variations']
In [28]: df = DataFrame(np.random.randn(1000000,10))
In [29]: df
Out[29]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
0 1000000 non-null values
1 1000000 non-null values
2 1000000 non-null values
3 1000000 non-null values
4 1000000 non-null values
5 1000000 non-null values
6 1000000 non-null values
7 1000000 non-null values
8 1000000 non-null values
9 1000000 non-null values
dtypes: float64(10)
In [30]: frames = [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
In [31]: %timeit [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
1 loops, best of 3: 849 ms per loop
In [32]: len(frames)
Out[32]: 16667
Here's a groupby way (and you could do an arbitrary apply rather than sum)
In [9]: g = df.groupby(lambda x: x/60)
In [8]: g.sum()
Out[8]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16667 entries, 0 to 16666
Data columns (total 10 columns):
0 16667 non-null values
1 16667 non-null values
2 16667 non-null values
3 16667 non-null values
4 16667 non-null values
5 16667 non-null values
6 16667 non-null values
7 16667 non-null values
8 16667 non-null values
9 16667 non-null values
dtypes: float64(10)
Sum is cythonized that's why this is so fast
In [10]: %timeit g.sum()
10 loops, best of 3: 27.5 ms per loop
In [11]: %timeit df.groupby(lambda x: x/60)
1 loops, best of 3: 231 ms per loop
The method based on list comprehension and groupby- Which stores all the split dataframe in list variable and can be accessed using the index.
Example
ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]
ans[0]
ans[0].column_name
You can use the groupby command, if you already have some labels for your data.
out_list = [group[1] for group in in_series.groupby(label_series.values)]
Here's a detailed example:
Let's say we want to partition a pd series using some labels into a list of chunks
For example, in_series is:
2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 5, dtype: float64
And its corresponding label_series is:
2019-07-01 08:00:00 1
2019-07-01 08:02:00 1
2019-07-01 08:04:00 2
2019-07-01 08:06:00 2
2019-07-01 08:08:00 2
Length: 5, dtype: float64
Run
out_list = [group[1] for group in in_series.groupby(label_series.values)]
which returns out_list a list of two pd.Series:
[2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
Length: 2, dtype: float64,
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 3, dtype: float64]
Note that you can use some parameters from in_series itself to group the series, e.g., in_series.index.day
here's a small function which might help some (efficiency not perfect probably, but compact + more or less easy to understand):
def get_splited_df_dict(df: 'pd.DataFrame', split_column: 'str'):
"""
splits a pandas.DataFrame on split_column and returns it as a dict
"""
df_dict = {value: df[df[split_column] == value].drop(split_column, axis=1) for value in df[split_column].unique()}
return df_dict
it converts a DataFrame to multiple DataFrames, by selecting each unique value in the given column and putting all those entries into a separate DataFrame.
the .drop(split_column, axis=1) is just for removing the column which was used to split the DataFrame. the removal is not necessary, but can help a little to cut down on memory usage after the operation.
the result of get_splited_df_dict is a dict, meaning one can access each DataFrame like this:
splitted = get_splited_df_dict(some_df, some_column)
# accessing the DataFrame with 'some_column_value'
splitted[some_column_value]
The existing answers cover all good cases and explains fairly well how the groupby object is like a dictionary with keys and values that can be accessed via .groups. Yet more methods to do the same job as the existing answers are:
Create a list by unpacking the groupby object and casting it to a dictionary:
dict([*df.groupby('Name')]) # same as dict(list(df.groupby('Name')))
Create a tuple + dict (this is the same as #jezrael's answer):
dict((*df.groupby('Name'),))
If we only want the DataFrames, we could get the values of the dictionary (created above):
[*dict([*df.groupby('Name')]).values()]
I had similar problem. I had a time series of daily sales for 10 different stores and 50 different items. I needed to split the original dataframe in 500 dataframes (10stores*50stores) to apply Machine Learning models to each of them and I couldn't do it manually.
This is the head of the dataframe:
I have created two lists;
one for the names of dataframes
and one for the couple of array [item_number, store_number].
list=[]
for i in range(1,len(items)*len(stores)+1):
global list
list.append('df'+str(i))
list_couple_s_i =[]
for item in items:
for store in stores:
global list_couple_s_i
list_couple_s_i.append([item,store])
And once the two lists are ready you can loop on them to create the dataframes you want:
for name, it_st in zip(list,list_couple_s_i):
globals()[name] = df.where((df['item']==it_st[0]) &
(df['store']==(it_st[1])))
globals()[name].dropna(inplace=True)
In this way I have created 500 dataframes.
Hope this will be helpful!