Flatten json column in Python [duplicate] - python-3.x

I have data saved in a postgreSQL database. I am querying this data using Python2.7 and turning it into a Pandas DataFrame. However, the last column of this dataframe has a dictionary of values inside it. The DataFrame df looks like this:
Station ID Pollutants
8809 {"a": "46", "b": "3", "c": "12"}
8810 {"a": "36", "b": "5", "c": "8"}
8811 {"b": "2", "c": "7"}
8812 {"c": "11"}
8813 {"a": "82", "c": "15"}
I need to split this column into separate columns, so that the DataFrame `df2 looks like this:
Station ID a b c
8809 46 3 12
8810 36 5 8
8811 NaN 2 7
8812 NaN NaN 11
8813 82 NaN 15
The major issue I'm having is that the lists are not the same lengths. But all of the lists only contain up to the same 3 values: 'a', 'b', and 'c'. And they always appear in the same order ('a' first, 'b' second, 'c' third).
The following code USED to work and return exactly what I wanted (df2).
objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]]
df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1)
print(df2)
I was running this code just last week and it was working fine. But now my code is broken and I get this error from line [4]:
IndexError: out-of-bounds on slice (end)
I made no changes to the code but am now getting the error. I feel this is due to my method not being robust or proper.
Any suggestions or guidance on how to split this column of lists into separate columns would be super appreciated!
EDIT: I think the .tolist() and .apply methods are not working on my code because it is one Unicode string, i.e.:
#My data format
u{'a': '1', 'b': '2', 'c': '3'}
#and not
{u'a': '1', u'b': '2', u'c': '3'}
The data is imported from the postgreSQL database in this format. Any help or ideas with this issue? is there a way to convert the Unicode?

To convert the string to an actual dict, you can do df['Pollutant Levels'].map(eval). Afterwards, the solution below can be used to convert the dict to different columns.
Using a small example, you can use .apply(pd.Series):
In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})
In [3]: df
Out[3]:
a b
0 1 {u'c': 1}
1 2 {u'd': 3}
2 3 {u'c': 5, u'd': 6}
In [4]: df['b'].apply(pd.Series)
Out[4]:
c d
0 1.0 NaN
1 NaN 3.0
2 5.0 6.0
To combine it with the rest of the dataframe, you can concat the other columns with the above result:
In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
Out[7]:
a c d
0 1 1.0 NaN
1 2 NaN 3.0
2 3 5.0 6.0
Using your code, this also works if I leave out the iloc part:
In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
Out[15]:
a c d
0 1 1.0 NaN
1 2 NaN 3.0
2 3 5.0 6.0

I know the question is quite old, but I got here searching for answers. There is actually a better (and faster) way now of doing this using json_normalize:
import pandas as pd
df2 = pd.json_normalize(df['Pollutant Levels'])
This avoids costly apply functions...

The fastest method to normalize a column of flat, one-level dicts, as per the timing analysis performed by Shijith in this answer:
df.join(pd.DataFrame(df.pop('Pollutants').values.tolist()))
It will not resolve other issues, with columns of list or dicts, that are addressed below, such as rows with NaN, or nested dicts.
pd.json_normalize(df.Pollutants) is significantly faster than df.Pollutants.apply(pd.Series)
See the %%timeit below. For 1M rows, .json_normalize is 47 times faster than .apply.
Whether reading data from a file, or from an object returned by a database, or API, it may not be clear if the dict column has dict or str type.
If the dictionaries in the column are str type, they must be converted back to a dict type, using ast.literal_eval, or json.loads(…).
Use pd.json_normalize to convert the dicts, with keys as headers and values for rows.
There are additional parameters (e.g. record_path & meta) for dealing with nested dicts.
Use pandas.DataFrame.join to combine the original DataFrame, df, with the columns created using pd.json_normalize
If the index isn't integers (as in the example), first use df.reset_index() to get an index of integers, before doing the normalize and join.
pandas.DataFrame.pop is used to remove the specified column from the existing dataframe. This removes the need to drop the column later, using pandas.DataFrame.drop.
As a note, if the column has any NaN, they must be filled with an empty dict
df.Pollutants = df.Pollutants.fillna({i: {} for i in df.index})
If the 'Pollutants' column is strings, use '{}'.
Also see How to json_normalize a column with NaNs.
import pandas as pd
from ast import literal_eval
import numpy as np
data = {'Station ID': [8809, 8810, 8811, 8812, 8813, 8814],
'Pollutants': ['{"a": "46", "b": "3", "c": "12"}', '{"a": "36", "b": "5", "c": "8"}', '{"b": "2", "c": "7"}', '{"c": "11"}', '{"a": "82", "c": "15"}', np.nan]}
df = pd.DataFrame(data)
# display(df)
Station ID Pollutants
0 8809 {"a": "46", "b": "3", "c": "12"}
1 8810 {"a": "36", "b": "5", "c": "8"}
2 8811 {"b": "2", "c": "7"}
3 8812 {"c": "11"}
4 8813 {"a": "82", "c": "15"}
5 8814 NaN
# check the type of the first value in Pollutants
>>> print(type(df.iloc[0, 1]))
<class 'str'>
# replace NaN with '{}' if the column is strings, otherwise replace with {}
df.Pollutants = df.Pollutants.fillna('{}') # if the NaN is in a column of strings
# df.Pollutants = df.Pollutants.fillna({i: {} for i in df.index}) # if the column is not strings
# Convert the column of stringified dicts to dicts
# skip this line, if the column contains dicts
df.Pollutants = df.Pollutants.apply(literal_eval)
# reset the index if the index is not unique integers from 0 to n-1
# df.reset_index(inplace=True) # uncomment if needed
# remove and normalize the column of dictionaries, and join the result to df
df = df.join(pd.json_normalize(df.pop('Pollutants')))
# display(df)
Station ID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15
5 8814 NaN NaN NaN
%%timeit
# dataframe with 1M rows
dfb = pd.concat([df]*20000).reset_index(drop=True)
%%timeit
dfb.join(pd.json_normalize(dfb.Pollutants))
[out]:
46.9 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pd.concat([dfb.drop(columns=['Pollutants']), dfb.Pollutants.apply(pd.Series)], axis=1)
[out]:
7.75 s ± 52.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Try this: The data returned from SQL has to converted into a Dict.
or could it be "Pollutant Levels" is now Pollutants'
StationID Pollutants
0 8809 {"a":"46","b":"3","c":"12"}
1 8810 {"a":"36","b":"5","c":"8"}
2 8811 {"b":"2","c":"7"}
3 8812 {"c":"11"}
4 8813 {"a":"82","c":"15"}
df2["Pollutants"] = df2["Pollutants"].apply(lambda x : dict(eval(x)) )
df3 = df2["Pollutants"].apply(pd.Series )
a b c
0 46 3 12
1 36 5 8
2 NaN 2 7
3 NaN NaN 11
4 82 NaN 15
result = pd.concat([df, df3], axis=1).drop('Pollutants', axis=1)
result
StationID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15

I strongly recommend the method extract the column 'Pollutants':
df_pollutants = pd.DataFrame(df['Pollutants'].values.tolist(), index=df.index)
it's much faster than
df_pollutants = df['Pollutants'].apply(pd.Series)
when the size of df is giant.

Merlin's answer is better and super easy, but we don't need a lambda function. The evaluation of dictionary can be safely ignored by either of the following two ways as illustrated below:
Way 1: Two steps
# step 1: convert the `Pollutants` column to Pandas dataframe series
df_pol_ps = data_df['Pollutants'].apply(pd.Series)
df_pol_ps:
a b c
0 46 3 12
1 36 5 8
2 NaN 2 7
3 NaN NaN 11
4 82 NaN 15
# step 2: concat columns `a, b, c` and drop/remove the `Pollutants`
df_final = pd.concat([df, df_pol_ps], axis = 1).drop('Pollutants', axis = 1)
df_final:
StationID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15
Way 2: The above two steps can be combined in one go:
df_final = pd.concat([df, df['Pollutants'].apply(pd.Series)], axis = 1).drop('Pollutants', axis = 1)
df_final:
StationID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15

Note : for dictionary with depth=1 (one-level)
>>> df
Station ID Pollutants
0 8809 {"a": "46", "b": "3", "c": "12"}
1 8810 {"a": "36", "b": "5", "c": "8"}
2 8811 {"b": "2", "c": "7"}
3 8812 {"c": "11"}
4 8813 {"a": "82", "c": "15"}
speed comparison for a large dataset of 10 million rows
>>> df = pd.concat([df]*2000000).reset_index(drop=True)
>>> print(df.shape)
(10000000, 2)
def apply_drop(df):
return df.join(df['Pollutants'].apply(pd.Series)).drop('Pollutants', axis=1)
def json_normalise_drop(df):
return df.join(pd.json_normalize(df.Pollutants)).drop('Pollutants', axis=1)
def tolist_drop(df):
return df.join(pd.DataFrame(df['Pollutants'].tolist())).drop('Pollutants', axis=1)
def vlues_tolist_drop(df):
return df.join(pd.DataFrame(df['Pollutants'].values.tolist())).drop('Pollutants', axis=1)
def pop_tolist(df):
return df.join(pd.DataFrame(df.pop('Pollutants').tolist()))
def pop_values_tolist(df):
return df.join(pd.DataFrame(df.pop('Pollutants').values.tolist()))
>>> %timeit apply_drop(df.copy())
1 loop, best of 3: 53min 20s per loop
>>> %timeit json_normalise_drop(df.copy())
1 loop, best of 3: 54.9 s per loop
>>> %timeit tolist_drop(df.copy())
1 loop, best of 3: 6.62 s per loop
>>> %timeit vlues_tolist_drop(df.copy())
1 loop, best of 3: 6.63 s per loop
>>> %timeit pop_tolist(df.copy())
1 loop, best of 3: 5.99 s per loop
>>> %timeit pop_values_tolist(df.copy())
1 loop, best of 3: 5.94 s per loop
+---------------------+-----------+
| apply_drop | 53min 20s |
| json_normalise_drop | 54.9 s |
| tolist_drop | 6.62 s |
| vlues_tolist_drop | 6.63 s |
| pop_tolist | 5.99 s |
| pop_values_tolist | 5.94 s |
+---------------------+-----------+
df.join(pd.DataFrame(df.pop('Pollutants').values.tolist())) is the fastest

How do I split a column of dictionaries into separate columns with pandas?
pd.DataFrame(df['val'].tolist()) is the canonical method for exploding a column of dictionaries
Here's your proof using a colorful graph.
Benchmarking code for reference.
Note that I am only timing the explosion since that's the most interesting part of answering this question - other aspects of result construction (such as whether to use pop or drop) are tangential to the discussion and can be ignored (it should be noted however that using pop avoids the followup drop call, so the final solution is a bit more performant, but we are still listifying the column and passing it to pd.DataFrame either way).
Additionally, pop destructively mutates the input DataFrame, making it harder to run in benchmarking code which assumes the input is not changed across test runs.
Critique of other solutions
df['val'].apply(pd.Series) is extremely slow for large N as pandas constructs Series objects for each row, then proceeds to construct a DataFrame from them. For larger N the performance dips to the order of minutes or hours.
pd.json_normalize(df['val'])) is slower simply because json_normalize is meant to work with a much more complex input data - particularly deeply nested JSON with multiple record paths and metadata. We have a simple flat dict for which pd.DataFrame suffices, so use that if your dicts are flat.
Some answers suggest df.pop('val').values.tolist() or df.pop('val').to_numpy().tolist(). I don't think it makes much of a difference whether you listify the series or the numpy array. It's one operation less to listify the series directly and really isn't slower so I'd recommend avoiding generating the numpy array in the intermediate step.

You can use join with pop + tolist. Performance is comparable to concat with drop + tolist, but some may find this syntax cleaner:
res = df.join(pd.DataFrame(df.pop('b').tolist()))
Benchmarking with other methods:
df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})
def joris1(df):
return pd.concat([df.drop('b', axis=1), df['b'].apply(pd.Series)], axis=1)
def joris2(df):
return pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
def jpp(df):
return df.join(pd.DataFrame(df.pop('b').tolist()))
df = pd.concat([df]*1000, ignore_index=True)
%timeit joris1(df.copy()) # 1.33 s per loop
%timeit joris2(df.copy()) # 7.42 ms per loop
%timeit jpp(df.copy()) # 7.68 ms per loop

One line solution is following:
>>> df = pd.concat([df['Station ID'], df['Pollutants'].apply(pd.Series)], axis=1)
>>> print(df)
Station ID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15

df = pd.concat([df['a'], df.b.apply(pd.Series)], axis=1)

I've concatenated those steps in a method, you have to pass only the dataframe and the column which contains the dict to expand:
def expand_dataframe(dw: pd.DataFrame, column_to_expand: str) -> pd.DataFrame:
"""
dw: DataFrame with some column which contain a dict to expand
in columns
column_to_expand: String with column name of dw
"""
import pandas as pd
def convert_to_dict(sequence: str) -> Dict:
import json
s = sequence
json_acceptable_string = s.replace("'", "\"")
d = json.loads(json_acceptable_string)
return d
expanded_dataframe = pd.concat([dw.drop([column_to_expand], axis=1),
dw[column_to_expand]
.apply(convert_to_dict)
.apply(pd.Series)],
axis=1)
return expanded_dataframe

my_df = pd.DataFrame.from_dict(my_dict, orient='index', columns=['my_col'])
.. would have parsed the dict properly (putting each dict key into a separate df column, and key values into df rows), so the dicts would not get squashed into a single column in the first place.

Related

Loop over columns with df.shift in Python

Lets say you have a dataframe like this:
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
df
A B
0 3 5
1 1 6
2 2 7
3 3 8
Now I want to skew and calculate on each column. I put the values as I want them skewed in the index:
range_span = range(4)
result = pd.DataFrame(index=range_span)
Then I try to pupulate result with the following:
for c in df.columns:
for i in range_span:
result.iloc[i][c] = df[c].shift(i).max()
result
This only returns the index. I expected something like this:
You've got 3 critical issues:
issue #1
At this line
result.iloc[i][c] = df[c].shift(i).max()
Raises warning that help understand why result is empty.
...\pandas\core\indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
According to their document:
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
As iloc[i] will return slice - aka copy - of that rows, you couldn't set original dataframe result. Further, this is why iloc didn't raised issue when it got str index. Explained in #2.
Instead you use iloc - potentially loc with str - like this:
>>> df
A B C
0 1 10 100
1 2 20 200
2 3 30 300
>>> df.iloc[1, 2]
200
>>>df.iloc[[1, 2], [1, 2]]
B C
1 20 200
2 30 300
>>> df.iloc[1:3, 1:3]
B C
1 20 200
2 30 300
>>> df.iloc[:, 1:3]
B C
0 10 100
1 20 200
2 30 300
# ..and so on
issue #2
If you fix issue #1 then you'll see following error:
result.iloc[[i][c]] = df[c].shift(i).max()
TypeError: list indices must be integers or slices, not str
Also from their document:
property DataFrame.iloc: Purely integer-location based indexing for selection by position.
At for c in df.columns: You're passing column name A, B which is str, not int. Use loc instead for str column indices.
This didn't raise TypeError due to issue #1 - as c was passed as argument of __setitem__().
Issue #3
Normally dataframe cannot be enlarged without special functions like combine.
# using same df from #1
>>> df.iloc[1, 3] = 300
Traceback (most recent call last):
File "~\pandas\core\indexing.py", line 1394, in _has_valid_setitem_indexer
raise IndexError("iloc cannot enlarge its target object")
IndexError: iloc cannot enlarge its target object
Easier fix would be using dict and convert to DataFrame when manipulation is complete. Or just creating DataFrame to match or have a larger size at firsthand:
>>> df2 = pd.DataFrame(index=range(4), columns=range(3))
>>> df2
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
Combining all, correct fix would be:
import pandas as pd
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
result = pd.DataFrame(index=df.index, columns=df.columns)
for col in df.columns:
for index in df.index:
result.loc[index, col] = df[col].shift(index).max()
print(result)
Output:
A B
0 3 8
1 3 7
2 3 6
3 3 5

Pairwise operations in Scikit-Learn and different filtering conditions on each pair

I have the following 2 data frames, say df1
a b c d
0 0 1 2 3
1 4 0 0 7
2 8 9 10 11
3 0 0 0 15
and df2
a b c d
0 5 1 2 3
What I am interested in doing is a pairwise operation on each row in df1 with the single row in df2. However, if a column in a row of df1 is 0, then that column is used in neither the df1 row nor df2 row to perform the pairwise operation. So each pairwise operation will work on pairs of rows of different length. Let me break it down how the 4 comparison should be.
Comparison 1
0 1 2 3 vs 5 1 2 3
The pairwise operation is done on 1 2 3 vs 1 2 3 as column a has a 0
Comparison 2
4 0 0 7 vs 5 1 2 3 is done on 4 7 vs 5 3 as we have 2 columns that need to be dropped
Comparison 3
8 9 10 11 vs 5 1 2 3 is done on 8 9 10 11 vs 5 1 2 3 as no columns are dropped
Comparison 4
0 0 0 15 vs 5 1 2 3 is done on 15 vs 3 as all but one column is dropped
The result of each pairwise operation is a scalar so the result is some sort of structure whether it be list, array, data frame, whatever with 4 (or the number of rows in df1) values. Also, I should note that values in df2 are irrelevant and no filtering is done based upon the value of any column in df2.
For simplicity, you could try looping over each row in the dataframe and do something like this:
import pandas as pd
import numpy as np
a = pd.DataFrame(data=[[0,1,2,3],[4,0,0,7],[8,9,10,11],[0,0,0,15]], columns=['a', 'b', 'c', 'd'])
b = pd.DataFrame(data=[[5, 1, 2, 3]], columns=['a', 'b', 'c', 'd'])
# loop over each row in 'a'
for i in range(len(a)):
# find indicies of non-zero elements of the row
non_zero = np.nonzero(a.iloc[i].to_numpy())[0]
# perform pair-wise addition between non-zero elements in 'a' and the same elements in 'b'
print(np.array(a.iloc[i])[(non_zero)] + np.array(b.iloc[0])[(non_zero)])
Here I used pair-wise addition but you could replace the addition with an operation of your choosing.
Edit:
We may want to vectorize this to avoid the loop if the dataframes are large. Here is an idea for that, where we convert zero values to nan so they are ignored in the row-wise operation:
import pandas as pd
import numpy as np
a = pd.DataFrame(data=[[0,1,2,3],[4,0,0,7],[8,9,10,11],[0,0,0,15]], columns=['a', 'b', 'c', 'd'])
b = pd.DataFrame(data=[[5, 1, 2, 3]], columns=['a', 'b', 'c', 'd'])
# find indicies of zeros
zeros = (a==0).values
# set zeros to nan
a[zeros] = np.nan
# tile and reshape 'b' so its the same shape as 'a'
b = pd.DataFrame(np.tile(b, len(a)).reshape(np.shape(a)), columns=b.columns)
# set the zero indices to nan
b[zeros] = np.nan
print('a:')
print(a)
print('b:')
print(b)
# now do some row-wise operation. For example take the sum of each row
print(np.sum(a+b, axis=1))
Output:
a:
a b c d
0 NaN 1.0 2.0 3
1 4.0 NaN NaN 7
2 8.0 9.0 10.0 11
3 NaN NaN NaN 15
b:
a b c d
0 NaN 1.0 2.0 3
1 5.0 NaN NaN 3
2 5.0 1.0 2.0 3
3 NaN NaN NaN 3
sum:
0 12.0
1 19.0
2 49.0
3 18.0
dtype: float64

Pandas: Group dataframe by condition "last value in column defines the group"

I've got a sorted dataframe (sorted by "customer_id" and "point_in_time") which looks like this:
import pandas as pd
import numpy as np
testing = pd.DataFrame({"customer_id": (1,1,1,2,2,2,2,2,3,3,3,3,4,4),
"point_in_time": (4,5,6,1,2,3,7,9,5,6,8,10,2,5),
"x": ("d", "a", "c", "ba", "cd", "d", "o", "a", "g", "f", "h", "d", "df", "b"),
"revenue": (np.nan, np.nan, 40, np.nan, np.nan, 23, np.nan, 10, np.nan, np.nan, np.nan, 40, np.nan, 100)})
testing
Now I want to group the dataframe by "customer_id" and the "revenue". But with regard to "revenue" a group should start after the last existing revenue and end with the next occuring revenue.
So the groups should look like this:
If I had those groups I could easily do a
testing.groupby(["customer_id", "groups"])
I first tried to create those groups by first grouping by "customer_id" and applying a function to it in which I fill the missing values of "revenue":
def my_func(sub_df):
sub_df["groups"] = sub_df["revenue"].fillna(method="bfill")
sub_df.groupby("groups").apply(next_function)
testing.groupby(["customer_id"]).apply(my_func)
Unfortunately, this does not work if one customer has two revenues which are exactly the same. In this case after using fillna the group column of this customer will consist of only one value which does not allow additional grouping.
So how can this be done and what is the most efficient way to accomplish this task?
Thank you in advance!
Use Series.shift with Series.notna and Series.cumsum, last if necessary add 1:
testing["groups"] = testing['revenue'].shift().notna().cumsum() + 1
print (testing)
customer_id point_in_time x revenue groups
0 1 4 d NaN 1
1 1 5 a NaN 1
2 1 6 c 40.0 1
3 2 1 ba NaN 2
4 2 2 cd NaN 2
5 2 3 d 23.0 2
6 2 7 o NaN 3
7 2 9 a 10.0 3
8 3 5 g NaN 4
9 3 6 f NaN 4
10 3 8 h NaN 4
11 3 10 d 40.0 4
12 4 2 df NaN 5
13 4 5 b 100.0 5

Create of multiple subsets from existing pandas dataframe [duplicate]

I have a very large dataframe (around 1 million rows) with data from an experiment (60 respondents).
I would like to split the dataframe into 60 dataframes (a dataframe for each participant).
In the dataframe, data, there is a variable called 'name', which is the unique code for each participant.
I have tried the following, but nothing happens (or execution does not stop within an hour). What I intend to do is to split the data into smaller dataframes, and append these to a list (datalist):
import pandas as pd
def splitframe(data, name='name'):
n = data[name][0]
df = pd.DataFrame(columns=data.columns)
datalist = []
for i in range(len(data)):
if data[name][i] == n:
df = df.append(data.iloc[i])
else:
datalist.append(df)
df = pd.DataFrame(columns=data.columns)
n = data[name][i]
df = df.append(data.iloc[i])
return datalist
I do not get an error message, the script just seems to run forever!
Is there a smart way to do it?
Can I ask why not just do it by slicing the data frame. Something like
#create some data with Names column
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, 'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})
#create unique list of names
UniqueNames = data.Names.unique()
#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame() for elem in UniqueNames}
for key in DataFrameDict.keys():
DataFrameDict[key] = data[:][data.Names == key]
Hey presto you have a dictionary of data frames just as (I think) you want them. Need to access one? Just enter
DataFrameDict['Joe']
Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.
However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?
I would sort the dataframe by column 'name', set the index to be this and if required not drop the column.
Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.
Use pandas.DataFrame.sort_values and pandas.DataFrame.set_index:
# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'
You can convert groupby object to tuples and then to dict:
df = pd.DataFrame({'Name':list('aabbef'),
'A':[4,5,4,5,5,4],
'B':[7,8,9,4,2,3],
'C':[1,3,5,7,1,0]}, columns = ['Name','A','B','C'])
print (df)
Name A B C
0 a 4 7 1
1 a 5 8 3
2 b 4 9 5
3 b 5 4 7
4 e 5 2 1
5 f 4 3 0
d = dict(tuple(df.groupby('Name')))
print (d)
{'b': Name A B C
2 b 4 9 5
3 b 5 4 7, 'e': Name A B C
4 e 5 2 1, 'a': Name A B C
0 a 4 7 1
1 a 5 8 3, 'f': Name A B C
5 f 4 3 0}
print (d['a'])
Name A B C
0 a 4 7 1
1 a 5 8 3
It is not recommended, but possible create DataFrames by groups:
for i, g in df.groupby('Name'):
globals()['df_' + str(i)] = g
print (df_a)
Name A B C
0 a 4 7 1
1 a 5 8 3
Easy:
[v for k, v in df.groupby('name')]
Groupby can helps you:
grouped = data.groupby(['name'])
Then you can work with each group like with a dataframe for each participant. And DataFrameGroupBy object methods such as (apply, transform, aggregate, head, first, last) return a DataFrame object.
Or you can make list from grouped and get all DataFrame's by index:
l_grouped = list(grouped)
l_grouped[0][1] - DataFrame for first group with first name.
In addition to Gusev Slava's answer, you might want to use groupby's groups:
{key: df.loc[value] for key, value in df.groupby("name").groups.items()}
This will yield a dictionary with the keys you have grouped by, pointing to the corresponding partitions. The advantage is that the keys are maintained and don't vanish in the list index.
The method in the OP works, but isn't efficient. It may have seemed to run forever, because the dataset was long.
Use .groupby on the 'method' column, and create a dict of DataFrames with unique 'method' values as the keys, with a dict-comprehension.
.groupby returns a groupby object, that contains information about the groups, where g is the unique value in 'method' for each group, and d is the DataFrame for that group.
The value of each key in df_dict, will be a DataFrame, which can be accessed in the standard way, df_dict['key'].
The original question wanted a list of DataFrames, which can be done with a list-comprehension
df_list = [d for _, d in df.groupby('method')]
import pandas as pd
import seaborn as sns # for test dataset
# load data for example
df = sns.load_dataset('planets')
# display(df.head())
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
# Using a dict-comprehension, the unique 'method' value will be the key
df_dict = {g: d for g, d in df.groupby('method')}
print(df_dict.keys())
[out]:
dict_keys(['Astrometry', 'Eclipse Timing Variations', 'Imaging', 'Microlensing', 'Orbital Brightness Modulation', 'Pulsar Timing', 'Pulsation Timing Variations', 'Radial Velocity', 'Transit', 'Transit Timing Variations'])
# or a specific name for the key, using enumerate (e.g. df1, df2, etc.)
df_dict = {f'df{i}': d for i, (g, d) in enumerate(df.groupby('method'))}
print(df_dict.keys())
[out]:
dict_keys(['df0', 'df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8', 'df9'])
df_dict['df1].head(3) or df_dict['Astrometry'].head(3)
There are only 2 in this group
method number orbital_period mass distance year
113 Astrometry 1 246.36 NaN 20.77 2013
537 Astrometry 1 1016.00 NaN 14.98 2010
df_dict['df2].head(3) or df_dict['Eclipse Timing Variations'].head(3)
method number orbital_period mass distance year
32 Eclipse Timing Variations 1 10220.0 6.05 NaN 2009
37 Eclipse Timing Variations 2 5767.0 NaN 130.72 2008
38 Eclipse Timing Variations 2 3321.0 NaN 130.72 2008
df_dict['df3].head(3) or df_dict['Imaging'].head(3)
method number orbital_period mass distance year
29 Imaging 1 NaN NaN 45.52 2005
30 Imaging 1 NaN NaN 165.00 2007
31 Imaging 1 NaN NaN 140.00 2004
For more information about the seaborn datasets
NASA Exoplanets
Alternatively
This is a manual method to create separate DataFrames using pandas: Boolean Indexing
This is similar to the accepted answer, but .loc is not required.
This is an acceptable method for creating a couple extra DataFrames.
The pythonic way to create multiple objects, is by placing them in a container (e.g. dict, list, generator, etc.), as shown above.
df1 = df[df.method == 'Astrometry']
df2 = df[df.method == 'Eclipse Timing Variations']
In [28]: df = DataFrame(np.random.randn(1000000,10))
In [29]: df
Out[29]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
0 1000000 non-null values
1 1000000 non-null values
2 1000000 non-null values
3 1000000 non-null values
4 1000000 non-null values
5 1000000 non-null values
6 1000000 non-null values
7 1000000 non-null values
8 1000000 non-null values
9 1000000 non-null values
dtypes: float64(10)
In [30]: frames = [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
In [31]: %timeit [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
1 loops, best of 3: 849 ms per loop
In [32]: len(frames)
Out[32]: 16667
Here's a groupby way (and you could do an arbitrary apply rather than sum)
In [9]: g = df.groupby(lambda x: x/60)
In [8]: g.sum()
Out[8]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16667 entries, 0 to 16666
Data columns (total 10 columns):
0 16667 non-null values
1 16667 non-null values
2 16667 non-null values
3 16667 non-null values
4 16667 non-null values
5 16667 non-null values
6 16667 non-null values
7 16667 non-null values
8 16667 non-null values
9 16667 non-null values
dtypes: float64(10)
Sum is cythonized that's why this is so fast
In [10]: %timeit g.sum()
10 loops, best of 3: 27.5 ms per loop
In [11]: %timeit df.groupby(lambda x: x/60)
1 loops, best of 3: 231 ms per loop
The method based on list comprehension and groupby- Which stores all the split dataframe in list variable and can be accessed using the index.
Example
ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]
ans[0]
ans[0].column_name
You can use the groupby command, if you already have some labels for your data.
out_list = [group[1] for group in in_series.groupby(label_series.values)]
Here's a detailed example:
Let's say we want to partition a pd series using some labels into a list of chunks
For example, in_series is:
2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 5, dtype: float64
And its corresponding label_series is:
2019-07-01 08:00:00 1
2019-07-01 08:02:00 1
2019-07-01 08:04:00 2
2019-07-01 08:06:00 2
2019-07-01 08:08:00 2
Length: 5, dtype: float64
Run
out_list = [group[1] for group in in_series.groupby(label_series.values)]
which returns out_list a list of two pd.Series:
[2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
Length: 2, dtype: float64,
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 3, dtype: float64]
Note that you can use some parameters from in_series itself to group the series, e.g., in_series.index.day
here's a small function which might help some (efficiency not perfect probably, but compact + more or less easy to understand):
def get_splited_df_dict(df: 'pd.DataFrame', split_column: 'str'):
"""
splits a pandas.DataFrame on split_column and returns it as a dict
"""
df_dict = {value: df[df[split_column] == value].drop(split_column, axis=1) for value in df[split_column].unique()}
return df_dict
it converts a DataFrame to multiple DataFrames, by selecting each unique value in the given column and putting all those entries into a separate DataFrame.
the .drop(split_column, axis=1) is just for removing the column which was used to split the DataFrame. the removal is not necessary, but can help a little to cut down on memory usage after the operation.
the result of get_splited_df_dict is a dict, meaning one can access each DataFrame like this:
splitted = get_splited_df_dict(some_df, some_column)
# accessing the DataFrame with 'some_column_value'
splitted[some_column_value]
The existing answers cover all good cases and explains fairly well how the groupby object is like a dictionary with keys and values that can be accessed via .groups. Yet more methods to do the same job as the existing answers are:
Create a list by unpacking the groupby object and casting it to a dictionary:
dict([*df.groupby('Name')]) # same as dict(list(df.groupby('Name')))
Create a tuple + dict (this is the same as #jezrael's answer):
dict((*df.groupby('Name'),))
If we only want the DataFrames, we could get the values of the dictionary (created above):
[*dict([*df.groupby('Name')]).values()]
I had similar problem. I had a time series of daily sales for 10 different stores and 50 different items. I needed to split the original dataframe in 500 dataframes (10stores*50stores) to apply Machine Learning models to each of them and I couldn't do it manually.
This is the head of the dataframe:
I have created two lists;
one for the names of dataframes
and one for the couple of array [item_number, store_number].
list=[]
for i in range(1,len(items)*len(stores)+1):
global list
list.append('df'+str(i))
list_couple_s_i =[]
for item in items:
for store in stores:
global list_couple_s_i
list_couple_s_i.append([item,store])
And once the two lists are ready you can loop on them to create the dataframes you want:
for name, it_st in zip(list,list_couple_s_i):
globals()[name] = df.where((df['item']==it_st[0]) &
(df['store']==(it_st[1])))
globals()[name].dropna(inplace=True)
In this way I have created 500 dataframes.
Hope this will be helpful!

Replace empty list values in Pandas DataFrame with NaN

I know that similar questions have been asked before, but I literarily tried every possible solution listed here and none of them worked.
I am having a dataframe which consists of dates, strings, empty values, and empty list values. It is very huge, 8 million rows.
I want to replace all of the empty list values - so only cells that contain only [], nothing else with NaN. Nothing seems to work.
I tried this:
df = df.apply(lambda y: np.nan if (type(y) == list and len(y) == 0) else y)
as advised similarly in this question replace empty list with NaN in pandas dataframe but it doesn't change anything in my dataframe.
Any help would be appreciated.
Just to assume the OP wants to convert empty list, the string '[]' and the object '[]' to na, below is a solution.
Setup
#borrowed from piRSquared's answer.
df = pd.DataFrame([
[1, 'hello', np.nan, None, 3.14],
['2017-06-30', 2, 'a', 'b', []],
[pd.to_datetime('2016-08-14'), 'x', '[]', 'z', 'w']
])
df
Out[1062]:
0 1 2 3 4
0 1 hello NaN None 3.14
1 2017-06-30 2 a b []
2 2016-08-14 00:00:00 x [] z w
Solution:
#convert all elements to string first, and then compare with '[]'. Finally use mask function to mark '[]' as na
df.mask(df.applymap(str).eq('[]'))
Out[1063]:
0 1 2 3 4
0 1 hello NaN None 3.14
1 2017-06-30 2 a b NaN
2 2016-08-14 00:00:00 x NaN z w
I'm going to make the assumption that you want to mask actual empty lists.
pd.DataFrame.mask will turn cells that have corresponding True values to np.nan
I want to find actual list values. So I'll use df.applymap(type) to get the type in every cell and see if it is equal to list
I know that [] evaluates to False in a boolean context, so I'll use df.astype(bool) to see.
I'll end up masking those cells that are both list type and evaluate to False
Consider the dataframe df
df = pd.DataFrame([
[1, 'hello', np.nan, None, 3.14],
['2017-06-30', 2, 'a', 'b', []],
[pd.to_datetime('2016-08-14'), 'x', '[]', 'z', 'w']
])
df
0 1 2 3 4
0 1 hello NaN None 3.14
1 2017-06-30 2 a b []
2 2016-08-14 00:00:00 x [] z w
Solution
df.mask(df.applymap(type).eq(list) & ~df.astype(bool))
0 1 2 3 4
0 1 hello NaN None 3.14
1 2017-06-30 2 a b NaN
2 2016-08-14 00:00:00 x [] z w

Resources