Expanding dict entries into rows with Pandas - python-3.x

I'm trying to expand dict keys and values into their own columns using Python3 and Pandas. Below is an example. Not all dicts have the same number of items and there's no guarantee that key names match for each metric type.
I want to convert this dataframe:
id metric dicts
1 some_metric_1 {'a': 161, 'b': 121}
2 some_metric_1 {'a': 152, 'c': 4}
2 some_metric_2 {'b': 162, 'a': 83}
3 some_metric_2 {'b': 103, 'z': 69}
Created by this:
data = {'id': [1, 2, 2, 3], 'metric': ['some_metric_1', 'some_metric_1', 'some_metric_2', 'some_metric_2'], 'dicts': [{'a': 161, 'b': 121}, {'a': 152, 'c': 4}, {'b': 162, 'a': 83}, {'b': 103, 'z': 69}]}
df = pd.DataFrame.from_dict(data)
into this:
id metric key value
1 some_metric_1 a 161
1 some_metric_1 b 121
2 some_metric_1 a 152
2 some_metric_1 c 4
2 some_metric_2 b 162
2 some_metric_2 a 83
3 some_metric_2 b 103
3 some_metric_2 z 69

You can simply iterate through the rows of your DataFrame and extract the values needed as shown below.
Now keep in mind that the code below assumes that each key will only have 1 value (i.e. no list of value will be passed to a dict key). Though, it will work regardless of the numbers of keys.
final_df = pd.DataFrame()
for row in df.iterrows():
metric = row[1][1] # get the value in the metric column
i = row[1][0] # get the id value
for key, value in row[1][2].items():
tmp_df = pd.DataFrame({
'id':i,
'metric':metric,
'key': key,
'value': value
}, index=[0])
final_df = final_df.append(tmp_df) # append the tmp_df to our final df
final_df.reset_index(drop=True) # Reset the final DF index sinze we assign index 0 to each tmp df
Output
id metric key value
0 1 some_metric_1 a 161
1 1 some_metric_1 b 121
2 1 some_metric_1 c 152
3 2 some_metric_1 a 152
4 2 some_metric_1 c 4
5 2 some_metric_2 b 162
6 2 some_metric_2 a 83
7 3 some_metric_3 b 103
8 3 some_metric_3 z 69
Here is more information regarding df.append().

I find this type of problem easier to solve in plain Python rather than Pandas - once you are storing dictionaries in your DataFrame, it's going to be difficult to perform the kind of fast vectorized operations which make Pandas so useful for simple numeric/string data.
Here's my solution which involves a couple of comprehensions, and zip.
metrics = df['metric']
dicts = df['dicts']
ids = df['id']
metrics, ids = zip(*((m, i) for m, d, i in zip(metrics, dicts, ids) for j in range(len(d))))
keys, values = zip(*((k, v) for d in dicts for k, v in d.items()))
new_data = {'id': ids, 'metric': metrics, 'keys': keys, 'values': values}
new_df = pd.DataFrame.from_dict(new_data)
Results:
id keys metric values
0 1 a some_metric_1 161
1 1 b some_metric_1 121
2 2 a some_metric_1 152
3 2 c some_metric_1 4
4 2 b some_metric_2 162
5 2 a some_metric_2 83
6 3 b some_metric_2 103
7 3 z some_metric_2 69

Related

Removing rows from DataFrame based on different conditions applied to subset of a data

Here is the Dataframe I am working with:
You can create it using the snippet:
my_dict = {'id': [1,2,1,2,1,2,1,2,1,2,3,1,3, 3],
'category':['a', 'a', 'b', 'b', 'b', 'b', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'],
'value' : [1, 12, 34, 12, 12 ,34, 12, 35, 34, 45, 65, 55, 34, 25]
}
x = pd.DataFrame(my_dict)
x
I want to filter IDs based on the condition: for category a, the count of values should be 2 and for category b, the count of values should be 3. Therefore, I would remove id 1 from category a and id 3 from category b from my original dataset x.
I can write the code for individual categories and start removing id's manually by using the code:
x.query('category == "a"').groupby('id').value.count().loc[lambda x: x != 2]
x.query('category == "b"').groupby('id').value.count().loc[lambda x: x != 3]
But, I don't want to do it manually since there are multiple categories. Is there a better way of doing it by considering all the categories at once and remove id's based on the condition listed in a list/dictionary?
If need filter MultiIndex Series - s by dictionary use Index.get_level_values with Series.map and get equal values per groups in boolean indexing:
s = x.groupby(['category','id']).value.count()
d = {'a': 2, 'b': 3}
print (s[s.eq(s.index.get_level_values(0).map(d))])
category id
a 2 2
3 2
b 1 3
2 3
Name: value, dtype: int64
If need filter original DataFrame:
s = x.groupby(['category','id'])['value'].transform('count')
print (s)
0 3
1 2
2 3
3 3
4 3
5 3
6 3
7 2
8 3
9 3
10 1
11 3
12 2
13 2
Name: value, dtype: int64
d = {'a': 2, 'b': 3}
print (x[s.eq(x['category'].map(d))])
id category value
1 2 a 12
2 1 b 34
3 2 b 12
4 1 b 12
5 2 b 34
7 2 a 35
8 1 b 34
9 2 b 45
12 3 a 34
13 3 a 25

Pandas DataFrame: Same operation on multiple sets of columns

I want to do the same operation on multiple sets of columns of a DataFrame.
Since "for-loops" are frowned upon I'm searching for a decent alternative.
An example:
df = pd.DataFrame({
'a': [1, 11, 111],
'b': [222, 22, 2],
'a_x': [10, 80, 30],
'b_x': [20, 20, 60],
})
This is a simple for-loop approach. It's short and well readable.
cols = ['a', 'b']
for col in cols:
df[f'{col}_res'] = df[[col, f'{col}_x']].min(axis=1)
a b a_x b_x a_res b_res
0 1 222 10 20 1 20
1 11 22 80 20 11 20
2 111 2 30 60 30 2
This is an alternative (w/o for-loop), but I feel that the additional complexity is not really for the better.
cols = ['a', 'b']
def res_df(df, col, name):
res = pd.Series(
df[[col, f'{col}_x']].min(axis=1), index=df.index, name=name)
return res
res = [res_df(df, col, f'{col}_res') for col in cols]
df = pd.concat([df, pd.concat(res, axis=1)], axis=1)
Does anyone have a better/more pythonic solution?
Thanks!
UPDATE 1
Inspired by the proposal from mozway I find the following solution quite appealing.
Imho it's short, readable and generic, since the particular operation can be swapped into a function and the list comprehension applies the function to the given sets of columns.
def operation(s1, s2):
# fill in any operation on pandas series'
# e.g. res = s1 * s2 / (s1 + s2)
res = np.minimum(s1, s2)
return res
df = df.join(
[operation(df[f'{col}'], df[f'{col}_x']).rename(f'{col}_res') for col in cols]
)
You can use numpy.minimum after setting the arrays to identical column names:
cols = ['a', 'b']
cols2 = [f'{x}_x' for x in cols]
df = df.join(np.minimum(df[cols],
df[cols2].set_axis(cols, axis=1))
.add_suffix('_res'))
output:
a b a_x b_x a_res b_res
0 1 222 10 20 1 20
1 11 22 80 20 11 20
2 111 2 30 60 30 2
or, using rename as suggested in the other answer:
cols = ['a', 'b']
cols2 = {f'{x}_x': x for x in cols}
df = df.join(np.minimum(df[cols],
df[list(cols2)].rename(columns=cols2))
.add_suffix('_res'))
One idea is rename columns names by dictionary, select columns by list cols and then group by columns names with aggregate min, sum, max or use custom function:
cols = ['a', 'b']
suffix = '_x'
d = {f'{x}{suffix}':x for x in cols}
print (d)
{'a_x': 'a', 'b_x': 'b'}
print (df.rename(columns=d)[cols])
a a b b
0 1 10 222 20
1 11 80 22 20
2 111 30 2 60
df1 = df.rename(columns=d)[cols].groupby(axis=1,level=0).min().add_suffix('_res')
print (df1)
a_res b_res
0 1 20
1 11 20
2 30 2
Last add to original DataFrame:
df = df.join(df1)
print (df)
a b a_x b_x a_res b_res
0 1 222 10 20 1 20
1 11 22 80 20 11 20
2 111 2 30 60 30 2

How to sort a pandas dataframe by the standard deviations of its columns?

Given the following example :
from statistics import stdev
d = pd.DataFrame({"a": [45, 55], "b": [5, 95], "c": [30, 70]})
stds = [stdev(d[c]) for c in d.columns]
With output:
In [87]: d
Out[87]:
a b c
0 45 5 30
1 55 95 70
In [91]: stds
Out[91]: [7.0710678118654755, 63.63961030678928, 28.284271247461902]
I would like to be able to sort the columns of the dataframe by their
standard deviations, resulting in the following
b c a
0 5 30 45
1 95 70 55
you are looking for:
d.iloc[:,(-d.std()).argsort()]
Out[8]:
b c a
0 5 30 45
1 95 70 55
You can get the column order like this:
column_order = d.std().sort_values(ascending=False).index
# >>> column_order
# Index(['b', 'c', 'a'], dtype='object')
And then sort the columns like this:
d[column_order]
b c a
0 5 30 45
1 95 70 55

Splitting dictionary/list into Separate Columns

I have movie dataset saved for revenue prediction. However, the genres column of this dataset has a dictionary in that dictionary there is 2 or more list in 1 row. The DataFrame looks like this this is not actual dataframe but dataframe is similar to this:
df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, [{'c':4},{'d':3}], [{'c':5, 'd':6},{'c':7, 'd':8}]]})
this is output
a b
0 1 {'c': 1}
1 2 [{'c': 4}, {'d': 3}]
2 3 [{'c': 5, 'd': 6}, {'c': 7, 'd': 8}]
I need to split this column into separate columns.
How can i do that I used apply(pd.series) method This is what I'm getting as a output
0 1 c
0 NaN NaN 1.0
1 {'c': 4} {'d': 3} NaN
2 {'c': 5, 'd': 6} {'c': 5, 'd': 6} NaN
but I want like this if possible:
a c d
0 1 1 NaN
1 2 4 3
2 3 5,7 6,8
I do not know if it is possible to achieve what you want by using apply(pd.Series) because you have mixed types in your 'b' column: you have dictionaries and list of dictionaries. Maybe it is, not sure.
However this is how I would do.
First, loop over your column to build a set with all the new column names: that is, the keys of the dictionaries.
Then you can use apply with a custom function to extract the value for each column.
Notice that the values in this column are strings, needed because you want to concatenate with a comma cases like your row #2.
newcols = set()
for el in df['b']:
if isinstance(el, dict):
newcols.update(el.keys())
elif isinstance(el, list):
for i in el:
newcols.update(i.keys())
def extractvalues(x, col):
if isinstance(x['b'], dict):
return x['b'].get(col, np.nan)
elif isinstance(x['b'], list):
return ','.join(str(i.get(col, '')) for i in x['b']).strip(',')
for nc in newcols:
df[nc] = df.apply(lambda r: extractvalues(r, nc), axis=1)
df.drop('b', axis=1, inplace=True)
Your dataframe is now:
a c d
0 1 1 NaN
1 2 4 3
2 3 5,7 6,8

Add two more columns to csv file based on matching values of other csv

I have two csv files
csv1:
csv2:
What i need to process is:
Get each value of column c of csv1 file and match it with column number of csv2.
If any row of csv2 matches with that number then add a new column c_text into csv1 that will contain value of text column for matching row of csv2
Repeat above process for column d of csv1 and add a new column d_text into csv1
Here is what i need at the end
Am new to pandas. How can i do this using pandas.
You can use apply():
csv1['c_text'] = csv1['c'].apply(lambda x: csv2[csv2['number']==x]['text'].values[0])
csv1['d_text'] = csv1['d'].apply(lambda x: csv2[csv2['number']==x]['text'].values[0])
Yields:
a b c d c_text d_text
0 1 4 101 201 val1 val4
1 2 5 105 202 val2 val5
2 3 6 107 203 val3 val6
In terms of an option using merge(), this will yield the same output:
csv1 = csv1.merge(csv2, left_on='c', right_on='number', how='left')
csv1 = csv1.merge(csv2, left_on='d', right_on='number', how='left')
csv1 = csv1.rename(columns={'text_x': 'c_text', 'text_y': 'd_text'})[['a','b','c','d','c_text','d_text']]
Here's something that will do the trick:
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c':[101, 105, 107], 'd':[201, 202, 203]})
df2 = pd.DataFrame({'number': [101, 105, 107, 201, 202, 203, 205, 2010, 310], 'text': ["val_{x}".format(x=y + 1) for y in range(9)]})
df1
a b c d
0 1 4 101 201
1 2 5 105 202
2 3 6 107 203
df2
number text
0 101 val_1
1 105 val_2
2 107 val_3
3 201 val_4
4 202 val_5
5 203 val_6
6 205 val_7
7 2010 val_8
8 310 val_9
merged = df1.merge(df2, left_on='c', right_on='number', how='left')
merged
a b c d number text
0 1 4 101 201 101 val_1
1 2 5 105 202 105 val_2
2 3 6 107 203 107 val_3
output = merged.merge(df2, left_on='d', right_on='number', how='left')[['a', 'b', 'c', 'd', 'text_x', 'text_y']]
output
a b c d text_x text_y
0 1 4 101 201 val_1 val_4
1 2 5 105 202 val_2 val_5
2 3 6 107 203 val_3 val_6
What you want is the merge functionality of Pandas. Assuming you have imported the Pandas module with the shorthand name like import pandas as pd, then:
csv1_with_text_col = pd.merge(csv1, csv2, left_on='c', right_on='number', how='left')
This will give you a new dataframe, csv1_with_text_col, with the columns from csv2 merged into csv1 where csv1['c'] == csv2['number']. Additionally, by specifying how='left', only rows from the left dataframe, csv1, will be kept.
You can then merge this new dataframe, csv1_with_text_col, with csv2 again but with left_on='d'.

Resources