How to create multiple value dictionary from pandas data frame - python-3.x

Lets say I have a pandas data frame with 2 columns(column A and Column B):
For values in column 'A' there are multiple values in column 'B'.
I want to create a dictionary with multiple values for each key those values should be unique as well. Please suggest me a way to do this.

One way is to groupby columns A:
In [1]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [2]: df
Out[2]:
A B
0 1 2
1 1 4
2 5 6
In [3]: g = df.groupby('A')
Apply tolist on each of the group's column B:
In [4]: g['B'].tolist() # shorthand for .apply(lambda s: s.tolist()) "automatic delegation"
Out[4]:
A
1 [2, 4]
5 [6]
dtype: object
And then call to_dict on this Series:
In [5]: g['B'].tolist().to_dict()
Out[5]: {1: [2, 4], 5: [6]}
If you want these to be unique, use unique (Note: this will create a numpy array rather than a list):
In [11]: df = pd.DataFrame([[1, 2], [1, 2], [5, 6]], columns=['A', 'B'])
In [12]: g = df.groupby('A')
In [13]: g['B'].unique()
Out[13]:
A
1 [2]
5 [6]
dtype: object
In [14]: g['B'].unique().to_dict()
Out[14]: {1: array([2]), 5: array([6])}
Other alternatives are to use .apply(lambda s: set(s)), .apply(lambda s: list(set(s))), .apply(lambda s: list(s.unique()))...

You can actually loop over df.groupby object and collect the value as list.
In[1]:
df = pd.DataFrame([[1, 2], [1, 2], [5, 6]], columns=['A', 'B'])
{k: list(v) for k,v in df.groupby("A")["B"]}
Out[1]:
{1: [2, 2], 5: [6]}

Related

Pandas Dataframe to dictionary of dictionaries of lists

I have this a dataframe like this:
col1
col2
col3
US
1
1
US
1
2
US
2
1
NL
1
1
US
2
2
DK
1
1
and I would like to get a dictionary of dictionaries of lists grouped by col1, like that:
dict = {US: {1: [1, 2], 2: [1, 2]}, NL: {1: [1]}, DK: {1: [1]}}
Basically, each unique element of col1 should correspond to a nested dictionary that has the unique element of col2 as key and all the elements of col3 as values.
I tried
dct = df.groupby("col1").apply(lambda x: x.set_index("col2")['col2'].to_dict()).to_dict()
but I do not get the expected outcome.
Any suggestions?
Here's one option using a nested groupby:
out = df.groupby('col1').apply(lambda g: g.groupby('col2')['col3'].agg(list).to_dict()).to_dict()
Output:
{'DK': {1: [1]}, 'NL': {1: [1]}, 'US': {1: [1, 2], 2: [1, 2]}}
itertuples
d = {}
for a, b, c in df.itertuples(index=False, name=None):
d.setdefault(a, {}).setdefault(b, []).append(c)
d
{'US': {1: [1, 2], 2: [1, 2]}, 'NL': {1: [1]}, 'DK': {1: [1]}}
Same thing but using map and zip
d = {}
for a, b, c in zip(*map(df.get, ['col1', 'col2', 'col3'])):
d.setdefault(a, {}).setdefault(b, []).append(c)
d
{'US': {1: [1, 2], 2: [1, 2]}, 'NL': {1: [1]}, 'DK': {1: [1]}}
Pandas variants
I don't think these are as good as the method above
1
d = df.groupby(['col1', 'col2'])['col3'].agg(list)
{a: d.xs(a).to_dict() for a in d.index.levels[0]}
{'DK': {1: [1]}, 'NL': {1: [1]}, 'US': {1: [1, 2], 2: [1, 2]}}
2
{
a: b.xs(a).to_dict()
for a, b in df.groupby(['col1', 'col2'])['col3'].agg(list).groupby('col1')
}
{'DK': {1: [1]}, 'NL': {1: [1]}, 'US': {1: [1, 2], 2: [1, 2]}}

Dictionary copy() - is shallow deep sometimes?

According to the official docs the dictionary copy is shallow, i.e. it returns a new dictionary that contains the same key-value pairs:
dict1 = {1: "a", 2: "b", 3: "c"}
dict1_alias = dict1
dict1_shallow_copy = dict1.copy()
My understanding is that if we del an element of dict1 both dict1_alias & dict1_shallow_copy should be affected; however, a deepcopy would not.
del dict1[2]
print(dict1)
>>> {1: 'a', 3: 'c'}
print(dict1_alias)
>>> {1: 'a', 3: 'c'}
But dict1_shallow_copy 2nd element is still there!
print(dict1_shallow_copy)
>>> {1: 'a', 2: 'b', 3: 'c'}
What am I missing?
A shallow copy means that the elements themselves are the same, just not the dictionary itself.
>>> a = {'a':[1, 2, 3], #create a list instance at a['a']
'b':4,
'c':'efd'}
>>> b = a.copy() #shallow copy a
>>> b['a'].append(2) #change b['a']
>>> b['a']
[1, 2, 3, 2]
>>> a['a'] #a['a'] changes too, it refers to the same list
[1, 2, 3, 2]
>>> del b['b'] #here we do not change b['b'], we change b
>>> b
{'a': [1, 2, 3, 2], 'c': 'efd'}
>>> a #so a remains unchanged
{'a': [1, 2, 3, 2], 'b': 4, 'c': 'efd'}1

Python Pandas - Update row with dictionary based on index, column

I have a dataframe with empty columns and a corresponding dictionary which I would like to update the empty columns with based on index, column:
import pandas as pd
import numpy as np
dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [4, 6, 2], [3, 4, 1]])
dataframe.columns = ['x', 'y', 'z']
additional_cols = ['a', 'b', 'c']
for col in additional_cols:
dataframe[col] = np.nan
x y z a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 4 6 2
4 3 4 1
for row, column in x.iterrows():
#caluclations to return dictionary y
y = {"a": 5, "b": 6, "c": 7}
df.loc[row, :].map(y)
Basically after performing the calculations using columns x, y, z I would like to update columns a, b, c for that same row :)
I could use a function as such but as far as the pandas library and a method for the DataFrame object I am not sure...
def update_row_with_dict(dictionary, dataframe, index):
for key in dictionary.keys():
dataframe.loc[index, key] = dictionary.get(key)
The above answer with correct indent
def update_row_with_dict(df,d,idx):
for key in d.keys():
df.loc[idx, key] = d.get(key)
more short would be
def update_row_with_dict(df,d,idx):
df.loc[idx,d.keys()] = d.values()
for your code snipped the syntax would be:
import pandas as pd
import numpy as np
dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [4, 6, 2], [3, 4, 1]])
dataframe.columns = ['x', 'y', 'z']
additional_cols = ['a', 'b', 'c']
for col in additional_cols:
dataframe[col] = np.nan
for idx in dataframe.index:
y = {'a':1,'b':2,'c':3}
update_row_with_dict(dataframe,y,idx)

lost attribute after adding column to pandas DataFrame

I like to add attributes to pandas DataFrame columns, for example to manage labels or units.
df = pd.DataFrame([[1, 2], [5, 6]], columns=['A', 'B'])
df['A'].units = 'm/s'
Calling the units of column (with df['A'].units) returns m/s.
However, the attribute gets lost after any DataFrame to Series operation, such as adding a new column:
df['C'] = [3, 8]
df['A'].units
AttributeError: 'Series' object has no attribute 'units'
Is there an approach to keep the attributes or an alternative to add columns?
_metadata, is not part of public API. Not a stable way of doing it, still, for now
In [8]: df = pd.DataFrame([[1, 2], [5, 6]], columns=['A', 'B'])
In [9]: df['A']._metadata
Out[9]: ['name']
In [10]: df['A']._metadata.append({'units': 'm/s'})
In [11]: df['C'] = [3, 8]
In [12]: df['A']._metadata
Out[12]: ['name', {'units': 'm/s'}]

Different behavior in list comprehension

In my mind this two pieces of code do the same thing:
l = [[1,2], [3,4],[3,2], [5,4], [4,4],[5,7]]
1)
In [4]: [list(g) for k,g in groupby(sorted(l,key=lambda x:x[1]),
key = lambda x:x[1]) if len(list(g)) == 2]
Out[4]: [[]]
2)
In [5]: groups = [list(g) for k,g in groupby(sorted(l,
key=lambda x:x[1]), key = lambda x:x[1])]
In [6]: [g for g in groups if len(g) == 2]
Out[6]: [[[1, 2], [3, 2]]]
But as you see first one gives an empty list while the second one gives what I need. Where am I mistaken?
The group is an iterator, you cannot consume it (e.g. by calling list on it) twice. For example:
>>> from operator import itemgetter
>>> from itertools import groupby
>>> l = [[1,2], [3,4],[3,2], [5,4], [4,4],[5,7]]
>>> for _, group in groupby(sorted(l, key=itemgetter(1)), key=itemgetter(1)):
... print('first', list(group))
... print('second', list(group))
...
first [[1, 2], [3, 2]]
second []
first [[3, 4], [5, 4], [4, 4]]
second []
first [[5, 7]]
second []
Instead, you need to call list once per group and filter on the results of that, e.g. by using map:
>>> [lst for lst in map(list, (group for _, group in groupby(sorted(l, key=itemgetter(1))), key=itemgetter(1))) if len(lst) == 2]
[[[1, 2], [3, 2]]]

Resources