Pandas Dataframe to dictionary of dictionaries of lists - python-3.x

I have this a dataframe like this:
col1
col2
col3
US
1
1
US
1
2
US
2
1
NL
1
1
US
2
2
DK
1
1
and I would like to get a dictionary of dictionaries of lists grouped by col1, like that:
dict = {US: {1: [1, 2], 2: [1, 2]}, NL: {1: [1]}, DK: {1: [1]}}
Basically, each unique element of col1 should correspond to a nested dictionary that has the unique element of col2 as key and all the elements of col3 as values.
I tried
dct = df.groupby("col1").apply(lambda x: x.set_index("col2")['col2'].to_dict()).to_dict()
but I do not get the expected outcome.
Any suggestions?

Here's one option using a nested groupby:
out = df.groupby('col1').apply(lambda g: g.groupby('col2')['col3'].agg(list).to_dict()).to_dict()
Output:
{'DK': {1: [1]}, 'NL': {1: [1]}, 'US': {1: [1, 2], 2: [1, 2]}}

itertuples
d = {}
for a, b, c in df.itertuples(index=False, name=None):
d.setdefault(a, {}).setdefault(b, []).append(c)
d
{'US': {1: [1, 2], 2: [1, 2]}, 'NL': {1: [1]}, 'DK': {1: [1]}}
Same thing but using map and zip
d = {}
for a, b, c in zip(*map(df.get, ['col1', 'col2', 'col3'])):
d.setdefault(a, {}).setdefault(b, []).append(c)
d
{'US': {1: [1, 2], 2: [1, 2]}, 'NL': {1: [1]}, 'DK': {1: [1]}}
Pandas variants
I don't think these are as good as the method above
1
d = df.groupby(['col1', 'col2'])['col3'].agg(list)
{a: d.xs(a).to_dict() for a in d.index.levels[0]}
{'DK': {1: [1]}, 'NL': {1: [1]}, 'US': {1: [1, 2], 2: [1, 2]}}
2
{
a: b.xs(a).to_dict()
for a, b in df.groupby(['col1', 'col2'])['col3'].agg(list).groupby('col1')
}
{'DK': {1: [1]}, 'NL': {1: [1]}, 'US': {1: [1, 2], 2: [1, 2]}}

Related

A better way to create a dictionary out of two lists with duplicated values in one

I have two lists:
a = ["A", "B", "B", "C", "D", "A"]
b = [1, 2, 3, 4, 5, 6]
I want to have a dictionary like the one below:
d = {"A":[1, 6], "B":[2, 3], "C":[4], "D":[5]}
Right now I am doing something like this:
d = {i:[] for i in set(a)}
for c in zip(a, b):
d[c[0]].append(c[1])
Is there a better way to do this?
You can use the dict.setdefault method to initialize each key as a list and then append the current value to it while you iterate:
d = {}
for k, v in zip(a, b):
d.setdefault(k, []).append(v)
With the sample input, d would become:
{'A': [1, 6], 'B': [2, 3], 'C': [4], 'D': [5]}
Running your code, didnt produce the output what you wanted. The below is a bit more verbose but makes it easy to see what the code is doing.
a = ["A", "B", "B", "C", "D", "A"]
b = [1, 2, 3, 4, 5, 6]
c = {}
for a, b in zip(a, b):
if a in c:
if isinstance(a, list):
c[a].append(b)
else:
c[a] = [c[a], b]
else:
c[a] = b
print(c)
OUTPUT
{'A': [1, 6], 'B': [2, 3], 'C': 4, 'D': 5}
An alternative less verbose mode would be to use a default dict with list as the type, you can then append all the items to it. this will mean even single items will be in a list. to me its much cleaner as you will know the data type of each item in the list. However if you really do want to have single items not in a list you can clean it up with a dict comprehension
from collections import defaultdict
a = ["A", "B", "B", "C", "D", "A"]
b = [1, 2, 3, 4, 5, 6]
c = defaultdict(list)
for a, b in zip(a, b):
c[a].append(b)
d = {k: v if len(v) > 1 else v[0] for k, v in c.items()}
print(c)
print(d)
OUTPUT
defaultdict(<class 'list'>, {'A': [1, 6], 'B': [2, 3], 'C': [4], 'D': [5]})
{'A': [1, 6], 'B': [2, 3], 'C': 4, 'D': 5}

Making a special sort of dictionary

I have a python program in which I have a list which resembles the list below:
a = [[1,2,3], [4,2,7], [5,2,3], [7,8,5]]
Here I want to create a dictionary using the middle value of each sublist as keys which should look something like this:
b = {2:[[1,2,3], [4,2,7], [5,2,3]], 8: [[7,8,5]]}
How can I achieve this?
You can do it simply like this:
a = [[1,2,3], [4,2,7], [5,2,3], [7,8,5]]
b = {}
for l in a:
m = l[len(l) // 2] # : get the middle element
if m in b:
b[m].append(l)
else:
b[m] = [l]
print(b)
Output:
{2: [[1, 2, 3], [4, 2, 7], [5, 2, 3]], 8: [[7, 8, 5]]}
You could also use a defaultdict to avoid the if in the loop:
from collections import defaultdict
b = defaultdict(list)
for l in a:
m = l[len(l) // 2]
b[m].append(l)
print(b)
Output:
defaultdict(<class 'list'>, {2: [[1, 2, 3], [4, 2, 7], [5, 2, 3]], 8: [[7, 8, 5]]})
Here is a solution that uses dictionary comprehension:
from itertools import groupby
a = [[1,2,3], [4,2,7], [5,2,3], [7,8,5]]
def get_mid(x):
return x[len(x) // 2]
b = {key: list(val) for key, val in groupby(sorted(a, key=get_mid), get_mid)}
print(b)

Dictionary copy() - is shallow deep sometimes?

According to the official docs the dictionary copy is shallow, i.e. it returns a new dictionary that contains the same key-value pairs:
dict1 = {1: "a", 2: "b", 3: "c"}
dict1_alias = dict1
dict1_shallow_copy = dict1.copy()
My understanding is that if we del an element of dict1 both dict1_alias & dict1_shallow_copy should be affected; however, a deepcopy would not.
del dict1[2]
print(dict1)
>>> {1: 'a', 3: 'c'}
print(dict1_alias)
>>> {1: 'a', 3: 'c'}
But dict1_shallow_copy 2nd element is still there!
print(dict1_shallow_copy)
>>> {1: 'a', 2: 'b', 3: 'c'}
What am I missing?
A shallow copy means that the elements themselves are the same, just not the dictionary itself.
>>> a = {'a':[1, 2, 3], #create a list instance at a['a']
'b':4,
'c':'efd'}
>>> b = a.copy() #shallow copy a
>>> b['a'].append(2) #change b['a']
>>> b['a']
[1, 2, 3, 2]
>>> a['a'] #a['a'] changes too, it refers to the same list
[1, 2, 3, 2]
>>> del b['b'] #here we do not change b['b'], we change b
>>> b
{'a': [1, 2, 3, 2], 'c': 'efd'}
>>> a #so a remains unchanged
{'a': [1, 2, 3, 2], 'b': 4, 'c': 'efd'}1

Dataframe to Dictionary [duplicate]

I have a DataFrame with four columns. I want to convert this DataFrame to a python dictionary. I want the elements of first column be keys and the elements of other columns in same row be values.
DataFrame:
ID A B C
0 p 1 3 2
1 q 4 3 2
2 r 4 0 9
Output should be like this:
Dictionary:
{'p': [1,3,2], 'q': [4,3,2], 'r': [4,0,9]}
The to_dict() method sets the column names as dictionary keys so you'll need to reshape your DataFrame slightly. Setting the 'ID' column as the index and then transposing the DataFrame is one way to achieve this.
to_dict() also accepts an 'orient' argument which you'll need in order to output a list of values for each column. Otherwise, a dictionary of the form {index: value} will be returned for each column.
These steps can be done with the following line:
>>> df.set_index('ID').T.to_dict('list')
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
In case a different dictionary format is needed, here are examples of the possible orient arguments. Consider the following simple DataFrame:
>>> df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
>>> df
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
Then the options are as follows.
dict - the default: column names are keys, values are dictionaries of index:data pairs
>>> df.to_dict('dict')
{'a': {0: 'red', 1: 'yellow', 2: 'blue'},
'b': {0: 0.5, 1: 0.25, 2: 0.125}}
list - keys are column names, values are lists of column data
>>> df.to_dict('list')
{'a': ['red', 'yellow', 'blue'],
'b': [0.5, 0.25, 0.125]}
series - like 'list', but values are Series
>>> df.to_dict('series')
{'a': 0 red
1 yellow
2 blue
Name: a, dtype: object,
'b': 0 0.500
1 0.250
2 0.125
Name: b, dtype: float64}
split - splits columns/data/index as keys with values being column names, data values by row and index labels respectively
>>> df.to_dict('split')
{'columns': ['a', 'b'],
'data': [['red', 0.5], ['yellow', 0.25], ['blue', 0.125]],
'index': [0, 1, 2]}
records - each row becomes a dictionary where key is column name and value is the data in the cell
>>> df.to_dict('records')
[{'a': 'red', 'b': 0.5},
{'a': 'yellow', 'b': 0.25},
{'a': 'blue', 'b': 0.125}]
index - like 'records', but a dictionary of dictionaries with keys as index labels (rather than a list)
>>> df.to_dict('index')
{0: {'a': 'red', 'b': 0.5},
1: {'a': 'yellow', 'b': 0.25},
2: {'a': 'blue', 'b': 0.125}}
Should a dictionary like:
{'red': '0.500', 'yellow': '0.250', 'blue': '0.125'}
be required out of a dataframe like:
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
simplest way would be to do:
dict(df.values)
working snippet below:
import pandas as pd
df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
dict(df.values)
Follow these steps:
Suppose your dataframe is as follows:
>>> df
A B C ID
0 1 3 2 p
1 4 3 2 q
2 4 0 9 r
1. Use set_index to set ID columns as the dataframe index.
df.set_index("ID", drop=True, inplace=True)
2. Use the orient=index parameter to have the index as dictionary keys.
dictionary = df.to_dict(orient="index")
The results will be as follows:
>>> dictionary
{'q': {'A': 4, 'B': 3, 'D': 2}, 'p': {'A': 1, 'B': 3, 'D': 2}, 'r': {'A': 4, 'B': 0, 'D': 9}}
3. If you need to have each sample as a list run the following code. Determine the column order
column_order= ["A", "B", "C"] # Determine your preferred order of columns
d = {} # Initialize the new dictionary as an empty dictionary
for k in dictionary:
d[k] = [dictionary[k][column_name] for column_name in column_order]
Try to use Zip
df = pd.read_csv("file")
d= dict([(i,[a,b,c ]) for i, a,b,c in zip(df.ID, df.A,df.B,df.C)])
print d
Output:
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
If you don't mind the dictionary values being tuples, you can use itertuples:
>>> {x[0]: x[1:] for x in df.itertuples(index=False)}
{'p': (1, 3, 2), 'q': (4, 3, 2), 'r': (4, 0, 9)}
For my use (node names with xy positions) I found #user4179775's answer to the most helpful / intuitive:
import pandas as pd
df = pd.read_csv('glycolysis_nodes_xy.tsv', sep='\t')
df.head()
nodes x y
0 c00033 146 958
1 c00031 601 195
...
xy_dict_list=dict([(i,[a,b]) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_list
{'c00022': [483, 868],
'c00024': [146, 868],
... }
xy_dict_tuples=dict([(i,(a,b)) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_tuples
{'c00022': (483, 868),
'c00024': (146, 868),
... }
Addendum
I later returned to this issue, for other, but related, work. Here is an approach that more closely mirrors the [excellent] accepted answer.
node_df = pd.read_csv('node_prop-glycolysis_tca-from_pg.tsv', sep='\t')
node_df.head()
node kegg_id kegg_cid name wt vis
0 22 22 c00022 pyruvate 1 1
1 24 24 c00024 acetyl-CoA 1 1
...
Convert Pandas dataframe to a [list], {dict}, {dict of {dict}}, ...
Per accepted answer:
node_df.set_index('kegg_cid').T.to_dict('list')
{'c00022': [22, 22, 'pyruvate', 1, 1],
'c00024': [24, 24, 'acetyl-CoA', 1, 1],
... }
node_df.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'kegg_id': 22, 'name': 'pyruvate', 'node': 22, 'vis': 1, 'wt': 1},
'c00024': {'kegg_id': 24, 'name': 'acetyl-CoA', 'node': 24, 'vis': 1, 'wt': 1},
... }
In my case, I wanted to do the same thing but with selected columns from the Pandas dataframe, so I needed to slice the columns. There are two approaches.
Directly:
(see: Convert pandas to dictionary defining the columns used fo the key values)
node_df.set_index('kegg_cid')[['name', 'wt', 'vis']].T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
"Indirectly:" first, slice the desired columns/data from the Pandas dataframe (again, two approaches),
node_df_sliced = node_df[['kegg_cid', 'name', 'wt', 'vis']]
or
node_df_sliced2 = node_df.loc[:, ['kegg_cid', 'name', 'wt', 'vis']]
that can then can be used to create a dictionary of dictionaries
node_df_sliced.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
Most of the answers do not deal with the situation where ID can exist multiple times in the dataframe. In case ID can be duplicated in the Dataframe df you want to use a list to store the values (a.k.a a list of lists), grouped by ID:
{k: [g['A'].tolist(), g['B'].tolist(), g['C'].tolist()] for k,g in df.groupby('ID')}
Dictionary comprehension & iterrows() method could also be used to get the desired output.
result = {row.ID: [row.A, row.B, row.C] for (index, row) in df.iterrows()}
df = pd.DataFrame([['p',1,3,2], ['q',4,3,2], ['r',4,0,9]], columns=['ID','A','B','C'])
my_dict = {k:list(v) for k,v in zip(df['ID'], df.drop(columns='ID').values)}
print(my_dict)
with output
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
With this method, columns of dataframe will be the keys and series of dataframe will be the values.`
data_dict = dict()
for col in dataframe.columns:
data_dict[col] = dataframe[col].values.tolist()
DataFrame.to_dict() converts DataFrame to dictionary.
Example
>>> df = pd.DataFrame(
{'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])
>>> df
col1 col2
a 1 0.1
b 2 0.2
>>> df.to_dict()
{'col1': {'a': 1, 'b': 2}, 'col2': {'a': 0.5, 'b': 0.75}}
See this Documentation for details

How to create multiple value dictionary from pandas data frame

Lets say I have a pandas data frame with 2 columns(column A and Column B):
For values in column 'A' there are multiple values in column 'B'.
I want to create a dictionary with multiple values for each key those values should be unique as well. Please suggest me a way to do this.
One way is to groupby columns A:
In [1]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [2]: df
Out[2]:
A B
0 1 2
1 1 4
2 5 6
In [3]: g = df.groupby('A')
Apply tolist on each of the group's column B:
In [4]: g['B'].tolist() # shorthand for .apply(lambda s: s.tolist()) "automatic delegation"
Out[4]:
A
1 [2, 4]
5 [6]
dtype: object
And then call to_dict on this Series:
In [5]: g['B'].tolist().to_dict()
Out[5]: {1: [2, 4], 5: [6]}
If you want these to be unique, use unique (Note: this will create a numpy array rather than a list):
In [11]: df = pd.DataFrame([[1, 2], [1, 2], [5, 6]], columns=['A', 'B'])
In [12]: g = df.groupby('A')
In [13]: g['B'].unique()
Out[13]:
A
1 [2]
5 [6]
dtype: object
In [14]: g['B'].unique().to_dict()
Out[14]: {1: array([2]), 5: array([6])}
Other alternatives are to use .apply(lambda s: set(s)), .apply(lambda s: list(set(s))), .apply(lambda s: list(s.unique()))...
You can actually loop over df.groupby object and collect the value as list.
In[1]:
df = pd.DataFrame([[1, 2], [1, 2], [5, 6]], columns=['A', 'B'])
{k: list(v) for k,v in df.groupby("A")["B"]}
Out[1]:
{1: [2, 2], 5: [6]}

Resources