Fill df with empty rows based on index of other df

Fill df with empty rows based on index of other df - python-3.x

I am trying to use df.update(), but my dfs have different sizes. Now I want to fill up the smaler df with dummy rows to match the shape the bigger df. Here's a minimal example:
import pandas as pd
import numpy as np
data = {
"Feat_A": ["INVALID", "INVALID", "INVALID"],
"Feat_B": ["INVALID", "INVALID", "INVALID"],
"Key": [12, 25, 99],
}
df = pd.DataFrame(data=data)
data = {"Feat_A": [1, np.nan], "Feat_B": [np.nan, 2], "Key": [12, 99]}
result = pd.DataFrame(data=data)
# df.update(result) not working because of different sizes/shape
# result should be
# Feat_A Feat_B Key
# 0 1.0 NaN 12
# NaN NaN NaN NaN
# 2 NaN 2.0 99
# df.update(result) should work now

This did it:
df.update(result.set_index('Key').reindex(df.set_index('Key').index).reset_index())

Does this meet your needs? Modified your example to include unique DataFrame values to confirm proper alignment:
# Modified example
data = {
"Feat_A": ["INVALID_A12", "INVALID_A25", "INVALID_A99"],
"Feat_B": ["INVALID_B12", "INVALID_B25", "INVALID_B99"],
"Key": [12, 25, 99],
}
df = pd.DataFrame(data=data)
data = {"Feat_A": [1, np.nan], "Feat_B": [np.nan, 2], "Key": [12, 99]}
result = pd.DataFrame(data=data)
# Use Key column as DataFrame indexes
df = df.set_index('Key')
result = result.set_index('Key')
# Add all-NaN rows with keys that exist in df but not in result
result = result.reindex_like(df)
# Update
result.update(df)
print(result)
Feat_A Feat_B
Key
12 INVALID_A12 INVALID_B12
25 INVALID_A25 INVALID_B25
99 INVALID_A99 INVALID_B99

Related

Calculating weighted average using grouped .agg in pandas

I would like to calculate, by group, the mean of one column and the weighted mean of another column in a dataset using the .agg() function within pandas. I am aware of a few solutions, but they aren't very concise.
One solution has been posted here (pandas and groupby: how to calculate weighted averages within an agg, but it still doesn't seem very flexible because the weights column is hard coded in the lambda function definition. I'm looking to create a syntax closer to this:
(
df
.groupby(['group'])
.agg(avg_x=('x', 'mean'),
wt_avg_y=('y', 'weighted_mean', weights='weight')
)
Here is a fully worked example with code that seems needlessly complicated:
import pandas as pd
import numpy as np
# sample dataset
df = pd.DataFrame({
'group': ['a', 'a', 'b', 'b'],
'x': [1, 2, 3, 4],
'y': [5, 6, 7, 8],
'weights': [0.75, 0.25, 0.75, 0.25]
})
df
#>>> group x y weights
#>>> 0 a 1 5 0.75
#>>> 1 a 2 6 0.25
#>>> 2 b 3 7 0.75
#>>> 3 b 4 8 0.25
# aggregation logic
summary = pd.concat(
[
df.groupby(['group']).x.mean(),
df.groupby(['group']).apply(lambda x: np.average(x['y'], weights=x['weights']))
], axis=1
)
# manipulation to format the output of the aggregation
summary = summary.reset_index().rename(columns={'x': 'avg_x', 0: 'wt_avg_y'})
# final output
summary
#>>> group avg_x wt_avg_y
#>>> 0 a 1.50 5.25
#>>> 1 b 3.50 7.25

Using the .apply() method on the entire DataFrame was the simplest solution I could arrive to that does not hardcode the column name inside the function definition.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'group': ['a', 'a', 'b', 'b'],
'x': [1, 2, 3, 4],
'y': [5, 6, 7, 8],
'weights': [0.75, 0.25, 0.75, 0.25]
})
summary = (
df
.groupby(['group'])
.apply(
lambda x: pd.Series([
np.mean(x['x']),
np.average(x['y'], weights=x['weights'])
], index=['avg_x', 'wt_avg_y'])
)
.reset_index()
)
# final output
summary
#>>> group avg_x wt_avg_y
#>>> 0 a 1.50 5.25
#>>> 1 b 3.50 7.25

How about this:
grouped = df.groupby('group')
def wavg(group):
group['mean_x'] = group['x'].mean()
group['wavg_y'] = np.average(group['y'], weights=group.loc[:, "weights"])
return group
grouped.apply(wavg)

Try:
df["weights"]=df["weights"].div(df.join(df.groupby("group")["weights"].sum(), on="group", rsuffix="_2").iloc[:, -1])
df["y"]=df["y"].mul(df["weights"])
res=df.groupby("group", as_index=False).agg({"x": "mean", "y": "sum"})
Outputs:
group x y
0 a 1.5 5.25
1 b 3.5 7.25

Since your weights sum to 1 within groups, you can assign a new column and groupby as usual:
(df.assign(wt_avg_y=df['y']*df['weights'])
.groupby('group')
.agg({'x': 'mean', 'wt_avg_y':'sum', 'weights':'sum'})
.assign(wt_avg_y=lambda x: x['wt_avg_y']/ x['weights'])
)
Output:
x wt_avg_y weights
group
a 1.5 5.25 1.0
b 3.5 7.25 1.0

Steven M. Mortimer's solution is clean and easy to read. Alternatively, one could use dict notation inside pd.Series() such that the index= argument is not needed. This provides slightly better readability in my opinion.
summary = (
df
.groupby(['group'])
.apply(
lambda x: pd.Series({
'avg_x' : np.mean(x['x']),
'wt_avg_y': np.average(x['y'], weights=x['weights'])
}))
.reset_index()
)

Error when checking target: expected dense_2 to have shape (45, 20) but got array with shape (45, 1)

I have a dataframe of shape (2000,45) and I need to convert it to (2000,45,20).
Each 45 columns contains a list of 20 elements.
Please help me out with this.
The dataset looks like the following-
enter image description here

Before I begin, a correction in your question:
"Each cell in your dataframe is a list of 20 elements"
import pandas as pd
import numpy as np
df = pd.DataFrame({0: [[2, 3, 4], [1,2,3], [4, 5,6]], 1: [[2, 3, 4], [1,2,3],
[4, 5,6]]})
def itemExploder(df):
df = df.to_numpy()
dim1, dim2, dim3 = len(df), len(df[0]), len(df[0][0])
df2 = np.zeros((dim1, dim2, dim3), dtype=int)
for r in range(dim1):
for c in range(dim2):
for i,item in enumerate(df[r][c]):
df2[r][c][i] = item
print(df2.shape)
return df2
newDf = itemExploder(df)
This function assumes you have the same number of elements(20) in each cell.

How can I iterate over pandas dataframes and concatenate on another dataframe [duplicate]

I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.
How can I "join" together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person's string name?
The join() function in pandas specifies that I need a multiindex, but I'm confused about what a hierarchical indexing scheme has to do with making a join based on a single index.

Zero's answer is basically a reduce operation. If I had more than a handful of dataframes, I'd put them in a list like this (generated via list comprehensions or loops or whatnot):
dfs = [df0, df1, df2, ..., dfN]
Assuming they have a common column, like name in your example, I'd do the following:
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name'), dfs)
That way, your code should work with whatever number of dataframes you want to merge.

You could try this if you have 3 dataframes
# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')
alternatively, as mentioned by cwharland
df1.merge(df2,on='name').merge(df3,on='name')

This is an ideal situation for the join method
The join method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.
The code would look something like this:
filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])
With #zero's data, you could do this:
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])
attr11 attr12 attr21 attr22 attr31 attr32
name
a 5 9 5 19 15 49
b 4 61 14 16 4 36
c 24 9 4 9 14 9

In python 3.6.3 with pandas 0.22.0 you can also use concat as long as you set as index the columns you want to use for the joining:
pd.concat(
objs=(iDF.set_index('name') for iDF in (df1, df2, df3)),
axis=1,
join='inner'
).reset_index()
where df1, df2, and df3 are defined as in John Galt's answer:
import pandas as pd
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32']
)

This can also be done as follows for a list of dataframes df_list:
df = df_list[0]
for df_ in df_list[1:]:
df = df.merge(df_, on='join_col_name')
or if the dataframes are in a generator object (e.g. to reduce memory consumption):
df = next(df_list)
for df_ in df_list:
df = df.merge(df_, on='join_col_name')

Simple Solution:
If the column names are similar:
df1.merge(df2,on='col_name').merge(df3,on='col_name')
If the column names are different:
df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})

Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:
This is the function to merge a dict of data frames
def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
keys = dfDict.keys()
for i in range(len(keys)):
key = keys[i]
df0 = dfDict[key]
cols = list(df0.columns)
valueCols = list(filter(lambda x: x not in (onCols), cols))
df0 = df0[onCols + valueCols]
df0.columns = onCols + [(s + '_' + key) for s in valueCols]
if (i == 0):
outDf = df0
else:
outDf = pd.merge(outDf, df0, how=how, on=onCols)
if (naFill != None):
outDf = outDf.fillna(naFill)
return(outDf)
OK, lets generates data and test this:
def GenDf(size):
df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True),
'col1':np.random.uniform(low=0.0, high=100.0, size=size),
'col2':np.random.uniform(low=0.0, high=100.0, size=size)
})
df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
return(df)
size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)

One does not need a multiindex to perform join operations.
One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name') for example)
The join operation is by default performed on index.
In your case, you just have to specify that the Name column corresponds to your index.
Below is an example
A tutorial may be useful.
# Simple example where dataframes index are the name on which to perform
# the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=name)
df = df1.join(df2)
df = df.join(df3)
# If you have a 'Name' column that is not the index of your dataframe,
# one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name'] = df1.index
# 1) Select the index from column 'Name'
df1 = df1.set_index('Name')
# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))
gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')

There is another solution from the pandas documentation (that I don't see here),
using the .append
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
A B
0 1 2
1 3 4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
A B
0 5 6
1 7 8
>>> df.append(df2, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8
The ignore_index=True is used to ignore the index of the appended dataframe, replacing it with the next index available in the source one.
If there are different column names, Nan will be introduced.

I tweaked the accepted answer to perform the operation for multiple dataframes on different suffix parameters using reduce and i guess it can be extended to different on parameters as well.
from functools import reduce
dfs_with_suffixes = [(df2,suffix2), (df3,suffix3),
(df4,suffix4)]
merge_one = lambda x,y,sfx:pd.merge(x,y,on=['col1','col2'..], suffixes=sfx)
merged = reduce(lambda left,right:merge_one(left,*right), dfs_with_suffixes, df1)

df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['d', 14, 16]]
),
columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['c', 4, 36],
['d', 14, 9]]),
columns=['name', 'attr31', 'attr32']
)
df4 = pd.DataFrame(np.array([
['a', 15, 49],
['c', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr41', 'attr42']
)
Three ways to join list dataframe
pandas.concat
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
# cant not run if index not unique
dfs = pd.concat(dfs, join='outer', axis = 1)
functools.reduce
dfs = [df1, df2, df3, df4]
# still run with index not unique
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name', how = 'outer'), dfs)
join
# cant not run if index not unique
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:], how = 'outer')

Joining together all three can be done using .join() function.
You have three DataFrames lets say
df1, df2, df3.
To join these into one DataFrame you can:
df = df1.join(df2).join(df3)
This is the simplest way I found to do this task.

split a list of dictionaries into multiple columns

I have a dataframe with 30000 rows and 5 columns. one of this column is a list of dictionaries and a few Nan's. I wanted to split this column into 3 fields (legroom to In-FLight Enternatinment) and wanted to extract ratings
Below is a sample for reference
d = {'col1': [[{'rating': 5, 'ratingLabel': 'Legroom'}, {'rating': 5, 'ratingLabel': 'Seat comfort'}, {'rating': 5, 'ratingLabel': 'In-flight Entertainment'}],'Nan']}
df = pd.DataFrame(data=d)
df

IIUC This should do the trick:
df=df["col1"].apply(lambda x: pd.Series({el["ratingLabel"]: el["rating"] for el in x if isinstance(x, list)}))
Output:
Legroom Seat comfort In-flight Entertainment
0 5.0 5.0 5.0
1 NaN NaN NaN

Here is a possible solution using the DataFrame.apply() and pd.Series and a strategy from Splitting dictionary/list inside a Pandas Column into Separate Columns
import pandas as pd
d = {'col1': [[{'rating': 5, 'ratingLabel': 'Legroom'},
{'rating': 5, 'ratingLabel': 'Seat comfort'},
{'rating': 5, 'ratingLabel': 'In-flight Entertainment'}],
[{'rating': 5, 'ratingLabel': 'Legroom'},
{'rating': 5, 'ratingLabel': 'Seat comfort'},
{'rating': 5, 'ratingLabel': 'In-flight Entertainment'}],
'Nan']}
df = pd.DataFrame(data=d)
df
df_split = df['col1'].apply(pd.Series)
pd.concat([df,
df_split[0].apply(pd.Series).rename(columns = {'rating':'legroom_rating',
'ratingLabel':'1'}),
df_split[1].apply(pd.Series).rename(columns = {'rating':'seat_comfort_rating',
'ratingLabel':'2'}),
df_split[2].apply(pd.Series).rename(columns = {'rating':'in_flight_entertainment_rating',
'ratingLabel':'3'})],
axis = 1).drop(['col1','1','2','3',0],
axis = 1)
Producing the following DataFrame

Dataframe to Dictionary [duplicate]

I have a DataFrame with four columns. I want to convert this DataFrame to a python dictionary. I want the elements of first column be keys and the elements of other columns in same row be values.
DataFrame:
ID A B C
0 p 1 3 2
1 q 4 3 2
2 r 4 0 9
Output should be like this:
Dictionary:
{'p': [1,3,2], 'q': [4,3,2], 'r': [4,0,9]}

The to_dict() method sets the column names as dictionary keys so you'll need to reshape your DataFrame slightly. Setting the 'ID' column as the index and then transposing the DataFrame is one way to achieve this.
to_dict() also accepts an 'orient' argument which you'll need in order to output a list of values for each column. Otherwise, a dictionary of the form {index: value} will be returned for each column.
These steps can be done with the following line:
>>> df.set_index('ID').T.to_dict('list')
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
In case a different dictionary format is needed, here are examples of the possible orient arguments. Consider the following simple DataFrame:
>>> df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
>>> df
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
Then the options are as follows.
dict - the default: column names are keys, values are dictionaries of index:data pairs
>>> df.to_dict('dict')
{'a': {0: 'red', 1: 'yellow', 2: 'blue'},
'b': {0: 0.5, 1: 0.25, 2: 0.125}}
list - keys are column names, values are lists of column data
>>> df.to_dict('list')
{'a': ['red', 'yellow', 'blue'],
'b': [0.5, 0.25, 0.125]}
series - like 'list', but values are Series
>>> df.to_dict('series')
{'a': 0 red
1 yellow
2 blue
Name: a, dtype: object,
'b': 0 0.500
1 0.250
2 0.125
Name: b, dtype: float64}
split - splits columns/data/index as keys with values being column names, data values by row and index labels respectively
>>> df.to_dict('split')
{'columns': ['a', 'b'],
'data': [['red', 0.5], ['yellow', 0.25], ['blue', 0.125]],
'index': [0, 1, 2]}
records - each row becomes a dictionary where key is column name and value is the data in the cell
>>> df.to_dict('records')
[{'a': 'red', 'b': 0.5},
{'a': 'yellow', 'b': 0.25},
{'a': 'blue', 'b': 0.125}]
index - like 'records', but a dictionary of dictionaries with keys as index labels (rather than a list)
>>> df.to_dict('index')
{0: {'a': 'red', 'b': 0.5},
1: {'a': 'yellow', 'b': 0.25},
2: {'a': 'blue', 'b': 0.125}}

Should a dictionary like:
{'red': '0.500', 'yellow': '0.250', 'blue': '0.125'}
be required out of a dataframe like:
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
simplest way would be to do:
dict(df.values)
working snippet below:
import pandas as pd
df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
dict(df.values)

Follow these steps:
Suppose your dataframe is as follows:
>>> df
A B C ID
0 1 3 2 p
1 4 3 2 q
2 4 0 9 r
1. Use set_index to set ID columns as the dataframe index.
df.set_index("ID", drop=True, inplace=True)
2. Use the orient=index parameter to have the index as dictionary keys.
dictionary = df.to_dict(orient="index")
The results will be as follows:
>>> dictionary
{'q': {'A': 4, 'B': 3, 'D': 2}, 'p': {'A': 1, 'B': 3, 'D': 2}, 'r': {'A': 4, 'B': 0, 'D': 9}}
3. If you need to have each sample as a list run the following code. Determine the column order
column_order= ["A", "B", "C"] # Determine your preferred order of columns
d = {} # Initialize the new dictionary as an empty dictionary
for k in dictionary:
d[k] = [dictionary[k][column_name] for column_name in column_order]

Try to use Zip
df = pd.read_csv("file")
d= dict([(i,[a,b,c ]) for i, a,b,c in zip(df.ID, df.A,df.B,df.C)])
print d
Output:
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}

If you don't mind the dictionary values being tuples, you can use itertuples:
>>> {x[0]: x[1:] for x in df.itertuples(index=False)}
{'p': (1, 3, 2), 'q': (4, 3, 2), 'r': (4, 0, 9)}

For my use (node names with xy positions) I found #user4179775's answer to the most helpful / intuitive:
import pandas as pd
df = pd.read_csv('glycolysis_nodes_xy.tsv', sep='\t')
df.head()
nodes x y
0 c00033 146 958
1 c00031 601 195
...
xy_dict_list=dict([(i,[a,b]) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_list
{'c00022': [483, 868],
'c00024': [146, 868],
... }
xy_dict_tuples=dict([(i,(a,b)) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_tuples
{'c00022': (483, 868),
'c00024': (146, 868),
... }
Addendum
I later returned to this issue, for other, but related, work. Here is an approach that more closely mirrors the [excellent] accepted answer.
node_df = pd.read_csv('node_prop-glycolysis_tca-from_pg.tsv', sep='\t')
node_df.head()
node kegg_id kegg_cid name wt vis
0 22 22 c00022 pyruvate 1 1
1 24 24 c00024 acetyl-CoA 1 1
...
Convert Pandas dataframe to a [list], {dict}, {dict of {dict}}, ...
Per accepted answer:
node_df.set_index('kegg_cid').T.to_dict('list')
{'c00022': [22, 22, 'pyruvate', 1, 1],
'c00024': [24, 24, 'acetyl-CoA', 1, 1],
... }
node_df.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'kegg_id': 22, 'name': 'pyruvate', 'node': 22, 'vis': 1, 'wt': 1},
'c00024': {'kegg_id': 24, 'name': 'acetyl-CoA', 'node': 24, 'vis': 1, 'wt': 1},
... }
In my case, I wanted to do the same thing but with selected columns from the Pandas dataframe, so I needed to slice the columns. There are two approaches.
Directly:
(see: Convert pandas to dictionary defining the columns used fo the key values)
node_df.set_index('kegg_cid')[['name', 'wt', 'vis']].T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
"Indirectly:" first, slice the desired columns/data from the Pandas dataframe (again, two approaches),
node_df_sliced = node_df[['kegg_cid', 'name', 'wt', 'vis']]
or
node_df_sliced2 = node_df.loc[:, ['kegg_cid', 'name', 'wt', 'vis']]
that can then can be used to create a dictionary of dictionaries
node_df_sliced.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }

Most of the answers do not deal with the situation where ID can exist multiple times in the dataframe. In case ID can be duplicated in the Dataframe df you want to use a list to store the values (a.k.a a list of lists), grouped by ID:
{k: [g['A'].tolist(), g['B'].tolist(), g['C'].tolist()] for k,g in df.groupby('ID')}

Dictionary comprehension & iterrows() method could also be used to get the desired output.
result = {row.ID: [row.A, row.B, row.C] for (index, row) in df.iterrows()}

df = pd.DataFrame([['p',1,3,2], ['q',4,3,2], ['r',4,0,9]], columns=['ID','A','B','C'])
my_dict = {k:list(v) for k,v in zip(df['ID'], df.drop(columns='ID').values)}
print(my_dict)
with output
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}

With this method, columns of dataframe will be the keys and series of dataframe will be the values.`
data_dict = dict()
for col in dataframe.columns:
data_dict[col] = dataframe[col].values.tolist()

DataFrame.to_dict() converts DataFrame to dictionary.
Example
>>> df = pd.DataFrame(
{'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])
>>> df
col1 col2
a 1 0.1
b 2 0.2
>>> df.to_dict()
{'col1': {'a': 1, 'b': 2}, 'col2': {'a': 0.5, 'b': 0.75}}
See this Documentation for details

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Fill df with empty rows based on index of other df - python-3.x

This did it: df.update(result.set_index('Key').reindex(df.set_index('Key').index).reset_index())

Related

Calculating weighted average using grouped .agg in pandas

Error when checking target: expected dense_2 to have shape (45, 20) but got array with shape (45, 1)

How can I iterate over pandas dataframes and concatenate on another dataframe [duplicate]

split a list of dictionaries into multiple columns

Dataframe to Dictionary [duplicate]

Categories

Resources