Python Pandas - Update row with dictionary based on index, column - python-3.x

I have a dataframe with empty columns and a corresponding dictionary which I would like to update the empty columns with based on index, column:
import pandas as pd
import numpy as np
dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [4, 6, 2], [3, 4, 1]])
dataframe.columns = ['x', 'y', 'z']
additional_cols = ['a', 'b', 'c']
for col in additional_cols:
dataframe[col] = np.nan
x y z a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 4 6 2
4 3 4 1
for row, column in x.iterrows():
#caluclations to return dictionary y
y = {"a": 5, "b": 6, "c": 7}
df.loc[row, :].map(y)
Basically after performing the calculations using columns x, y, z I would like to update columns a, b, c for that same row :)

I could use a function as such but as far as the pandas library and a method for the DataFrame object I am not sure...
def update_row_with_dict(dictionary, dataframe, index):
for key in dictionary.keys():
dataframe.loc[index, key] = dictionary.get(key)

The above answer with correct indent
def update_row_with_dict(df,d,idx):
for key in d.keys():
df.loc[idx, key] = d.get(key)
more short would be
def update_row_with_dict(df,d,idx):
df.loc[idx,d.keys()] = d.values()
for your code snipped the syntax would be:
import pandas as pd
import numpy as np
dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [4, 6, 2], [3, 4, 1]])
dataframe.columns = ['x', 'y', 'z']
additional_cols = ['a', 'b', 'c']
for col in additional_cols:
dataframe[col] = np.nan
for idx in dataframe.index:
y = {'a':1,'b':2,'c':3}
update_row_with_dict(dataframe,y,idx)

Related

add_scatter and custom data

How to add fig.add_scatter with hover data that is in the hover label?
The minimal code is not working.
I need to add another set of data with the same hover template as the first one.
Many thanks
import numpy as np
import pandas as pd
a, b, c = [1, 2], [1, 5], [5, 6]
d, e, f = [5, 5], [4, 4], [5, 5]
s1 = ['A', 'F']
s2 = ['V', 'T']
d = {'a': a, 'b': b, 'c': c, 's1':s1}
df = pd.DataFrame(data=d)
d2 = {'d': d, 'e': e, 'f': f, 's2':s2}
df2 = pd.DataFrame(data=d2)
fig = px.scatter(df, x='a', y='b', hover_data=['c', 's1'], color='s1', color_discrete_sequence=["green", "navy"])
fig.add_scatter(x=df2['d'], y=df2['e'], customdata=['f', 's2'], mode="markers", marker=dict(size=10,color='Purple'), name = 'A') # ------> these custom data are not in label, there is just %{customdata[1]}
fig.update_traces(
hovertemplate="<br>".join([
"<b>G:</b> %{x:.3f}",
"<b>R:</b> %{y:.6f}<extra></extra>",
"<b>D:</b> %{customdata[1]}",
"<b>E:</b> %{customdata[0]}",
])
)
fig.update_xaxes(title_font_family="Trebuchet")
fig.update_traces(marker=dict(size=9),
selector=dict(mode='markers'))
fig.show()
There are errors in creating df2. Have assumed what you are trying to achieve. Below makes hovertext work.
import numpy as np
import pandas as pd
a, b, c = [1, 2], [1, 5], [5, 6]
d, e, f = [5, 5], [4, 4], [5, 5]
s1 = ["A", "F"]
s2 = ["V", "T"]
d = {"a": a, "b": b, "c": c, "s1": s1}
df = pd.DataFrame(data=d)
d2 = {"d": d, "e": e, "f": f, "s2": s2}
# SO question invalid !!!
# df2 = pd.DataFrame(data=d2)
# try this
df2 = pd.DataFrame(d).join(pd.DataFrame({k:v for k,v in d2.items() if k!="d"}))
fig = px.scatter(
df,
x="a",
y="b",
hover_data=["c", "s1"],
color="s1",
color_discrete_sequence=["green", "navy"],
)
fig.add_scatter(
x=df2["a"],
y=df2["e"],
customdata=df2.loc[:,["f", "s2"]].values.reshape([len(df2),2]),
mode="markers",
marker=dict(size=10, color="Purple"),
name="A",
) # ------> these custom data are not in label, there is just %{customdata[1]}
fig.update_traces(
hovertemplate="<br>".join(
[
"<b>G:</b> %{x:.3f}",
"<b>R:</b> %{y:.6f}<extra></extra>",
"<b>D:</b> %{customdata[1]}",
"<b>E:</b> %{customdata[0]}",
]
)
)
fig.update_xaxes(title_font_family="Trebuchet")
fig.update_traces(marker=dict(size=9), selector=dict(mode="markers"))
fig.show()

Compare multiple dataframes' columns names with original one's in Pandas

Let's say I have a dataframe df with headers a, b, c, d.
I want to compare other dfs (df1, df2, df3, ...) columns name with it. I need all the dfs's columns name should be exactly identical as df (Please note the different order of columns names should be not considered as different column names).
For example:
Original dataframe:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
col = ['a', 'b', 'c']
dfs:
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'c', 'b'])
Returns identical columns name;
df2 = pd.DataFrame(np.array([[1, 2, 3, 10], [4, 5, 6, 11], [7, 8, 9, 12]]),
columns=['a', 'c', 'e', 'b'])
Returns extra columns in dataframe;
df3 = pd.DataFrame(np.array([[1, 2], [4, 5], [7, 8]]),
columns=['a', 'c'])
Returns missing columns in dataframe;
df4 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', '*c', 'b'])
Returns errors in dataframe's column names;
df5 = pd.DataFrame(np.array([[1, 2, 3, 9], [4, 5, 6, 9], [7, 8, 9, 10]]),
columns=['a', 'b', 'b', 'c'])
returns extra columns in dataframe.
If it's too complicated, it's also OK returning columns names are incorrect for all kinds of errors.
How could I do that in Pandas? Thanks.
I think set here is good choice, because order is not important:
def compare(df, df1):
orig = set(df.columns)
c = set(df1.columns)
#testing if length of set is same like length of columns names
if len(c) != len(df1.columns):
return ('extra columns in dataframe')
#if same sets
elif (c == orig):
return ('identical columns name')
#compared subsets
elif c.issubset(orig):
return ('missing columns in dataframe')
#compared subsets
elif orig.issubset(c):
return ('extra columns in dataframe')
else:
return ('columns names are incorrect')
print(compare(df, df1))
print(compare(df, df2))
print(compare(df, df3))
print(compare(df, df4))
print(compare(df, df5))
identical columns name
extra columns in dataframe
missing columns in dataframe
columns names are incorrect
extra columns in dataframe
For returned values:
def compare(df, df1):
orig = set(df.columns)
c = set(df1.columns)
#testing if length of set is same like length of columns names
if len(c) != len(df1.columns):
col = df1.columns.tolist()
a = set([str(x) for x in col if col.count(x) > 1])
return f'duplicated columns: {", ".join(a)}'
#if same sets
elif (c == orig):
return ('identical columns name')
#compared subsets
elif c.issubset(orig):
a = (str(x) for x in orig - c)
return f'missing columns: {", ".join(a)}'
#compared subsets
elif orig.issubset(c):
a = (str(x) for x in c - orig)
return f'extra columns: {", ".join(a)}'
else:
a = (str(x) for x in c - orig)
return f'incorrect: {", ".join(a)}'
print(compare(df, df1))
print(compare(df, df2))
print(compare(df, df3))
print(compare(df, df4))
print(compare(df, df5))
identical columns name
extra columns: e
missing columns: b
incorrect: *c
duplicated columns: b
I wrote a normal python function which uses pandas function to get columns and compare them, please see if this helps:
def check_errors(original_df, df1):
original_columns = original_df.columns
columns1 = df1.columns
if len(original_columns) > len(columns1):
print("Columns missing!!")
elif len(original_columns) < len(columns1):
print("Extra Columns")
else:
for i in columns1:
if i not in original_columns:
print("Column names are incorrect")

Python 3 ~ How to take rows from a csv file and put them into a list

I would like to know how to take this file:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
and put it in a list like the following:
[['Alice', 'Bob', 'Charlie'], [2, 8, 3], [4, 1, 5], [3, 2, 5]]
I'm fairly new to python so excuse me
my current code looks like this:
file = open(argv[1] , "r")
file1 = open(argv[2] , "r")
text = file1.read()
strl = []
with file:
csv = csv.reader(file,delimiter=",")
for row in csv:
strl = row[1:9]
break
df = pd.read_csv(argv[1],header=0)
df = [df[col].tolist() for col in df.columns]
ignore the strl part its for something else unrelated
but it outputs like this:
[['Alice', 'Bob', 'Charlie'], [2, 4, 3], [8, 1, 2], [3, 5, 5]]
i want it to output like this:
[['Alice', 'Bob', 'Charlie'], [2, 8, 3], [4, 1, 5], [3, 2, 5]]
i would like it to output like the above sample
Using pandas
In [13]: import pandas as pd
In [14]: df = pd.read_csv("a.csv",header=None)
In [15]: df
Out[15]:
0 1 2 3
0 Alice 2 8 3
1 Bob 4 1 5
2 Charlie 3 2 5
In [16]: [df[col].tolist() for col in df.columns]
Out[16]: [['Alice', 'Bob', 'Charlie'], [2, 4, 3], [8, 1, 2], [3, 5, 5]]
Update:
In [51]: import pandas as pd
In [52]: df = pd.read_csv("a.csv",header=None)
In [53]: data = df[df.columns[1:]].to_numpy().tolist()
In [57]: data.insert(0,df[0].tolist())
In [58]: data
Out[58]: [['Alice', 'Bob', 'Charlie'], [2, 8, 3], [4, 1, 5], [3, 2, 5]]
Update:
In [51]: import pandas as pd
In [52]: df = pd.read_csv("a.csv")
In [94]: df
Out[94]:
name AGATC AATG TATC
0 Alice 2 8 3
1 Bob 4 1 5
2 Charlie 3 2 5
In [97]: data = df.loc[:, df.columns != 'name'].to_numpy().tolist()
In [98]: data.insert(0, df["name"].tolist())
In [99]: data
Out[99]: [['Alice', 'Bob', 'Charlie'], [2, 8, 3], [4, 1, 5], [3, 2, 5]]

Dataframe to Dictionary [duplicate]

I have a DataFrame with four columns. I want to convert this DataFrame to a python dictionary. I want the elements of first column be keys and the elements of other columns in same row be values.
DataFrame:
ID A B C
0 p 1 3 2
1 q 4 3 2
2 r 4 0 9
Output should be like this:
Dictionary:
{'p': [1,3,2], 'q': [4,3,2], 'r': [4,0,9]}
The to_dict() method sets the column names as dictionary keys so you'll need to reshape your DataFrame slightly. Setting the 'ID' column as the index and then transposing the DataFrame is one way to achieve this.
to_dict() also accepts an 'orient' argument which you'll need in order to output a list of values for each column. Otherwise, a dictionary of the form {index: value} will be returned for each column.
These steps can be done with the following line:
>>> df.set_index('ID').T.to_dict('list')
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
In case a different dictionary format is needed, here are examples of the possible orient arguments. Consider the following simple DataFrame:
>>> df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
>>> df
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
Then the options are as follows.
dict - the default: column names are keys, values are dictionaries of index:data pairs
>>> df.to_dict('dict')
{'a': {0: 'red', 1: 'yellow', 2: 'blue'},
'b': {0: 0.5, 1: 0.25, 2: 0.125}}
list - keys are column names, values are lists of column data
>>> df.to_dict('list')
{'a': ['red', 'yellow', 'blue'],
'b': [0.5, 0.25, 0.125]}
series - like 'list', but values are Series
>>> df.to_dict('series')
{'a': 0 red
1 yellow
2 blue
Name: a, dtype: object,
'b': 0 0.500
1 0.250
2 0.125
Name: b, dtype: float64}
split - splits columns/data/index as keys with values being column names, data values by row and index labels respectively
>>> df.to_dict('split')
{'columns': ['a', 'b'],
'data': [['red', 0.5], ['yellow', 0.25], ['blue', 0.125]],
'index': [0, 1, 2]}
records - each row becomes a dictionary where key is column name and value is the data in the cell
>>> df.to_dict('records')
[{'a': 'red', 'b': 0.5},
{'a': 'yellow', 'b': 0.25},
{'a': 'blue', 'b': 0.125}]
index - like 'records', but a dictionary of dictionaries with keys as index labels (rather than a list)
>>> df.to_dict('index')
{0: {'a': 'red', 'b': 0.5},
1: {'a': 'yellow', 'b': 0.25},
2: {'a': 'blue', 'b': 0.125}}
Should a dictionary like:
{'red': '0.500', 'yellow': '0.250', 'blue': '0.125'}
be required out of a dataframe like:
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
simplest way would be to do:
dict(df.values)
working snippet below:
import pandas as pd
df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
dict(df.values)
Follow these steps:
Suppose your dataframe is as follows:
>>> df
A B C ID
0 1 3 2 p
1 4 3 2 q
2 4 0 9 r
1. Use set_index to set ID columns as the dataframe index.
df.set_index("ID", drop=True, inplace=True)
2. Use the orient=index parameter to have the index as dictionary keys.
dictionary = df.to_dict(orient="index")
The results will be as follows:
>>> dictionary
{'q': {'A': 4, 'B': 3, 'D': 2}, 'p': {'A': 1, 'B': 3, 'D': 2}, 'r': {'A': 4, 'B': 0, 'D': 9}}
3. If you need to have each sample as a list run the following code. Determine the column order
column_order= ["A", "B", "C"] # Determine your preferred order of columns
d = {} # Initialize the new dictionary as an empty dictionary
for k in dictionary:
d[k] = [dictionary[k][column_name] for column_name in column_order]
Try to use Zip
df = pd.read_csv("file")
d= dict([(i,[a,b,c ]) for i, a,b,c in zip(df.ID, df.A,df.B,df.C)])
print d
Output:
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
If you don't mind the dictionary values being tuples, you can use itertuples:
>>> {x[0]: x[1:] for x in df.itertuples(index=False)}
{'p': (1, 3, 2), 'q': (4, 3, 2), 'r': (4, 0, 9)}
For my use (node names with xy positions) I found #user4179775's answer to the most helpful / intuitive:
import pandas as pd
df = pd.read_csv('glycolysis_nodes_xy.tsv', sep='\t')
df.head()
nodes x y
0 c00033 146 958
1 c00031 601 195
...
xy_dict_list=dict([(i,[a,b]) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_list
{'c00022': [483, 868],
'c00024': [146, 868],
... }
xy_dict_tuples=dict([(i,(a,b)) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_tuples
{'c00022': (483, 868),
'c00024': (146, 868),
... }
Addendum
I later returned to this issue, for other, but related, work. Here is an approach that more closely mirrors the [excellent] accepted answer.
node_df = pd.read_csv('node_prop-glycolysis_tca-from_pg.tsv', sep='\t')
node_df.head()
node kegg_id kegg_cid name wt vis
0 22 22 c00022 pyruvate 1 1
1 24 24 c00024 acetyl-CoA 1 1
...
Convert Pandas dataframe to a [list], {dict}, {dict of {dict}}, ...
Per accepted answer:
node_df.set_index('kegg_cid').T.to_dict('list')
{'c00022': [22, 22, 'pyruvate', 1, 1],
'c00024': [24, 24, 'acetyl-CoA', 1, 1],
... }
node_df.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'kegg_id': 22, 'name': 'pyruvate', 'node': 22, 'vis': 1, 'wt': 1},
'c00024': {'kegg_id': 24, 'name': 'acetyl-CoA', 'node': 24, 'vis': 1, 'wt': 1},
... }
In my case, I wanted to do the same thing but with selected columns from the Pandas dataframe, so I needed to slice the columns. There are two approaches.
Directly:
(see: Convert pandas to dictionary defining the columns used fo the key values)
node_df.set_index('kegg_cid')[['name', 'wt', 'vis']].T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
"Indirectly:" first, slice the desired columns/data from the Pandas dataframe (again, two approaches),
node_df_sliced = node_df[['kegg_cid', 'name', 'wt', 'vis']]
or
node_df_sliced2 = node_df.loc[:, ['kegg_cid', 'name', 'wt', 'vis']]
that can then can be used to create a dictionary of dictionaries
node_df_sliced.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
Most of the answers do not deal with the situation where ID can exist multiple times in the dataframe. In case ID can be duplicated in the Dataframe df you want to use a list to store the values (a.k.a a list of lists), grouped by ID:
{k: [g['A'].tolist(), g['B'].tolist(), g['C'].tolist()] for k,g in df.groupby('ID')}
Dictionary comprehension & iterrows() method could also be used to get the desired output.
result = {row.ID: [row.A, row.B, row.C] for (index, row) in df.iterrows()}
df = pd.DataFrame([['p',1,3,2], ['q',4,3,2], ['r',4,0,9]], columns=['ID','A','B','C'])
my_dict = {k:list(v) for k,v in zip(df['ID'], df.drop(columns='ID').values)}
print(my_dict)
with output
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
With this method, columns of dataframe will be the keys and series of dataframe will be the values.`
data_dict = dict()
for col in dataframe.columns:
data_dict[col] = dataframe[col].values.tolist()
DataFrame.to_dict() converts DataFrame to dictionary.
Example
>>> df = pd.DataFrame(
{'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])
>>> df
col1 col2
a 1 0.1
b 2 0.2
>>> df.to_dict()
{'col1': {'a': 1, 'b': 2}, 'col2': {'a': 0.5, 'b': 0.75}}
See this Documentation for details

How to create multiple value dictionary from pandas data frame

Lets say I have a pandas data frame with 2 columns(column A and Column B):
For values in column 'A' there are multiple values in column 'B'.
I want to create a dictionary with multiple values for each key those values should be unique as well. Please suggest me a way to do this.
One way is to groupby columns A:
In [1]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [2]: df
Out[2]:
A B
0 1 2
1 1 4
2 5 6
In [3]: g = df.groupby('A')
Apply tolist on each of the group's column B:
In [4]: g['B'].tolist() # shorthand for .apply(lambda s: s.tolist()) "automatic delegation"
Out[4]:
A
1 [2, 4]
5 [6]
dtype: object
And then call to_dict on this Series:
In [5]: g['B'].tolist().to_dict()
Out[5]: {1: [2, 4], 5: [6]}
If you want these to be unique, use unique (Note: this will create a numpy array rather than a list):
In [11]: df = pd.DataFrame([[1, 2], [1, 2], [5, 6]], columns=['A', 'B'])
In [12]: g = df.groupby('A')
In [13]: g['B'].unique()
Out[13]:
A
1 [2]
5 [6]
dtype: object
In [14]: g['B'].unique().to_dict()
Out[14]: {1: array([2]), 5: array([6])}
Other alternatives are to use .apply(lambda s: set(s)), .apply(lambda s: list(set(s))), .apply(lambda s: list(s.unique()))...
You can actually loop over df.groupby object and collect the value as list.
In[1]:
df = pd.DataFrame([[1, 2], [1, 2], [5, 6]], columns=['A', 'B'])
{k: list(v) for k,v in df.groupby("A")["B"]}
Out[1]:
{1: [2, 2], 5: [6]}

Resources