Convert Pandas dataframe with multiindex to a list of dictionaries - python-3.x

I have a pandas dataframe, which looks like
df =
Index1 Index2 Index3 column1 column2
i11 i12 i13 2 5
i11 i12 i23 3 8
i21 i22 i23 4 5
How to convert this into list of dictionaries with keys as Index3, column1, column2 and values as in the respective cells.
So, expected output:
[[{Index3: i13, column1: 2, column2: 5}, {Index3: i23, column1: 3, column2: 8}], [{Index3: i23, column1: 4, column2: 5}]]
Please note that the same values of Index1 and Index2 form 1 inner list and the values won't be repeated.

d = {'Index1': ["i11", "i12", "i13"],
'Index2': ["i21", "i22", "i23"],
'Index3': ["i31", "i32", "i33"],
'column1': [2, 3,4],
'column2': [5, 8, 5]}
df = pd.DataFrame(data=d)
This should fit:
a = []
for i in range(df.shape[0]):
a.append({"Index3": df.iloc[2,i],"column 1": df.iloc[i,3], "column2": df.iloc[i,4]})
Res:
[{'Index3': 'i13', 'column 1': 2, 'column2': 5},
{'Index3': 'i23', 'column 1': 3, 'column2': 8},
{'Index3': 'i33', 'column 1': 4, 'column2': 5}]

Related

Python - How to extract data from pandas column that contains dictionary [duplicate]

I've got a csv that I'm reading into a pandas dataframe. However one of the columns is in the form of a dictionary. Here is an example:
ColA, ColB, ColC, ColdD
20, 30, {"ab":"1", "we":"2", "as":"3"},"String"
How can I turn this into a dataframe that looks like this:
ColA, ColB, AB, WE, AS, ColdD
20, 30, "1", "2", "3", "String"
edit
I fixed up the question, it looks like this but is a string that needs to be parsed, not dict object.
As per https://stackoverflow.com/a/38231651/454773, you can use .apply(pd.Series) to map the dict containing column onto new columns and then concatenate these new columns back into the original dataframe minus the original dict containing column:
dw=pd.DataFrame( [[20, 30, {"ab":"1", "we":"2", "as":"3"},"String"]],
columns=['ColA', 'ColB', 'ColC', 'ColdD'])
pd.concat([dw.drop(['ColC'], axis=1), dw['ColC'].apply(pd.Series)], axis=1)
Returns:
ColA ColB ColdD ab as we
20 30 String 1 3 2
So starting with your one row df
Col A Col B Col C Col D
0 20 30 {u'we': 2, u'ab': 1, u'as': 3} String1
EDIT: based on the comment by the OP, I'm assuming we need to convert the string first
import ast
df["ColC"] = df["ColC"].map(lambda d : ast.literal_eval(d))
then we convert Col C to a dict, transpose it and then join it to the original df
dfNew = df.join(pd.DataFrame(df["Col C"].to_dict()).T)
dfNew
which gives you this
Col A Col B Col C Col D ab as we
0 20 30 {u'we': 2, u'ab': 1, u'as': 3} String1 1 3 2
Then we just select the columns we want in dfNew
dfNew[["Col A", "Col B", "ab", "we", "as", "Col D"]]
Col A Col B ab we as Col D
0 20 30 1 2 3 String1
What about something like:
import pandas as pd
# Create mock dataframe
df = pd.DataFrame([
[20, 30, {'ab':1, 'we':2, 'as':3}, 'String1'],
[21, 31, {'ab':4, 'we':5, 'as':6}, 'String2'],
[22, 32, {'ab':7, 'we':8, 'as':9}, 'String2'],
], columns=['Col A', 'Col B', 'Col C', 'Col D'])
# Create dataframe where you'll store the dictionary values
ddf = pd.DataFrame(columns=['AB','WE','AS'])
# Populate ddf dataframe
for (i,r) in df.iterrows():
e = r['Col C']
ddf.loc[i] = [e['ab'], e['we'], e['as']]
# Replace df with the output of concat(df, ddf)
df = pd.concat([df, ddf], axis=1)
# New column order, also drops old Col C column
df = df[['Col A', 'Col B', 'AB', 'WE', 'AS', 'Col D']]
print(df)
Output:
Col A Col B AB WE AS Col D
0 20 30 1 2 3 String1
1 21 31 4 5 6 String2
2 22 32 7 8 9 String2

how to remain max column in groupby table?

I made summarized table like below using pandas groupby function
I
II
A
apple
3
banana
4
B
dog
1
cat
2
C
seoul
9
tokyo
5
I want to remain if II column has max value in each category.
For example, in A category I want to remain banana row only because it has max value in II column.
the result table what I want to get is like below.
I
II
A
banana
4
B
cat
2
C
seoul
9
Thanks.
Dataframe used by me:
df=pd.DataFrame({'II': {('A', 'apple'): 3,
('A', 'banana'): 4,
('B', 'dog'): 1,
('B', 'cat'): 2,
('C', 'seoul'): 9,
('C', 'tokyo'): 5}})
Try via sort_values(),reset_index() and drop_duplicates():
out=(df.sort_values('II',ascending=False)
.reset_index()
.drop_duplicates('level_0')
.set_index('level_0')
.rename_axis(index=None)
.rename(columns={'level_1':'I'}))
OR
out=(df.reset_index()
.sort_values('II',ascending=False)
.groupby('level_0')
.first()
.rename(columns={'level_1':'I'})
.rename_axis(index=None))
output of out:
I II
C seoul 9
A banana 4
B cat 2
Not sure if this is the most elegant solution, but if you want this should work with a groupby object.
# Creating the Dummy DataFrame
d = {
'Letter': ['A', 'A', 'B', 'B', 'C', 'C'], 'Word': ['apple', 'banana',
'dog', 'cat', 'seoul', 'tokyo'], 'II': [3, 4, 1, 2, 9, 5]
}
df = pd.DataFrame(data=d)
df_max = df.groupby('Letter')[['II']].agg('max')
df_max = df_max.merge(df, how='left', on='II') # merge the "Word" column back into df_max
You could then reorder the columns if you need them to be in a specific order.

Function on column from dictionary

I have a df like this:
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
A B
0 3 5
1 1 6
2 2 7
3 3 8
And I have a dictionary like this:
{'A': 1, 'B': 2}
Is there a simple way to performa function (eg. divide) on df values based on the values from the dictionary?
Example, all values in column A is divided by 1, and all values in column B is divided by 2?
For me working division by dictionary, because keys of dict matching columns names:
d = {'A': 1, 'B': 2}
df1 = df.div(d)
Or:
df1 = df / d
print(df1)
A B
0 3.0 2.5
1 1.0 3.0
2 2.0 3.5
3 3.0 4.0
If you want to do it using for loop you can try this
df = pd.DataFrame({'A': [3, 1,2, 3, 4],
'B': [5, 6, 7, 8, 9]})
dict={'A': 1, 'B': 2}
final_dict={}
for col in df.columns:
if col in dict.keys():
for item in dict.keys():
if col==item:
lists=[i/dict[item] for i in df[col]]
final_dict[col]=lists
df=pd.DataFrame(final_dict)

How to select rows and columns that meet criteria from a list

Let's say I've got a pandas dataframe that looks like:
df1 = pd.DataFrame({"Item ID":["A", "B", "C", "D", "E"], "Value1":[1, 2, 3, 4, 0],
"Value2":[4, 5, 1, 8, 7], "Value3":[3, 8, 1, 2, 0],"Value4":[4, 5, 7, 9, 4]})
print(df1)
Item_ID Value1 Value2 Value3 Value4
0 A 1 4 3 4
1 B 2 5 8 5
2 C 3 1 1 7
3 D 4 8 2 9
4 E 0 7 0 4
Now I've got a second dataframe that looks like:
df2 = {"Item ID":["A", "C", "D"], "Value5":[4, 5, 7]}
print(df2)
Item_ID Value5
0 A 4
1 C 5
2 D 7
What I want do is find where the Item ID's match between my two data frames, and then add the "Value5" column values to the intersection of the rows AND ONLY columns Value1 and Value2 from df1 (these columns could change every iteration, so these columns need to be contained in a variable).
My output should show:
4 added to Row A, columns "Value1" and "Value2"
5 added to Row C, columns "Value1" and "Value2"
7 added to Row D, columns "Value1" and "Value2"
Item_ID Value1 Value2 Value3 Value4
0 A 5 8 3 4
1 B 2 5 8 5
2 C 8 6 1 7
3 D 11 15 2 9
4 E 0 7 0 4
Of course my data is many thousand rows long. I can do it using a for loop, but this is taking way too long. I want to be able to vectorize this in some way. Any ideas?
This is what I ended up doing based on #sammywemmy's suggestions
#Takes columns names and changes them into a list
names = df1.colnames.tolist()
#Merge df1 and df2 based on 'Item_ID'
merged = df1.merge(df2, on='Item_ID', how='outer')
for i in range(len(names)):
#using assign and **, we can bring in variable names with assign.
#Then add our Value 5 column
merged = merged.assign(**{names[i] : lambda x : x[names[i]] + x.Value5})
#Only keep all the columns before and including 'Value4'
df1= merged.loc[:,:'Value4']
Try this:
#set 'Item ID' as the index
df1 = df1.set_index('Item ID')
df2 = df2.set_index('Item ID')
#create list of columns that you are interested in
list_of_cols = ['Value1','Value2']
#create two separate dataframes
#unselected will not contain the columns you want to add
unselected = df1.drop(list_of_cols,axis=1)
#this will contain the columns you wish to add
selected = df1.filter(list_of_cols)
#reindex df2 so it has the same indices as df1
#then convert to a series
#fill the null values with 0
A = df2.reindex(index=selected.index,fill_value=0).loc[:,'Value5']
#add the series A to selected
selected = selected.add(A,axis='index')
#combine selected and unselected into one dataframe
result = pd.concat([unselected,selected],axis=1)
#this part is extra to get ur dataframe back to the way it was
#assumption here is that it is value1, value 2, bla bla
#so 1>2>3
#if ur columns are not actually Value1, Value2,
#bla bla, then a different sorting has to be used
#alternatively before the calculations,
#you could create a mapping of the columns to numbers
#that will give u a sorting mechanism and
#restore ur dataframe after calculations are complete
columns = sorted(result.columns,key = lambda x : x[-1])
#reindex back to the way it was
result = result.reindex(columns,axis='columns')
print(result)
Value1 Value2 Value3 Value4
Item ID
A 5 8 3 4
B 2 5 8 5
C 8 6 1 7
D 11 15 2 9
E 0 7 0 4
Alternative solution, using python's built-in dictionaries:
#create dictionaries
dict1 = (df1
#create temporary column
#and set as index
.assign(temp=df1['Item ID'])
.set_index('temp')
.to_dict('index')
)
dict2 = (df2
.assign(temp=df2['Item ID'])
.set_index('temp')
.to_dict('index')
)
list_of_cols = ['Value1','Value2']
intersected_keys = dict1.keys() & dict2.keys()
key_value_pair = [(key,col) for key in intersected_keys
for col in list_of_cols ]
#check for keys that are in both dict1 and 2
#loop through dict 1 and add values from dict2
#can be optimized with a dict comprehension
#leaving as is for better clarity IMHO
for key, val in key_value_pair:
dict1[key][val] = dict1[key][val] + dict2[key]['Value5']
#print(dict1)
{'A': {'Item ID': 'A', 'Value1': 5, 'Value2': 8, 'Value3': 3, 'Value4': 4},
'B': {'Item ID': 'B', 'Value1': 2, 'Value2': 5, 'Value3': 8, 'Value4': 5},
'C': {'Item ID': 'C', 'Value1': 8, 'Value2': 6, 'Value3': 1, 'Value4': 7},
'D': {'Item ID': 'D', 'Value1': 11, 'Value2': 15, 'Value3': 2, 'Value4': 9},
'E': {'Item ID': 'E', 'Value1': 0, 'Value2': 7, 'Value3': 0, 'Value4': 4}}
#create dataframe
pd.DataFrame.from_dict(dict1,orient='index').reset_index(drop=True)
Item ID Value1 Value2 Value3 Value4
0 A 5 8 3 4
1 B 2 5 8 5
2 C 8 6 1 7
3 D 11 15 2 9
4 E 0 7 0 4

pandas Dataframe, assign value based on selection of other rows

I have a pandas DataFrame in python 3.
In this DataFrame there are rows which have identical values in two columns (this can be whole sections), I'll call this a group.
Each row also has a True/False value in a column.
Now for each row I want to know if any of the rows in its group have a False value, if so, I want to assign a False value to every row in that group in another column. I've managed to do this in a for-loop, but it's quite slow:
import pandas as pd
import numpy as np
df = pd.DataFrame({'E': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'D': [0, 1, 2, 3, 4, 5, 6],
'C': [True, True, False, False, True, True, True],
'B': ['aa', 'aa', 'aa', 'bb', 'cc', 'dd', 'dd'],
'A': [0, 0, 0, 0, 1, 1, 1]})
Which gives:
df:
A B C D E
0 0 aa True 0 NaN
1 0 aa True 1 NaN
2 0 aa False 2 NaN
3 0 bb False 3 NaN
4 1 cc True 4 NaN
5 1 dd True 5 NaN
6 1 dd True 6 NaN
Now I run the for-loop:
for i in df.index:
df.ix[i, 'E'] = df[(df['A'] == df.iloc[i]['A']) & (df['B'] == df.iloc[i]['B'])]['C'].all()
which then gives the desired result:
df:
A B C D E
0 0 aa True 0 False
1 0 aa True 1 False
2 0 aa False 2 False
3 0 bb False 3 False
4 1 cc True 4 True
5 1 dd True 5 True
6 1 dd True 6 True
When running this for my entire DataFrame of ~1 million rows this takes ages. So, looking at using .apply() to avoid a for-loop I've stumbled across the following question: apply a function to a pandas Dataframe whose retuned value is based on other rows
however:
def f(x): return False not in x
df.groupby(['A','B']).C.apply(f)
returns:
A B
0 aa False
bb True
1 cc True
dd True
Does anyone know a better way or how to fix the last case?
You could try doing a SQL-style join using pd.merge.
Perform the same groupby that you're doing, but apply min() to it to look for any cases with C == True. Then convert that to a DataFrame, rename the column as "E", and merge it back to df.
df = pd.DataFrame({'D': [0, 1, 2, 3, 4, 5, 6],
'C': [True, True, False, False, True, True, True],
'B': ['aa', 'aa', 'aa', 'bb', 'cc', 'dd', 'dd'],
'A': [0, 0, 0, 0, 1, 1, 1]})
falses = pd.DataFrame(df.groupby(['A', 'B']).C.min() == True)
falses = falses.rename(columns={'C': 'E'})
df = df.merge(falses, left_on=['A', 'B'], right_index=True)

Resources