split a list of dictionaries into multiple columns - python-3.x

I have a dataframe with 30000 rows and 5 columns. one of this column is a list of dictionaries and a few Nan's. I wanted to split this column into 3 fields (legroom to In-FLight Enternatinment) and wanted to extract ratings
Below is a sample for reference
d = {'col1': [[{'rating': 5, 'ratingLabel': 'Legroom'}, {'rating': 5, 'ratingLabel': 'Seat comfort'}, {'rating': 5, 'ratingLabel': 'In-flight Entertainment'}],'Nan']}
df = pd.DataFrame(data=d)
df

IIUC This should do the trick:
df=df["col1"].apply(lambda x: pd.Series({el["ratingLabel"]: el["rating"] for el in x if isinstance(x, list)}))
Output:
Legroom Seat comfort In-flight Entertainment
0 5.0 5.0 5.0
1 NaN NaN NaN

Here is a possible solution using the DataFrame.apply() and pd.Series and a strategy from Splitting dictionary/list inside a Pandas Column into Separate Columns
import pandas as pd
d = {'col1': [[{'rating': 5, 'ratingLabel': 'Legroom'},
{'rating': 5, 'ratingLabel': 'Seat comfort'},
{'rating': 5, 'ratingLabel': 'In-flight Entertainment'}],
[{'rating': 5, 'ratingLabel': 'Legroom'},
{'rating': 5, 'ratingLabel': 'Seat comfort'},
{'rating': 5, 'ratingLabel': 'In-flight Entertainment'}],
'Nan']}
df = pd.DataFrame(data=d)
df
df_split = df['col1'].apply(pd.Series)
pd.concat([df,
df_split[0].apply(pd.Series).rename(columns = {'rating':'legroom_rating',
'ratingLabel':'1'}),
df_split[1].apply(pd.Series).rename(columns = {'rating':'seat_comfort_rating',
'ratingLabel':'2'}),
df_split[2].apply(pd.Series).rename(columns = {'rating':'in_flight_entertainment_rating',
'ratingLabel':'3'})],
axis = 1).drop(['col1','1','2','3',0],
axis = 1)
Producing the following DataFrame

Related

Fill df with empty rows based on index of other df

I am trying to use df.update(), but my dfs have different sizes. Now I want to fill up the smaler df with dummy rows to match the shape the bigger df. Here's a minimal example:
import pandas as pd
import numpy as np
data = {
"Feat_A": ["INVALID", "INVALID", "INVALID"],
"Feat_B": ["INVALID", "INVALID", "INVALID"],
"Key": [12, 25, 99],
}
df = pd.DataFrame(data=data)
data = {"Feat_A": [1, np.nan], "Feat_B": [np.nan, 2], "Key": [12, 99]}
result = pd.DataFrame(data=data)
# df.update(result) not working because of different sizes/shape
# result should be
# Feat_A Feat_B Key
# 0 1.0 NaN 12
# NaN NaN NaN NaN
# 2 NaN 2.0 99
# df.update(result) should work now
This did it:
df.update(result.set_index('Key').reindex(df.set_index('Key').index).reset_index())
Does this meet your needs? Modified your example to include unique DataFrame values to confirm proper alignment:
# Modified example
data = {
"Feat_A": ["INVALID_A12", "INVALID_A25", "INVALID_A99"],
"Feat_B": ["INVALID_B12", "INVALID_B25", "INVALID_B99"],
"Key": [12, 25, 99],
}
df = pd.DataFrame(data=data)
data = {"Feat_A": [1, np.nan], "Feat_B": [np.nan, 2], "Key": [12, 99]}
result = pd.DataFrame(data=data)
# Use Key column as DataFrame indexes
df = df.set_index('Key')
result = result.set_index('Key')
# Add all-NaN rows with keys that exist in df but not in result
result = result.reindex_like(df)
# Update
result.update(df)
print(result)
Feat_A Feat_B
Key
12 INVALID_A12 INVALID_B12
25 INVALID_A25 INVALID_B25
99 INVALID_A99 INVALID_B99

Filter pandas dataframe in python3 depending on the value of a list

So I have a dataframe like this:
df = {'c': ['A','B','C','D'],
'x': [[1,2,3],[2],[1,3],[1,2,5]]}
And I want to create another dataframe that contains only the rows that have a certain value contained in the lists of x. For example, if I only want the ones that contain a 3, to get something like:
df2 = {'c': ['A','C'],
'x': [[1,2,3],[1,3]]}
I am trying to do something like this:
df2 = df[(3 in df.x.tolist())]
But I am getting a
KeyError: False
exception. Any suggestion/idea? Many thanks!!!
df = df[df.x.apply(lambda x: 3 in x)]
print(df)
Prints:
c x
0 A [1, 2, 3]
2 C [1, 3]
Below code would help you
To create the Correct dataframe
df = pd.DataFrame({'c': ['A','B','C','D'],
'x': [[1,2,3],[2],[1,3],[1,2,5]]})
To filter the rows which contains 3
df[df.x.apply(lambda x: 3 in x)==True]
Output:
c x
0 A [1, 2, 3]
2 C [1, 3]

python pandas get column which has first element containing a string

So i've got this DataFrame:
df = pd.DataFrame({'A': ['ex1|context1', 1, 'ex3|context3', 3], 'B': [5, 'ex2|context2', 6, 'data']})
i want to get the column that has '|' in its first element which in my example would be A because ex1|context1 is the first element and contains '|'
If always exist at least one | value in data:
s = df.stack().reset_index(level=0, drop=True)
out = s.str.contains('|', na=False).idxmax()
print (out)
A
General solution working also if no data match:
df = pd.DataFrame({'A': ['ex1context1', 1, 'ex3ontext3', 3],
'B': [5, 'ex2ontext2', 6, 'data']})
print (df)
B A
0 ex1context1 5
1 1 ex2ontext2
2 ex3ontext3 6
3 3 data
s = df.stack().reset_index(level=0, drop=True)
out = next(iter(s.index[s.str.contains('|', na=False, regex=False)]), 'no match')
print (out)
no match

pandas: aggregate a column of list into one list

I have the following data frame my_df:
name numbers
----------------------
A [4,6]
B [3,7,1,3]
C [2,5]
D [1,2,3]
I want to combine all numbers to a new list, so the output should be:
new_numbers
---------------
[4,6,3,7,1,3,2,5,1,2,3]
And here is my code:
def combine_list(my_lists):
new_list = []
for x in my_lists:
new_list.append(x)
return new_list
new_df = my_df.agg({'numbers': combine_list})
but the new_df still looks the same as original:
numbers
----------------------
0 [4,6]
1 [3,7,1,3]
2 [2,5]
3 [1,2,3]
What did I do wrong? How do I make new_df like:
new_numbers
---------------
[4,6,3,7,1,3,2,5,1,2,3]
Thanks!
You need flatten values and then create new Dataframe by constructor:
flatten = [item for sublist in df['numbers'] for item in sublist]
Or:
flatten = np.concatenate(df['numbers'].values).tolist()
Or:
from itertools import chain
flatten = list(chain.from_iterable(df['numbers'].values.tolist()))
df1 = pd.DataFrame({'numbers':[flatten]})
print (df1)
numbers
0 [4, 6, 3, 7, 1, 3, 2, 5, 1, 2, 3]
Timings are here.
You can use df['numbers'].sum() which returns a combined list to create the new dataframe
new_df = pd.DataFrame({'new_numbers': [df['numbers'].sum()]})
new_numbers
0 [4, 6, 3, 7, 1, 3, 2, 5, 1, 2, 3]
This should do:
newdf = pd.DataFrame({'numbers':[[x for i in mydf['numbers'] for x in i]]})
Check this pandas groupby and join lists
What you are looking for is,
my_df = my_df.groupby(['name']).agg(sum)

Python Pandas - Update row with dictionary based on index, column

I have a dataframe with empty columns and a corresponding dictionary which I would like to update the empty columns with based on index, column:
import pandas as pd
import numpy as np
dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [4, 6, 2], [3, 4, 1]])
dataframe.columns = ['x', 'y', 'z']
additional_cols = ['a', 'b', 'c']
for col in additional_cols:
dataframe[col] = np.nan
x y z a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 4 6 2
4 3 4 1
for row, column in x.iterrows():
#caluclations to return dictionary y
y = {"a": 5, "b": 6, "c": 7}
df.loc[row, :].map(y)
Basically after performing the calculations using columns x, y, z I would like to update columns a, b, c for that same row :)
I could use a function as such but as far as the pandas library and a method for the DataFrame object I am not sure...
def update_row_with_dict(dictionary, dataframe, index):
for key in dictionary.keys():
dataframe.loc[index, key] = dictionary.get(key)
The above answer with correct indent
def update_row_with_dict(df,d,idx):
for key in d.keys():
df.loc[idx, key] = d.get(key)
more short would be
def update_row_with_dict(df,d,idx):
df.loc[idx,d.keys()] = d.values()
for your code snipped the syntax would be:
import pandas as pd
import numpy as np
dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [4, 6, 2], [3, 4, 1]])
dataframe.columns = ['x', 'y', 'z']
additional_cols = ['a', 'b', 'c']
for col in additional_cols:
dataframe[col] = np.nan
for idx in dataframe.index:
y = {'a':1,'b':2,'c':3}
update_row_with_dict(dataframe,y,idx)

Resources