Extract value from list of dictionaries in dataframe using pandas - python-3.x

I have this dataframe with 4 columns. I want to extract resourceName (i.e IDs ) in one separate column. I tried various methods and loops but unable to seperate it.
Dataset:
Username
Event name
Resources
XYZ-DEV_ENV_POST_function
StopInstances
[{"resourceType":"AWS::EC2::Instance","resourceName":"i-05fbb7a"}]
XYZ-DEV_ENV_POST_function
StartInstances
[{"resourceType":"AWS::EC2::Instance","resourceName":"i-08bd2475"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0fd69dc1"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0174dd38aea"}]
I want one more column IDs which will have IDS from Resource column and ll look like this:
Username
Event name
Resources
IDS
XYZ-DEV_ENV_POST_function
StopInstances
[{"resourceType":"AWS::EC2::Instance","resourceName":"i-05fbb7a"}]
i-05fbb7a"
XYZ-DEV_ENV_POST_function
StartInstances
[{"resourceType":"AWS::EC2::Instance","resourceName":"i-08bd2475"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0fd69dc1"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0174dd38aea"}]
i-08bd2475 , i-0fd69dc1 , i-0174
Here is output of data.head(2).to_dict():
{'Date':
{0: '28-02-2022', 1: '28-02-2022'},
'Event name':
{0: 'StopInstances', 1: 'StartInstances'},
'Resources':
{
0: '[{"resourceType":"AWS::EC2::Instance","resourceName":"i-05fbb7a"}]',
1: '[{"resourceType":"AWS::EC2::Instance","resourceName":"i-08bd2475"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0fd69dc1"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0174dd38aea"}]'},
'User name': {0: 'XYZ-DEV_ENV_POST_function', 1:
'XYZ-DEV_ENV_POST_function'}}
Thanks and Regards

df['ID'] = df['Resources'].apply(lambda x: ','.join([i['resourceName'] for i in eval(x)]))
Date ... ID
0 28-02-2022 ... i-05fbb7a
1 28-02-2022 ... i-08bd2475,i-0fd69dc1,i-0174dd38aea

Related

How to divide a column into groups and loop across the groups

I have a column with IDs and I divide the column into groups using pd.cut like this:
ID
Group
3645390
1
3678122
1
3615370
2
3371122
2
3645590
2
3778682
3
3125140
3
3578772
3
After that, how do I loop across the 'Group' column and pick out the IDs and assign them to an array.
array_1 = [3645390,3678122]
array_2 = [3615370,3371122,3645590]
.
so on..
It is often a bad practice to generate variables, the ideal is to use a container (dictionaries are ideal).
You could use groupby and transform the output into dictionary of lists:
out = df.groupby('Group')['ID'].apply(list).to_dict()
Then access your lists by group key:
>>> out
{1: [3645390, 3678122],
2: [3615370, 3371122, 3645590],
3: [3778682, 3125140, 3578772]}
>>> out[1] ## group #1
[3645390, 3678122]
If you really want array_x as keys:
(df.assign(Group='array_'+df['Group'].astype(str))
.groupby('Group')['ID'].apply(list).to_dict()
)
output:
{'array_1': [3645390, 3678122],
'array_2': [3615370, 3371122, 3645590],
'array_3': [3778682, 3125140, 3578772]}
Rather than create a new list for every array, you can much more easily store them in a dictionary and access them using array[1], array[2] and so on. This is what an implementation would look like:
array = {}
for group in df.Group.drop_duplicates().tolist():
array[group] = df[df.Group == group, 'ID']
Or:
>>> {k: list(v) for k, v in df.groupby('Group')['ID']}
{1: [3645390, 3678122], 2: [3615370, 3371122, 3645590], 3: [3778682, 3125140, 3578772]}
>>>
But if you are fine with series, just use:
>>> dict(tuple(df.groupby('Group')['ID']))
{1: 0 3645390
1 3678122
Name: ID, dtype: int64, 2: 2 3615370
3 3371122
4 3645590
Name: ID, dtype: int64, 3: 5 3778682
6 3125140
7 3578772
Name: ID, dtype: int64}
>>>

How to make a function with dynamic variables to add rows in pandas?

I'm trying to make a table from a list of data using pandas.
Originally I wanted to make a function where I can pass dynamic variables so I could continuously add new rows from data list.
It works up until a point where adding rows part begun. Column headers are adding, but the data - no. It either keeps value at only last col or adds nothing.
My scrath was:
for title in titles:
for x in data:
table = {
title: data[x]
}
df.DataFrame(table, columns=titles, index[0]
columns list:
titles = ['timestamp', 'source', 'tracepoint']
data list:
data = ['first', 'second', 'third',
'first', 'second', 'third',
'first', 'second', 'third']
How can I make something like this?
timestamp, source, tracepoint
first, second, third
first, second, third
first, second, third
If you just want to initialize pandas dataframe, you can use dataframe’s constructor.
And you can also append row by using a dict.
Pandas provides other useful functions,
such as concatenation between data frames, insert/delete columns. If you need, please check pandas’s doc.
import pandas as pd
# initialization by dataframe’s constructor
titles = ['timestamp', 'source', 'tracepoint']
data = [['first', 'second', 'third'],
['first', 'second', 'third'],
['first', 'second', 'third']]
df = pd.DataFrame(data, columns=titles)
print('---initialization---')
print(df)
# append row
new_row = {
'timestamp': '2020/11/01',
'source': 'xxx',
'tracepoint': 'yyy'
}
df = df.append(new_row, ignore_index=True)
print('---append result---')
print(df)
output
---initialization---
timestamp source tracepoint
0 first second third
1 first second third
2 first second third
---append result---
timestamp source tracepoint
0 first second third
1 first second third
2 first second third
3 2020/11/01 xxx yyy

How to drop duplicates in Vaex?

I have some entries from users and how many interactions this user had on my website...
I have 340k rows and 70+ columns, and I want to use Vaex, but I'm having problems to do simple things like to drop duplicates.
Could someone help me on how to do it?
import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Bob', 'Alice', 'Alice', 'Alice', "Ralph", "Ralph"],
'date': ['2013-12-05', '2014-02-05', '2013-11-07', '2014-04-22', '2014-04-30', '2014-04-20', '2014-05-29'],
'interaction_num': ['1', '2', '1', '2', '3', '1','2']})
I want to have the same result of the pandas.drop_duplicates(keep="last") function
df.drop_duplicates('user', keep='last', inplace=True)
the expected result using Vaex should be:
user date interaction_num
1 Bob 2014-02-05 2
4 Alice 2014-04-30 3
6 Ralph 2014-05-29 2
Duplicate question
It seems there is none yet, but we should expect this functionality at some point.
In the meantime, there is an attempt from the creator of vaex
The code adapted from https://github.com/vaexio/vaex/pull/1623/files works for me:
def drop_duplicates(df, columns=None):
"""Return a :class:`DataFrame` object with no duplicates in the given columns.
.. warning:: The resulting dataframe will be in memory, use with caution.
:param columns: Column or list of column to remove duplicates by, default to all columns.
:return: :class:`DataFrame` object with duplicates filtered away.
"""
if columns is None:
columns = df.get_column_names()
if type(columns) is str:
columns = [columns]
return df.groupby(columns, agg={'__hidden_count': vaex.agg.count()}).drop('__hidden_count')

From list of dict to pandas DataFrame

I'm retrieving real-time financial data.
Every 1 second, I pull the following list:
[{'symbol': 'ETHBTC', 'price': '0.03381600'}, {'symbol': 'LTCBTC', 'price': >'0.01848300'}...]
The goal is to put this list into an already-existing pandas DataFrame.
What I've done so far is converting this list of a dictionary to a pandas DataFrame. My problem is that symbols and prices are in two columns. I would like to have symbols as the DataFrame header and add a new row every 1 second containing price's values.
marketInformation = [{'symbol': 'ETHBTC', 'price': '0.03381600'}, {'symbol': 'LTCBTC', 'price': >'0.01848300'}...]
data = pd.DataFrame(marketInformation)
header = data['symbol'].values
newData = pd.DataFrame(columns=header)
while True:
realTimeData = ... // get a new marketInformation list of dict
newData.append(pd.DataFrame(realTimeData)['price'])
print(newData)
Unfortunately, the printed DataFrame is always empty. I would like to have a new row added every second with new prices for each symbol with the current time.
I printed the below part:
pd.DataFrame(realTimeData)['price']
and it gives me a pandas.core.series.Series object with a length equals to the number of symbol.
What's wrong?
After you create newData, just do:
newData.loc[len(newData), :] = [item['price'] for item in realTimeData]
You just need to set_index() and then transpose the df:
newData = pd.DataFrame(marketInformation).set_index('symbol').T
#In [245]: newData
#Out[245]:
#symbol ETHBTC LTC
#price 0.03381600 0.01848300
# then change the index to whatever makes sense to your data
newdata_time = pd.Timestamp.now()
newData.rename(index={'price':newdata_time})
#Out[246]:
#symbol ETHBTC LTC
#2019-04-03 17:08:51.389359 0.03381600 0.01848300

Issue faced in converting dataframe to dict

I have a 2 columns (Column names Orig_Nm and Mapping) dataframe:
Orig Name Mapping
Name FinalName
Id_No Identification
Group Zone
Now I wish to convert it to a dictionary, so I use
name_dict = df.set_index('Orig_Nm').to_dict()
print (name_dict)
The output which I get is:
{'Mapping': {'Group': 'Zone', 'ID_No': 'Identification', 'Name': 'Final_Name'}}
So it is a dictionary within dictionary {{}}.
What am I doing wrong, that I am not getting a single dictionary i.e. {}
You need Series.to_dict instead DataFrame.to_dict by selcting Mapping column for Series:
name_dict = df.set_index('Orig_Nm')['Mapping'].to_dict()
Also working selecting by key:
name_dict = df.set_index('Orig_Nm').to_dict()['Mapping']
EDIT:
In your solution is after set_index created one column DataFrame, so function to_dict create nested dictionary - first key is columns name:
d = {'Orig_Nm': ['Group', 'ID_No', 'Name'],
'Mapping': ['Zone', 'Identification', 'Final_Name']}
df = pd.DataFrame(d)
print (df)
Orig_Nm Mapping
0 Group Zone
1 ID_No Identification
2 Name Final_Name
print (df.set_index('Orig_Nm'))
Mapping
Orig_Nm
Group Zone
ID_No Identification
Name Final_Name
print (type(df.set_index('Orig_Nm')))
<class 'pandas.core.frame.DataFrame'>
So for avoid it is necessary selecting this column to Series:
print (df.set_index('Orig_Nm')['Mapping'])
Orig_Nm
Group Zone
ID_No Identification
Name Final_Name
Name: Mapping, dtype: object
print (type(df.set_index('Orig_Nm')['Mapping']))
<class 'pandas.core.series.Series'>

Resources