How to divide a column into groups and loop across the groups - python-3.x

I have a column with IDs and I divide the column into groups using pd.cut like this:
ID
Group
3645390
1
3678122
1
3615370
2
3371122
2
3645590
2
3778682
3
3125140
3
3578772
3
After that, how do I loop across the 'Group' column and pick out the IDs and assign them to an array.
array_1 = [3645390,3678122]
array_2 = [3615370,3371122,3645590]
.
so on..

It is often a bad practice to generate variables, the ideal is to use a container (dictionaries are ideal).
You could use groupby and transform the output into dictionary of lists:
out = df.groupby('Group')['ID'].apply(list).to_dict()
Then access your lists by group key:
>>> out
{1: [3645390, 3678122],
2: [3615370, 3371122, 3645590],
3: [3778682, 3125140, 3578772]}
>>> out[1] ## group #1
[3645390, 3678122]
If you really want array_x as keys:
(df.assign(Group='array_'+df['Group'].astype(str))
.groupby('Group')['ID'].apply(list).to_dict()
)
output:
{'array_1': [3645390, 3678122],
'array_2': [3615370, 3371122, 3645590],
'array_3': [3778682, 3125140, 3578772]}

Rather than create a new list for every array, you can much more easily store them in a dictionary and access them using array[1], array[2] and so on. This is what an implementation would look like:
array = {}
for group in df.Group.drop_duplicates().tolist():
array[group] = df[df.Group == group, 'ID']

Or:
>>> {k: list(v) for k, v in df.groupby('Group')['ID']}
{1: [3645390, 3678122], 2: [3615370, 3371122, 3645590], 3: [3778682, 3125140, 3578772]}
>>>
But if you are fine with series, just use:
>>> dict(tuple(df.groupby('Group')['ID']))
{1: 0 3645390
1 3678122
Name: ID, dtype: int64, 2: 2 3615370
3 3371122
4 3645590
Name: ID, dtype: int64, 3: 5 3778682
6 3125140
7 3578772
Name: ID, dtype: int64}
>>>

Related

Append strings to pandas dataframe column with conditional

I'm trying to achieve the following: check if each key value in the dictionary is in the string from column layers. If it meets the conditional, to append the value from the dictionary to the pandas dataframe.
For example, if BR and EWKS is contained within the layer, then in the new column there will be BRIDGE-EARTHWORKS.
Dataframe
mapping = {'IDs': [1244, 35673, 37863, 76373, 234298],
'Layers': ['D-BR-PILECAPS-OUTLINE 2',
'D-BR-STEEL-OUTLINE 2D-TERR-BOUNDARY',
'D-SUBG-OTHER',
'D-COMP-PAVE-CONC2',
'D-EWKS-HINGE']}
df = pd.DataFrame(mapping)
Dictionary
d1 = {"BR": "Bridge", "EWKS": "Earthworks", "KERB": "Kerb", "TERR": "Terrain"}
My code thus far is:
for i in df.Layers
for x in d1.keys():
first_key = list(d1)[0]
first_val = list(d1.values())[0]
print(first_key,first_val)
if first_key in i:
df1 = df1.append(first_val, ignore_index = True)
# df.apply(first_val)
Note I'm thinking it may be easier to do the comprehension at the mapping step prior to creating the dataframe.. I'm rather new to python still so any tips are appreciated.
Thanks!
Use Series.str.extractall for all matched keys, then mapping by dictionary with Series.map and last aggregate join:
pat = r'({})'.format('|'.join(d1.keys()))
df['new'] = df['Layers'].str.extractall(pat)[0].map(d1).groupby(level=0).agg('-'.join)
print (df)
IDs Layers new
0 1244 D-BR-PILECAPS-OUTLINE 2 Bridge
1 35673 D-BR-STEEL-OUTLINE 2D-TERR-BOUNDARY Bridge-Terrain
2 37863 D-SUBG-OTHER NaN
3 76373 D-COMP-PAVE-CONC2 NaN
4 234298 D-EWKS-HINGE Earthworks

Extract value from list of dictionaries in dataframe using pandas

I have this dataframe with 4 columns. I want to extract resourceName (i.e IDs ) in one separate column. I tried various methods and loops but unable to seperate it.
Dataset:
Username
Event name
Resources
XYZ-DEV_ENV_POST_function
StopInstances
[{"resourceType":"AWS::EC2::Instance","resourceName":"i-05fbb7a"}]
XYZ-DEV_ENV_POST_function
StartInstances
[{"resourceType":"AWS::EC2::Instance","resourceName":"i-08bd2475"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0fd69dc1"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0174dd38aea"}]
I want one more column IDs which will have IDS from Resource column and ll look like this:
Username
Event name
Resources
IDS
XYZ-DEV_ENV_POST_function
StopInstances
[{"resourceType":"AWS::EC2::Instance","resourceName":"i-05fbb7a"}]
i-05fbb7a"
XYZ-DEV_ENV_POST_function
StartInstances
[{"resourceType":"AWS::EC2::Instance","resourceName":"i-08bd2475"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0fd69dc1"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0174dd38aea"}]
i-08bd2475 , i-0fd69dc1 , i-0174
Here is output of data.head(2).to_dict():
{'Date':
{0: '28-02-2022', 1: '28-02-2022'},
'Event name':
{0: 'StopInstances', 1: 'StartInstances'},
'Resources':
{
0: '[{"resourceType":"AWS::EC2::Instance","resourceName":"i-05fbb7a"}]',
1: '[{"resourceType":"AWS::EC2::Instance","resourceName":"i-08bd2475"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0fd69dc1"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0174dd38aea"}]'},
'User name': {0: 'XYZ-DEV_ENV_POST_function', 1:
'XYZ-DEV_ENV_POST_function'}}
Thanks and Regards
df['ID'] = df['Resources'].apply(lambda x: ','.join([i['resourceName'] for i in eval(x)]))
Date ... ID
0 28-02-2022 ... i-05fbb7a
1 28-02-2022 ... i-08bd2475,i-0fd69dc1,i-0174dd38aea

Find if any value from a list exists anywhere in a dataframe

I have a list of specific company identifications numbers.
ex. companyID = ['1','2','3']
and I have a dataframe of different attributes relating to company business.
ex. company_df
There are multiple columns where values from my list could be.
ex. 'company_number', 'company_value', 'job_referred_by', etc.
How can I check if any value from my companyID list exists anywhere in my company_df, regardless of datatype, and return only the columns where a companyID is found?
This is what I have tried, to no luck:
def find_any(company_df, companyID):
found = company_df.isin(companyID).any()
foundCols = found.index[found].tolist()
print(foundCols)
Create a df from your list of companyIDs and then merge the two dfs on company ID. Then filter the df to show only the rows that match.
For datatypes, you can convert int to string no problem, but the other way around would crash if you have a string that can't be converted to int (e.g., 'a'), so I'd use string.
Here's a toy example:
company_df = pd.DataFrame({'co_id': [1, 2, 4, 9]})
company_df['co_id'] = company_df['co_id'].astype(str)
companyID = ['1','2','3']
df_companyID = pd.DataFrame(companyID, columns=['co_id'])
company_df = company_df.merge(df_companyID, on='co_id', how='left', indicator=True)
print(company_df)
# co_id _merge
# 0 1 both
# 1 2 both
# 2 4 left_only
# 3 9 left_only
company_df_hits_only = company_df[company_df['_merge'] == 'both']
del company_df['_merge']
del company_df_hits_only['_merge']
print(company_df_hits_only)
# co_id
# 0 1
# 1 2

create MultiIndex columns based on "lookup"

I'd like to take an existing DataFrame with a single level of columns and modify it to use a MultiIndex based on a reference list of tuples and have the proper ordering/alignment. To illustrate by example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,5), columns = ['nyc','london','canada','chile','earth'])
coltuples = [('cities','nyc'),('countries','canada'),('countries','usa'),('countries','chile'),('planets','earth'),('planets','mars'),('cities','sf'),('cities','london')]
I'd like to create a new DataFrame which has a top level consisting of 'cities', 'countries', and 'planets' with the corresponding original columns underneath. I am not concerned about order but definitely proper alignment.
It can be assumed that 'coltuples' will not be missing any of the columns from 'df', but may have extraneous pairs, and the ordering of the pairs can be random.
I am trying something along the lines of:
coltuplesuse = [x for x in coltuples if x[1] in df.columns]
cols = pd.MultiIndex.from_tuples(coltuplesuse, names=['level1','level2'])
df.reindex(columns=cols)
which seems to be on the right track but the underlying data in the DataFrame is 'nan'
thanks in advance!
Two things to notice: you want the command set_axis rather than reindex, and sorting by the original column order will ensure the correct label is assigned to the correct column (this is done in the sorted... key= bit).
use_cols = [tup for tup in coltuples if tup[1] in df.columns]
use_cols = sorted(use_cols, key=lambda x: list(df.columns).index(x[1]))
multi_index = pd.MultiIndex.from_tuples(use_cols, names=['level1', 'level2'])
df.set_axis(multi_index, axis=1)
output:
level1 cities countries planets
level2 nyc london canada chile earth
0 0.028033 0.540977 -0.056096 1.675698 -0.328630
1 1.170465 -1.003825 0.882126 0.453294 -1.127752
2 -0.187466 -0.192546 0.269802 -1.225172 -0.548491
3 2.272900 -0.085427 0.029242 -2.258696 1.034485
4 -1.243871 -1.660432 -0.051674 2.098602 -2.098941
5 -0.820820 -0.289754 0.019348 0.176778 0.395959
6 1.346459 -0.260583 0.212008 -1.071501 0.945545
7 0.673351 1.133616 1.117379 -0.531403 1.467604
8 0.332187 -3.541103 -0.222365 1.035739 -0.485742
9 -0.605965 -1.442371 -1.628210 -0.711887 -2.104755

Issue faced in converting dataframe to dict

I have a 2 columns (Column names Orig_Nm and Mapping) dataframe:
Orig Name Mapping
Name FinalName
Id_No Identification
Group Zone
Now I wish to convert it to a dictionary, so I use
name_dict = df.set_index('Orig_Nm').to_dict()
print (name_dict)
The output which I get is:
{'Mapping': {'Group': 'Zone', 'ID_No': 'Identification', 'Name': 'Final_Name'}}
So it is a dictionary within dictionary {{}}.
What am I doing wrong, that I am not getting a single dictionary i.e. {}
You need Series.to_dict instead DataFrame.to_dict by selcting Mapping column for Series:
name_dict = df.set_index('Orig_Nm')['Mapping'].to_dict()
Also working selecting by key:
name_dict = df.set_index('Orig_Nm').to_dict()['Mapping']
EDIT:
In your solution is after set_index created one column DataFrame, so function to_dict create nested dictionary - first key is columns name:
d = {'Orig_Nm': ['Group', 'ID_No', 'Name'],
'Mapping': ['Zone', 'Identification', 'Final_Name']}
df = pd.DataFrame(d)
print (df)
Orig_Nm Mapping
0 Group Zone
1 ID_No Identification
2 Name Final_Name
print (df.set_index('Orig_Nm'))
Mapping
Orig_Nm
Group Zone
ID_No Identification
Name Final_Name
print (type(df.set_index('Orig_Nm')))
<class 'pandas.core.frame.DataFrame'>
So for avoid it is necessary selecting this column to Series:
print (df.set_index('Orig_Nm')['Mapping'])
Orig_Nm
Group Zone
ID_No Identification
Name Final_Name
Name: Mapping, dtype: object
print (type(df.set_index('Orig_Nm')['Mapping']))
<class 'pandas.core.series.Series'>

Resources