Loop over columns and calculate relative change within groups

Loop over columns and calculate relative change within groups - python-3.x

I have a dataset (df) and want to achieve df_goal. That is to create a new variable that captures the relative change within groups from value1 and value2. In my real dataset I have a lot of columns, so I want to find a solution that loops over columns and add new ones along the way.
I have tried versions of the snippet below but it doesn't work. Any suggestions?
for col in df.columns:
df[col + 'REL_CGH'] = df.groupby(['GROUP']).apply((df.col / dfcol[0]) * 100)
import pandas as pd
df = pd.DataFrame({'GROUP': ['A', 'A', 'A', 'B', 'B', 'B'],
'VALUE1': [5, 6, 7, 3, 5, 8],
'VALUE2': [11, 16, 21, 321, 401, 423]})
df_goal = pd.DataFrame({'GROUP': ['A', 'A', 'A', 'B', 'B', 'B'],
'VALUE1': [5, 6, 7, 3, 5, 8],
'VALUE2': [11, 16, 21, 321, 401, 423],
'VALUE1_REL_CHG': [100, 120, 140, 100, 167, 267],
'VALUE2_REL_CHG' :[100, 145, 191, 100, 174, 183]})

You can use GroupBy.transform with GroupBy.first for first value per groups of all columns defined in list cols, divide by DataFrame.div, round and convert to integers, use DataFrame.add_suffix and last append to original:
cols = ['VALUE1','VALUE2']
df = (df.join(df[cols].div(df.groupby(['GROUP'])[cols].transform('first'))
.mul(100)
.round()
.astype(int)
.add_suffix('_REL_CGH')))
print (df)
GROUP VALUE1 VALUE2 VALUE1_REL_CGH VALUE2_REL_CGH
0 A 5 11 100 100
1 A 6 16 120 145
2 A 7 21 140 191
3 B 3 321 100 100
4 B 5 401 167 125
5 B 8 423 267 132
Your solution should be changed with lambda function, but is slowier if large DataFrame:
for col in cols:
df[col + 'REL_CGH'] = df.groupby(['GROUP'])[col].apply(lambda x: (x / x.iloc[0]) * 100)

Related

How to wright a function to work with dictionary type Serires and a column in Dataframe?

I am trying to wright a function that works with Series and Dataframe.
dct= {10: 0.5, 20: 2, 30: 3,40:4}
#Defining the function
def funtion_dict(row,dict1):
total_area=row['total_area']
if total_area.round(-1) in dict1:
return dict1.get(total_area.round(-1))*total_area
#checking function in a test situation
row = pd.DataFrame(
{
'total_area': [53, 14.8, 94, 77, 12],
'b': [5, 4, 3, 2, 1],
'c': ['X', 'Y', 'Y', 'Y', 'Z'],
}
)
print(funtion_dict(row,dct))
I keep getting an error 'Series' objects are mutable, thus they cannot be hashed'. Please help

This is the expected behavior because you are trying to use a "Series" as a lookup for a dictionary which is not allowed.
From your code,
dct= {10: 0.5, 20: 2, 30: 3,40:4}
df = pd.DataFrame({
'total_area': [53, 14.8, 94, 77, 12],
'b': [5, 4, 3, 2, 1],
'c': ['X', 'Y', 'Y', 'Y', 'Z'],
})
If you want to add another column to your data frame with multipliers matched from a dictionary, you can do it like so:
df['new_column'] = df['total_area'].round(-1).map(dct) * df['total_area']
which will then give you
total_area b c new_column
0 53.0 5 X NaN
1 14.8 4 Y 7.4
2 94.0 3 Y NaN
3 77.0 2 Y NaN
4 12.0 1 Z 6.0

How to get key from the value of dictionary

I have the following set of rules for grading system
if 25 < score <= 30, grade = A.
if 20 < score <= 25, grade = B.
if 15 < score <= 20, grade = C.
if 10 < score <= 15, grade = D.
if 5 < score <= 10, grade = E.
if 0 <= score <= 5, grade = F.
so I have to write a function which takes score as parameter and returns letter grade. So I can do this using selections(if, else). But I want to do it in different manner.
for instance I want to declare a dictionary like below:
gradeDict = {
'A': [26, 27, 28, 29, 30],
'B': [21, 22, 23, 24, 25],
'C': [16, 17, 18, 19, 20],
'D': [11, 12, 13, 14, 15],
'E': [6, 7, 8, 9, 10],
'F': [0, 1, 2, 3, 4, 5]
}
so while checking the score with values I want to return the key
In python I've learned something like dict.get(term, 'otherwise') but it will give you the values. Is there any other mechanism that does the opposite, ie: if we can pass the value in the get method it will return the key?

The bisect standard library offers an elegant solution to problems like this one. In fact, grading is one of the examples shown in the docs.. Here is an adaption of the example modeled on OP's grading curve:
Example:
from bisect import bisect_left
def grade(score, breakpoints=[5, 10, 15, 20, 25], grades='FEDCBA'):
i = bisect_left(breakpoints, score)
return grades[i]
[grade(score) for score in [1, 5, 8, 10, 11, 15, 17, 20, 22, 25, 26]]
Output:
['F', 'F', 'E', 'E', 'D', 'D', 'C', 'C', 'B', 'B', 'A']

Funny thing is that you don't even need a dictionary for this, just an array. Ofc you can do it in a dictionary way style by declaring the following dict:
gradeDict = {
1:'F',
2:'E',
3:'D',
4:'C',
5:'B',
6:'A'
}
This dict seems to be useless since it's just an ordered list of indexes 1,2,3...
You can transform it: grates_arr = ['F', 'E', 'D', 'C', 'B', 'A']
But how can I get the letter that I need? you may ask. Simple, divide the score by 5. 21 // 5 means 4. grates_arr[21//5] is 'B'.
2 more particular cases:
when the score divides 5 means you have to subtract 1 because for example 25 // 5 means 5 but grates_arr[5] is A not B.
when score is 0 do not subtract.

Cannot map an item in a list to a list in list of lists

I have two lists like so,
list1 = ['a','b','c','d']
list2 = [[20,30,15], [23,32,62,234, 234], [34,345,5345], [12]]
How can I map them so it outputs:
a 20
a 30
a 15
b 23
b 32
b 62
.
.
.
d 12
I tried this
list1 = ['a', 'b', 'c', 'd']
list2 = [[20, 30, 15], [23, 32, 62, 234, 234], [34, 345, 5345], [12]]
for item in list2:
for al, it, in zip(list1, item):
print(al, it)
which gives
a 20
b 30
c 15
a 23
b 32
c 62
d 234
a 34
b 345
c 5345
a 12

Using enumerate():
list1 = ['a', 'b', 'c', 'd']
list2 = [[20, 30, 15], [23, 32, 62, 234, 234], [34, 345, 5345], [12]]
for index, alpha in enumerate(list1):
for number in list2[index]:
print(alpha, number)

delete duplicated rows based on conditions pandas

I want to delete rows in dataframe if (x1, x2, x3) are the same between different rows and save in variable all ids of the rows deleted.
For example, with this data, I want to delete the second row;
d = {'id': ["i1", "i2", "i3", "i4"], 'x1': [13, 13, 61, 61], 'x2': [10, 10, 13, 13], 'x3': [12, 12, 2, 22], 'x4': [24, 24,9, 12]}
df = pd.DataFrame(data=d)

#input data
d = {'id': ["i1", "i2", "i3", "i4"], 'x1': [13, 13, 61, 61], 'x2': [10, 10, 13, 13], 'x3': [12, 12, 2, 22], 'x4': [24, 24,9, 12]}
df = pd.DataFrame(data=d)
#create new column where contents from x1, x2 and x3 columns are merged
df['MergedColumn'] = df[df.columns[1:4]].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
#remove duplicates based on the created column and drop created column
df1 = pd.DataFrame(df.drop_duplicates("MergedColumn", keep='first').drop(columns="MergedColumn"))
#print output dataframe
print(df1)
#merge two dataframes
df2 = pd.merge(df, df1, how='left', on = 'id')
#find rows with null values in the right table (rows that were removed)
df2 = df2[df2['x1_y'].isnull()]
#prints ids of rows that were removed
print(df2['id'])

How can I iterate over pandas dataframes and concatenate on another dataframe [duplicate]

I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.
How can I "join" together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person's string name?
The join() function in pandas specifies that I need a multiindex, but I'm confused about what a hierarchical indexing scheme has to do with making a join based on a single index.

Zero's answer is basically a reduce operation. If I had more than a handful of dataframes, I'd put them in a list like this (generated via list comprehensions or loops or whatnot):
dfs = [df0, df1, df2, ..., dfN]
Assuming they have a common column, like name in your example, I'd do the following:
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name'), dfs)
That way, your code should work with whatever number of dataframes you want to merge.

You could try this if you have 3 dataframes
# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')
alternatively, as mentioned by cwharland
df1.merge(df2,on='name').merge(df3,on='name')

This is an ideal situation for the join method
The join method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.
The code would look something like this:
filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])
With #zero's data, you could do this:
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])
attr11 attr12 attr21 attr22 attr31 attr32
name
a 5 9 5 19 15 49
b 4 61 14 16 4 36
c 24 9 4 9 14 9

In python 3.6.3 with pandas 0.22.0 you can also use concat as long as you set as index the columns you want to use for the joining:
pd.concat(
objs=(iDF.set_index('name') for iDF in (df1, df2, df3)),
axis=1,
join='inner'
).reset_index()
where df1, df2, and df3 are defined as in John Galt's answer:
import pandas as pd
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32']
)

This can also be done as follows for a list of dataframes df_list:
df = df_list[0]
for df_ in df_list[1:]:
df = df.merge(df_, on='join_col_name')
or if the dataframes are in a generator object (e.g. to reduce memory consumption):
df = next(df_list)
for df_ in df_list:
df = df.merge(df_, on='join_col_name')

Simple Solution:
If the column names are similar:
df1.merge(df2,on='col_name').merge(df3,on='col_name')
If the column names are different:
df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})

Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:
This is the function to merge a dict of data frames
def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
keys = dfDict.keys()
for i in range(len(keys)):
key = keys[i]
df0 = dfDict[key]
cols = list(df0.columns)
valueCols = list(filter(lambda x: x not in (onCols), cols))
df0 = df0[onCols + valueCols]
df0.columns = onCols + [(s + '_' + key) for s in valueCols]
if (i == 0):
outDf = df0
else:
outDf = pd.merge(outDf, df0, how=how, on=onCols)
if (naFill != None):
outDf = outDf.fillna(naFill)
return(outDf)
OK, lets generates data and test this:
def GenDf(size):
df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True),
'col1':np.random.uniform(low=0.0, high=100.0, size=size),
'col2':np.random.uniform(low=0.0, high=100.0, size=size)
})
df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
return(df)
size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)

One does not need a multiindex to perform join operations.
One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name') for example)
The join operation is by default performed on index.
In your case, you just have to specify that the Name column corresponds to your index.
Below is an example
A tutorial may be useful.
# Simple example where dataframes index are the name on which to perform
# the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=name)
df = df1.join(df2)
df = df.join(df3)
# If you have a 'Name' column that is not the index of your dataframe,
# one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name'] = df1.index
# 1) Select the index from column 'Name'
df1 = df1.set_index('Name')
# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))
gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')

There is another solution from the pandas documentation (that I don't see here),
using the .append
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
A B
0 1 2
1 3 4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
A B
0 5 6
1 7 8
>>> df.append(df2, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8
The ignore_index=True is used to ignore the index of the appended dataframe, replacing it with the next index available in the source one.
If there are different column names, Nan will be introduced.

I tweaked the accepted answer to perform the operation for multiple dataframes on different suffix parameters using reduce and i guess it can be extended to different on parameters as well.
from functools import reduce
dfs_with_suffixes = [(df2,suffix2), (df3,suffix3),
(df4,suffix4)]
merge_one = lambda x,y,sfx:pd.merge(x,y,on=['col1','col2'..], suffixes=sfx)
merged = reduce(lambda left,right:merge_one(left,*right), dfs_with_suffixes, df1)

df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['d', 14, 16]]
),
columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['c', 4, 36],
['d', 14, 9]]),
columns=['name', 'attr31', 'attr32']
)
df4 = pd.DataFrame(np.array([
['a', 15, 49],
['c', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr41', 'attr42']
)
Three ways to join list dataframe
pandas.concat
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
# cant not run if index not unique
dfs = pd.concat(dfs, join='outer', axis = 1)
functools.reduce
dfs = [df1, df2, df3, df4]
# still run with index not unique
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name', how = 'outer'), dfs)
join
# cant not run if index not unique
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:], how = 'outer')

Joining together all three can be done using .join() function.
You have three DataFrames lets say
df1, df2, df3.
To join these into one DataFrame you can:
df = df1.join(df2).join(df3)
This is the simplest way I found to do this task.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Loop over columns and calculate relative change within groups - python-3.x

Related

How to wright a function to work with dictionary type Serires and a column in Dataframe?

How to get key from the value of dictionary

Cannot map an item in a list to a list in list of lists

delete duplicated rows based on conditions pandas

How can I iterate over pandas dataframes and concatenate on another dataframe [duplicate]

Categories

Resources