I'm doing a bunch of operations on pandas dataframes. For example finding max, min and average inside columns and return the column names in a new column. Now I'm trying to wrap these things into a function, and use max() and/or min() as arguments in this function.
Below is a snippet that describes what I'm trying to do in a very simplified way. In its current state it also returns a description of the desired output. The snippet does not have the desired functionality and flexibility though.
The setup:
# Sample dataframe
df = pd.DataFrame({'col_A':[1,20,6,1,3]})
def findValue(function, df, colname):
print(function) # just a placeholder
df[colname] = df.max()[0]
return df
df2 = findValue(function='max', df=df, colname='col_B')
print(df)
Output 1:
col_A col_B
0 1 20
1 20 20
2 6 20
3 1 20
4 3 20
A naive attempt:
# Sample dataframe
df = pd.DataFrame({'col_A':[1,20,6,1,3]})
# The function I would like to use in another function is max()
# My function
def findValue(function, df, colname):
df[colname] = df.function()[0]
return df
df2 = findValue(function=max(), df=df , colname='col_B')
print(df)
Output 2:
Traceback (most recent call last):
File "<ipython-input-7-85964ff29e69>", line 1, in <module>
df2 = findValue(function=max(), df=df , colname='col_B')
TypeError: max expected 1 arguments, got 0
How can I change the above snippet so that I can change function = max() to function = min() or any other function in the arguments of findValue()? Or even define a list of functions to be used in a similar manner?
Thank you for any suggestions!
You are very, very close. You pretty much just need to remove the parens when passing in the function. Here's a simplified example that loops over a list of function names, and appears to do what you want:
def findValue(func, x, y):
return func(x, y)
for calc in (max, min):
result = findValue(func=calc, x=1, y=10)
print(result)
Output:
10
1
Related
I have an hourly time series data (say df with date/time and value columns) where I want to:
Step 1: Remove the top 5 percentile of each day
Step 2: Get the max(Step 1)for each day
Step 3: Get the mean(Step 2) for each month
Here is what I have tried to implement the above logic:
step_1 = df.resample('D').apply(lambda x: x<x.quantile(0.95))
step_2 = step_1.resample('D').max()
step_3 = step_2.resample('M').mean()
Even though I do not get any code error, the generated output is different to the expected result based on the above 3 steps (I always get a constant value)
Any help will be appreciated.
You are almost there. Your step_1 is a series of booleans with the same index as the original data, you can use it to filter your DataFrame, thus:
step_1 = df.resample('D').apply(lambda x: x<x.quantile(0.95))
step_2 = df[step_1].resample('D').max()
step_3 = step_2.resample('M').mean()
Your first step is a boolean mask, so you need to add an additional step:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range(start='1/1/2019', periods=1000, freq='H'), columns=['my_data'])
mask = df.resample('D').apply(lambda x: x < x.quantile(.95))
step_1 = df[mask]
step_2 = df.resample('D').max()
step_3 = df.resample('M').mean()
Goal
Apply deid_notes function to df
Background
I have a df that resembles this sample df
import pandas as pd
df = pd.DataFrame({'Text' : ['there are many different types of crayons',
'i like a lot of sports cares',
'the middle east has many camels '],
'P_ID': [1,2,3],
'Word' : ['crayons', 'cars', 'camels'],
'P_Name' : ['John', 'Mary', 'Jacob'],
'N_ID' : ['A1', 'A2', 'A3']
})
#rearrange columns
df = df[['Text','N_ID', 'P_ID', 'P_Name', 'Word']]
df
Text N_ID P_ID P_Name Word
0 many types of crayons A1 1 John crayons
1 i like sports cars A2 2 Mary cars
2 has many camels A3 3 Jacob camels
I use the following function to deidentify certain words within the Text column using NeuroNER http://neuroner.com/
def deid_notes(text):
#use predict function from neuorNER to tag words to be deidentified
ner_list = n1.predict(text)
#n1.predict wont work in this toy example because neuroNER package needs to be installed (and installation is difficult)
#but the output resembles this: [{'start': 1, 'end:' 11, 'id': 1, 'tagged word': crayon}]
#use start and end position of tagged words to deidentify and replace with **BLOCK**
if len(ner_list) > 0:
parts_to_take = [(0, ner_list[0]['start'])] + [(first["end"]+1, second["start"]) for first, second in zip(ner_list, ner_list[1:])] + [(ner_list[-1]['end'], len(text)-1)]
parts = [text[start:end] for start, end in parts_to_take]
deid = '**BLOCK**'.join(parts)
#if n1.predict does not identify any words to be deidentified, place NaN
else:
deid='NaN'
return pd.Series(deid, index='Deid')
Problem
I apply the deid_notes function to my df using the following code
fx = lambda x: deid_notes(x.Text,axis=1)
df.join(df.apply(fx))
But I get the following error
AttributeError: ("'Series' object has no attribute 'Text'", 'occurred at index Text')
Question
How do I get the deid_notes function to work on my df?
Assuming you are returning a pandas series as output from deid_notes function taking text as the only input argument. Pass the axis = 1 argument to the apply instead of died_notes. For eg.
# Dummy function
def deid_notes(text):
deid = 'prediction to: ' + text
return pd.Series(deid, index = ['Deid'])
fx = lambda x: deid_notes(x.Text)
df.join(df.apply(fx, axis =1))
I am trying to calculate a discount that I would like to apply to each row of two columns of my dataframe and add the result to a new column.
I have already tried many ways, by following existing examples, but everytime an error occurs.
I define the function as:
def delta_perc(x,y):
if y == 0:
return 0
else:
return (x-y)/x*100
and then try to apply the function to my dataframe
ordini["discount"] = ordini.apply(delta_perc(ordini["revenue1"],ordini["revenue2"]), axis=1)
I expected a new column where each row was the result of the function applied to ordini["revenue1"] and ordini["revenue2"].
But I get the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I also tried to apply all the suggestion from here but everytime an error occured.
You are getting a few concepts mixed up. When you use pandas.DataFrame.apply (with axis=1) you are iterating through each row and passing that row (as a pandas.Series object) to the function you used when you called apply.
First Point of Failure
Instead, you are calling your function inside the apply and passing two columns to the function. This will pass the return value of the function to the apply. Since your function does not pass back a callable object, this should fail.
Second Point of Failure
Also, your function is designed to look at scalar values hence if y == 0: and when you pass column like ordini["revenue1"] (which is a pandas.Series object) it tries to evaluate if pandas.Series == 0: and that is what is generating the error you see:
ValueError: The truth value of a Series is ambiguous.
Approach #1
Fix your function and don't use apply
def delta_perc(x, y):
return x.sub(y).div(x).mask(x == 0, 0).mul(100)
ordini["discount"] = delta_perc(ordini["revenue1"], ordini["revenue2"])
Approach #2
Fix your funciton and use map. This would be similar to using a comprehension.
def delta_perc(x, y):
if x == 0:
return 0
else:
return (x - y) / x * 100
ordini["discount"] = [*map(delta_perc, ordini["revenue1"], ordini["revenue2"])]
Approach #3
Actually using apply
def delta_perc(x, y):
if x == 0:
return 0
else:
return (x - y) / x * 100
# Because remember `apply` takes a function that gets a row (or column) passed to it
ordini["discount"] = ordini.apply(
lambda row: delta_perc(row['revenue1'], row['revenue2']),
axis=1
)
You can also try:
ordini["discount"] = [delta_perc(a,b) for a,b in zip(ordini["revenue1"],ordini["revenue2"])]
You should apply this calculation to entire Series with np.where:
import pandas as pd
import numpy as np
def delta_perc(x, y):
return np.where(y != 0, (x-y)/x*100, 0)
# I think you may want when x != 0, since you divide by x:
#return np.where(x != 0, (x-y)/x*100, 0)
Example:
np.random.seed(12)
df = pd.DataFrame(np.random.randint(0,10,(10,2)))
df['new_col'] = delta_perc(df[0], df[1])
# 0 1 new_col
#0 6 1 83.333333
#1 2 3 -50.000000
#2 3 0 0.000000
#3 6 1 83.333333
#4 4 5 -25.000000
#5 9 2 77.777778
#6 6 0 0.000000
#7 5 8 -60.000000
#8 2 9 -350.000000
#9 3 4 -33.333333
Have you tried adding lambda inside apply like this ?
ordini["discount"] = ordini.apply(
lambda x: delta_perc(x["revenue1"], x["revenue2"]), axis=1
)
Try this, if performance matters to you.
import numpy as np
delta_perc_vec = np.vectorize(delta_perc)
ordini["discount"] = delta_perc_vec(df["revenue1"].values, df["revenue2"].values)
I have a requirement that the result value should be a string. But when I calculate the maximum value of dataframe it gives the result as a list.
import pandas as pd
def answer_one():
df_copy = [df['# Summer'].idxmax()]
return (df_copy)
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)
names_ids = df.index.str.split('\s\(')
df.index = names_ids.str[0] # the [0] element is the country name (new index)
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)
df = df.drop('Totals')
df.head()
answer_one()
But here the answer_one() will give me a List as an output and not a string. Can someone help me know how this came be converted to a string or how can I get the answer directly from dataframe as a string. I don't want to convert the list to a string using str(df_copy).
Your first solution would be as #juanpa.arrivillaga put it: To not wrap it. Your function becomes:
def answer_one():
df_copy = df['# Summer'].idxmax()
return (df_copy)
>>> 1
Another thing that you might not be expecting but idxmax() will return the index of the max, perhaps you want to do:
def answer_one():
df_copy = df['# Summer'].max()
return (df_copy)
>>> 30
Since you don't want to do str(df_copy) you can do df_copy.astype(str) instead.
Here is how I would write your function:
def get_max_as_string(data, column_name):
""" Return Max Value from a column as a string."""
return data[column_name].max().astype(str)
get_max_as_string(df, '# Summer')
>>> '30'
This is likely a very simple question but I would appreciate help!
As part of a larger script, I have a dataframe (imported from a csv file) with two columns, 'file_name' and 'value'. I have a short example below:
file_name value
0 201623800811s.fits True
1 201623802491s.fits True
2 201623802451s.fits False
I would like to define a function that reads the values within column 'value', and returns 0 for 'False' and 1 for 'True'. I would then like to append the results to a third column in the dataframe, and finally export the updated dataframe to the csv.
I have defined a function that appears to me to work. However, when I run the script it does not execute and I receive the message:
<function convert_string at 0x000000000DE35588>
In the console.
My function is below. Any help or advice will be welcomed.
def convert_string(explosions):
for i in range(0,len(explosions)):
if i == 'True' :
return 1
elif i == 'False' :
return 0
else:
return 2
print convert_string
If you are using an explicit for loop when working with a dataframe, you are most probably "doing it wrong". Also, what is the point of having a for loop if you return on the very first iteration?
Consider these:
import numpy as np
df['third_column'] = np.where(df['value'], 1, 0)
If you insist on defining a function:
def foo(x):
return int(x)
df['third_column'] = df['value'].apply(foo)
or simply
df['third_column'] = df['value'].apply(lambda x: int(x))
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'value': [True, False]})
print(df)
# value
# 0 True
# 1 False
df['third_column'] = np.where(df['value'], 1, 0)
print(df)
# value third_column
# 0 True 1
# 1 False 0
You're not calling the function. Your print statement should be: print convert_string(<value>), where <value> is an integer.