Define function to convert string to integer in Python - string

This is likely a very simple question but I would appreciate help!
As part of a larger script, I have a dataframe (imported from a csv file) with two columns, 'file_name' and 'value'. I have a short example below:
file_name value
0 201623800811s.fits True
1 201623802491s.fits True
2 201623802451s.fits False
I would like to define a function that reads the values within column 'value', and returns 0 for 'False' and 1 for 'True'. I would then like to append the results to a third column in the dataframe, and finally export the updated dataframe to the csv.
I have defined a function that appears to me to work. However, when I run the script it does not execute and I receive the message:
<function convert_string at 0x000000000DE35588>
In the console.
My function is below. Any help or advice will be welcomed.
def convert_string(explosions):
for i in range(0,len(explosions)):
if i == 'True' :
return 1
elif i == 'False' :
return 0
else:
return 2
print convert_string

If you are using an explicit for loop when working with a dataframe, you are most probably "doing it wrong". Also, what is the point of having a for loop if you return on the very first iteration?
Consider these:
import numpy as np
df['third_column'] = np.where(df['value'], 1, 0)
If you insist on defining a function:
def foo(x):
return int(x)
df['third_column'] = df['value'].apply(foo)
or simply
df['third_column'] = df['value'].apply(lambda x: int(x))
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'value': [True, False]})
print(df)
# value
# 0 True
# 1 False
df['third_column'] = np.where(df['value'], 1, 0)
print(df)
# value third_column
# 0 True 1
# 1 False 0

You're not calling the function. Your print statement should be: print convert_string(<value>), where <value> is an integer.

Related

Count element in list if it is present in each row of a column. Add to a new column (pandas)

I have a pandas df like this:
MEMBERSHIP
[2022_K_, EWREW_NK]
[333_NFK_,2022_K_, EWREW_NK, 000]
And I have a list of keys:
list_k = ["_K_","_NK_","_NKF_","_KF_"]
I want to add and create a column that count if any of that element is in the column. The desired output is:
MEMBERSHIP | COUNT
[2022_K_, EWREW_NK] | 2
[333_NFK_,2022_K_, EWREW_NK, 000] | 3
Can you help me?
IIUC, you can use pandas .str acccess methods with regex:
import pandas as pd
df = pd.DataFrame({'MEMBERSHIP':[['2022_K_', 'EWREW_NK'],
['333_NFK_','2022_K_', 'EWREW_NK', '000']]})
list_k = ["_K_","_NK","_NFK_","_KF_"] #I changed this list a little
reg = '|'.join(list_k)
df['count'] = df['MEMBERSHIP'].explode().str.contains(reg).groupby(level=0).sum()
print(df)
Output:
MEMBERSHIP count
0 [2022_K_, EWREW_NK] 2
1 [333_NFK_, 2022_K_, EWREW_NK, 000] 3
you can use a lambda function:
def check(x):
total=0
for i in x:
if type(i) != str: #if value is not string pass.
pass
else:
for j in list_k:
if j in i:
total+=1
return total
df['count']=df['MEMBERSHIP'].apply(lambda x: check(x))
I come up with this dumb code
count_row=0
df['Count']= None
for i in df['MEMBERSHIP_SPLIT']:
count_element=0
for sub in i:
for e in list_k:
if e in sub:
count_element+=1
df['Count'][count_row]=count_element
count_row += 1

Getting `A value is trying to be set on a copy of a slice from a DataFrame.` when setting a column

I know a value should not be set on a view of a pandas dataframe and I'm not doing that but I'm getting this error. I have a function like this:
def do_something(df):
# id(df) is xxx240
idx = get_skip_idx(df) # another function that returns a boolean series
if any(idx):
df = df[~idx]
# id(df) is xxx744, df is now a local variable which is a copy of the input argument
assert not df._is_view # This doesn't fail, I'm not having a view
df['date_fixed'] = pd.to_datetime(df['old_date'].str[:10], format='%Y-%m-%d')
# I'm getting the warning here which doesn't make any sense to me
I'm using pandas 1.4.1. This sounds like a bug to me, wanted to confirm I'm not missing anything before filing a ticket.
My understanding is that _is_view can return false negatives and that you are actually working on a view of the original dataframe.
One workaround is to replace df[~idx] with df[~idx].copy():
import pandas as pd
df = pd.DataFrame(
{
"value": [1, 2, 3],
"old_date": ["2022-04-20 abcd", "2022-04-21 efgh", "2022-04-22 ijkl"],
}
)
def do_something(df, idx):
if any(idx):
df = df[~idx].copy()
df["date_fixed"] = pd.to_datetime(df["old_date"].str[:10], format="%Y-%m-%d")
return df
print(do_something(df, pd.Series({0: True, 1: False, 2: False})))
# No warning
value old_date date_fixed
1 2 2022-04-21 efgh 2022-04-21
2 3 2022-04-22 ijkl 2022-04-22

Use if statement within .str.find()

I would like to know If I have an if statement that looks something like this:
if int(i) > 10:
return 0
else:
return -1
where i is equivalent to a row entry in df["price"] (df is a pandas dataframe) defined as follows:
import pandas as pd
df = pd.DataFrame(columns=["price", "Number"], data=[["10", "07367"], ["20", "08356"], ["9", "07745"]])
how can I use df["price"].str.find(...) together with the above if statement to filter the data by the true condition?
I would like output that looks like the following:
0 -1
1 0
2 -1
I have been struggling with how to implement it, please assist.
Generally its easiest to first convert to optimal dtypes. That way all operations will be quicker - of course, it depends on your application whether this matters. But if things are numbers, let them be numbers (explicit > implicit).
import pandas as pd
df = pd.DataFrame(columns=["price", "Number"], data=[["10", "07367"], ["20", "08356"], ["9", "07745"]])
df['price'] = df.price.astype(int) # or float
df['number'] = df.number.astype(int)
You can then add your criteria as a colum (or just use the output). Apply or map are not so quick, so its better to use the np.where suggested by others or any other comparison that will use numpy under the hood. For example:
df['criteria'] = -1 * (df.price <= 10).astype(int) # quicker to not use map or apply
df.criteria
You could use gt + map:
import pandas as pd
df = pd.DataFrame(columns=["price", "Number"], data=[["10", "07367"], ["20", "08356"], ["9", "07745"]])
result = df.price.astype(int).gt(10).map({False: -1, True: 0})
print(result)
Output
0 -1
1 0
2 -1
Name: price, dtype: int64
Or if you prefer, you could use np.where, as mentioned by #coldspeed in the comments.
import numpy as np
import pandas as pd
df = pd.DataFrame(columns=["price", "Number"], data=[["10", "07367"], ["20", "08356"], ["9", "07745"]])
result = np.where(df.price.astype(int) > 10, 0, -1)
print(result)
Output
[-1 0 -1]
You can use np.where:
df['price'] =df['price'].astype(int)
df['output'] = np.where(df['price']>10, 0, -1)
df
price Number output
0 10 07367 -1
1 20 08356 0
2 9 07745 -1
The syntax is: np.where(condition, valueIfTrue, valueIfFalse)
simply you can use lambda functions
df.price.apply(lambda x : 0 if int(x)>10 else -1)

pandas dataframe output need to be a string instead of a list

I have a requirement that the result value should be a string. But when I calculate the maximum value of dataframe it gives the result as a list.
import pandas as pd
def answer_one():
df_copy = [df['# Summer'].idxmax()]
return (df_copy)
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)
names_ids = df.index.str.split('\s\(')
df.index = names_ids.str[0] # the [0] element is the country name (new index)
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)
df = df.drop('Totals')
df.head()
answer_one()
But here the answer_one() will give me a List as an output and not a string. Can someone help me know how this came be converted to a string or how can I get the answer directly from dataframe as a string. I don't want to convert the list to a string using str(df_copy).
Your first solution would be as #juanpa.arrivillaga put it: To not wrap it. Your function becomes:
def answer_one():
df_copy = df['# Summer'].idxmax()
return (df_copy)
>>> 1
Another thing that you might not be expecting but idxmax() will return the index of the max, perhaps you want to do:
def answer_one():
df_copy = df['# Summer'].max()
return (df_copy)
>>> 30
Since you don't want to do str(df_copy) you can do df_copy.astype(str) instead.
Here is how I would write your function:
def get_max_as_string(data, column_name):
""" Return Max Value from a column as a string."""
return data[column_name].max().astype(str)
get_max_as_string(df, '# Summer')
>>> '30'

How to pass a function as a parameter to another function?

I'm doing a bunch of operations on pandas dataframes. For example finding max, min and average inside columns and return the column names in a new column. Now I'm trying to wrap these things into a function, and use max() and/or min() as arguments in this function.
Below is a snippet that describes what I'm trying to do in a very simplified way. In its current state it also returns a description of the desired output. The snippet does not have the desired functionality and flexibility though.
The setup:
# Sample dataframe
df = pd.DataFrame({'col_A':[1,20,6,1,3]})
def findValue(function, df, colname):
print(function) # just a placeholder
df[colname] = df.max()[0]
return df
df2 = findValue(function='max', df=df, colname='col_B')
print(df)
Output 1:
col_A col_B
0 1 20
1 20 20
2 6 20
3 1 20
4 3 20
A naive attempt:
# Sample dataframe
df = pd.DataFrame({'col_A':[1,20,6,1,3]})
# The function I would like to use in another function is max()
# My function
def findValue(function, df, colname):
df[colname] = df.function()[0]
return df
df2 = findValue(function=max(), df=df , colname='col_B')
print(df)
Output 2:
Traceback (most recent call last):
File "<ipython-input-7-85964ff29e69>", line 1, in <module>
df2 = findValue(function=max(), df=df , colname='col_B')
TypeError: max expected 1 arguments, got 0
How can I change the above snippet so that I can change function = max() to function = min() or any other function in the arguments of findValue()? Or even define a list of functions to be used in a similar manner?
Thank you for any suggestions!
You are very, very close. You pretty much just need to remove the parens when passing in the function. Here's a simplified example that loops over a list of function names, and appears to do what you want:
def findValue(func, x, y):
return func(x, y)
for calc in (max, min):
result = findValue(func=calc, x=1, y=10)
print(result)
Output:
10
1

Resources