Use if statement within .str.find() - python-3.x

I would like to know If I have an if statement that looks something like this:
if int(i) > 10:
return 0
else:
return -1
where i is equivalent to a row entry in df["price"] (df is a pandas dataframe) defined as follows:
import pandas as pd
df = pd.DataFrame(columns=["price", "Number"], data=[["10", "07367"], ["20", "08356"], ["9", "07745"]])
how can I use df["price"].str.find(...) together with the above if statement to filter the data by the true condition?
I would like output that looks like the following:
0 -1
1 0
2 -1
I have been struggling with how to implement it, please assist.

Generally its easiest to first convert to optimal dtypes. That way all operations will be quicker - of course, it depends on your application whether this matters. But if things are numbers, let them be numbers (explicit > implicit).
import pandas as pd
df = pd.DataFrame(columns=["price", "Number"], data=[["10", "07367"], ["20", "08356"], ["9", "07745"]])
df['price'] = df.price.astype(int) # or float
df['number'] = df.number.astype(int)
You can then add your criteria as a colum (or just use the output). Apply or map are not so quick, so its better to use the np.where suggested by others or any other comparison that will use numpy under the hood. For example:
df['criteria'] = -1 * (df.price <= 10).astype(int) # quicker to not use map or apply
df.criteria

You could use gt + map:
import pandas as pd
df = pd.DataFrame(columns=["price", "Number"], data=[["10", "07367"], ["20", "08356"], ["9", "07745"]])
result = df.price.astype(int).gt(10).map({False: -1, True: 0})
print(result)
Output
0 -1
1 0
2 -1
Name: price, dtype: int64
Or if you prefer, you could use np.where, as mentioned by #coldspeed in the comments.
import numpy as np
import pandas as pd
df = pd.DataFrame(columns=["price", "Number"], data=[["10", "07367"], ["20", "08356"], ["9", "07745"]])
result = np.where(df.price.astype(int) > 10, 0, -1)
print(result)
Output
[-1 0 -1]

You can use np.where:
df['price'] =df['price'].astype(int)
df['output'] = np.where(df['price']>10, 0, -1)
df
price Number output
0 10 07367 -1
1 20 08356 0
2 9 07745 -1
The syntax is: np.where(condition, valueIfTrue, valueIfFalse)

simply you can use lambda functions
df.price.apply(lambda x : 0 if int(x)>10 else -1)

Related

Replacing a string value in Python

I have a column named "status" full of string values either "legitimate" or "phishing". I'm trying to convert them into a 0 for "legitimate" or 1 for "phishing". Currently my approach is to replace "legitimate" with a string value of "0", and "phishing" with a string value of "1", then convert the strings "0" and "1" to the int values 0 and 1. I'm getting the error:
TypeError: '(0, status legitimate
Name: 0, dtype: object)' is an invalid key
with the following code, what am I doing wrong?
df2 = pd.read_csv('dataset_phishing.csv', usecols=[87], dtype=str)
leg = 'legitimate'
phi = 'phishing'
for i in df2.iterrows():
if df2[i] == leg:
df2[i].replace('legitimate', '0')
else if df2[i] == phi:
df2[i].replace('phishing', '1')
Here iterrow gives you tuple which can't be used as index, that why you get that error. Here is a simple solution:
import pandas as pd
df2=pd.DataFrame([["legitimate"],["phishing"]],columns=["status"])
leg = 'legitimate'
phi = 'phishing'
for i in range(len(df2)):
df2.iloc[i]["status"]='1' if df2.iloc[i]["status"]==phi else '0'
print(df2)
Here is more pythonic way to do this:
import pandas as pd
import numpy as np
df2=pd.DataFrame([["legitimate"],["phishing"]],columns=["status"])
leg = 'legitimate'
phi = 'phishing'
df2["status"]=np.where(df2["status"]==phi,'1','0')
print(df2)
Hope this helps you
Here is another way to do this
import pandas as pd
import numpy as np
data = {'status': ["legitimate", "phishing"]}
df = pd.DataFrame(data)
leg = 'legitimate'
phi = 'phishing'
df.loc[df["status"] == leg, "status"] = 0
df.loc[df["status"] == phi, "status"] = 1
print(df)

Apply a function to multiple rows of a pandas DataFrame

I'm trying to apply a function to different readings of each measure. Is it possible to do it without transforming the dataframe?
import random
import pandas as pd
df = pd.DataFrame({
'index': sorted(['A', 'B']*3),
'measure': [i for i in range(0,3)]*2,
'reading': [random.random() for i in range(0,6)]
})
index measure reading
0 A 0 0.260492
1 A 1 0.805028
2 A 2 0.548699
3 B 0 0.014042
4 B 1 0.719705
5 B 2 0.398824
How can I apply a function like basic difference to different readings for each index?
Here I assumed function applied to reading 0 and 1. It should be part of the call as I need to calculate it for different values of measure.
Desired output looks like this:
index applied
0 A 0.5445359999999999
1 B 0.705663
Try this
import random
import pandas as pd
import numpy as np
df = pd.DataFrame({
'index': sorted(['A', 'B']*3),
'measure': [i for i in range(0,3)]*2,
'reading': [random.random() for i in range(0,6)]
})
print(df)
# index measure reading
# 0 A 0 0.869707
# 1 A 1 0.120680
# 2 A 2 0.772035
# 3 B 0 0.565548
# 4 B 1 0.577074
# 5 B 2 0.290668
start = 0
stop = 1
# I decided to specify start and stop value separately, the absolute difference is
# calculated via np.sum(). If the difference between start and stop is always 1, you
# can omit the np.sum() call.
df = df.groupby('index').agg(applied=('reading', lambda x: np.sum(np.diff(x)
[start:stop])))
print(df)
# applied
# index
# A -0.749027
# B 0.011526

Resample time series after removing top x percentile data

I have an hourly time series data (say df with date/time and value columns) where I want to:
Step 1: Remove the top 5 percentile of each day
Step 2: Get the max(Step 1)for each day
Step 3: Get the mean(Step 2) for each month
Here is what I have tried to implement the above logic:
step_1 = df.resample('D').apply(lambda x: x<x.quantile(0.95))
step_2 = step_1.resample('D').max()
step_3 = step_2.resample('M').mean()
Even though I do not get any code error, the generated output is different to the expected result based on the above 3 steps (I always get a constant value)
Any help will be appreciated.
You are almost there. Your step_1 is a series of booleans with the same index as the original data, you can use it to filter your DataFrame, thus:
step_1 = df.resample('D').apply(lambda x: x<x.quantile(0.95))
step_2 = df[step_1].resample('D').max()
step_3 = step_2.resample('M').mean()
Your first step is a boolean mask, so you need to add an additional step:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range(start='1/1/2019', periods=1000, freq='H'), columns=['my_data'])
mask = df.resample('D').apply(lambda x: x < x.quantile(.95))
step_1 = df[mask]
step_2 = df.resample('D').max()
step_3 = df.resample('M').mean()

Define function to convert string to integer in Python

This is likely a very simple question but I would appreciate help!
As part of a larger script, I have a dataframe (imported from a csv file) with two columns, 'file_name' and 'value'. I have a short example below:
file_name value
0 201623800811s.fits True
1 201623802491s.fits True
2 201623802451s.fits False
I would like to define a function that reads the values within column 'value', and returns 0 for 'False' and 1 for 'True'. I would then like to append the results to a third column in the dataframe, and finally export the updated dataframe to the csv.
I have defined a function that appears to me to work. However, when I run the script it does not execute and I receive the message:
<function convert_string at 0x000000000DE35588>
In the console.
My function is below. Any help or advice will be welcomed.
def convert_string(explosions):
for i in range(0,len(explosions)):
if i == 'True' :
return 1
elif i == 'False' :
return 0
else:
return 2
print convert_string
If you are using an explicit for loop when working with a dataframe, you are most probably "doing it wrong". Also, what is the point of having a for loop if you return on the very first iteration?
Consider these:
import numpy as np
df['third_column'] = np.where(df['value'], 1, 0)
If you insist on defining a function:
def foo(x):
return int(x)
df['third_column'] = df['value'].apply(foo)
or simply
df['third_column'] = df['value'].apply(lambda x: int(x))
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'value': [True, False]})
print(df)
# value
# 0 True
# 1 False
df['third_column'] = np.where(df['value'], 1, 0)
print(df)
# value third_column
# 0 True 1
# 1 False 0
You're not calling the function. Your print statement should be: print convert_string(<value>), where <value> is an integer.

Python Pandas: bootstrap confidence limits by row rather than entire dataframe

What I am trying to do is to get bootstrap confidence limits by row regardless of the number of rows and make a new dataframe from the output.I currently can do this for the entire dataframe, but not by row. The data I have in my actual program looks similar to what I have below:
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
I want the new dataframe to look something like this with the lower and upper confidence limits:
0 1
0 1 2
1 1 5.5
2 1 4.5
3 1 4.2
The current generated output looks like this:
0 1
0 2.0 2.75
The python 3 code below generates a mock dataframe and generates the bootstrap confidence limits for the entire dataframe. The result is a new dataframe with just 2 values, a upper and a lower confidence limit rather than 4 sets of 2(one for each row).
import pandas as pd
import numpy as np
import scikits.bootstrap as sci
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
print(zz)
x= zz.dtypes
print(x)
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
b = sci.ci(a)
b = pd.DataFrame(b)
b = b.T
print(b)
Thank you for any help.
scikits.bootstrap operates by assuming that data samples are arranged by row, not by column. If you want the opposite behavior, just use the transpose, and a statfunction that doesn't combine columns.
import pandas as pd
import numpy as np
import scikits.bootstrap as sci
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
print(zz)
x= zz.dtypes
print(x)
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
b = sci.ci(a.T, statfunction=lambda x: np.average(x, axis=0))
print(b.T)
Below is the answer I ended up figuring out to create bootstrap ci by row.
import pandas as pd
import numpy as np
import numpy.random as npr
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
x= zz.dtypes
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
def bootstrap(data, num_samples, statistic, alpha):
n = len(data)
idx = npr.randint(0, n, (num_samples, n))
samples = data[idx]
stat = np.sort(statistic(samples, 1))
return (stat[int((alpha/2.0)*num_samples)],
stat[int((1-alpha/2.0)*num_samples)])
cc = list(a.index.values) # informs generator of the number of rows
def bootbyrow(cc):
for xx in range(1):
xx = list(a.index.values)
for xx in range(len(cc)):
k = a.apply(lambda y: y[xx])
k = k.values
for xx in range(1):
kk = list(bootstrap(k,10000,np.mean,0.05))
yield list(kk)
abc = pd.DataFrame(list(bootbyrow(cc))) #bootstrap ci by row
# the next 4 just show that its working correctly
a0 = bootstrap((a.loc[0,].values),10000,np.mean,0.05)
a1 = bootstrap((a.loc[1,].values),10000,np.mean,0.05)
a2 = bootstrap((a.loc[2,].values),10000,np.mean,0.05)
a3 = bootstrap((a.loc[3,].values),10000,np.mean,0.05)
print(abc)
print(a0)
print(a1)
print(a2)
print(a3)

Resources