I have a column named "status" full of string values either "legitimate" or "phishing". I'm trying to convert them into a 0 for "legitimate" or 1 for "phishing". Currently my approach is to replace "legitimate" with a string value of "0", and "phishing" with a string value of "1", then convert the strings "0" and "1" to the int values 0 and 1. I'm getting the error:
TypeError: '(0, status legitimate
Name: 0, dtype: object)' is an invalid key
with the following code, what am I doing wrong?
df2 = pd.read_csv('dataset_phishing.csv', usecols=[87], dtype=str)
leg = 'legitimate'
phi = 'phishing'
for i in df2.iterrows():
if df2[i] == leg:
df2[i].replace('legitimate', '0')
else if df2[i] == phi:
df2[i].replace('phishing', '1')
Here iterrow gives you tuple which can't be used as index, that why you get that error. Here is a simple solution:
import pandas as pd
df2=pd.DataFrame([["legitimate"],["phishing"]],columns=["status"])
leg = 'legitimate'
phi = 'phishing'
for i in range(len(df2)):
df2.iloc[i]["status"]='1' if df2.iloc[i]["status"]==phi else '0'
print(df2)
Here is more pythonic way to do this:
import pandas as pd
import numpy as np
df2=pd.DataFrame([["legitimate"],["phishing"]],columns=["status"])
leg = 'legitimate'
phi = 'phishing'
df2["status"]=np.where(df2["status"]==phi,'1','0')
print(df2)
Hope this helps you
Here is another way to do this
import pandas as pd
import numpy as np
data = {'status': ["legitimate", "phishing"]}
df = pd.DataFrame(data)
leg = 'legitimate'
phi = 'phishing'
df.loc[df["status"] == leg, "status"] = 0
df.loc[df["status"] == phi, "status"] = 1
print(df)
I'm trying to apply a function to different readings of each measure. Is it possible to do it without transforming the dataframe?
import random
import pandas as pd
df = pd.DataFrame({
'index': sorted(['A', 'B']*3),
'measure': [i for i in range(0,3)]*2,
'reading': [random.random() for i in range(0,6)]
})
index measure reading
0 A 0 0.260492
1 A 1 0.805028
2 A 2 0.548699
3 B 0 0.014042
4 B 1 0.719705
5 B 2 0.398824
How can I apply a function like basic difference to different readings for each index?
Here I assumed function applied to reading 0 and 1. It should be part of the call as I need to calculate it for different values of measure.
Desired output looks like this:
index applied
0 A 0.5445359999999999
1 B 0.705663
Try this
import random
import pandas as pd
import numpy as np
df = pd.DataFrame({
'index': sorted(['A', 'B']*3),
'measure': [i for i in range(0,3)]*2,
'reading': [random.random() for i in range(0,6)]
})
print(df)
# index measure reading
# 0 A 0 0.869707
# 1 A 1 0.120680
# 2 A 2 0.772035
# 3 B 0 0.565548
# 4 B 1 0.577074
# 5 B 2 0.290668
start = 0
stop = 1
# I decided to specify start and stop value separately, the absolute difference is
# calculated via np.sum(). If the difference between start and stop is always 1, you
# can omit the np.sum() call.
df = df.groupby('index').agg(applied=('reading', lambda x: np.sum(np.diff(x)
[start:stop])))
print(df)
# applied
# index
# A -0.749027
# B 0.011526
I have an hourly time series data (say df with date/time and value columns) where I want to:
Step 1: Remove the top 5 percentile of each day
Step 2: Get the max(Step 1)for each day
Step 3: Get the mean(Step 2) for each month
Here is what I have tried to implement the above logic:
step_1 = df.resample('D').apply(lambda x: x<x.quantile(0.95))
step_2 = step_1.resample('D').max()
step_3 = step_2.resample('M').mean()
Even though I do not get any code error, the generated output is different to the expected result based on the above 3 steps (I always get a constant value)
Any help will be appreciated.
You are almost there. Your step_1 is a series of booleans with the same index as the original data, you can use it to filter your DataFrame, thus:
step_1 = df.resample('D').apply(lambda x: x<x.quantile(0.95))
step_2 = df[step_1].resample('D').max()
step_3 = step_2.resample('M').mean()
Your first step is a boolean mask, so you need to add an additional step:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range(start='1/1/2019', periods=1000, freq='H'), columns=['my_data'])
mask = df.resample('D').apply(lambda x: x < x.quantile(.95))
step_1 = df[mask]
step_2 = df.resample('D').max()
step_3 = df.resample('M').mean()
This is likely a very simple question but I would appreciate help!
As part of a larger script, I have a dataframe (imported from a csv file) with two columns, 'file_name' and 'value'. I have a short example below:
file_name value
0 201623800811s.fits True
1 201623802491s.fits True
2 201623802451s.fits False
I would like to define a function that reads the values within column 'value', and returns 0 for 'False' and 1 for 'True'. I would then like to append the results to a third column in the dataframe, and finally export the updated dataframe to the csv.
I have defined a function that appears to me to work. However, when I run the script it does not execute and I receive the message:
<function convert_string at 0x000000000DE35588>
In the console.
My function is below. Any help or advice will be welcomed.
def convert_string(explosions):
for i in range(0,len(explosions)):
if i == 'True' :
return 1
elif i == 'False' :
return 0
else:
return 2
print convert_string
If you are using an explicit for loop when working with a dataframe, you are most probably "doing it wrong". Also, what is the point of having a for loop if you return on the very first iteration?
Consider these:
import numpy as np
df['third_column'] = np.where(df['value'], 1, 0)
If you insist on defining a function:
def foo(x):
return int(x)
df['third_column'] = df['value'].apply(foo)
or simply
df['third_column'] = df['value'].apply(lambda x: int(x))
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'value': [True, False]})
print(df)
# value
# 0 True
# 1 False
df['third_column'] = np.where(df['value'], 1, 0)
print(df)
# value third_column
# 0 True 1
# 1 False 0
You're not calling the function. Your print statement should be: print convert_string(<value>), where <value> is an integer.
What I am trying to do is to get bootstrap confidence limits by row regardless of the number of rows and make a new dataframe from the output.I currently can do this for the entire dataframe, but not by row. The data I have in my actual program looks similar to what I have below:
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
I want the new dataframe to look something like this with the lower and upper confidence limits:
0 1
0 1 2
1 1 5.5
2 1 4.5
3 1 4.2
The current generated output looks like this:
0 1
0 2.0 2.75
The python 3 code below generates a mock dataframe and generates the bootstrap confidence limits for the entire dataframe. The result is a new dataframe with just 2 values, a upper and a lower confidence limit rather than 4 sets of 2(one for each row).
import pandas as pd
import numpy as np
import scikits.bootstrap as sci
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
print(zz)
x= zz.dtypes
print(x)
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
b = sci.ci(a)
b = pd.DataFrame(b)
b = b.T
print(b)
Thank you for any help.
scikits.bootstrap operates by assuming that data samples are arranged by row, not by column. If you want the opposite behavior, just use the transpose, and a statfunction that doesn't combine columns.
import pandas as pd
import numpy as np
import scikits.bootstrap as sci
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
print(zz)
x= zz.dtypes
print(x)
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
b = sci.ci(a.T, statfunction=lambda x: np.average(x, axis=0))
print(b.T)
Below is the answer I ended up figuring out to create bootstrap ci by row.
import pandas as pd
import numpy as np
import numpy.random as npr
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
x= zz.dtypes
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
def bootstrap(data, num_samples, statistic, alpha):
n = len(data)
idx = npr.randint(0, n, (num_samples, n))
samples = data[idx]
stat = np.sort(statistic(samples, 1))
return (stat[int((alpha/2.0)*num_samples)],
stat[int((1-alpha/2.0)*num_samples)])
cc = list(a.index.values) # informs generator of the number of rows
def bootbyrow(cc):
for xx in range(1):
xx = list(a.index.values)
for xx in range(len(cc)):
k = a.apply(lambda y: y[xx])
k = k.values
for xx in range(1):
kk = list(bootstrap(k,10000,np.mean,0.05))
yield list(kk)
abc = pd.DataFrame(list(bootbyrow(cc))) #bootstrap ci by row
# the next 4 just show that its working correctly
a0 = bootstrap((a.loc[0,].values),10000,np.mean,0.05)
a1 = bootstrap((a.loc[1,].values),10000,np.mean,0.05)
a2 = bootstrap((a.loc[2,].values),10000,np.mean,0.05)
a3 = bootstrap((a.loc[3,].values),10000,np.mean,0.05)
print(abc)
print(a0)
print(a1)
print(a2)
print(a3)