I have a dataframe for which I want to create a new column called result which should take the value "refuse" if the value of the column "mean" is less than 10 otherwise the refuse column should take the value "Admitted".
[
Here you go:
import numpy as np
data['new_col'] = np.where(data['mean'] < 10, 'refuse', 'Admitted')
data.loc[(data['mean'] < 10), 'result'] = 'Refuse'
data.loc[(data['mean'] >= 10), 'result'] = 'Admitted'
Related
I'm trying to select values greater than 50 in the column, and so where is it true to change the column value to Yes. Without if condition I understand how to do it:
df3.loc[df3['Text_Count'] >= 50, 'big'] = "Yes"
However, I need to do it with an if condition.
I tried this, but nothing changes after using the code:
for index, row in df3.iterrows(): if [row['Text_Count'] >= 50] is True: row['big'] = 'Yes'
My DataFrame:
DataFrame
Using indexing is clearly the best practice but if you need a loop, you can use:
for index, row in df3.iterrows():
if row['Text_Count'] >= 50:
df3.loc[index, 'big'] = 'Yes'
Try with np.where:
import numpy as np
df3['big'] = np.where(df3['Text_Count'] >= 50, 'Yes', 'No')
print(df3)
# Output
Text_Count big
0 52 Yes
1 12 No
In my python code, using pandas i have to resample a datetimedata series and calculate diffs between a column values (the sum of diffs between values), i write this piece of code:
import pandas as pd
import datetime
from .models import Results, VarsResults
start_date = datetime.date(2021, 6, 21)
end_date = datetime.date(2021, 6, 24)
def calc_q(start_d, end_d):
start_d = start_date
end_d = end_date
var_results = VarsResults.objects.filter(
id_res__read_date__range=(start_d, end_d)
).select_related(
"id_res"
).values(
"id_res__read_date",
"id_res__unit_id",
"id_res__device_id",
"id_res__proj_code",
"var_val",
)
df = pd.DataFrame(list(var_results))
df['id_res__read_date'] = pd.to_datetime(df['id_res__read_date'])
df = df.set_index('id_res__read_date')
df_15 = df.resample('15min').sum()
return df_15
but i get the sum of the values itself.
example
... | 5
... | 3
... | 1
i get 9
i would the sum of the difference between values not the sum of the values:
in this case 4 (5-3 = 2 + 3-1 = 2, 2+2)
Is there a method in pandas using resample for manage this kind of clcultion?
So many thanks in advance
Manuel
The sum of all the differences is equal to the difference between the first element and the last one: if you work it out, all the other elements cancel out. In your data for example the 3 cancels out:
(5-3) + (3-1)
= 5 - 3 + 3 - 1 # - 3 and + 3 cancel out
= 5 - 1
I don't know how Pandas works, but you can simply do the equivalent of first_value - last_value.
The program I have here is simulating the velocity of a falling object.
The velocity is calculated by subtracting the y position from time_1 and time_2.
The problem that I have is that the dimensions of array v and array t don't match. Instead of shortening array t I would like to add 0 at the beginning of the v array. So that the graph will show v = 0 at t= 0. Yes, I know it is a small interval and that it does not really matter. But I want to know it for educational purpose.
I'm wondering if i can write the line v = (y[1:] - y[:-1])/0.1 in a from where i keep the dimension.
The ideal thing that would happen is that the array y will be substracted with an array y[:-1] and that this subtraction will happen at the end of the y array so the result will be an array of dimension 101 with a 0 as start value.
I would like to know your thoughts about this.
import matplotlib.pyplot as plt
t = linspace(0,10,101)
g = 9.80665
y = 0.5*g*t*t
v = (y[1:] - y[:-1])/0.1
plt.plot(t,v)
plt.show()
is there a function where i can add a certain value to the beginning of an array? np.append will add it to the end.
Maybe you could just pre-define the length of the result at the beginning and then fill up the values:
import numpy as np
dt = .1
g = 9.80665
t_end = 10
t = np.arange(0,t_end+dt,dt)
y = 0.5*g*t*t
v = np.zeros(t.shape[0])
v[1:] = (y[1:] - y[:-1])/dt
if you simply looking for the append at index function it would be this one:
np.insert([1,2,3,4,5,6], 2, 100)
>> array([ 1, 2, 100, 3, 4, 5, 6])
another possible solution to this would be to use np.append but inverse your order :
import numpy as np
v = np.random.rand(10)
value = 42 # value to append at the beginning of v
value_arr = np.array([value]) # dimensions should be adjust for multidimensional array
v = np.append(arr = value_arr, values = v, axis=0)
and the possible variants following the same idea, using np.concatenate or np.hstack ...
regarding your second question in comments, one solution may be :
t = np.arange(6)
condlist = [t <= 2, t >= 4]
choicelist = [1, 1]
t = np.select(condlist, choicelist, default=t)
I have an hourly time series data (say df with date/time and value columns) where I want to:
Step 1: Remove the top 5 percentile of each day
Step 2: Get the max(Step 1)for each day
Step 3: Get the mean(Step 2) for each month
Here is what I have tried to implement the above logic:
step_1 = df.resample('D').apply(lambda x: x<x.quantile(0.95))
step_2 = step_1.resample('D').max()
step_3 = step_2.resample('M').mean()
Even though I do not get any code error, the generated output is different to the expected result based on the above 3 steps (I always get a constant value)
Any help will be appreciated.
You are almost there. Your step_1 is a series of booleans with the same index as the original data, you can use it to filter your DataFrame, thus:
step_1 = df.resample('D').apply(lambda x: x<x.quantile(0.95))
step_2 = df[step_1].resample('D').max()
step_3 = step_2.resample('M').mean()
Your first step is a boolean mask, so you need to add an additional step:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range(start='1/1/2019', periods=1000, freq='H'), columns=['my_data'])
mask = df.resample('D').apply(lambda x: x < x.quantile(.95))
step_1 = df[mask]
step_2 = df.resample('D').max()
step_3 = df.resample('M').mean()
I have a pandas.dataframe, and I want to select certain data by some rules.
The following codes generate the dataframe
import datetime
import pandas as pd
import numpy as np
today = datetime.date.today()
dates = list()
for k in range(10):
a_day = today - datetime.timedelta(days=k)
dates.append(np.datetime64(a_day))
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(10, 3)),
columns=('other1', 'actual', 'other2'),
index=['{}'.format(i) for i in range(10)])
df.insert(0, 'dates', dates)
df['err_m'] = np.random.rand(10, 1)*0.1
df['std'] = np.random.rand(10, 1)*0.05
df['gain'] = np.random.rand(10, 1)
Now, I want select by the following rules:
1. compute the sum of 'err_m' and 'std', then sort the df so that the sum is descending
2. from the result of step 1, select the part where 'actual' is > 50
Thanks
Create a new column and then sort by this one:
df['errsum'] = df['err_m'] + df['std']
# Return a sorted dataframe
df_sorted = df.sort('errsum', ascending = False)
Select the lines you want
# Create an array with True where the condition is met
selector = df_sorted['errsum'] > 50
# Return a view of sorted_dataframe with only the lines you want
df_sorted[selector]