Python Hypothesis mixing strategies bahavior for DataFrames - python-hypothesis

The following works as expected
from datetime import datetime
from hypothesis.extra.pandas import columns, data_frames, indexes
import hypothesis.strategies as st
def boundarize(d: datetime):
return d.replace(minute=15 * (d.minute // 15), second=0, microsecond=0)
min_date = datetime(2022, 4, 1, 22, 22, 22)
max_date = datetime(2022, 5, 1, 22, 22, 22)
dfs = data_frames(
index=indexes(
elements=st.datetimes(min_value=min_date, max_value=max_date).map(boundarize),
min_size=3,
max_size=5,
).map(lambda idx: idx.sort_values()),
columns=columns("A B C".split(), dtype=int),
)
dfs.example()
with an output similar to
A B C
2022-04-06 12:45:00 -11482 1588438979 -1994987295
2022-04-08 15:45:00 -833447611 3 -51
2022-04-24 06:15:00 -465371373 990274387 -14969
2022-05-01 01:15:00 1750446827 1214440777 116
2022-05-01 06:15:00 -44089 30508 58737
now when I try to generate a similar DataFrame with evenly spaced DatetimeIndex values via
from datetime import datetime
from hypothesis.extra.pandas import columns, data_frames, indexes
import hypothesis.strategies as st
def boundarize(d: datetime):
return d.replace(minute=15 * (d.minute // 15), second=0, microsecond=0)
min_date_start = datetime(2022, 4, 1, 11, 11, 11)
max_date_start = datetime(2022, 4, 2, 11, 11, 11)
min_date_end = datetime(2022, 5, 1, 22, 22, 22)
max_date_end = datetime(2022, 5, 2, 22, 22, 22)
dfs = data_frames(
index=st.builds(pd.date_range,
start=st.datetimes(min_value=min_date_start, max_value=max_date_start).map(boundarize),
end=st.datetimes(min_value=min_date_end, max_value=max_date_end).map(boundarize),
freq=st.just("15T"),
),
columns=columns("A B C".split(), dtype=int),
)
dfs.example()
The output is the following, note that the integer columns are always zero when they were not in the first example:
A B C
2022-04-01 15:45:00 0 0 0
2022-04-01 16:00:00 0 0 0
2022-04-01 16:15:00 0 0 0
2022-04-01 16:30:00 0 0 0
2022-04-01 16:45:00 0 0 0
... .. .. ..
2022-05-01 21:15:00 0 0 0
2022-05-01 21:30:00 0 0 0
2022-05-01 21:45:00 0 0 0
2022-05-01 22:00:00 0 0 0
2022-05-01 22:15:00 0 0 0
[2907 rows x 3 columns]
is this expected behavior or am I missing something?
Edit:
Sidestepping the approach of "random consecutive subsets" (see my comments below), I also tried with a pre-defined index
from datetime import datetime
from hypothesis.extra.pandas import columns, data_frames
import hypothesis.strategies as st
min_date_start = datetime(2022, 4, 1, 8, 0, 0)
dfs = data_frames(
index=st.just(pd.date_range(start=min_date_start, periods=10, freq="15T")),
columns=columns("A B C".split(), dtype=int),
)
dfs.example()
which gives all zero columns as well
A B C
2022-04-01 08:00:00 0 0 0
2022-04-01 08:15:00 0 0 0
2022-04-01 08:30:00 0 0 0
2022-04-01 08:45:00 0 0 0
2022-04-01 09:00:00 0 0 0
2022-04-01 09:15:00 0 0 0
2022-04-01 09:30:00 0 0 0
2022-04-01 09:45:00 0 0 0
2022-04-01 10:00:00 0 0 0
2022-04-01 10:15:00 0 0 0
Edit 2:
I tried to come up with a handmade version of consecutive subsets which should reduce the space of values to leave enough entropy for the column values as per #zac-hatfield-dodds answer, but empirically it still generates mostly all zero column values
from datetime import datetime
import math
import hypothesis.strategies as st
from hypothesis.extra.pandas import columns, data_frames
import pandas as pd
time_start = datetime(2022, 4, 1, 8, 0, 0)
time_stop = datetime(2022, 4, 2, 8, 0, 0)
r = pd.date_range(start=time_start, end=time_stop, freq="15T")
def build_indices(sequence):
first = 0
if len(sequence) % 2 == 0:
mid_ceiling = len(sequence) // 2
mid_floor = mid_ceiling - 1
else:
mid_floor = math.floor(len(sequence) / 2)
mid_ceiling = mid_floor + 1
second = len(sequence) - 1
return first, mid_floor, mid_ceiling, second
first, mid_floor, mid_ceiling, second = build_indices(r)
a = st.integers(min_value=first, max_value=mid_floor)
b = st.integers(min_value=mid_ceiling, max_value=second)
def indexer(sequence, lower, upper):
return sequence[lower:upper]
dfs = data_frames(
index=st.builds(lambda lower, upper: indexer(r, lower, upper), lower=a, upper=b),
columns=columns("A B C".split(), dtype=int),
)
dfs.example()

Your problem is that the latter indexes are way way larger, and Hypothesis is running out of entropy to generate column contents. If you limit the index to at most a few dozen entries, everything should work fine.
We have this soft-cap in order to limit otherwise unbounded recursive structures, so the overall design is working as intended though I acknowledge that in this case it's neither necessary nor desirable.

Related

Using Pandas to assign specific values

I have the following dataframe:
data = {'id': [1, 2, 3, 4, 5, 6, 7, 8],
'stat': ['ordered', 'unconfirmed', 'ordered', 'unknwon', 'ordered', 'unconfirmed', 'ordered', 'back'],
'date': ['2021', '2022', '2023', '2024', '2025','2026','2027', '1990']
}
df = pd.DataFrame(data)
df
I am trying to get the following data frame:
Unfortunate I am not successful so far and I used the following commands (for loops) for only stat==ordered:
y0 = np.zeros((len(df), 8), dtype=int)
y1 = [1990]
if stat=='ordered':
for i in df['id']:
for j in y1:
if df.loc[i].at['date'] in y1:
y0[i][y1.index(j)] = 1
else:
y0[i][y1.index(j)] = 0
But unfortunately it did not returned the expected solution and beside that it takes a very long time to do the calculation. I tried to use gruopby, but it could not fgure out either how to use it perporly since it is faster than using for loops. Any idea would be very appreiciated.
IIUC:
df.join(
pd.get_dummies(df.date).cumsum(axis=1).mul(
[1, 2, 1, 3, 1, 2, 1, 0], axis=0
).astype(int)
)
id stat date 1990 2021 2022 2023 2024 2025 2026 2027
0 1 ordered 2021 0 1 1 1 1 1 1 1
1 2 unconfirmed 2022 0 0 2 2 2 2 2 2
2 3 ordered 2023 0 0 0 1 1 1 1 1
3 4 unknwon 2024 0 0 0 0 3 3 3 3
4 5 ordered 2025 0 0 0 0 0 1 1 1
5 6 unconfirmed 2026 0 0 0 0 0 0 2 2
6 7 ordered 2027 0 0 0 0 0 0 0 1
7 8 back 1990 0 0 0 0 0 0 0 0

Python Pandas Conditional Sum and subtract previous row

I am new here and i need some help with python pandas.
I need help creating a new column where i get sum of another columns + previous row of this calculated row.
This is my example:
df = pd.DataFrame({
'column0': ['x', 'x', 'y', 'x', 'y', 'y', 'x'],
'column1': [50, 100, 30, 0, 30, 80, 0],
'column2': [0, 0, 0, 10, 0, 0, 30],
})
print(df)
column0 column1 column2
0 x 50 0
1 x 100 0
2 y 30 0
3 x 0 10
4 y 30 0
5 y 80 0
6 x 0 30
I have used loc to filter this DataFrame like this:
df = df.loc[df['column0'] == 'x']
df = df.reset_index(drop=True)
Now...when i try to get the output, i don't get correct result:
df['Result'] = df['column1'] + df['column2']
df['Result'] = df['column1'] + df['column2'] + df['Result'].shift(1)
print(df)
column0 column1 column2 Result
0 x 50 0 NaN
1 x 100 0 100.0
2 x 0 10 10.0
3 x 0 30 30.0
I just want this output....
column0 column1 column2 Result
0 x 50 0 50
1 x 100 0 150.0
2 x 0 10 160.0
3 x 0 30 190.0
Thank you very much!
You can use .cumsum() to calculate a cumulative sum of the column:
df = pd.DataFrame({
'column1': [50, 100, 30, 0, 30, 80, 0],
'column2': [0, 0, 0, 10, 0, 0, 30],
})
df['column3'] = df['column1'].cumsum() - df['column2'].cumsum()
This results in:
column1 column2 column3
0 50 0 50
1 100 0 150
2 30 0 180
3 0 10 170
4 30 0 200
5 80 0 280
6 0 30 250

Add extra columns with default values from a list in a dataframe

I have a dataframe like
df = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
Now i want the data frame to have additional columns from a list=['a','b','c'] with default values as 0.
so the output will be
Name Age a b c
Tome 20 0 0 0
nick 21 0 0 0
krish 19 0 0 0
Jack 18 0 0 0
Dont use variable list, because builtin (python code word).
For new columns is possible create dictionary from list and pass to DataFrame.assign:
d = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
df = pd.DataFrame(d)
L = ['a','b','c']
df1 = df.assign(**dict.fromkeys(L, 0))
Or create new DataFrame and use DataFrame.join:
df1 = df.join(pd.DataFrame(0, columns=L, index=df.index))
print (df1)
Name Age a b c
0 Tom 20 0 0 0
1 nick 21 0 0 0
2 krish 19 0 0 0
3 jack 18 0 0 0
>>> df.join(df.reindex(columns=list('abc'), fill_value=0))
Name Age a b c
0 Tom 20 0 0 0
1 nick 21 0 0 0
2 krish 19 0 0 0
3 jack 18 0 0 0
You can also use reindex to create new df with fill_value zero. and than combine columns by using join.

Calculating based on date pandas

I have this dataframe:
a = [1, 2, 3, 4, 5]
b = ['2019-08-01', '2019-09-01', '2019-10-23', '2019-11-12', '2019-11-30']
c = [12, 0, 0, 0, 0]
d = [0, 23, 0, 0, 0]
e = [12, 24, 35, 0, 0]
f = [0, 0, 44, 56, 82]
g = [21, 22, 17, 75, 63]
df = pd.DataFrame({'ID': a, 'Date': b, 'Unit_sold_8': c,
'Unit_sold_9': d, 'Unit_sold_10': e, 'Unit_sold_11': f,
'Unit_sold_12': g})
df['Date'] = pd.to_datetime(df['Date'])
I want to calculate average sales of each ID which are based on Date. For example, If ID's open date was on Sep, so the average sales of this ID would start on Sep. I tried np.select but I realized that this method would make my code super long.
col = df.columns
mask1 = (df['Date'] >= "08/01/2019") & (df['Date'] < "09/01/2019")
mask2 = (df['Date'] >= "09/01/2019") & (df['Date'] < "10/01/2019")
mask3 = (df['Date'] >= "10/01/2019") & (df['Date'] < "11/01/2019")
mask4 = (df['Date'] >= "11/01/2019") & (df['Date'] < "12/01/2019")
mask5 = (df['Date'] >= "12/01/2019")
condition2 = [mask1, mask2, mask3, mask4, mask5]
result2 = [df[col[2:]].mean(skipna = True, axis = 1),
df[col[3:]].mean(skipna = True, axis = 1),
df[col[4:]].mean(skipna = True, axis = 1),
df[col[5:]].mean(skipna = True, axis = 1),
df[col[6:]].mean(skipna = True, axis = 1)]
df.loc[:, 'Mean'] = np.select(condition2, result2, default = np.nan)
Are there any way faster to solve this problem? Especially when the time range is expanded (12 months, 24 months, .etc)
Does it help you?
from datetime import datetime
import numpy as np
from dateutil import relativedelta
check_date = datetime.today()
df['n_months'] = df['Date'].apply(lambda x: relativedelta.relativedelta( check_date,x).months)
df['total'] = df.iloc[:,range(2,df.shape[1]-1)].sum(axis=1)
df['avg'] = df['total'] / df['n_months']
print(df)
ID Date Unit_sold_8 ... n_months total avg
0 1 2019-08-01 12 ... 5 45 9.00
1 2 2019-09-01 0 ... 4 69 17.25
2 3 2019-10-23 0 ... 3 96 32.00
3 4 2019-11-12 0 ... 2 131 65.50
4 5 2019-11-30 0 ... 2 145 72.50
M= df
#melt data to pull units as variables
.melt(id_vars=['ID','Date'])
#create temp variables to pull out Month from Date and Units
.assign(Mth=lambda x: x['Date'].dt.month,
oda_detail = lambda x: x.variable.str.split('_').str[-1])
.sort_values(['ID','Mth'])
#keep only rows where the Mth is less than or equal to other detail
.loc[lambda x : x['Mth'].astype(int).le(x['oda_detail'].astype(int))]
#groupby and get the mean
.groupby(['ID','Date'])['value'].mean()
.reset_index()
.drop(['ID','Date'],axis=1)
.rename({'value':'Mean'},axis=1)
Join back to original dataframe:
pd.concat([df,M],axis=1)
ID Date Unit_sold_8 Unit_sold_9 Unit_sold_10 Unit_sold_11
Unit_sold_12 Mean
0 1 2019-08-01 12 0 12 0 21 9.00
1 2 2019-09-01 0 23 24 0 22 17.25
2 3 2019-10-23 0 0 35 44 17 32.00
3 4 2019-11-12 0 0 0 56 75 65.50
4 5 2019-11-30 0 0 0 82 63 72.50

What am I doing wrong with series.replace()?

I am trying to replace integer values in pd.Series with other integer values as follows. I am using dict-like replace:
ser_list = [pd.Series([65, 1, 0, 0, 1]), pd.Series([0, 62, 1, 1, 0])]
for ser in ser_list:
ser.replace({65: 10, 62: 20})
I am expecting the result:
[10, 1, 0, 0, 1] # first series in the list
[0, 20, 1, 1, 0] # second series in the list
where 65 should be replaced with 10 in the first series, and 62 should be replaced with 20 in the second.
However, in with this code it is returning the original series without any replacement. Any clue why?
It is possible, by inplace=True:
for ser in ser_list:
ser.replace({65: 10, 62: 20}, inplace=True)
print (ser_list)
[0 10
1 1
2 0
3 0
4 1
dtype: int64, 0 0
1 20
2 1
3 1
4 0
dtype: int64]
But not recommended like mentioned #Dan in comments - link:
The pandas core team discourages the use of the inplace parameter, and eventually it will be deprecated (which means "scheduled for removal from the library"). Here's why:
inplace won't work within a method chain.
The use of inplace often doesn't prevent copies from being created, contrary to what the name implies.
Removing the inplace option would reduce the complexity of the pandas codebase.
Or assign to same variable in list comprehension:
ser_list = [ser.replace({65: 10, 62: 20}) for ser in ser_list]
Loop solution is possible with append to new list and assign back:
out = []
for ser in ser_list:
ser = ser.replace({65: 10, 62: 20})
out.append(ser)
print (out)
[0 10
1 1
2 0
3 0
4 1
dtype: int64, 0 0
1 20
2 1
3 1
4 0
dtype: int64]
We can also use Series.map with fillna and list comprehension:
new = [ser.map({65: 10, 62: 20}).fillna(ser) for ser in ser_list]
print(new)
[0 10.0
1 1.0
2 0.0
3 0.0
4 1.0
dtype: float64, 0 0.0
1 20.0
2 1.0
3 1.0
4 0.0
dtype: float64]

Resources