cuDF - UDF failed in cuda pipeline - python-3.x

I want based on DateTime select a row, if it is selected, then is processed Time, if it is good outcol turns into 1
import cudf
import numpy as np
df = cudf.DataFrame({ 'DateTime' : ['20080501 00:00:00.000','20080502 00:01:00.000', '20080503 00:02:00.000', '20080504 00:03:00.000', '20080505 00:04:00.000','20080506 00:05:00.000','20080507 00:06:00.000','20080508 00:07:00.000','20080509 00:08:00.000','20080510 00:09:00.000'],
'Time' : [0,1,2,3,4,5,6,7,8,9]})
df['DateTime'] = cudf.to_datetime(df['DateTime'], errors='coerce')
date1 = cudf.to_datetime("2008-05-01")
date2 = cudf.to_datetime("2008-05-03")
date3 = cudf.to_datetime("2008-05-06")
date4 = cudf.to_datetime("2008-05-08")
date1 = cudf.Series(date1)
date2 = cudf.Series(date2)
date3 = cudf.Series(date3)
date4 = cudf.Series(date4)
so i created a function
def timex(incol1, incol2, outcol, date1,date2,date3,date4,time1,time2,time3,time4):
for i, (a,b) in enumerate(zip(incol1,incol2)):
if date1 <= a < date2:
if time1 <= b <= time2:
outcol[i] = 1
elif date3 <= a < date4:
if time3 <= b <= time4:
outcol[i] = 1
else:
outcol[i] = 0
df.apply_rows(timex,
incols={'DateTime':'incol1', 'Time':'incol2'},
outcols=dict(outcol=np.int64),
kwargs=dict(date1 = date1,date2 = date2,date3=date3,date4=date4,time1 = 0,time2 = 2, time3 = 6, time4 = 8))
but it returns me
TypingError: Failed in cuda mode pipeline (step: nopython frontend)
can someone tell me how to do it? or where is the mistake

Related

Restricting last item value in python list to a given date

I am trying to create a list using a given start and end date range in a specific format. I would like to have the elements in the list incremented by 30 days. The last item in the list should not exceed the end date that i have set.
With the logic that I have below, it exceeds the end date that i have set.
from datetime import timedelta, date
start_date = date(2021,1,1)
end_date = date(2021,6,30)
n = 30
next_date = start_date
res = []
while start_date < end_date:
next_date = (start_date + timedelta(n))
next_date_str = next_date.strftime('%Y-%m-%d')
var = start_date.strftime('%Y-%m-%d')+'#'+next_date_str
res.append(var)
start_date = next_date + timedelta(1)
print(res)
Result of above code:
['2022-01-01#2022-01-31', '2022-02-01#2022-03-03', '2022-03-04#2022-04-03', '2022-04-04#2022-05-04', '2022-05-05#2022-06-04', '2022-06-05#2022-07-05']
Expected output:
['2022-01-01#2022-01-31', '2022-02-01#2022-03-03', '2022-03-04#2022-04-03', '2022-04-04#2022-05-04', '2022-05-05#2022-06-04', '2022-06-05#2022-06-30']
Please guide me how to restrict the end date for the list
One way to fix this problem would be to add a check to your while loop to check if the next date is greater than the end date. If it is, you can set the next date to the end date, and then break out of the loop.
Here's how you could update your code to do this:
from datetime import timedelta, date
start_date = date(2021,1,1)
end_date = date(2021,6,30)
n = 30
next_date = start_date
res = []
while start_date < end_date:
next_date = (start_date + timedelta(n))
if next_date > end_date:
next_date = end_date
break
next_date_str = next_date.strftime('%Y-%m-%d')
var = start_date.strftime('%Y-%m-%d')+'#'+next_date_str
res.append(var)
start_date = next_date + timedelta(1)
print(res)

How to get last wednesday for current month in python?

my code as follows
today = todayte
print('today1 =', today)
offset = (today.weekday() - 2) % 7
print('offset1=', offset)
last_wednesday = today - timedelta(days=offset)
print('last_wednesday1 = ', last_wednesday)
my current output as follows
today1 = 2018-03-05
offset1 = 5
last_wednesday1 = 2018-02-28
in the above case i am getting previous month last wednesday
but i need current month last wednesday.
my expected output is as follows
last_wednesday = 2018-03-28
Here is a way:
from datetime import datetime , timedelta
todayDT = datetime.today()
currentMonth = todayDT.month
nWed = todayDT
while todayDT.month == currentMonth:
todayDT += timedelta(days=1)
if todayDT.weekday()==2: #this is Wednesday
nWed = todayDT
print (nWed)
you can use a combination of datetime and calendar modules:
from datetime import datetime, timedelta
import calendar
today = datetime.now()
# find first day of the month and then, find first wednesday of the month and replace it
# weekday of wednesday == 2
first_day = datetime.today().replace(day=1)
while first_day.weekday() != 2:
first_day += timedelta(days=1)
number_of_days_in_month = calendar.monthrange(today.year, today.month)[1]
last_wend = first_day + timedelta(days=(((number_of_days_in_month - first_day.day) // 7) * 7))
print(last_wend)
or as #Mark Ransom suggested:
from datetime import datetime, timedelta
day_ = (datetime.now().replace(day=1) + timedelta(days=32)).replace(day=1)
while True:
day_ -= timedelta(days=1)
if day_.weekday() == 2:
break
print(day_)
How about this, we first jump to the next month and re-use your existing code:
import datetime as dt
todayte = dt.date(2018, 3, 5)
today = todayte
d = dt.date(today.year, today.month, 28) # the 28th day of a month must exist
d = d + dt.timedelta(days=7) # so we are sure d is in the next month
# then we apply your original logic
offset = (d.weekday() - 2) % 7
last_wednesday = d - dt.timedelta(days=offset)
print(last_wednesday)
Result:
2018-04-04

pyspark extracting specific value to variable

I have the below script.
I am a bit stuck with this specific piece:
datex = datetime.datetime.strptime(df1.start_time,'%Y-%m-%d %H:%M:%S')
I can't figure out how to extract the actual value from the start_time field & store it in the datex variable.
Can anyone help me please?
while iters <10:
time_to_add = iters * 900
time_to_checkx = time_to_check + datetime.timedelta(seconds=time_to_add)
iters = iters + 1
session = 0
for row in df1.rdd.collect():
datex = datetime.datetime.strptime(df1.start_time,'%Y-%m-%d %H:%M:%S')
print(datex)
filterx = df1.filter(datex < time_to_checkx)
session = session + filterx.count()
print('current session value' + str(session))
print(session)
Check this out. I have converted your for loop in general. If you can get me more info on iters variable or the explanation of how you want it to work:
import pyspark.sql.functions a F
spark_date_format = "YYYY-MM-dd hh:mm:ss"
session = 0
time_to_checkx = time_to_check + datetime.timedelta(seconds=time_to_add)
df1 = df1.withColumn('start_time', F.to_timestamp(F.col(date_column), spark_date_format))
filterx = df1.filter(df1.start_time < time_to_checkx)
session = session + filterx.count()

Looping over pandas DataFrame

I have a weird issue that the result doesn't change for each iteration. The code is the following:
import pandas as pd
import numpy as np
X = np.arange(10,100)
Y = X[::-1]
Z = np.array([X,Y]).T
df = pd.DataFrame(Z ,columns = ['col1','col2'])
dif = df['col1'] - df['col2']
for gap in range(100):
Up = dif > gap
Down = dif < -gap
df.loc[Up,'predict'] = 'Up'
df.loc[Down,'predict'] = 'Down'
df_result = df.dropna()
Total = df.shape[0]
count = df_result.shape[0]
ratio = count/Total
print(f'Total: {Total}; count: {count}; ratio: {ratio}')
The result is always
Total: 90; count: 90; ratio: 1.0
when it shouldn't be.
Found the root of the problem 5 mins after posting this question. I just needed to reset the dataFrame to the original to fix the problem.
import pandas as pd
import numpy as np
X = np.arange(10,100)
Y = X[::-1]
Z = np.array([X,Y]).T
df = pd.DataFrame(Z ,columns = ['col1','col2'])
df2 = df.copy()#added this line to preserve the original df
dif = df['col1'] - df['col2']
for gap in range(100):
df = df2.copy()#reset the altered df back to the original
Up = dif > gap
Down = dif < -gap
df.loc[Up,'predict'] = 'Up'
df.loc[Down,'predict'] = 'Down'
df_result = df.dropna()
Total = df.shape[0]
count = df_result.shape[0]
ratio = count/Total
print(f'Total: {Total}; count: {count}; ratio: {ratio}')

TypeError: ("Cannot compare type 'Timestamp' with type 'str'", 'occurred at index 262224')

I am trying to create a flag for date from datetime column. but getting an error after applying the below function.
def f(r):
if r['balance_dt'] <= '2016-11-30':
return 0
else:
return 1
df_obctohdfc['balance_dt_flag'] = df_obctohdfc.apply(f,axis=1)
The error your are getting is because you are comparing string object to datetime object. You can convert the string to datetime.
Ex:
import datetime
def f(r):
if r['balance_dt'] <= datetime.datetime.strptime('2016-11-30', '%Y-%m-%d'):
return 0
else:
return 1
df_obctohdfc['balance_dt_flag'] = df_obctohdfc.apply(f,axis=1)
Note: It is better to do the way jezrael has mention. That is the right way to do it
In pandas is best avoid loops, how working apply under the hood.
I think need convert string to datetime and then cast mask to integer - True to 1 and False to 0 and change <= to >:
timestamp = pd.to_datetime('2016-11-30')
df_obctohdfc['balance_dt_flag'] = (df_obctohdfc['balance_dt'] > timestamp).astype(int)
Sample:
rng = pd.date_range('2016-11-27', periods=10)
df_obctohdfc = pd.DataFrame({'balance_dt': rng})
#print (df_obctohdfc)
timestamp = pd.to_datetime('2016-11-30')
df_obctohdfc['balance_dt_flag'] = (df_obctohdfc['balance_dt'] > timestamp).astype(int)
print (df_obctohdfc)
balance_dt balance_dt_flag
0 2016-11-27 0
1 2016-11-28 0
2 2016-11-29 0
3 2016-11-30 0
4 2016-12-01 1
5 2016-12-02 1
6 2016-12-03 1
7 2016-12-04 1
8 2016-12-05 1
9 2016-12-06 1
Comparing in 1000 rows DataFrame:
In [140]: %timeit df_obctohdfc['balance_dt_flag1'] = (df_obctohdfc['balance_dt'] > timestamp).astype(int)
1000 loops, best of 3: 368 µs per loop
In [141]: %timeit df_obctohdfc['balance_dt_flag2'] = df_obctohdfc.apply(f,axis=1)
10 loops, best of 3: 91.2 ms per loop
Setup:
rng = pd.date_range('2015-11-01', periods=1000)
df_obctohdfc = pd.DataFrame({'balance_dt': rng})
#print (df_obctohdfc)
timestamp = pd.to_datetime('2016-11-30')
import datetime
def f(r):
if r['balance_dt'] <= datetime.datetime.strptime('2016-11-30', '%Y-%m-%d'):
return 0
else:
return 1

Resources