I have a dataframe and I'm trying to filter based on end_date if it's >= or < a certain date.
However, I'm getting a "not callable" error.
line 148, in <module>
df_s1 = df_x.filter(df_x[\"end_date\"].ge(lit(\"2022-08-17\")))
TypeError: 'Column' object is not callable"
Here is my code:
df_x = df_x.join(df_di_meet, trim(df_x.application_id) == trim(df_di_meet.application_id), "left")\
.select (df_x["*"], df_di_meet["end_date"])
# ... Cast end_date to timestamp ...end_date format looks like 2013-12-20 23:59:00.0000000
df_x = df_x.withColumn("end_date",(col("end_date").cast("timestamp")))
# ... Here df_s1 >= 2022-08-17
df_s1 = df_x.filter(df_x["end_date"].ge(lit("2022-08-17")))
#... Here df_s2 < 2022-08-17
df_s2 = df_x.filter(df_x["end_date"].lt(lit("2022-08-17")))
What I'm trying to do is check additional logic as well like the code below, but since it's not working with a when clause I decided to break down the dataframes and check each one separately. Is there an easier way, or how could I get the below code to work?
df_x = df_x.withColumn("REV_STAT_TYP_DES", when((df_x.review_statmnt_type_desc == lit("")) & (df_x("end_date").ge(lit("2022-08-17"))), "Not Released")
when((df_x.review_statmnt_type_desc == lit("")) & ((df_x("end_date").lt(lit("2022-08-17"))) | (df_x.end_date == lit(""))), "Not Available")
.otherwise(None))
There are attempts to make difficult code look cleaner. According to those recommendations, conditional statements may be better understood and maintained if they were separated into different variables. Look at how I've added isnull to some of the variables - it would have been a lot more difficult if they were not refactored into separate variables.
from pyspark.sql import functions as F
no_review = (F.col("review_statmnt_type_desc") == "") | F.isnull("review_statmnt_type_desc")
no_end_date = (F.col("end_date") == "") | F.isnull("end_date")
not_released = no_review & (F.col("end_date") >= F.lit("2022-08-17"))
not_available = no_review & ((F.col("end_date") < F.lit("2022-08-17")) | no_end_date)
Also, you don't need the otherwise clause if it returns null (its the default behaviour).
df_x = df_x.withColumn(
"REV_STAT_TYP_DES",
F.when(not_released, "Not Released")
.when(not_available, "Not Available")
)
df_x("end_date") --> This is wrong way of accessing a spark dataframe column. That's why python is assuming it as a callable and you are getting that error.
df_x["end_date"] --> This is how you should access the column (or df_x.end_date)
UPDATE:
Now only noticed , .ge() or .le() kind of methods won't work with spark dataframe column objects. You can use any of the below ways of filtering:
from pyspark.sql.functions import col
df_s1 = df_x.filter(df_x["end_date"] >='2022-08-17')
# OR
df_s1 = df_x.filter(df_x.end_date>='2022-08-17')
# OR
df_s1 = df_x.filter(col('end_date')>='2022-08-17')
# OR
df_s1 = df_x.filter("end_date>='2022-08-17'")
# OR
# you can use df_x.where() instead of df_x.filter
You probably got confused between pandas and pyspark. Anyway this is how you do it
DataFrame
df=spark.createDataFrame([("2022-08-16",1),("2019-06-24",2),("2022-08-19",3)]).toDF("date","increment")#
Pyspark
df_x = df.withColumn('date', to_date('date'))
df_x.filter(col('date')>(to_date(lit("2022-08-17")))).show()
Pandas
df_x = df.toPandas()
df_s1 = df_x.assign(date= pd.to_datetime(df_x['date'])).query("date.gt('2022-08-17')", engine='python')
or
df_x[df_x['date']>'2022-08-17']
Use SQL style free form case/when syntax in the expr() function. That way it is portable also.
df_x = (df_x.withColumn("REV_STAT_TYP_DES",
expr(""" case
when review_statmnt_type_desc='' and end_date >='2022-08-17' then 'Not Released'
when review_statmnt_type_desc='' and ( end_date <'2022-08-17' or end_date is null ) then 'Not Available'
else null
end
""")
Related
I have a data frame in pandas like this:
ID Date Element Data_Value
0 USW00094889 2014-11-12 TMAX 22
1 USC00208972 2009-04-29 TMIN 56
2 USC00200032 2008-05-26 TMAX 278
3 USC00205563 2005-11-11 TMAX 139
4 USC00200230 2014-02-27 TMAX -106
I want to remove all leap days and my code is
df = df[~((df.Date.month == 2) & (df.Date.day == 29))]
but the AttributeError happened :
'Series' object has no attribute 'month'
Whats wrong with my code?
Use dt accessor:
df = df[~((df.Date.dt.month == 2) & (df.Date.dt.day == 29))]
Add dt accessor because working with Series, not with DatetimeIndex:
df = df[~((df.Date.dt.month == 2) & (df.Date.dt.day == 29))]
Or invert condition with chaining | for bitwise OR and != for not equal:
df = df[(df.Date.dt.month != 2) | (df.Date.dt.day != 29)]
Or use strftime for convert to MM-DD format:
df = df[df.Date.dt.strftime('%m-%m') != '02-29']
Another way you can try below in incase your Date column is not proper datetime rather a str.
df[~df.Date.str.endswith('02-29')]
OR , if it's in datetime format even you can try converting to str.
df[~df.Date.astype(str).str.endswith('02-29')]
OR, Even use contains:
df[~df.Date.str.contains('02-29')]
Assuming the data.xlsx looks like this:
Column_Name | Table_Name
CUST_ID_1 | Table_1
CUST_ID_2 | Table_2
Here are the SQLs that I'm trying to generate by using the bind_param for db2 in Python:
SELECT CUST_ID_1 FROM TABLE_1 WHERE CUST_ID_1 = 12345
SELECT CUST_ID_2 FROM TABLE_2 WHERE CUST_ID_2 = 12345
And this is how Im trying to generate this query:
import ibm_db
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
validate_sql = "SELECT ? FROM ? WHERE ?=12345"
validate_stmt = ibm_db.prepare(conn, validate_sql)
df = pd.read_excel("data.xlsx", sheet_name='Sheet1')
for i in df.index:
ibm_db.bind_param(validate_stmt, 1, df['Column_Name'][i])
ibm_db.bind_param(validate_stmt, 2, df['Table_Name'][i])
ibm_db.bind_param(validate_stmt, 3, df['Column_Name'][i])
ibm_db.execute(validate_stmt)
validation_result = ibm_db.fetch_both(validate_stmt)
while validation_result != False:
print(validation_result[0])
validation_result = ibm_db.fetch_both(validate_stmt)
When I try to execute this code, Im hitting a SQLCODE=-104 error.
Any idea how the syntax should be for parameter binding?
Thanks,
Ganesh
2 major errors.
1. You can’t use a parameter marker for a table or column name (2-nd & 3-rd parameters).
2. You must specify the data type of the parameter marker, if it’s not possible to understand it from the query (1-st parameter). You must use something like «cast(? as data-type-desired)». But it’s just for you info, since you try to use it here as a column name, which is not possible as described in 1).
I need to detect threshold values on timeseries with Pyspark.
On the example graph below I want to detect (by storing the associated timestamp) each occurrence of the parameter ALT_STD being larger than 5000 and then lower than 5000.
For this simple case I can run simple queries such as
t_start = df.select('timestamp')\
.filter(df.ALT_STD > 5000)\
.sort('timestamp')\
.first()
t_stop = df.select('timestamp')\
.filter((df.ALT_STD < 5000)\
& (df.timestamp > t_start.timestamp))\
.sort('timestamp')\
.first()
However, in some cases, the event can by cyclic and I may have several curves (i.e. several times ALT_STD will raise above or below 5000). Of course, if I use the queries above I will only be able to detect the first occurrences.
I guess I should use window function with an udf, but I can't find a working solution.
My guess is that the algorithm should be something like:
windowSpec = Window.partitionBy('flight_hash')\
.orderBy('timestamp')\
.rowsBetween(Window.currentRow, 1)
def detect_thresholds(x):
if (x['ALT_STD'][current_row]< 5000) and (x['ALT_STD'][next_row] > 5000):
return x['timestamp'] #Or maybe simply 1
if (x['ALT_STD'][current_row]> 5000) and (x['ALT_STD'][current_row] > 5000):
return x['timestamp'] #Or maybe simply 2
else:
return 0
import pyspark.sql.functions as F
detect_udf = F.udf(detect_threshold, IntegerType())
df.withColumn('Result', detect_udf(F.Struct('ALT_STD')).over(windowSpec).show()
Is such an algorithm feasible in Pyspark ? How ?
Post-scriptum:
As a side note, I have understood how to use udf or udf and built-in sql window functions but not how to combine udf AND window.
e.g. :
# This will compute the mean (built-in function)
df.withColumn("Result", F.mean(df['ALT_STD']).over(windowSpec)).show()
# This will also work
divide_udf = F.udf(lambda x: x[0]/1000., DoubleType())
df.withColumn('result', divide_udf(F.struct('timestamp')))
No need for udf here (and python udfs cannot be used as window functions). Just use lead / lag with when:
from pyspark.sql.functions import col, lag, lead, when
result = (when((col('ALT_STD') < 5000) & (lead(col('ALT_STD'), 1) > 5000), 1)
.when(col('ALT_STD') > 5000) & (lead(col('ALT_STD'), 1) < 5000), 1)
.otherwise(0))
df.withColum("result", result)
Thanks to user9569772 answer I found out. His solution did not work because .lag() or .lead() are window functions.
from pyspark.sql.functions import when
from pyspark.sql import functions as F
# Define conditions
det_start = (F.lag(F.col('ALT_STD')).over(windowSpec) < 100)\
& (F.lead(F.col('ALT_STD'), 0).over(windowSpec) >= 100)
det_end = (F.lag(F.col('ALT_STD'), 0).over(windowSpec) > 100)\
& (F.lead(F.col('ALT_STD')).over(windowSpec) < 100)
# Combine conditions with .when() and .otherwise()
result = (when(det_start, 1)\
.when(det_end, 2)\
.otherwise(0))
df.withColumn("phases", result).show()
I'm trying to apply a complex function to a pandas DataFrame, and I'm wondering if there's a faster way to do it. A simplified version of my data looks like this:
UID,UID2,Time,EventType
1,1,18:00,A
1,1,18:05,B
1,2,19:00,A
1,2,19:03,B
2,6,20:00,A
3,4,14:00,A
What I want to do is for each combination of UID and UID2 check if there is both a row with EventType = A and EventType = B, and then calculate the time difference, and then add it back as a new column. So the new dataset would be:
UID,UID2,Time,EventType,TimeDiff
1,1,18:00,A,5
1,1,18:05,B,5
1,2,19:00,A,3
1,2,19:03,B,3
2,6,20:00,A,nan
3,4,14:00,A,nan
This is the current implementation, where I group the records by UID and UID2, then have only a small subset of rows to search to identify whether both EventTypes exist. I can't figure out a faster one, and profiling in PyCharm hasn't helped uncover where the bottleneck is.
for (uid, uid2), group in df.groupby(["uid", "uid2"]):
# if there is a row for both A and B for a uid, uid2 combo
if len(group[group["EventType"] == "A"]) > 0 and len(group[group["EventType"] == "D"]) > 0:
time_a = group.loc[group["EventType"] == "A", "Time"].iloc[0]
time_b = group.loc[group["EventType"] == "B", "Time"].iloc[0]
timediff = time_b - time_a
timediff_min = timediff.components.minutes
df.loc[(df["uid"] == uid) & (df["uid2"] == uid2), "TimeDiff"] = timediff_min
I need to make sure Time column is a timedelta
df.Time = pd.to_datetime(df.Time)
df.Time = df.Time - pd.to_datetime(df.Time.dt.date)
After that I create a helper dataframe
df1 = df.set_index(['UID', 'UID2', 'EventType']).unstack().Time
df1
Finally, I take the diff and merge to df
df.merge((df1.B - df1.A).rename('TimeDiff').reset_index())
Fairly new to coding. I have looked at a couple of other similar questions about appending DataFrames in python but could not solve the problem.
I have the data below (CSV) in an excel xls file:
Venue Name,Cost ,Restriction,Capacity
Cinema,5,over 13,50
Bar,10,over 18,50
Restaurant,15,no restriction,25
Hotel,7,no restriction,100
I am using the code below to try to "filter" out rows which have "no restriction" under the restrictions column. The code seems to work right through to the last line i.e. both print statements are giving me what I would expect.
import pandas as pd
import numpy as np
my_file = pd.ExcelFile("venue data.xlsx")
mydata = my_file.parse(0, index_col = None, na_values = ["NA"])
my_new_file = pd.DataFrame()
for index in mydata.index:
if "no restriction" in mydata.Restriction[index]:
print (mydata.Restriction[index])
print (mydata.loc[index:index])
my_new_file.append(mydata.loc[index:index], ignore_index = True)
Don't loop through dataframes. It's almost never necessary.
Use:
df2 = df[df['Restriction'] != 'no restriction']
Or
df2 = df.query("Restriction != 'no restriction'")