I need to detect threshold values on timeseries with Pyspark.
On the example graph below I want to detect (by storing the associated timestamp) each occurrence of the parameter ALT_STD being larger than 5000 and then lower than 5000.
For this simple case I can run simple queries such as
t_start = df.select('timestamp')\
.filter(df.ALT_STD > 5000)\
t_stop = df.select('timestamp')\
.filter((df.ALT_STD < 5000)\
& (df.timestamp > t_start.timestamp))\
However, in some cases, the event can by cyclic and I may have several curves (i.e. several times ALT_STD will raise above or below 5000). Of course, if I use the queries above I will only be able to detect the first occurrences.
I guess I should use window function with an udf, but I can't find a working solution.
My guess is that the algorithm should be something like:
windowSpec = Window.partitionBy('flight_hash')\
.rowsBetween(Window.currentRow, 1)
def detect_thresholds(x):
if (x['ALT_STD'][current_row]< 5000) and (x['ALT_STD'][next_row] > 5000):
return x['timestamp'] #Or maybe simply 1
if (x['ALT_STD'][current_row]> 5000) and (x['ALT_STD'][current_row] > 5000):
return x['timestamp'] #Or maybe simply 2
return 0
import pyspark.sql.functions as F
detect_udf = F.udf(detect_threshold, IntegerType())
df.withColumn('Result', detect_udf(F.Struct('ALT_STD')).over(windowSpec).show()
Is such an algorithm feasible in Pyspark ? How ?
As a side note, I have understood how to use udf or udf and built-in sql window functions but not how to combine udf AND window.
e.g. :
# This will compute the mean (built-in function)
df.withColumn("Result", F.mean(df['ALT_STD']).over(windowSpec)).show()
# This will also work
divide_udf = F.udf(lambda x: x[0]/1000., DoubleType())
df.withColumn('result', divide_udf(F.struct('timestamp')))
No need for udf here (and python udfs cannot be used as window functions). Just use lead / lag with when:
from pyspark.sql.functions import col, lag, lead, when
result = (when((col('ALT_STD') < 5000) & (lead(col('ALT_STD'), 1) > 5000), 1)
.when(col('ALT_STD') > 5000) & (lead(col('ALT_STD'), 1) < 5000), 1)
df.withColum("result", result)
Thanks to user9569772 answer I found out. His solution did not work because .lag() or .lead() are window functions.
from pyspark.sql.functions import when
from pyspark.sql import functions as F
# Define conditions
det_start = (F.lag(F.col('ALT_STD')).over(windowSpec) < 100)\
& (F.lead(F.col('ALT_STD'), 0).over(windowSpec) >= 100)
det_end = (F.lag(F.col('ALT_STD'), 0).over(windowSpec) > 100)\
& (F.lead(F.col('ALT_STD')).over(windowSpec) < 100)
# Combine conditions with .when() and .otherwise()
result = (when(det_start, 1)\
.when(det_end, 2)\
df.withColumn("phases", result).show()
I have a dataframe and I'm trying to filter based on end_date if it's >= or < a certain date.
However, I'm getting a "not callable" error.
line 148, in <module>
df_s1 = df_x.filter(df_x[\"end_date\"].ge(lit(\"2022-08-17\")))
TypeError: 'Column' object is not callable"
Here is my code:
df_x = df_x.join(df_di_meet, trim(df_x.application_id) == trim(df_di_meet.application_id), "left")\
.select (df_x["*"], df_di_meet["end_date"])
# ... Cast end_date to timestamp ...end_date format looks like 2013-12-20 23:59:00.0000000
df_x = df_x.withColumn("end_date",(col("end_date").cast("timestamp")))
# ... Here df_s1 >= 2022-08-17
df_s1 = df_x.filter(df_x["end_date"].ge(lit("2022-08-17")))
#... Here df_s2 < 2022-08-17
df_s2 = df_x.filter(df_x["end_date"].lt(lit("2022-08-17")))
What I'm trying to do is check additional logic as well like the code below, but since it's not working with a when clause I decided to break down the dataframes and check each one separately. Is there an easier way, or how could I get the below code to work?
df_x = df_x.withColumn("REV_STAT_TYP_DES", when((df_x.review_statmnt_type_desc == lit("")) & (df_x("end_date").ge(lit("2022-08-17"))), "Not Released")
when((df_x.review_statmnt_type_desc == lit("")) & ((df_x("end_date").lt(lit("2022-08-17"))) | (df_x.end_date == lit(""))), "Not Available")
There are attempts to make difficult code look cleaner. According to those recommendations, conditional statements may be better understood and maintained if they were separated into different variables. Look at how I've added isnull to some of the variables - it would have been a lot more difficult if they were not refactored into separate variables.
from pyspark.sql import functions as F
no_review = (F.col("review_statmnt_type_desc") == "") | F.isnull("review_statmnt_type_desc")
no_end_date = (F.col("end_date") == "") | F.isnull("end_date")
not_released = no_review & (F.col("end_date") >= F.lit("2022-08-17"))
not_available = no_review & ((F.col("end_date") < F.lit("2022-08-17")) | no_end_date)
Also, you don't need the otherwise clause if it returns null (its the default behaviour).
df_x = df_x.withColumn(
F.when(not_released, "Not Released")
.when(not_available, "Not Available")
df_x("end_date") --> This is wrong way of accessing a spark dataframe column. That's why python is assuming it as a callable and you are getting that error.
df_x["end_date"] --> This is how you should access the column (or df_x.end_date)
Now only noticed , .ge() or .le() kind of methods won't work with spark dataframe column objects. You can use any of the below ways of filtering:
from pyspark.sql.functions import col
df_s1 = df_x.filter(df_x["end_date"] >='2022-08-17')
# OR
df_s1 = df_x.filter(df_x.end_date>='2022-08-17')
# OR
df_s1 = df_x.filter(col('end_date')>='2022-08-17')
# OR
df_s1 = df_x.filter("end_date>='2022-08-17'")
# OR
# you can use df_x.where() instead of df_x.filter
You probably got confused between pandas and pyspark. Anyway this is how you do it
df_x = df.withColumn('date', to_date('date'))
df_x = df.toPandas()
df_s1 = df_x.assign(date= pd.to_datetime(df_x['date'])).query("date.gt('2022-08-17')", engine='python')
Use SQL style free form case/when syntax in the expr() function. That way it is portable also.
df_x = (df_x.withColumn("REV_STAT_TYP_DES",
expr(""" case
when review_statmnt_type_desc='' and end_date >='2022-08-17' then 'Not Released'
when review_statmnt_type_desc='' and ( end_date <'2022-08-17' or end_date is null ) then 'Not Available'
else null
How do i create a new column in data frame that will say "Cheap" if the price is below 50000, "Fair" is the price is between 50000 and 100000 and "Expensive" if the price is over 100000enter image description here
Although I think #mozway's solution is the cleanest one, here is another way using numpy.select
import numpy as np
df['new_column'] = np.select([df['selling_price'] < 50_000,
df['selling_price'] <= 100_000],
['Cheap', 'Fair'], 'Expensive')
There are many options. A nice one is pandas.cut:
df['new'] = pd.cut(df['selling_price'],
bins=[0,50000,100000, float('inf')],
labels=['cheap', 'fair', 'expensive'])
You could use numpy.where() for this kind of data processing:
import numpy as np
df['Cheap']=np.where(df['selling_price']<=50000,'Cheap', #When selling_price <50k, 'Cheap', otherwise...
np.where((df['selling_price']>50000) & (df['selling_price']<100000) ,'Fair', #When selling_price >50k and <100k, 'Fair', otherwise...
np.where(df['selling_price']>=100000,'Expensive',#When selling_price >100k, Expensive
'N/A')))#Otherwise N/A - in case you have some string or other data type in your data
Another way with apply() and lambda function :
df["new"] = df.selling_price.apply(
lambda x: "cheap" if x < 50000 else
"fair" if x < 100000 else
Or in a general way which allows you to include multiple columns in the condition :
df["new"] = df.apply(
lambda x: "cheap"
if x["selling_price"] < 50000
else "fair"
if x["selling_price"] < 100000
else "expensive",
I have a problem inside a pyspark udf function and I want to print the number of the row generating the problem.
I tried to count the rows using the equivalent of "static variable" in Python so that when the udf is called with a new row, a counter is incremented. However, it is not working:
import pyspark.sql.functions as F
def myF(input):
myF.lineNumber += 1
if (somethingBad):
return res
myF.lineNumber = 0
myF_udf = F.udf(myF, StringType())
How can I count the number of times a udf is called in order to find the number of the row generating the problem in pyspark?
UDFs are executed at workers, so the print statements inside them won't show up in the output (which is from the driver). The best way to handle issues with UDFs is to change the return type of the UDF to a struct or a list and pass the error information along with the returned output. In the code below I am just adding the error info to the string res that you were returning originally.
import pyspark.sql.functions as F
def myF(input):
myF.lineNumber += 1
if (somethingBad):
res += 'Error in line {}'.format(myF.lineNumber)
return res
myF.lineNumber = 0
myF_udf = F.udf(myF, StringType())
I am trying to apply a levenshtein function for each string in dfs against each string in dfc and write the resulting dataframe to csv. The issue is that I'm creating so many rows by using the cross join and then applying the function, that my machine is struggling to write anything (taking forever to execute).
Trying to improve write performance:
I'm filtering out a few things on the result of the cross join i.e. rows where the LevenshteinDistance is less than 15% of the target word's.
Using bucketing on the first letter of each target word i.e. a, b, c, etc. still no luck (i.e. job runs for hours and doesn't generate any results).
from datetime import datetime
from config import config
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql import Window
def fuzzy_match(dfs, dfc, path_summary):
Implements the Levenshtein and Soundex algorithms and returns a fuzzy matched DataFrame.
Filters out those where resulting LS distance is less than 15% of SF name length.
# Apply Levenshtein and Soundex functions
dfs = dfs.withColumn("OrganisationNameKeyLen", F.length("OrganisationNameKey"))
df = dfc\
.withColumn( "LevenshteinDistance", F.levenshtein( F.lower("OrganisationNameKey") , F.lower("CompanyNameKey") ) )\
.withColumn( "HasSameSoundex", F.soundex("OrganisationNameKey") == F.soundex("CompanyNameKey") )\
.where("LevenshteinDistance < OrganisationNameKeyLen * 0.15")\
.orderBy("OrganisationName", "CompanyName")
def fuzzy_match_approve(df, path_fuzzy_match_approved, path_fuzzy_match_rejected, path_summary):
Filters fuzzy matching DataFrame results on approved/rejected based on set of conditions:
- If there is only 1 match against the SF name
- If more than 1 match then take that with LS distance of 1
- If more than 1 match and more multiple LS distances of 1, then take the one where Soundex codes are the same
Writes results and summary to CSV.
def write_with_bucket(df, bucket_col, path):
.bucketBy(26, bucket_col)\
.option("path", path)\
.option("header", True)\
.saveAsTable("bucket", format="csv")
# Add window function columns:
# OrganisationNameMatchCount: Count AccountID per OrganisationName
# LevenshteinDistance1Count: Count AccountID per OrganisationName where LevenshteinDistance = 1
windowSpec = Window.partitionBy("OrganisationName")
df = df\
.select("AccountID", "OrganisationName", "OrganisationNameKey", "CompanyNumber", "CompanyName", "LevenshteinDistance", "HasSameSoundex")\
.withColumn("OrganisationNameMatchCount", F.count("AccountID").over(windowSpec))\
.withColumn("LevenshteinDistance1Count", F.count(F.when(F.col("LevenshteinDistance")==1, F.col("AccountID"))).over(windowSpec))
# Add bucket key column
df = df.withColumn( "OrganisationNameBucketKey", F.substring( col("OrganisationNameKey"),0,1) )
# Define fuzzy match approved condition
is_approved_1 = ( F.col("OrganisationNameMatchCount") == 1 )
is_approved_2 = ( (F.col("OrganisationNameMatchCount") > 1) & (F.col("LevenshteinDistance1Count") == 1) & (F.col("LevenshteinDistance") == 1) )
is_approved_3 = ( (F.col("OrganisationNameMatchCount") > 1) & (F.col("LevenshteinDistance1Count") > 1) & (F.col("HasSameSoundex") == 'true') )
is_approved = is_approved_1 | is_approved_2 | is_approved_3
# Split fuzzy match results into approved and rejected
df_approved = df.filter(is_approved)
df_rejected = df.filter(~is_approved)
# Export results
# df_approved.write.csv(path_fuzzy_match_approved, mode="overwrite", header=True, quoteAll=True)
# df_rejected.write.csv(path_fuzzy_match_rejected, mode="overwrite", header=True, quoteAll=True)
write_with_bucket(df_approved, "OrganisationNameBucketKey", path_fuzzy_match_approved)
write_with_bucket(df_rejected, "OrganisationNameBucketKey", path_fuzzy_match_rejected)
def main():
spark = SparkSession...
# Apply fuzzy match
dfs = spark.read...
dfc = spark.read...
path_summary = ...
df_fuzzy_match = fuzzy_match(dfs, dfc, path_summary)
# Export results
path_fuzzy_match_approved = ...
path_fuzzy_match_rejected = ...
fuzzy_match_approve(df_fuzzy_match, path_fuzzy_match_approved, path_fuzzy_match_rejected, path_summary)
Other info:
df.rdd.getNumPartitions() is 2
dfs.count() is 12,515
dfc.count() is 5,110,430
How can I improve performance here and get the results into a CSV successfully?
There are a couple of things you can do to improve your computation:
Improve parallelism
As Nithish mentioned in the comments, you don't have enough partitions in your input data frames to make use of all your CPU cores. You're not using all your CPU capability and this will slow you down.
To increase your parallelism, repartition dfc to at least your number of cores:
dfc = dfc.repartition(dfc.sql_ctx.sparkContext.defaultParallelism)
You need to do this because your crossJoin is run as a BroadcastNestedLoopJoin which doesn't reshuffle your large input dataframe.
Separate your computation stages
A Spark dataframe/RDD is conceptually just a directed action graph (DAG) of operations to run on your input data but it does not hold data. One consequence of this behavior is that, by default, you'll rerun your computations as many times as you reuse your dataframe.
In your fuzzy_match_approve function, you run 2 separate filters on your df, this means you rerun the whole cross-join operations twice. You really don't want this !
One easy way to avoid this is to use cache() on your fuzzy_match result which should be fairly small given your inputs and matching criteria.
def fuzzy_match_running(dfs, dfc, path_summary):
Implements the Levenshtein and Soundex algorithms and returns a fuzzy matched DataFrame.
Filters out those where resulting LS distance is less than 15% of SF name length.
# Apply Levenshtein and Soundex functions
dfs = dfs.withColumn("OrganisationNameKeyLen", F.length("OrganisationNameKey")).cache()
dfc = dfc.repartition(dfc.sql_ctx.sparkContext.defaultParallelism).cache()
df = dfc.crossJoin(dfs) \
.withColumn( "LevenshteinDistance", F.levenshtein( F.lower("OrganisationNameKey") , F.lower("CompanyNameKey") ) ) \
.withColumn( "HasSameSoundex", F.soundex("OrganisationNameKey") == F.soundex("CompanyNameKey") ) \
.where("LevenshteinDistance < OrganisationNameKeyLen * 0.15") \
.orderBy("OrganisationName", "CompanyName") \
return df
If I run my fuzzy_match_running on some example data frames on my 8 core/16 threads I9-9980HK laptop (spark in local[*] mode with 8GB driver memory):
dfc rowcount : 572494
dfs rowcount : 17728
fuzzy_match rowcount: 7228499
Duration: 679.5572581291199 seconds
Matches/core/sec: 933436.210726889
The job takes about 12 min doing 572494*17728 ~ 10 billion row comparisons
at 933k comparisons/seconds/core. Since your job does 64 billions row comparisons I would expect it to take about 80 min on my laptop.
You should run a similar experiment on your computer with a smaller sample to get an idea of your actual computing speed.
Going further: maximizing matches/sec
To go faster, we need to adjust the computation and increase the number of comparisons that can be done per seconds.
A few things stand out in the function:
you filter your output by comparing the levenshtein distance, an integer, to a decimal calculation. This means spark will cast your integer to a decimal and operate on decimal. Comparing decimals is much slower than integers and it's unnecessary here, you can cast the bound to an int beforehand.
your levenshtein operates on the lower versions of your keys, this means, for each row comparison, Spark will convert the column values to lower again and again, wasting CPU cycles for redundant stuff. You can preprocess this before your join.
I update the function like this:
def fuzzy_match(dfs: DataFrame, dfc: DataFrame, path_summary: str) -> DataFrame:
dfs = dfs.withColumn("OrganisationNameKeyLower", F.lower("OrganisationNameKey"))\
.withColumn("MatchingTolerance", F.ceil(F.length("OrganisationNameKey") * 0.15).cast("int"))\
dfc = dfc.repartition(dfc.sql_ctx.sparkContext.defaultParallelism)\
.withColumn("CompanyNameKeyLower", F.lower("CompanyNameKey"))\
df = dfc.crossJoin(dfs)\
.withColumn("LevenshteinDistance", F.levenshtein(F.col("OrganisationNameKeyLower"), F.col("CompanyNameKeyLower")).cast("int")) \
.where("LevenshteinDistance < MatchingTolerance")\
# clean unnecessary caches before returning
return df
When running the updated version on the same inputs as before and on the same computer I get nearly twice the performance as the first implementation
dfc rowcount : 572494
dfs rowcount : 17728
fuzzy_match rowcount: 7228499
Duration: 356.23311281204224 seconds
Matches/core/sec: 1780641.1846241967
If that is still too slow for your needs, you'll need to find conditions on your data that you can use as a join condition but that's highly data and use case specific.
I have an operation that I want to perform within PySpark 2.0 that would be easy to perform as a df.rdd.map, but since I would prefer to stay inside the Dataframe execution engine for performance reasons, I want to find a way to do this using Dataframe operations only.
The operation, in RDD-style, is something like this:
def precision_formatter(row):
formatter = "%.{}f".format(row.precision)
return row + [formatter % row.amount_raw / 10 ** row.precision]
df = df.rdd.map(precision_formatter)
Basically, I have a column that tells me, for each row, what the precision for my string formatting operation should be, and I want to selectively format the 'amount_raw' column as a string depending on that precision.
I don't know of a way to use the contents of one or more columns as input to another Column operation. The closest I can come is suggesting the use of Column.when with an externally-defined set of boolean operations that correspond to the set of possible boolean conditions/cases within the column or columns.
In this specific case, for instance, if you can obtain (or better yet, already have) all possible values of row.precision, then you can iterate over that set and apply a Column.when operation for each value in the set. I believe this set can be obtained with df.select('precision').distinct().collect().
Because the pyspark.sql.functions.when and Column.when operations themselves return a Column object, you can iterate over the items in the set (however it was obtained) and keep 'appending' when operations to each other programmatically until you have exhausted the set:
import pyspark.sql.functions as PSF
def format_amounts_with_precision(df, all_precisions_set):
amt_col = PSF.when(df['precision'] == 0, df['amount_raw'].cast(StringType()))
for precision in all_precisions_set:
if precision != 0: # this is a messy way of having a base case above
fmt_str = '%.{}f'.format(precision)
amt_col = amt_col.when(df['precision'] == precision,
PSF.format_string(fmt_str, df['amount_raw'] / 10 ** precision)
return df.withColumn('amount', amt_col)
You can do it with a python UDF. They can take as many input values (values from columns of a Row) and spit out a single output value. It would look something like this:
from pyspark.sql import types as T, functions as F
from pyspark.sql.function import udf, col
# Create example data frame
schema = T.StructType([
T.StructField('precision', T.IntegerType(), False),
T.StructField('value', T.FloatType(), False)
data = [
(1, 0.123456),
(2, 0.123456),
(3, 0.123456)
rdd = sc.parallelize(data)
df = sqlContext.createDataFrame(rdd, schema)
# Define UDF and apply it
def format_func(precision, value):
format_str = "{:." + str(precision) + "f}"
return format_str.format(value)
format_udf = F.udf(format_func, T.StringType())
new_df = df.withColumn('formatted', format_udf('precision', 'value'))
Also, if instead of the column precision value you wanted to use a global one, you could use the lit(..) function when you call it like this:
new_df = df.withColumn('formatted', format_udf(F.lit(2), 'value'))