Related
I am trying to apply a levenshtein function for each string in dfs against each string in dfc and write the resulting dataframe to csv. The issue is that I'm creating so many rows by using the cross join and then applying the function, that my machine is struggling to write anything (taking forever to execute).
Trying to improve write performance:
I'm filtering out a few things on the result of the cross join i.e. rows where the LevenshteinDistance is less than 15% of the target word's.
Using bucketing on the first letter of each target word i.e. a, b, c, etc. still no luck (i.e. job runs for hours and doesn't generate any results).
from datetime import datetime
from config import config
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql import Window
def fuzzy_match(dfs, dfc, path_summary):
"""
Implements the Levenshtein and Soundex algorithms and returns a fuzzy matched DataFrame.
Filters out those where resulting LS distance is less than 15% of SF name length.
"""
# Apply Levenshtein and Soundex functions
dfs = dfs.withColumn("OrganisationNameKeyLen", F.length("OrganisationNameKey"))
df = dfc\
.crossJoin(dfs)\
.withColumn( "LevenshteinDistance", F.levenshtein( F.lower("OrganisationNameKey") , F.lower("CompanyNameKey") ) )\
.withColumn( "HasSameSoundex", F.soundex("OrganisationNameKey") == F.soundex("CompanyNameKey") )\
.where("LevenshteinDistance < OrganisationNameKeyLen * 0.15")\
.orderBy("OrganisationName", "CompanyName")
def fuzzy_match_approve(df, path_fuzzy_match_approved, path_fuzzy_match_rejected, path_summary):
"""
Filters fuzzy matching DataFrame results on approved/rejected based on set of conditions:
- If there is only 1 match against the SF name
- If more than 1 match then take that with LS distance of 1
- If more than 1 match and more multiple LS distances of 1, then take the one where Soundex codes are the same
Writes results and summary to CSV.
"""
def write_with_bucket(df, bucket_col, path):
df.write\
.mode("overwrite")\
.bucketBy(26, bucket_col)\
.option("path", path)\
.option("header", True)\
.saveAsTable("bucket", format="csv")
# Add window function columns:
# OrganisationNameMatchCount: Count AccountID per OrganisationName
# LevenshteinDistance1Count: Count AccountID per OrganisationName where LevenshteinDistance = 1
windowSpec = Window.partitionBy("OrganisationName")
df = df\
.select("AccountID", "OrganisationName", "OrganisationNameKey", "CompanyNumber", "CompanyName", "LevenshteinDistance", "HasSameSoundex")\
.withColumn("OrganisationNameMatchCount", F.count("AccountID").over(windowSpec))\
.withColumn("LevenshteinDistance1Count", F.count(F.when(F.col("LevenshteinDistance")==1, F.col("AccountID"))).over(windowSpec))
# Add bucket key column
df = df.withColumn( "OrganisationNameBucketKey", F.substring( col("OrganisationNameKey"),0,1) )
# Define fuzzy match approved condition
is_approved_1 = ( F.col("OrganisationNameMatchCount") == 1 )
is_approved_2 = ( (F.col("OrganisationNameMatchCount") > 1) & (F.col("LevenshteinDistance1Count") == 1) & (F.col("LevenshteinDistance") == 1) )
is_approved_3 = ( (F.col("OrganisationNameMatchCount") > 1) & (F.col("LevenshteinDistance1Count") > 1) & (F.col("HasSameSoundex") == 'true') )
is_approved = is_approved_1 | is_approved_2 | is_approved_3
# Split fuzzy match results into approved and rejected
df_approved = df.filter(is_approved)
df_rejected = df.filter(~is_approved)
# Export results
# df_approved.write.csv(path_fuzzy_match_approved, mode="overwrite", header=True, quoteAll=True)
# df_rejected.write.csv(path_fuzzy_match_rejected, mode="overwrite", header=True, quoteAll=True)
write_with_bucket(df_approved, "OrganisationNameBucketKey", path_fuzzy_match_approved)
write_with_bucket(df_rejected, "OrganisationNameBucketKey", path_fuzzy_match_rejected)
def main():
spark = SparkSession...
# Apply fuzzy match
dfs = spark.read...
dfc = spark.read...
path_summary = ...
df_fuzzy_match = fuzzy_match(dfs, dfc, path_summary)
# Export results
path_fuzzy_match_approved = ...
path_fuzzy_match_rejected = ...
fuzzy_match_approve(df_fuzzy_match, path_fuzzy_match_approved, path_fuzzy_match_rejected, path_summary)
main()
Other info:
df.rdd.getNumPartitions() is 2
dfs.count() is 12,515
dfc.count() is 5,110,430
Jobs:
How can I improve performance here and get the results into a CSV successfully?
There are a couple of things you can do to improve your computation:
Improve parallelism
As Nithish mentioned in the comments, you don't have enough partitions in your input data frames to make use of all your CPU cores. You're not using all your CPU capability and this will slow you down.
To increase your parallelism, repartition dfc to at least your number of cores:
dfc = dfc.repartition(dfc.sql_ctx.sparkContext.defaultParallelism)
You need to do this because your crossJoin is run as a BroadcastNestedLoopJoin which doesn't reshuffle your large input dataframe.
Separate your computation stages
A Spark dataframe/RDD is conceptually just a directed action graph (DAG) of operations to run on your input data but it does not hold data. One consequence of this behavior is that, by default, you'll rerun your computations as many times as you reuse your dataframe.
In your fuzzy_match_approve function, you run 2 separate filters on your df, this means you rerun the whole cross-join operations twice. You really don't want this !
One easy way to avoid this is to use cache() on your fuzzy_match result which should be fairly small given your inputs and matching criteria.
def fuzzy_match_running(dfs, dfc, path_summary):
"""
Implements the Levenshtein and Soundex algorithms and returns a fuzzy matched DataFrame.
Filters out those where resulting LS distance is less than 15% of SF name length.
"""
# Apply Levenshtein and Soundex functions
dfs = dfs.withColumn("OrganisationNameKeyLen", F.length("OrganisationNameKey")).cache()
dfc = dfc.repartition(dfc.sql_ctx.sparkContext.defaultParallelism).cache()
df = dfc.crossJoin(dfs) \
.withColumn( "LevenshteinDistance", F.levenshtein( F.lower("OrganisationNameKey") , F.lower("CompanyNameKey") ) ) \
.withColumn( "HasSameSoundex", F.soundex("OrganisationNameKey") == F.soundex("CompanyNameKey") ) \
.where("LevenshteinDistance < OrganisationNameKeyLen * 0.15") \
.orderBy("OrganisationName", "CompanyName") \
.cache()
return df
If I run my fuzzy_match_running on some example data frames on my 8 core/16 threads I9-9980HK laptop (spark in local[*] mode with 8GB driver memory):
dfc rowcount : 572494
dfs rowcount : 17728
fuzzy_match rowcount: 7228499
Duration: 679.5572581291199 seconds
Matches/core/sec: 933436.210726889
The job takes about 12 min doing 572494*17728 ~ 10 billion row comparisons
at 933k comparisons/seconds/core. Since your job does 64 billions row comparisons I would expect it to take about 80 min on my laptop.
You should run a similar experiment on your computer with a smaller sample to get an idea of your actual computing speed.
Going further: maximizing matches/sec
To go faster, we need to adjust the computation and increase the number of comparisons that can be done per seconds.
A few things stand out in the function:
you filter your output by comparing the levenshtein distance, an integer, to a decimal calculation. This means spark will cast your integer to a decimal and operate on decimal. Comparing decimals is much slower than integers and it's unnecessary here, you can cast the bound to an int beforehand.
your levenshtein operates on the lower versions of your keys, this means, for each row comparison, Spark will convert the column values to lower again and again, wasting CPU cycles for redundant stuff. You can preprocess this before your join.
I update the function like this:
def fuzzy_match(dfs: DataFrame, dfc: DataFrame, path_summary: str) -> DataFrame:
dfs = dfs.withColumn("OrganisationNameKeyLower", F.lower("OrganisationNameKey"))\
.withColumn("MatchingTolerance", F.ceil(F.length("OrganisationNameKey") * 0.15).cast("int"))\
.cache()
dfc = dfc.repartition(dfc.sql_ctx.sparkContext.defaultParallelism)\
.withColumn("CompanyNameKeyLower", F.lower("CompanyNameKey"))\
.cache()
df = dfc.crossJoin(dfs)\
.withColumn("LevenshteinDistance", F.levenshtein(F.col("OrganisationNameKeyLower"), F.col("CompanyNameKeyLower")).cast("int")) \
.where("LevenshteinDistance < MatchingTolerance")\
.drop("MatchingTolerance")\
.cache()
# clean unnecessary caches before returning
dfs.unpersist()
dfc.unpersist()
return df
When running the updated version on the same inputs as before and on the same computer I get nearly twice the performance as the first implementation
dfc rowcount : 572494
dfs rowcount : 17728
fuzzy_match rowcount: 7228499
Duration: 356.23311281204224 seconds
Matches/core/sec: 1780641.1846241967
If that is still too slow for your needs, you'll need to find conditions on your data that you can use as a join condition but that's highly data and use case specific.
I have a huge PySpark dataframe and I'm doing a series of Window functions over partitions defined by my key.
The issue with the key is, my partitions gets skewed by this and results in Event Timeline that looks something like this,
I know that I can use salting technique to solve this issue when I'm doing a join. But how can I solve this issue when I'm using Window functions?
I'm using functions like lag, lead etc in the Window functions. I can't do the process with salted key, because I'll get wrong results.
How to solve skewness in this case?
I'm looking for a dynamic way of repartitioning my dataframe without skewness.
Updates based on answer from #jxc
I tried creating a sample df and tried running code over that,
df = pd.DataFrame()
df['id'] = np.random.randint(1, 1000, size=150000)
df['id'] = df['id'].map(lambda x: 100 if x % 2 == 0 else x)
df['timestamp'] = pd.date_range(start=pd.Timestamp('2020-01-01'), periods=len(df), freq='60s')
sdf = sc.createDataFrame(df)
sdf = sdf.withColumn("amt", F.rand()*100)
w = Window.partitionBy("id").orderBy("timestamp")
sdf = sdf.withColumn("new_col", F.lag("amt").over(w) + F.lead("amt").over(w))
x = sdf.toPandas()
This gave me a event timeline like this,
I tried the code from #jxc's answer,
sdf = sc.createDataFrame(df)
sdf = sdf.withColumn("amt", F.rand()*100)
N = 24*3600*365*2
sdf_1 = sdf.withColumn('pid', F.ceil(F.unix_timestamp('timestamp')/N))
w1 = Window.partitionBy('id', 'pid').orderBy('timestamp')
w2 = Window.partitionBy('id', 'pid')
sdf_2 = sdf_1.select(
'*',
F.count('*').over(w2).alias('cnt'),
F.row_number().over(w1).alias('rn'),
(F.lag('amt',1).over(w1) + F.lead('amt',1).over(w1)).alias('new_val')
)
sdf_3 = sdf_2.filter('rn in (1, 2, cnt-1, cnt)') \
.withColumn('new_val', F.lag('amt',1).over(w) + F.lead('amt',1).over(w)) \
.filter('rn in (1,cnt)')
df_new = sdf_2.filter('rn not in (1,cnt)').union(sdf_3)
x = df_new.toPandas()
I ended up one additional stage and the event timeline looked more skewed,
Also the run time is increased by a bit with new code
To process a large partition, you can try split it based on the orderBy column(most likely a numeric column or date/timestamp column which can be converted into numeric) so that all new sub-partitions maintain the correct order of rows. process rows with the new partitioner and for calculation using lag and lead functions, only rows around the boundary between sub-partitions need to be post-processed. (Below also discussed how to merge small partitions in task-2)
Use your example sdf and assume we have the following WinSpec and a simple aggregate function:
w = Window.partitionBy('id').orderBy('timestamp')
df.withColumn('new_amt', F.lag('amt',1).over(w) + F.lead('amt',1).over(w))
Task-1: split large partitions:
Try the following:
select a N to split timestamp and set up an additional partitionBy column pid (using ceil, int, floor etc.):
# N to cover 35-days' intervals
N = 24*3600*35
df1 = sdf.withColumn('pid', F.ceil(F.unix_timestamp('timestamp')/N))
add pid into partitionBy(see w1), then calaulte row_number(), lag() and lead() over w1. find also number of rows (cnt) in each new partition to help identify the end of partitions (rn == cnt). the resulting new_val will be fine for majority of rows except those on the boundaries of each partition.
w1 = Window.partitionBy('id', 'pid').orderBy('timestamp')
w2 = Window.partitionBy('id', 'pid')
df2 = df1.select(
'*',
F.count('*').over(w2).alias('cnt'),
F.row_number().over(w1).alias('rn'),
(F.lag('amt',1).over(w1) + F.lead('amt',1).over(w1)).alias('new_amt')
)
Below is an example df2 showing the boundary rows.
process the boundary: select rows which are on the boundaries rn in (1, cnt) plus those which have values used in the calculation rn in (2, cnt-1), do the same calculation of new_val over w and save result for boundary rows only.
df3 = df2.filter('rn in (1, 2, cnt-1, cnt)') \
.withColumn('new_amt', F.lag('amt',1).over(w) + F.lead('amt',1).over(w)) \
.filter('rn in (1,cnt)')
Below shows the resulting df3 from the above df2
merge df3 back to df2 to update boundary rows rn in (1,cnt)
df_new = df2.filter('rn not in (1,cnt)').union(df3)
Below screenshot shows the final df_new around the boundary rows:
# drop columns which are used to implement logic only
df_new = df_new.drop('cnt', 'rn')
Some Notes:
the following 3 WindowSpec are defined:
w = Window.partitionBy('id').orderBy('timestamp') <-- fix boundary rows
w1 = Window.partitionBy('id', 'pid').orderBy('timestamp') <-- calculate internal rows
w2 = Window.partitionBy('id', 'pid') <-- find #rows in a partition
note: strictly, we'd better use the following w to fix boundary rows to avoid issues with tied timestamp around the boundaries.
w = Window.partitionBy('id').orderBy('pid', 'rn') <-- fix boundary rows
if you know which partitions are skewed, just divide them and skip others. the existing method might split a small partition into 2 or even more if they are sparsely distributed
df1 = df.withColumn('pid', F.when(F.col('id').isin('a','b'), F.ceil(F.unix_timestamp('timestamp')/N)).otherwise(1))
If for each partition, you can retrieve count(number of rows) and min_ts=min(timestamp), then try something more dynamically for pid(below M is the threshold number of rows to split):
F.expr(f"IF(count>{M}, ceil((unix_timestamp(timestamp)-unix_timestamp(min_ts))/{N}), 1)")
note: for skewness inside a partition, will requires more complex functions to generate pid.
if only lag(1) function is used, just post-process left boundaries, filter by rn in (1, cnt) and update only rn == 1
df3 = df1.filter('rn in (1, cnt)') \
.withColumn('new_amt', F.lag('amt',1).over(w)) \
.filter('rn = 1')
similar to lead function when we need only to fix right boundaries and update rn == cnt
if only lag(2) is used, then filter and update more rows with df3:
df3 = df1.filter('rn in (1, 2, cnt-1, cnt)') \
.withColumn('new_amt', F.lag('amt',2).over(w)) \
.filter('rn in (1,2)')
You can extend the same method to mixed cases with both lag and lead having different offset.
Task-2: merge small partitions:
Based on the number of records in a partition count, we can set up an threshold M so that if count>M, the id holds its own partition, otherwise we merge partitions so that #of total records is less than M (below method has a edging case of 2*M-2).
M = 20000
# create pandas df with columns `id`, `count` and `f`, sort rows so that rows with count>=M are located on top
d2 = pd.DataFrame([ e.asDict() for e in sdf.groupby('id').count().collect() ]) \
.assign(f=lambda x: x['count'].lt(M)) \
.sort_values('f')
# add pid column to merge smaller partitions but the total row-count in partition should be less than or around M
# potentially there could be at most `2*M-2` records for the same pid, to make sure strictly count<M, use a for-loop to iterate d1 and set pid:
d2['pid'] = (d2.mask(d2['count'].gt(M),M)['count'].shift(fill_value=0).cumsum()/M).astype(int)
# add pid to sdf. In case join is too heavy, try using Map
sdf_1 = sdf.join(spark.createDataFrame(d2).alias('d2'), ["id"]) \
.select(sdf["*"], F.col("d2.pid"))
# check pid: # of records and # of distinct ids
sdf_1.groupby('pid').agg(F.count('*').alias('count'), F.countDistinct('id').alias('cnt_ids')).orderBy('pid').show()
+---+-----+-------+
|pid|count|cnt_ids|
+---+-----+-------+
| 0|74837| 1|
| 1|20036| 133|
| 2|20052| 134|
| 3|20010| 133|
| 4|15065| 100|
+---+-----+-------+
Now, the new Window should be partitioned by pid alone and move id to orderBy, see below:
w3 = Window.partitionBy('pid').orderBy('id','timestamp')
customize lag/lead functions based on the above w3 WinSpec, and then calculate new_val:
lag_w3 = lambda col,n=1: F.when(F.lag('id',n).over(w3) == F.col('id'), F.lag(col,n).over(w3))
lead_w3 = lambda col,n=1: F.when(F.lead('id',n).over(w3) == F.col('id'), F.lead(col,n).over(w3))
sdf_new = sdf_1.withColumn('new_val', lag_w3('amt',1) + lead_w3('amt',1))
To handle such skewed data, there are a couple of things you can try out.
If you are using Databricks to run your jobs and you know which column will have the skew then you can try out an option called skew hint
I recommend moving to Spark 3.0 since you will have the option to use Adaptive Query Execution (AQE) which can handle most of the issues improving your job health and potentially running them faster.
Usually, I suggest making your data more even-sized partitions before any wide operation, and Increasing the cluster size does help but I am not sure if this will work for you.
I have a dataframe with time-series data and I am trying to add a lot of moving average columns to it with different windows of various ranges. When I do this column by column, results are pretty slow.
I have tried to just pile the withColumn calls until I have all of them.
Pseudo code:
import pyspark.sql.functions as pysparkSqlFunctions
## working from a data frame with 12 colums:
## - key as a String
## - time as a DateTime
## - col_{1:10} as numeric values
window_1h = Window.partitionBy("key") \
.orderBy(col("time").cast("long")) \
.rangeBetween(-3600, 0)
window_2h = Window.partitionBy("key") \
.orderBy(col("time").cast("long")) \
.rangeBetween(-7200, 0)
df = df.withColumn("col1_1h", pysparkSqlFunctions.avg("col_1").over(window_1h))
df = df.withColumn("col1_2h", pysparkSqlFunctions.avg("col_1").over(window_2h))
df = df.withColumn("col2_1h", pysparkSqlFunctions.avg("col_2").over(window_1h))
df = df.withColumn("col2_2h", pysparkSqlFunctions.avg("col_2").over(window_2h))
What I would like is the ability to add all 4 columns (or many more) in one call, hopefully traversing the data only once for better performance.
I prefer to import the functions library as F as it looks neater and it is the standard alias used in the official Spark documentation.
The star string, '*', should capture all the current columns within the dataframe. Alternatively, you could replace the star string with *df.columns. Here the star explodes the list into separate parameters for the select method.
from pyspark.sql import functions as F
df = df.select(
"*",
F.avg("col_1").over(window_1h).alias("col1_1h"),
F.avg("col_1").over(window_2h).alias("col1_2h"),
F.avg("col_2").over(window_1h).alias("col2_1h"),
F.avg("col_2").over(window_1h).alias("col2_1h"),
)
I have the following code:
val df_in = sqlcontext.read.json(jsonFile) // the file resides in hdfs
//some operations in here to create df as df_in with two more columns "terms1" and "terms2"
val intersectUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { seq1 intersect seq2 } ) //intersects two sequences
val symmDiffUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { (seq1 diff seq2) ++ (seq2 diff seq1) } ) //compute the difference of two sequences
val df1 = (df.withColumn("termsInt", intersectUDF(df("terms1"), df1("terms2") ) )
.withColumn("termsDiff", symmDiffUDF(df("terms1"), df1("terms2") ) )
.where( size(col("termsInt")) >0 && size(col("termsDiff")) > 0 && size(col("termsDiff")) <= 2 )
.cache()
) // add the intersection and difference columns and filter the resulting DF
df1.show()
df1.count()
The app is working properly and fast until the show() but in the count() step, it creates 40000 tasks.
My understanding is that df1.show() should be triggering the full df1 creation and df1.count() should be very fast. What am I missing here? why is count() that slow?
Thank you very much in advance,
Roxana
show is indeed an action, but it is smart enough to know when it doesn't have to run everything. If you had an orderBy it would take very long too, but in this case all your operations are map operations and so there's no need to calculate the whole final table. However, count needs to physically go through the whole table in order to count it and that's why it's taking so long. You could test what I'm saying by adding an orderBy to df1's definition - then it should take long.
EDIT: Also, the 40k tasks are likely due to the amount of partitions your DF is partitioned into. Try using df1.repartition(<a sensible number here, depending on cluster and DF size>) and trying out count again.
show() by default shows only 20 rows. If the 1st partition returned more than 20 rows, then the rest partitions will not be executed.
Note show has a lot of variations. If you run show(false) which means show all results, all partitions will be executed and may take more time. So, show() equals show(20) which is a partial action.
I have an operation that I want to perform within PySpark 2.0 that would be easy to perform as a df.rdd.map, but since I would prefer to stay inside the Dataframe execution engine for performance reasons, I want to find a way to do this using Dataframe operations only.
The operation, in RDD-style, is something like this:
def precision_formatter(row):
formatter = "%.{}f".format(row.precision)
return row + [formatter % row.amount_raw / 10 ** row.precision]
df = df.rdd.map(precision_formatter)
Basically, I have a column that tells me, for each row, what the precision for my string formatting operation should be, and I want to selectively format the 'amount_raw' column as a string depending on that precision.
I don't know of a way to use the contents of one or more columns as input to another Column operation. The closest I can come is suggesting the use of Column.when with an externally-defined set of boolean operations that correspond to the set of possible boolean conditions/cases within the column or columns.
In this specific case, for instance, if you can obtain (or better yet, already have) all possible values of row.precision, then you can iterate over that set and apply a Column.when operation for each value in the set. I believe this set can be obtained with df.select('precision').distinct().collect().
Because the pyspark.sql.functions.when and Column.when operations themselves return a Column object, you can iterate over the items in the set (however it was obtained) and keep 'appending' when operations to each other programmatically until you have exhausted the set:
import pyspark.sql.functions as PSF
def format_amounts_with_precision(df, all_precisions_set):
amt_col = PSF.when(df['precision'] == 0, df['amount_raw'].cast(StringType()))
for precision in all_precisions_set:
if precision != 0: # this is a messy way of having a base case above
fmt_str = '%.{}f'.format(precision)
amt_col = amt_col.when(df['precision'] == precision,
PSF.format_string(fmt_str, df['amount_raw'] / 10 ** precision)
return df.withColumn('amount', amt_col)
You can do it with a python UDF. They can take as many input values (values from columns of a Row) and spit out a single output value. It would look something like this:
from pyspark.sql import types as T, functions as F
from pyspark.sql.function import udf, col
# Create example data frame
schema = T.StructType([
T.StructField('precision', T.IntegerType(), False),
T.StructField('value', T.FloatType(), False)
])
data = [
(1, 0.123456),
(2, 0.123456),
(3, 0.123456)
]
rdd = sc.parallelize(data)
df = sqlContext.createDataFrame(rdd, schema)
# Define UDF and apply it
def format_func(precision, value):
format_str = "{:." + str(precision) + "f}"
return format_str.format(value)
format_udf = F.udf(format_func, T.StringType())
new_df = df.withColumn('formatted', format_udf('precision', 'value'))
new_df.show()
Also, if instead of the column precision value you wanted to use a global one, you could use the lit(..) function when you call it like this:
new_df = df.withColumn('formatted', format_udf(F.lit(2), 'value'))