Cacheing and Loops in (Py)Spark - apache-spark

I understand that 'for' and 'while' loops are generally to-be-avoided when using Spark. My question is about optimizing a 'while' loop, though if I'm missing a solution that makes it unnecessary, I am all ears.
I'm not sure I can demonstrate the issue (very long processing times, compounding as the loop goes on) with toy data, but here is some pseudo code:
### I have a function - called 'enumerator' - which involves several joins and window functions.
# I run this function on my base dataset, df0, and return df1
df1 = enumerator(df0, param1 = apple, param2 = banana)
# Check for some condition in df1, then count number of rows in the result
counter = df1 \
.filter(col('X') == some_condition) \
.count()
# If there are rows meeting this condition, start a while loop
while counter > 0:
print('Starting with counter: ', str(counter))
# Run the enumerator function on df1 again
df2 = enumerator(df1, param1= apple, param2 = banana)
# Check for the condition again, then continue the while loop if necessary
counter = df2 \
.filter(col('X') == some_condition) \
.count()
df1 = df2
# After the while loop finishes, I take the last resulting dataframe and I will do several more operations and analyses downstream
final_df = df2
An essential aspect of the enumerator function is to 'look back' on a sequence in a window, and so it may take several runs before all the necessary corrections are made.
In my heart, I know this is ugly but the windowing/ranking/sequential analysis within the function is critical. My understanding is that the underlying Spark query plan gets more and more convoluted as the loop continues. Are there any best practices I should adopt in this situation? Should I be cacheing at any point - either before the while loop starts, or within the loop itself?

You definitely should cache/persist the dataframes, otherwise every iteration in the while loop will start from scratch from df0. Also you may want to unpersist the used dataframes to free up disk/memory space.
Another point to optimize is not to do a count, but use a cheaper operation, such as df.take(1). If that returns nothing then counter == 0.
df1 = enumerator(df0, param1 = apple, param2 = banana)
df1.cache()
# Check for some condition in df1, then count number of rows in the result
counter = len(df1.filter(col('X') == some_condition).take(1))
while counter > 0:
print('Starting with counter: ', str(counter))
df2 = enumerator(df1, param1 = apple, param2 = banana)
df2.cache()
counter = len(df2.filter(col('X') == some_condition).take(1))
df1.unpersist() # unpersist df1 as it will be overwritten
df1 = df2
final_df = df2

Related

PySpark apply function on 2 dataframes and write to csv for billions of rows on small hardware

I am trying to apply a levenshtein function for each string in dfs against each string in dfc and write the resulting dataframe to csv. The issue is that I'm creating so many rows by using the cross join and then applying the function, that my machine is struggling to write anything (taking forever to execute).
Trying to improve write performance:
I'm filtering out a few things on the result of the cross join i.e. rows where the LevenshteinDistance is less than 15% of the target word's.
Using bucketing on the first letter of each target word i.e. a, b, c, etc. still no luck (i.e. job runs for hours and doesn't generate any results).
from datetime import datetime
from config import config
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql import Window
def fuzzy_match(dfs, dfc, path_summary):
"""
Implements the Levenshtein and Soundex algorithms and returns a fuzzy matched DataFrame.
Filters out those where resulting LS distance is less than 15% of SF name length.
"""
# Apply Levenshtein and Soundex functions
dfs = dfs.withColumn("OrganisationNameKeyLen", F.length("OrganisationNameKey"))
df = dfc\
.crossJoin(dfs)\
.withColumn( "LevenshteinDistance", F.levenshtein( F.lower("OrganisationNameKey") , F.lower("CompanyNameKey") ) )\
.withColumn( "HasSameSoundex", F.soundex("OrganisationNameKey") == F.soundex("CompanyNameKey") )\
.where("LevenshteinDistance < OrganisationNameKeyLen * 0.15")\
.orderBy("OrganisationName", "CompanyName")
def fuzzy_match_approve(df, path_fuzzy_match_approved, path_fuzzy_match_rejected, path_summary):
"""
Filters fuzzy matching DataFrame results on approved/rejected based on set of conditions:
- If there is only 1 match against the SF name
- If more than 1 match then take that with LS distance of 1
- If more than 1 match and more multiple LS distances of 1, then take the one where Soundex codes are the same
Writes results and summary to CSV.
"""
def write_with_bucket(df, bucket_col, path):
df.write\
.mode("overwrite")\
.bucketBy(26, bucket_col)\
.option("path", path)\
.option("header", True)\
.saveAsTable("bucket", format="csv")
# Add window function columns:
# OrganisationNameMatchCount: Count AccountID per OrganisationName
# LevenshteinDistance1Count: Count AccountID per OrganisationName where LevenshteinDistance = 1
windowSpec = Window.partitionBy("OrganisationName")
df = df\
.select("AccountID", "OrganisationName", "OrganisationNameKey", "CompanyNumber", "CompanyName", "LevenshteinDistance", "HasSameSoundex")\
.withColumn("OrganisationNameMatchCount", F.count("AccountID").over(windowSpec))\
.withColumn("LevenshteinDistance1Count", F.count(F.when(F.col("LevenshteinDistance")==1, F.col("AccountID"))).over(windowSpec))
# Add bucket key column
df = df.withColumn( "OrganisationNameBucketKey", F.substring( col("OrganisationNameKey"),0,1) )
# Define fuzzy match approved condition
is_approved_1 = ( F.col("OrganisationNameMatchCount") == 1 )
is_approved_2 = ( (F.col("OrganisationNameMatchCount") > 1) & (F.col("LevenshteinDistance1Count") == 1) & (F.col("LevenshteinDistance") == 1) )
is_approved_3 = ( (F.col("OrganisationNameMatchCount") > 1) & (F.col("LevenshteinDistance1Count") > 1) & (F.col("HasSameSoundex") == 'true') )
is_approved = is_approved_1 | is_approved_2 | is_approved_3
# Split fuzzy match results into approved and rejected
df_approved = df.filter(is_approved)
df_rejected = df.filter(~is_approved)
# Export results
# df_approved.write.csv(path_fuzzy_match_approved, mode="overwrite", header=True, quoteAll=True)
# df_rejected.write.csv(path_fuzzy_match_rejected, mode="overwrite", header=True, quoteAll=True)
write_with_bucket(df_approved, "OrganisationNameBucketKey", path_fuzzy_match_approved)
write_with_bucket(df_rejected, "OrganisationNameBucketKey", path_fuzzy_match_rejected)
def main():
spark = SparkSession...
# Apply fuzzy match
dfs = spark.read...
dfc = spark.read...
path_summary = ...
df_fuzzy_match = fuzzy_match(dfs, dfc, path_summary)
# Export results
path_fuzzy_match_approved = ...
path_fuzzy_match_rejected = ...
fuzzy_match_approve(df_fuzzy_match, path_fuzzy_match_approved, path_fuzzy_match_rejected, path_summary)
main()
Other info:
df.rdd.getNumPartitions() is 2
dfs.count() is 12,515
dfc.count() is 5,110,430
Jobs:
How can I improve performance here and get the results into a CSV successfully?
There are a couple of things you can do to improve your computation:
Improve parallelism
As Nithish mentioned in the comments, you don't have enough partitions in your input data frames to make use of all your CPU cores. You're not using all your CPU capability and this will slow you down.
To increase your parallelism, repartition dfc to at least your number of cores:
dfc = dfc.repartition(dfc.sql_ctx.sparkContext.defaultParallelism)
You need to do this because your crossJoin is run as a BroadcastNestedLoopJoin which doesn't reshuffle your large input dataframe.
Separate your computation stages
A Spark dataframe/RDD is conceptually just a directed action graph (DAG) of operations to run on your input data but it does not hold data. One consequence of this behavior is that, by default, you'll rerun your computations as many times as you reuse your dataframe.
In your fuzzy_match_approve function, you run 2 separate filters on your df, this means you rerun the whole cross-join operations twice. You really don't want this !
One easy way to avoid this is to use cache() on your fuzzy_match result which should be fairly small given your inputs and matching criteria.
def fuzzy_match_running(dfs, dfc, path_summary):
"""
Implements the Levenshtein and Soundex algorithms and returns a fuzzy matched DataFrame.
Filters out those where resulting LS distance is less than 15% of SF name length.
"""
# Apply Levenshtein and Soundex functions
dfs = dfs.withColumn("OrganisationNameKeyLen", F.length("OrganisationNameKey")).cache()
dfc = dfc.repartition(dfc.sql_ctx.sparkContext.defaultParallelism).cache()
df = dfc.crossJoin(dfs) \
.withColumn( "LevenshteinDistance", F.levenshtein( F.lower("OrganisationNameKey") , F.lower("CompanyNameKey") ) ) \
.withColumn( "HasSameSoundex", F.soundex("OrganisationNameKey") == F.soundex("CompanyNameKey") ) \
.where("LevenshteinDistance < OrganisationNameKeyLen * 0.15") \
.orderBy("OrganisationName", "CompanyName") \
.cache()
return df
If I run my fuzzy_match_running on some example data frames on my 8 core/16 threads I9-9980HK laptop (spark in local[*] mode with 8GB driver memory):
dfc rowcount : 572494
dfs rowcount : 17728
fuzzy_match rowcount: 7228499
Duration: 679.5572581291199 seconds
Matches/core/sec: 933436.210726889
The job takes about 12 min doing 572494*17728 ~ 10 billion row comparisons
at 933k comparisons/seconds/core. Since your job does 64 billions row comparisons I would expect it to take about 80 min on my laptop.
You should run a similar experiment on your computer with a smaller sample to get an idea of your actual computing speed.
Going further: maximizing matches/sec
To go faster, we need to adjust the computation and increase the number of comparisons that can be done per seconds.
A few things stand out in the function:
you filter your output by comparing the levenshtein distance, an integer, to a decimal calculation. This means spark will cast your integer to a decimal and operate on decimal. Comparing decimals is much slower than integers and it's unnecessary here, you can cast the bound to an int beforehand.
your levenshtein operates on the lower versions of your keys, this means, for each row comparison, Spark will convert the column values to lower again and again, wasting CPU cycles for redundant stuff. You can preprocess this before your join.
I update the function like this:
def fuzzy_match(dfs: DataFrame, dfc: DataFrame, path_summary: str) -> DataFrame:
dfs = dfs.withColumn("OrganisationNameKeyLower", F.lower("OrganisationNameKey"))\
.withColumn("MatchingTolerance", F.ceil(F.length("OrganisationNameKey") * 0.15).cast("int"))\
.cache()
dfc = dfc.repartition(dfc.sql_ctx.sparkContext.defaultParallelism)\
.withColumn("CompanyNameKeyLower", F.lower("CompanyNameKey"))\
.cache()
df = dfc.crossJoin(dfs)\
.withColumn("LevenshteinDistance", F.levenshtein(F.col("OrganisationNameKeyLower"), F.col("CompanyNameKeyLower")).cast("int")) \
.where("LevenshteinDistance < MatchingTolerance")\
.drop("MatchingTolerance")\
.cache()
# clean unnecessary caches before returning
dfs.unpersist()
dfc.unpersist()
return df
When running the updated version on the same inputs as before and on the same computer I get nearly twice the performance as the first implementation
dfc rowcount : 572494
dfs rowcount : 17728
fuzzy_match rowcount: 7228499
Duration: 356.23311281204224 seconds
Matches/core/sec: 1780641.1846241967
If that is still too slow for your needs, you'll need to find conditions on your data that you can use as a join condition but that's highly data and use case specific.

PySpark data skewness with Window Functions

I have a huge PySpark dataframe and I'm doing a series of Window functions over partitions defined by my key.
The issue with the key is, my partitions gets skewed by this and results in Event Timeline that looks something like this,
I know that I can use salting technique to solve this issue when I'm doing a join. But how can I solve this issue when I'm using Window functions?
I'm using functions like lag, lead etc in the Window functions. I can't do the process with salted key, because I'll get wrong results.
How to solve skewness in this case?
I'm looking for a dynamic way of repartitioning my dataframe without skewness.
Updates based on answer from #jxc
I tried creating a sample df and tried running code over that,
df = pd.DataFrame()
df['id'] = np.random.randint(1, 1000, size=150000)
df['id'] = df['id'].map(lambda x: 100 if x % 2 == 0 else x)
df['timestamp'] = pd.date_range(start=pd.Timestamp('2020-01-01'), periods=len(df), freq='60s')
sdf = sc.createDataFrame(df)
sdf = sdf.withColumn("amt", F.rand()*100)
w = Window.partitionBy("id").orderBy("timestamp")
sdf = sdf.withColumn("new_col", F.lag("amt").over(w) + F.lead("amt").over(w))
x = sdf.toPandas()
This gave me a event timeline like this,
I tried the code from #jxc's answer,
sdf = sc.createDataFrame(df)
sdf = sdf.withColumn("amt", F.rand()*100)
N = 24*3600*365*2
sdf_1 = sdf.withColumn('pid', F.ceil(F.unix_timestamp('timestamp')/N))
w1 = Window.partitionBy('id', 'pid').orderBy('timestamp')
w2 = Window.partitionBy('id', 'pid')
sdf_2 = sdf_1.select(
'*',
F.count('*').over(w2).alias('cnt'),
F.row_number().over(w1).alias('rn'),
(F.lag('amt',1).over(w1) + F.lead('amt',1).over(w1)).alias('new_val')
)
sdf_3 = sdf_2.filter('rn in (1, 2, cnt-1, cnt)') \
.withColumn('new_val', F.lag('amt',1).over(w) + F.lead('amt',1).over(w)) \
.filter('rn in (1,cnt)')
df_new = sdf_2.filter('rn not in (1,cnt)').union(sdf_3)
x = df_new.toPandas()
I ended up one additional stage and the event timeline looked more skewed,
Also the run time is increased by a bit with new code
To process a large partition, you can try split it based on the orderBy column(most likely a numeric column or date/timestamp column which can be converted into numeric) so that all new sub-partitions maintain the correct order of rows. process rows with the new partitioner and for calculation using lag and lead functions, only rows around the boundary between sub-partitions need to be post-processed. (Below also discussed how to merge small partitions in task-2)
Use your example sdf and assume we have the following WinSpec and a simple aggregate function:
w = Window.partitionBy('id').orderBy('timestamp')
df.withColumn('new_amt', F.lag('amt',1).over(w) + F.lead('amt',1).over(w))
Task-1: split large partitions:
Try the following:
select a N to split timestamp and set up an additional partitionBy column pid (using ceil, int, floor etc.):
# N to cover 35-days' intervals
N = 24*3600*35
df1 = sdf.withColumn('pid', F.ceil(F.unix_timestamp('timestamp')/N))
add pid into partitionBy(see w1), then calaulte row_number(), lag() and lead() over w1. find also number of rows (cnt) in each new partition to help identify the end of partitions (rn == cnt). the resulting new_val will be fine for majority of rows except those on the boundaries of each partition.
w1 = Window.partitionBy('id', 'pid').orderBy('timestamp')
w2 = Window.partitionBy('id', 'pid')
df2 = df1.select(
'*',
F.count('*').over(w2).alias('cnt'),
F.row_number().over(w1).alias('rn'),
(F.lag('amt',1).over(w1) + F.lead('amt',1).over(w1)).alias('new_amt')
)
Below is an example df2 showing the boundary rows.
process the boundary: select rows which are on the boundaries rn in (1, cnt) plus those which have values used in the calculation rn in (2, cnt-1), do the same calculation of new_val over w and save result for boundary rows only.
df3 = df2.filter('rn in (1, 2, cnt-1, cnt)') \
.withColumn('new_amt', F.lag('amt',1).over(w) + F.lead('amt',1).over(w)) \
.filter('rn in (1,cnt)')
Below shows the resulting df3 from the above df2
merge df3 back to df2 to update boundary rows rn in (1,cnt)
df_new = df2.filter('rn not in (1,cnt)').union(df3)
Below screenshot shows the final df_new around the boundary rows:
# drop columns which are used to implement logic only
df_new = df_new.drop('cnt', 'rn')
Some Notes:
the following 3 WindowSpec are defined:
w = Window.partitionBy('id').orderBy('timestamp') <-- fix boundary rows
w1 = Window.partitionBy('id', 'pid').orderBy('timestamp') <-- calculate internal rows
w2 = Window.partitionBy('id', 'pid') <-- find #rows in a partition
note: strictly, we'd better use the following w to fix boundary rows to avoid issues with tied timestamp around the boundaries.
w = Window.partitionBy('id').orderBy('pid', 'rn') <-- fix boundary rows
if you know which partitions are skewed, just divide them and skip others. the existing method might split a small partition into 2 or even more if they are sparsely distributed
df1 = df.withColumn('pid', F.when(F.col('id').isin('a','b'), F.ceil(F.unix_timestamp('timestamp')/N)).otherwise(1))
If for each partition, you can retrieve count(number of rows) and min_ts=min(timestamp), then try something more dynamically for pid(below M is the threshold number of rows to split):
F.expr(f"IF(count>{M}, ceil((unix_timestamp(timestamp)-unix_timestamp(min_ts))/{N}), 1)")
note: for skewness inside a partition, will requires more complex functions to generate pid.
if only lag(1) function is used, just post-process left boundaries, filter by rn in (1, cnt) and update only rn == 1
df3 = df1.filter('rn in (1, cnt)') \
.withColumn('new_amt', F.lag('amt',1).over(w)) \
.filter('rn = 1')
similar to lead function when we need only to fix right boundaries and update rn == cnt
if only lag(2) is used, then filter and update more rows with df3:
df3 = df1.filter('rn in (1, 2, cnt-1, cnt)') \
.withColumn('new_amt', F.lag('amt',2).over(w)) \
.filter('rn in (1,2)')
You can extend the same method to mixed cases with both lag and lead having different offset.
Task-2: merge small partitions:
Based on the number of records in a partition count, we can set up an threshold M so that if count>M, the id holds its own partition, otherwise we merge partitions so that #of total records is less than M (below method has a edging case of 2*M-2).
M = 20000
# create pandas df with columns `id`, `count` and `f`, sort rows so that rows with count>=M are located on top
d2 = pd.DataFrame([ e.asDict() for e in sdf.groupby('id').count().collect() ]) \
.assign(f=lambda x: x['count'].lt(M)) \
.sort_values('f')
# add pid column to merge smaller partitions but the total row-count in partition should be less than or around M
# potentially there could be at most `2*M-2` records for the same pid, to make sure strictly count<M, use a for-loop to iterate d1 and set pid:
d2['pid'] = (d2.mask(d2['count'].gt(M),M)['count'].shift(fill_value=0).cumsum()/M).astype(int)
# add pid to sdf. In case join is too heavy, try using Map
sdf_1 = sdf.join(spark.createDataFrame(d2).alias('d2'), ["id"]) \
.select(sdf["*"], F.col("d2.pid"))
# check pid: # of records and # of distinct ids
sdf_1.groupby('pid').agg(F.count('*').alias('count'), F.countDistinct('id').alias('cnt_ids')).orderBy('pid').show()
+---+-----+-------+
|pid|count|cnt_ids|
+---+-----+-------+
| 0|74837| 1|
| 1|20036| 133|
| 2|20052| 134|
| 3|20010| 133|
| 4|15065| 100|
+---+-----+-------+
Now, the new Window should be partitioned by pid alone and move id to orderBy, see below:
w3 = Window.partitionBy('pid').orderBy('id','timestamp')
customize lag/lead functions based on the above w3 WinSpec, and then calculate new_val:
lag_w3 = lambda col,n=1: F.when(F.lag('id',n).over(w3) == F.col('id'), F.lag(col,n).over(w3))
lead_w3 = lambda col,n=1: F.when(F.lead('id',n).over(w3) == F.col('id'), F.lead(col,n).over(w3))
sdf_new = sdf_1.withColumn('new_val', lag_w3('amt',1) + lead_w3('amt',1))
To handle such skewed data, there are a couple of things you can try out.
If you are using Databricks to run your jobs and you know which column will have the skew then you can try out an option called skew hint
I recommend moving to Spark 3.0 since you will have the option to use Adaptive Query Execution (AQE) which can handle most of the issues improving your job health and potentially running them faster.
Usually, I suggest making your data more even-sized partitions before any wide operation, and Increasing the cluster size does help but I am not sure if this will work for you.

How to optimize searching RDD contents in another RDD

I have a problem where I need to search the contents of an RDD in another RDD.
This question is different from Efficient string matching in Apache Spark, as I am searching for an exact match and I don't need the overhead of using the ML stack.
I am new to spark and I want to know which of these methods is more efficient or if there is another way.
I have a keyword file like the below sample (in production it might reach up to 200 lines)
Sample keywords file
0.47uF 25V X7R 10% -TDK C2012X7R1E474K125AA
20pF-50V NPO/COG - AVX- 08055A200JAT2A
and I have another file(tab separated)from which I need to find matches(in production I have up to 80 Million line)
C2012X7R1E474K125AA Conn M12 Circular PIN 5 POS Screw ST Cable Mount 5 Terminal 1 Port
First method
I defined a UDF and looped through keywords for each line
keywords = sc.textFile("keys")
part_description = sc.textFile("part_description")
def build_regex(keywords):
res = '('
for key in keywords:
res += '(?<!\\\s)%s(?!\\\s)|' % re.escape(key)
res = res[0:len(res) - 1] + ')'
return r'%s' % res
def get_matching_string(line, regex):
matches = re.findall(regex, line, re.IGNORECASE)
matches = list(set(matches))
return list(set(matches)) if matches else None
def find_matching_regex(line):
result = list()
for keyGroup in keys:
matches = get_matching_string(line, keyGroup)
if matches:
result.append(str(keyGroup) + '~~' + str(matches) + '~~' + str(len(matches)))
if len(result) > 0:
return result
def split_row(list):
try:
return Row(list[0], list[1])
except:
return None
keys_rdd = keywords.map(lambda keywords: build_regex(keywords.replace(',', ' ').replace('-', ' ').split(' ')))
keys = keys_rdd.collect()
sc.broadcast(keys)
part_description = part_description.map(lambda item: item.split('\t'))
df = part_description.map(lambda list: split_row(list)).filter(lambda x: x).toDF(
["part_number", "description"])
find_regex = udf(lambda line: find_matching_regex(line), ArrayType(StringType()))
df = df.withColumn('matched', find_regex(df['part_number']))
df = df.filter(df.matched.isNotNull())
df.write.save(path=job_id, format='csv', mode='append', sep='\t')
Second method
I thought I can do more parallel processing (instead of looping through keys like above) I did cartersian product between keys and lines, splitted and exploded the keys then compared each key to the part column
df = part_description.cartesian(keywords)
df = df.map(lambda tuple: (tuple[0].split('\t'), tuple[1])).map(
lambda tuple: (tuple[0][0], tuple[0][1], tuple[1]))
df = df.toDF(['part_number', 'description', 'keywords'])
df = df.withColumn('single_keyword', explode(split(F.col("keywords"), "\s+"))).where('keywords != ""')
df = df.withColumn('matched_part_number', (df['part_number'] == df['single_keyword']))
df = df.filter(df['matched_part_number'] == F.lit(True))
df.write.save(path='part_number_search', format='csv', mode='append', sep='\t')
Are these the correct ways to do this? Is there anything I can do to process these data faster?
These are both valid solutions, and I have used both in different circumstances.
You communicate less data by using your broadcast approach, sending only 200 extra lines to each executor as opposed to replicating each line of your >80m line file 200 times, so it is likely this one will end up being faster for you.
I have used the cartesian approach when the number of records in my lookup is not feasibly broadcast-able (being much, much larger than 200 lines).
In your situation, I would use broadcast.

Is .show() a Spark action? [duplicate]

I have the following code:
val df_in = sqlcontext.read.json(jsonFile) // the file resides in hdfs
//some operations in here to create df as df_in with two more columns "terms1" and "terms2"
val intersectUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { seq1 intersect seq2 } ) //intersects two sequences
val symmDiffUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { (seq1 diff seq2) ++ (seq2 diff seq1) } ) //compute the difference of two sequences
val df1 = (df.withColumn("termsInt", intersectUDF(df("terms1"), df1("terms2") ) )
.withColumn("termsDiff", symmDiffUDF(df("terms1"), df1("terms2") ) )
.where( size(col("termsInt")) >0 && size(col("termsDiff")) > 0 && size(col("termsDiff")) <= 2 )
.cache()
) // add the intersection and difference columns and filter the resulting DF
df1.show()
df1.count()
The app is working properly and fast until the show() but in the count() step, it creates 40000 tasks.
My understanding is that df1.show() should be triggering the full df1 creation and df1.count() should be very fast. What am I missing here? why is count() that slow?
Thank you very much in advance,
Roxana
show is indeed an action, but it is smart enough to know when it doesn't have to run everything. If you had an orderBy it would take very long too, but in this case all your operations are map operations and so there's no need to calculate the whole final table. However, count needs to physically go through the whole table in order to count it and that's why it's taking so long. You could test what I'm saying by adding an orderBy to df1's definition - then it should take long.
EDIT: Also, the 40k tasks are likely due to the amount of partitions your DF is partitioned into. Try using df1.repartition(<a sensible number here, depending on cluster and DF size>) and trying out count again.
show() by default shows only 20 rows. If the 1st partition returned more than 20 rows, then the rest partitions will not be executed.
Note show has a lot of variations. If you run show(false) which means show all results, all partitions will be executed and may take more time. So, show() equals show(20) which is a partial action.

Is dataframe.show() an action in spark?

I have the following code:
val df_in = sqlcontext.read.json(jsonFile) // the file resides in hdfs
//some operations in here to create df as df_in with two more columns "terms1" and "terms2"
val intersectUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { seq1 intersect seq2 } ) //intersects two sequences
val symmDiffUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { (seq1 diff seq2) ++ (seq2 diff seq1) } ) //compute the difference of two sequences
val df1 = (df.withColumn("termsInt", intersectUDF(df("terms1"), df1("terms2") ) )
.withColumn("termsDiff", symmDiffUDF(df("terms1"), df1("terms2") ) )
.where( size(col("termsInt")) >0 && size(col("termsDiff")) > 0 && size(col("termsDiff")) <= 2 )
.cache()
) // add the intersection and difference columns and filter the resulting DF
df1.show()
df1.count()
The app is working properly and fast until the show() but in the count() step, it creates 40000 tasks.
My understanding is that df1.show() should be triggering the full df1 creation and df1.count() should be very fast. What am I missing here? why is count() that slow?
Thank you very much in advance,
Roxana
show is indeed an action, but it is smart enough to know when it doesn't have to run everything. If you had an orderBy it would take very long too, but in this case all your operations are map operations and so there's no need to calculate the whole final table. However, count needs to physically go through the whole table in order to count it and that's why it's taking so long. You could test what I'm saying by adding an orderBy to df1's definition - then it should take long.
EDIT: Also, the 40k tasks are likely due to the amount of partitions your DF is partitioned into. Try using df1.repartition(<a sensible number here, depending on cluster and DF size>) and trying out count again.
show() by default shows only 20 rows. If the 1st partition returned more than 20 rows, then the rest partitions will not be executed.
Note show has a lot of variations. If you run show(false) which means show all results, all partitions will be executed and may take more time. So, show() equals show(20) which is a partial action.

Resources