Why is pyspark write() so slow compared to show()? - apache-spark

I'm using Pyspark 3.1.1 locally for a simple calculation without changing any configs apart from setting setMaster to "local".
I have a very large file which I read into Pyspark, it's a 17.7 GB text file with over 87 million lines which I read in as a csv with one column.
Here's some code showing what I'm doing, B is a long string that is cut into pieces of 3 characters which are then exploded into individual rows and used for joining.
from pyspark.sql import functions as f
from pyspark.sql.types import ArrayType, StringType
df1 = spark.read.csv(path_to_file1, schema=schema1)
df1 = df1.withColumn('A', f.expr("substring(data, 1, 10)"))
df1 = df1.withColumn('B', f.expr("substring(data, 11, length(data)-13)")).drop("data")
split_data = f.udf(lambda x: [x[i:i+3] for i in range(0, len(x), 3)], ArrayType(StringType()))
df1 = df1.withColumn('B', split_data(f.col('B')))
df1 = df1.withColumn('B', f.explode(f.col('B')))
df2 = spark.read.csv(path_to_file2, schema=schema2)
df2 = f.broadcast(df2)
df = df1.join(df2, on='B', how='inner')
When I do df.coalesce(1).show() in the end everything just takes a couple of seconds.
However when I do df.write.csv() instead I get 142 stages with each taking around 22 seconds to complete adding up to over 50 minutes! Each stage writes between 11 and 19 MB result. When I add a coalesce(1) before the write it's 1 stage that of course complains about lacking sufficient memory but also seems to take a horribly long time. I did not wait for it to finish because of the memory warnings.
=======================================
Why is the write-call so much slower? Both calls return the final result, so shouldn't both calls be executing the entire DAG? How can just writing results to disc take so much longer than everything else?
On the other side, when just using show how does Pyspark read a 17.7 GB csv file, does all the operations including the explode and the join within just a few seconds? Does the DAG even calculate the entire dataset or just a chunk? I do coalesce after all.
I did experiment with setting spark.sql.shuffle.partitions to 4 or 8, this has helped me before with smaller data sets in this situation, but now it doesn't seem to matter, I always get 142 tasks.
=======================================
Where is the actual bottleneck in this scenario?
How can I speed up the write-operation?
Would this be a problem on a cluster too or is it a local problem?

Related

Why is UDF not running in parallel on available executors?

I have a tiny spark Dataframe that essentially pushes a string into a UDF. I'm expecting, because of .repartition(3), which is the same length as targets, for the processing inside run_sequential to be applied on available executors - i.e. applied to 3 different executors.
The issue is that only 1 executor is used. How can I parallelise this processing to force my pyspark script to assign each element of target to a different executor?
import pandas as pd
import pyspark.sql.functions as F
def run_parallel(config):
def run_sequential(target):
#process with target variable
pass
return F.udf(run_sequential)
targets = ["target_1", "target_2", "target_3"]
config = {}
pdf = spark.createDataFrame(pd.DataFrame({"targets": targets})).repartition(3)
pdf.withColumn(
"apply_udf", run_training_parallel(config)("targets")
).collect()
The issue here is that repartitioning a DataFrame does not guarantee that all the created partitions will be of the same size. With such a small number of records there is a pretty high chance that some of them will map into the same partition. Spark is not meant to process such small datasets and its algorithms are tailored to work efficiently with large amounts of data - if your dataset has 3 million records and you split it in 3 partitions of approximately 1 million records each, a difference of several records per partition will be insignificant in most cases. This is obviously not the case when repartitioning 3 records.
You can use df.rdd.glom().map(len).collect() to examine the size of the partitions before and after repartitioning to see how the distribution changes.
$ pyspark --master "local[3]"
...
>>> pdf = spark.createDataFrame([("target_1",), ("target_2",), ("target_3",)]).toDF("targets")
>>> pdf.rdd.glom().map(len).collect()
[1, 1, 1]
>>> pdf.repartition(3).rdd.glom().map(len).collect()
[0, 2, 1]
As you can see, the resulting partitioning is uneven and the first partition in my case is actually empty. The irony here is that the original dataframe has the desired property and that one is getting destroyed by repartition().
While your particular case is not what Spark typically targets, it is still possible to forcefully distribute three records in three partitions. All you need to do is to provide an explicit partition key. RDDs have the zipWithIndex() method that extends each record with its ID. The ID is the perfect partition key since its value starts with 0 and increases by 1.
>>> new_df = (pdf
.coalesce(1) # not part of the solution - see below
.rdd # Convert to RDD
.zipWithIndex() # Append ID to each record
.map(lambda x: (x[1], x[0])) # Make record ID come first
.partitionBy(3) # Repartition
.map(lambda x: x[1]) # Remove record ID
.toDF()) # Turn back into a dataframe
>>> new_df.rdd.glom().map(len).collect()
[1, 1, 1]
In the above code, coalesce(1) is added only to demonstrate that the final partitioning is not influenced by the fact that pdf initially has one record in each partition.
A DataFrame-only solution is to first coalesce pdf to a single partition and then use repartition(3). With no partitioning column(s) provided, DataFrame.repartition() uses the round-robin partitioner and hence the desired partitioning will be achieved. You cannot simply do pdf.coalesce(1).repartition(3) since Catalyst (the Spark query optimisation engine) optimises out the coalesce operation, so a partitioning-dependent operation must be inserted in between. Adding a column containing F.monotonically_increasing_id() is a good candidate for such an operation.
>>> new_df = (pdf
.coalesce(1)
.withColumn("id", F.monotonically_increasing_id())
.repartition(3))
>>> new_df.rdd.glom().map(len).collect()
[1, 1, 1]
Note that, unlike in the RDD-based solution, coalesce(1) is required as part of the solution.

How to perform parallel computation on Spark Dataframe by row?

I have a collection of 300 000 points and I would like to compute the distance between them.
id x y
0 0 1 0
1 1      28 76
…
Thus I do a Cartesian product between those points and I filter such as I keep only one combination of points. Indeed for my purpose distance between points (0, 1) is same as (1,0)
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
import math
#udf(returnType=IntegerType())
def compute_distance(x1,y1, x2,y2):
return math.square(math.pow(x1-x2) + math.pow(y1-y2))
columns = ['id','x', 'y']
data = [(0, 1, 0), (1, 28,76), (2, 33,42)]
spark = SparkSession\
.builder \
.appName('distance computation') \
.config('spark.sql.execution.arrow.pyspark.enabled', 'true') \
.config('spark.executor.memory', '2g') \
.master('local[20]') \
.getOrCreate()
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
result = df.alias('a')\
.join(df.alias('b'),
F.array(*['a.id']) < F.array(*['b.id']))\
.withColumn('distance', compute_distance(F.col('a.x'), F.col('a.y'), F.col('b.x'), F.col('b.y')))
result.write.parquet('distance-between-points')
While that seems to work, the CPU usage for my latest task (parquet at NativeMethodAccessorImpl.java:0) did not go above 100%. Also, it took and a day to complete.
I would like to know if the withColumn operation is performed on multiple executors in order to achieve parallelism?
Is there a way to split the data in order to compute distance by batch and to store the result in one or multiple Parquet files?
Thanks for your insight.
I would like to know if the withColumn operation is performed on multiple executor in order to achieve parallelism ?
Yes, assuming a correctly configured cluster, the dataframe will be partitioned across your cluster and the executors will work through the partitions in parallel running your UDF.
Is there a way to split the data in order to compute distance by batch in // and to store them into one or multiples parquet files ?
By default, the resulting dataframe will be partitioned across the cluster and written out as one Parquet file per partition. You can change that by re-partioning if required, but that will result in a shuffle and take longer.
I recommend the 'Level of Parallelism' section in the Learning Spark book for further reading.

pyspark df.count() taking a very long time (or not working at all)

I have the following code that is simply doing some joins and then outputting the data;
from pyspark.sql.functions import udf, struct
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkConf
from pyspark.sql.functions import broadcast
conf = SparkConf()
conf.set('spark.logConf', 'true')
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("Generate Parameters") \
.getOrCreate()
spark.sparkContext.setLogLevel("OFF")
df1 = spark.read.parquet("/location/mydata")
df1 = df1.select([c for c in df1.columns if c in ['sender','receiver','ccc,'cc','pr']])
df2 = spark.read.csv("/location/mydata2")
cond1 = [(df1.sender == df2._c1) | (df1.receiver == df2._c1)]
df3 = df1.join(broadcast(df2), cond1)
df3 = df3.select([c for c in df3.columns if c in['sender','receiver','ccc','cc','pr']])
df1 is 1,862,412,799 rows and df2 is 8679 rows
when I then call;
df3.count()
It just seems to sit there with the following
[Stage 33:> (0 + 200) / 200]
Assumptions for this answer:
df1 is the dataframe containing 1,862,412,799 rows.
df2 is the dataframe containing 8679 rows.
df1.count() returns a value quickly (as per your comment)
There may be three areas where the slowdown is occurring:
The imbalance of data sizes (1,862,412,799 vs 8679):
Although spark is amazing at handling large quantities of data, it doesn't deal well with very small sets. If not specifically set, Spark attempts to partition your data into multiple parts and on small files this can be excessively high in comparison to the actual amount of data each part has. I recommend trying to use the following and see if it improves speed.
df2 = spark.read.csv("/location/mydata2")
df2 = df2.repartition(2)
Note: The number 2 here is just an estimated number, based on how many partitions would suit the amount of rows that are in that set.
Broadcast Cost:
The delay in the count may be due to the actual broadcast step. Your data is being saved and copied to every node within your cluster before the join, this all happening together once count() is called. Depending on your infrastructure, this could take some time. If the above repartition doesn't work, try removing the broadcast call. If that ends up being the delay, it may be good to confirm that there are no bottlenecks within your cluster or if it's necessary.
Unexpected Merge Explosion
I do not imply that this is an issue, but it is always good to check that the merge condition you have set is not creating unexpected duplicates. It is a possibility that this may be happening and creating the slow down you are experiencing when actioning the processing of df3.

Spark-Cassandra write takes longer than expected

I have a spark job that runs reads data from one cassandra table and dumps the result back into two tables with slight modifications. My problem is that the job takes much longer than expected.
The code is as follows:
val range = sc.parallelize(0 to 100)
val rdd1 = range.map(x => (some_value, x)).joinWithCassandraTable[Event](keyspace_name, table2).select("col1", "col2", "col3", "col4", "col5", "col6", "col7").map(x => x._2)
val rdd2: RDD[((Int, String, String, String), Iterable[Event])] = rdd1.keyBy(r => (r.col1, r.col2, r.col3, r.col4 )).groupByKey
val rdd3 = rdd2.mapValues(iter => someFunction(iter.toList.sorted))
//STORE 1
rdd3.map(r => (r._1._1, r._1._2, r._1._3, r._1._4, r._2.split('|')(1).toDouble )).saveToCassandra(keyspace_name, table1, SomeColumns("col1","col2", "col3","col4", "col5"))
//STORE 2
rdd3.map(r => (to, r._1%100, to, "MANUAL_"+r._1+"_"+r._2+"_"+r._3+"_"+r._4+"_"+java.util.UUID.randomUUID(), "M", to, r._4, r._3, r._1, r._5, r._2) ).saveToCassandra(keyspace_name, table2, SomeColumns("col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9", "col10", "col11"))
For around a million records, STORE 1 takes close to 40 seconds and STORE 2 (slight modification to rdd3) takes more than a minute. Am not sure where I am going wrong or why is taking so much time. My spark environment is as follows:
DSE 4.8.9 with 6 nodes
70 GB RAM
12 cores each
Any help is appreciated.
Let me do my guess. Logs, perf monitoring output and C* data model is needed for more precise answer.
But some math:
You have
joinWithCassandra — random C* read
saveToCassandra — sec C* write
spark repartition? / reduce
(I expect saveToCassadndra takes half of all time)
and if you do not run any queries before you need to minus 12-20 sec for spark to start executors and other things
SO for 1M entries on 6nodes and 40 sec you got:
1000000 / 6 / 40 = 4166 record/sec/node. That's not bad. 10K/s per node with mixed workload is a good result.
The second write is 2 times bigger (11 column compared to 5) and it run after the first one, so i expect Cassandra to start spilling previous data to disk at thas moment, so you can get more perf degradation here.
do I understand correctly that when you add rdd3.cache() call, nothing changed for the second run? That strange.
and yes you can get better results with tuning of C* data model and Spark/C* parameters

My spark app is too slow, how can I increase the speed significantly?

This is part of my spark code which is very slow. By slow I mean for 70 Million data rows it takes almost 7 minutes to run the code but I need it to run in under 5 seconds if possible. I have a cluster with 5 spark nodes with 80 cores and 177 GB memory of which 33Gb are currently used.
range_expr = col("created_at").between(
datetime.now()-timedelta(hours=timespan),
datetime.now()-timedelta(hours=time_delta(timespan))
)
article_ids = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="table", keyspace=source).load().where(range_expr).select('article','created_at').repartition(64*2)
axes = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="table", keyspace=source).load()
#article_ids.join(axes,article_ids.article==axes.article)
speed_df = article_ids.join(axes,article_ids.article==axes.article).select(axes.article,axes.at,axes.comments,axes.likes,axes.reads,axes.shares) \
.map(lambda x:(x.article,[x])).reduceByKey(lambda x,y:x+y) \
.map(lambda x:(x[0],sorted(x[1],key=lambda y:y.at,reverse = False))) \
.filter(lambda x:len(x[1])>=2) \
.map(lambda x:x[1][-1]) \
.map(lambda x:(x.article,(x,(x.comments if x.comments else 0)+(x.likes if x.likes else 0)+(x.reads if x.reads else 0)+(x.shares if x.shares else 0))))
I believe especially this part of the code is particularly slow:
sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="table", keyspace=source).load()
When put in spark it transforms into this which I think causes it to be slow :
javaToPython at NativeMethodAccessorImpl.java:-2
Any help would really be appreciated. Thanks
EDIT
The biggest speed problem seems to be JavatoPython. The attached picture is only for part of my data and is already very slow.
EDIT (2)
About len(x1)>=2:
Sorry for the long elaboration but I really hope I can solve this problem, so making people understand a quite complex problem in detail is crucial:
this is my rdd example:
rdd1 = [(1,3),(1,5),(1,6),(1,9),(2,10),(2,76),(3,8),(4,87),(4,96),(4,109),(5,10),(6,19),(6,18),(6,65),(6,43),(6,81),(7,12),(7,96),(7,452),(8,59)]
After the spark transformation rdd1 has this form:
rdd_result = [(1,9),(2,76),(4,109),(6,81),(7,452)]
the result does not contain (3,8),(5,10) because the key 3 or 5 only occur once, I don't want the 3 or 5 to appear.
below is my program:
first:rdd1 reduceByKey then the result is:
rdd_reduceByKey=[(1,[3,5,6,9]),(2,[10,76]),(3,[8]),(4,[87,96,109]),(5,[10]),(6,[19,18,65,43,81]),(7,[12,96,452,59]))]
second:rdd_reduceByKey filter by len(x1)>=2 then result is:
rdd_filter=[(1,[3,5,6,9]),(2,[10,76]),(4,[87,96,109]),(6,[19,18,65,43,81]),(7,[12,96,452,59]))]
so the len(x1)>=2 is necessary but slow.
Any recommendation improvements would be hugely appreciated.
Few things I would to do if I meet performance issue.
check spark web UI. Find the slowest part.
The lambda function is really suspicious
Check executor configuration
Store some of the data in intermediate table.
Compare the result if store data in parquet helps.
Compare the if using Scala helps
EDIT:
Using Scala instead of Python could do the trick if the JavatoPython is the slowest.
Here is the code for finding the latest/largest. It should be NlogN, most likely close to N, since the sorting is on small data set.
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
val data = Seq((1,3),(1,5),(1,6),(1,9),(2,10),
(2,76),(3,8),(4,87),(4,96),(4,109),
(5,10),(6,19),(6,18),(6,65),(6,43),
(6,81),(7,12),(7,96),(7,452),(8,59))
val df = sqlContext.createDataFrame(data)
val dfAgg = df.groupBy("_1").agg(collect_set("_2").alias("_2"))
val udfFirst= udf[Int, WrappedArray[Int]](_.head)
val dfLatest = dfAgg.filter(size($"_2") > 1).
select($"_1", udfFirst(sort_array($"_2", asc=false)).alias("latest"))
dfLatest.show()

Resources