Does repartition not shuffle data to all nodes? - apache-spark

Consider the following simple example, run the Spark Shell connected to a Cluster with 4 Executors:
scala> val rdd = sc.parallelize(Seq(1, 2, 3, 4, 5, 6), 4).cache.setName("rdd")
rdd: org.apache.spark.rdd.RDD[Int] = rdd ParallelCollectionRDD[0] at parallelize at <console>:27
scala> rdd.count()
res0: Long = 6
scala> val singlePartition = rdd.repartition(1).cache.setName("singlePartition")
singlePartition: org.apache.spark.rdd.RDD[Int] = singlePartition MapPartitionsRDD[4] at repartition at <console>:29
scala> singlePartition.count()
res1: Long = 6
scala> val multiplePartitions = singlePartition.repartition(6).cache.setName("multiplePartitions")
multiplePartitions: org.apache.spark.rdd.RDD[Int] = multiplePartitions MapPartitionsRDD[8] at repartition at <console>:31
scala> multiplePartitions.count()
res2: Long = 6
The original rdd has 4 partitions which when I check in the UI are distributed across the 4 Executors. The singlePartition RDD is obviously only contained on one Executor. And when the multiplePartitions RDD is created by repartitioning the singlePartition RDD and I would expect that to shuffle the data across the 4 Executors. What I see is that there are 6 partitions for multiplePartitions, but they are all on one Executor, the same one where singlePartition has its partition.
Shouldn't the data be shuffled across the 4 Executors by the repartition?

Related

Job are not shown on Spark WebUI

I a naive user of spark. I installed spark and using anaconda install pyspark, then run a basic code in the jupyter notebook that is given below. I then open the spark WebUI however I am unable to see any jobs either running or completed. Any comments are appreciated.
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.master("local")\
.appName("NQlabtop")\
.config('spark.ui.port', '4050')\
.getOrCreate()
sc = spark.sparkContext
input_file=sc.textFile("C:/Users/nqazi/NQ/anscombe.json")
map = input_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1))
counts = map.reduceByKey(lambda a, b: a + b)
print("counts",counts)
sc = spark.sparkContext
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
Please see the image of the Spark WebUI below. I am not sure why I cannot see any of the jobs as I think it should display completed the jobs.
There two types of functions in PySpark (Spark) transformations and actions. Transformations are lazily evaluated and PySpark doesn't perform any jobs until you call an action function like show, count, collect etc.

Question about joining dataframes in Spark

Suppose I have two partitioned dataframes:
df1 = spark.createDataFrame(
[(x,x,x) for x in range(5)], ['key1', 'key2', 'time']
).repartition(3, 'key1', 'key2')
df2 = spark.createDataFrame(
[(x,x,x) for x in range(7)], ['key1', 'key2', 'time']
).repartition(3, 'key1', 'key2')
(scenario 1) If I join them by [key1, key2] join operation is performed within each partition without shuffle (number of partitions in result dataframe is the same):
x = df1.join(df2, on=['key1', 'key2'], how='left')
assert x.rdd.getNumPartitions() == 3
(scenario 2) But If I joint them by [key1, key2, time] shuffle operation takes place (number of partitions in result dataframe is 200 which is driven by spark.sql.shuffle.partitions option):
x = df1.join(df2, on=['key1', 'key2', 'time'], how='left')
assert x.rdd.getNumPartitions() == 200
At the same time groupby and window operations by [key1, key2, time] preserve number of partitions and done without shuffle:
x = df1.groupBy('key1', 'key2', 'time').agg(F.count('*'))
assert x.rdd.getNumPartitions() == 3
I can’t understand is this a bug or there are some reasons for performing shuffle operation in second scenario? And how can I avoid shuffle if it's possible?
I guess was able to figure out the reason of different result in Python and Scala.
The reason is in broadcast optimisation. If spark-shell is started with broadcast disabled both Python and Scala works identically.
./spark-shell --conf spark.sql.autoBroadcastJoinThreshold=-1
val df1 = Seq(
(1, 1, 1)
).toDF("key1", "key2", "time").repartition(3, col("key1"), col("key2"))
val df2 = Seq(
(1, 1, 1),
(2, 2, 2)
).toDF("key1", "key2", "time").repartition(3, col("key1"), col("key2"))
val x = df1.join(df2, usingColumns = Seq("key1", "key2", "time"))
x.rdd.getNumPartitions == 200
So looks like spark 2.4.0 isn't able to optimise described case out of the box and catalyst optimizer extension needed as suggested by #user10938362.
BTW. Here are info about writing catalyst optimizer extensions https://developer.ibm.com/code/2017/11/30/learn-extension-points-apache-spark-extend-spark-catalyst-optimizer/
The behaviour of Catalyst Optimizer differs between pyspark and Scala (using Spark 2.4 at least).
I ran both and got two different plans.
Indeed you get 200 partitions in pyspark, unless you state for pyspark explicitly:
spark.conf.set("spark.sql.shuffle.partitions", 3)
Then 3 partitions are processed, and thus 3 retained under pyspark.
A little surprised as I would have thought under the hood it would be common. So people keep telling me. It just goes to show.
Physical Plan for pyspark with param set via conf:
== Physical Plan ==
*(5) Project [key1#344L, key2#345L, time#346L]
+- SortMergeJoin [key1#344L, key2#345L, time#346L], [key1#350L, key2#351L, time#352L], LeftOuter
:- *(2) Sort [key1#344L ASC NULLS FIRST, key2#345L ASC NULLS FIRST, time#346L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key1#344L, key2#345L, time#346L, 3)
: +- *(1) Scan ExistingRDD[key1#344L,key2#345L,time#346L]
+- *(4) Sort [key1#350L ASC NULLS FIRST, key2#351L ASC NULLS FIRST, time#352L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(key1#350L, key2#351L, time#352L, 3)
+- *(3) Filter ((isnotnull(key1#350L) && isnotnull(key2#351L)) && isnotnull(time#352L))
+- *(3) Scan ExistingRDD[key1#350L,key2#351L,time#352L]

Parquet filter pushdown is not working with Spark Dataset API [duplicate]

This question already has answers here:
Why is predicate pushdown not used in typed Dataset API (vs untyped DataFrame API)?
(1 answer)
Spark 2.0 Dataset vs DataFrame
(3 answers)
Closed 4 years ago.
Here is the sample code which i am running.
Creating a test parquet Dataset with mod column as partition.
scala> val test = spark.range(0 , 100000000).withColumn("mod", $"id".mod(40))
test: org.apache.spark.sql.DataFrame = [id: bigint, mod: bigint]
scala> test.write.partitionBy("mod").mode("overwrite").parquet("test_pushdown_filter")
After that, i am reading this data as dataframe and applying filter on partition column i.e. mod.
scala> val df = spark.read.parquet("test_pushdown_filter").filter("mod = 5")
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, mod: int]
scala> df.queryExecution.executedPlan
res1: org.apache.spark.sql.execution.SparkPlan =
*FileScan parquet [id#16L,mod#17] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/kprajapa/WorkSpace/places/test_pushdown_filter], PartitionCount: 1, PartitionFilters: [
isnotnull(mod#17), (mod#17 = 5)], PushedFilters: [], ReadSchema: struct<id:bigint>
You can see in execution plan, it is only reading 1 partition.
But if you apply same filter with dataset. its reading all the partition and then applying filter.
scala> case class Test(id: Long, mod: Long)
defined class Test
scala> val ds = spark.read.parquet("test_pushdown_filter").as[Test].filter(_.mod==5)
ds: org.apache.spark.sql.Dataset[Test] = [id: bigint, mod: int]
scala> ds.queryExecution.executedPlan
res2: org.apache.spark.sql.execution.SparkPlan =
*Filter <function1>.apply
+- *FileScan parquet [id#22L,mod#23] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/kprajapa/WorkSpace/places/test_pushdown_filter], PartitionCount: 40, PartitionFilter
s: [], PushedFilters: [], ReadSchema: struct<id:bigint>
Is this how dataset API works? or am i missing something?

Convert csv files to parquet on s3 using Spark structured streaming

I'm trying to create a Spark application that will read my csv files from s3, convert it to parquet files and write the results to s3.
I have 8 new csv files every minute compressed with gzip (~60MB each gzip file), each row have ~200 columns and ~99% are at the same date (my partition column).
The cluster have 3 workers with 10 cores and memory of 20 GB each.
Here is my code:
val spark = SparkSession
.builder()
.appName("Csv2Parquet")
.config("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("fs.s3a.access.key", "MY ACESS KEY")
.config("fs.s3a.secret.key", "MY SECRET")
.config("spark.executor.memory", "15G")
.config("spark.driver.memory", "5G")
.getOrCreate()
import spark.implicits._
val schema= StructType(Array(
StructField("myDate", DateType, nullable=false),
StructField("myTimestamp", TimestampType, nullable=true),
...
...
...
StructField("myColumn200", StringType, nullable=true)
))
val df = spark.readStream
.format("com.databricks.spark.csv")
.schema(schema)
.option("header", "false")
.option("mode", "DROPMALFORMED")
.option("delimiter","\t")
.load("s3a://my-bucket/raw-data/*.gz")
.withColumn("myPartitionDate", $"myDate")
val query = df.repartition($"myPartitionDate").writeStream
.option("checkpointLocation", "/shared/checkpoints/csv2parquet")
.trigger(Trigger.ProcessingTime(60000))
.format("parquet")
.option("path", "s3a://my-bucket/parquet-data")
.partitionBy(myPartitionDate)
.start("s3a://my-bucket/parquet-data")
query.awaitTermination()
The problem is that only one task is responsible for writing the "main" partition (that includes 99% of the events) to s3 and it takes ~4 minutes to handle this task. how can i improve it?

Getting the hivecontext from a dataframe

I am creating a hivecontext instead of sqlcontext to create adtaframe
val conf=new SparkConf().setMaster("yarn-cluster")
val context=new SparkContext(conf)
//val sqlContext=new SQLContext(context)
val hiveContext=new HiveContext(context)
val data=Seq(1,2,3,4,5,6,7,8,9,10).map(x=>(x.toLong,x+1,x+2.toDouble)).toDF("ts","value","label")
//outdta is a dataframe
data.registerTempTable("df")
//val hiveTest=hiveContext.sql("SELECT * from df where ts < percentile(BIGINT ts, 0.5)")
val ratio1=hiveContext.sql("SELECT percentile_approx(ts, array (0.5,0.7)) from df")
I need to get the exact hive context from ratio1 and not again create a hivecontext from the povidedsql context in the dataframe, I don't know why spark don't give me a hivecontext from dataframe and it just gives sqlcontext.
If you use HiveCOntext, then the runtime-type of df.sqlContext is HiveContext (HiveContext is a subtype of SQLContext), therefore you can do:
val hiveContext = df.sqlContext.asInstanceOf[HiveContext]

Resources