Why spark repartition increase size (data volume) - apache-spark

I just wanted to understand why the spark repartition increase data volume ?. When the same operation I did with Coalesce , It showed me correct size. When I did repartition with 100GB of data It became around 400GB (more than that).
Here is my code which do repartition
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("spark compatation");
JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession spark = SparkSession.builder().config(conf).getOrCreate();
String partition = "file='hit_data'";
spark.read()
.format("delta")
.load("delta-table/clickstream/")
// .where(partition)
.repartition(10)
.write()
.format("delta")
.mode("overwrite")
//.option("replaceWhere", partition)
.save("delta-table/clickstream/");
spark.stop();
sc.close();

Related

Spark always bradcasts tables greater than spark.sql.autoBroadcastJoinThreshold when performing streaming merge on DeltaTable sink

I am trying to do a streaming merge between delta tables using this guide - https://docs.delta.io/latest/delta-update.html#upsert-from-streaming-queries-using-foreachbatch
Our Code Sample (Java):
Dataset<Row> sourceDf = sparkSession
.readStream()
.format("delta")
.option("inferSchema", "true")
.load(sourcePath);
DeltaTable deltaTable = DeltaTable.forPath(sparkSession, targetPath);
sourceDf.createOrReplaceTempView("vTempView");
StreamingQuery sq = sparkSession.sql("select * from vTempView").writeStream()
.format("delta")
.foreachBatch((microDf, id) -> {
deltaTable.alias("e").merge(microDf.alias("d"), "e.SALE_ID = d.SALE_ID")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute();
})
.outputMode("update")
.option("checkpointLocation", util.getFullS3Path(target)+"/_checkpoint")
.trigger(Trigger.Once())
.start();
Problem:
Here Source path and Target path is already in sync using the checkpoint folder. Which has around 8 million rows of data amounting to around 450mb of parquet files.
When new data comes in Source Path (let's say 987 rows), then above code will pick that up and perform a merge with target table. During this operation spark is trying to perform a BroadCastHashJoin, and broadcasts the target table which has 8M rows.
Here's a DAG snippet for merge operation (with table with 1M rows),
Expectation:
I am expecting smaller dataset (i.e: 987 rows) to be broadcasted. If not then atleast spark should not broadcast target table, as it is larger than provided spark.sql.autoBroadcastJoinThreshold setting and neither are we providing any broadcast hint anywhere.
Things I have tried:
I searched around and got this article - https://learn.microsoft.com/en-us/azure/databricks/kb/sql/bchashjoin-exceeds-bcjointhreshold-oom.
It provides 2 solutions,
Run "ANALYZE TABLE ..." (but since we are reading target table from path and not from a table this is not possible)
Cache the table you are broadcasting, DeltaTable does not have any provision to cache table, so can't do this.
I thought this was because we are using DeltaTable.forPath() method for reading target table and spark is unable to calculate target table metrics. So I also tried a different approach,
Dataset<Row> sourceDf = sparkSession
.readStream()
.format("delta")
.option("inferSchema", "true")
.load(sourcePath);
Dataset<Row> targetDf = sparkSession
.read()
.format("delta")
.option("inferSchema", "true")
.load(targetPath);
sourceDf.createOrReplaceTempView("vtempview");
targetDf.createOrReplaceTempView("vtemptarget");
targetDf.cache();
StreamingQuery sq = sparkSession.sql("select * from vtempview").writeStream()
.format("delta")
.foreachBatch((microDf, id) -> {
microDf.createOrReplaceTempView("vtempmicrodf");
microDf.sparkSession().sql(
"MERGE INTO vtemptarget as t USING vtempmicrodf as s ON t.SALE_ID = s.SALE_ID WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * "
);
})
.outputMode("update")
.option("checkpointLocation", util.getFullS3Path(target)+"/_checkpoint")
.trigger(Trigger.Once())
.start();
In above snippet I am also caching the targetDf so that Spark can calculate metrics and not broadcast target table. But it didn't help and spark still broadcasts it.
Now I am out of options. Can anyone give me some guidance on this?

DSE Spark Streaming: Long active batches queue

I have the following code:
val conf = new SparkConf()
.setAppName("KafkaReceiver")
.set("spark.cassandra.connection.host", "192.168.0.78")
.set("spark.cassandra.connection.keep_alive_ms", "20000")
.set("spark.executor.memory", "2g")
.set("spark.driver.memory", "4g")
.set("spark.submit.deployMode", "cluster")
.set("spark.executor.instances", "3")
.set("spark.executor.cores", "3")
.set("spark.shuffle.service.enabled", "false")
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.io.compression.codec", "snappy")
.set("spark.rdd.compress", "true")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.streaming.backpressure.initialRate", "200")
.set("spark.streaming.receiver.maxRate", "500")
val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val sqlContext = new SQLContext(sc)
val kafkaParams = Map[String, String](
"bootstrap.servers" -> "192.168.0.113:9092",
"group.id" -> "test-group-aditya",
"auto.offset.reset" -> "largest")
val topics = Set("random")
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
I'm running the code through spark-submit with the following command:
dse> bin/dse spark-submit --class test.kafkatesting /home/aditya/test.jar
I have a three-node Cassandra DSE cluster installed on different machines. Whenever I run the application, it takes so much data and starts creating a queue of active batches, which in turn creates a backlog and a long scheduling delay. How can I increase the performance and control the queue such that it receives a new batch only after it finishes executing the current batch?
I found the solution, did some optimisation in code. Instead of saving RDD try to create Dataframe, saving DF to Cassandra in much faster as compared to RDD. Also, increase the no of core and and executor memory in order to achieve good results.
Thanks,

why output write data is way more then input data in apache spark

I am inserting 310GB of csv file data from HDFS to cassandra, I have 3 node spark cluster with each node of 32GB RAM and 24 cores. The output data size is more than input data, what could be the reason? Is it because that spark is running multiple task for same block of data and writing it back to cassandra? If yes how to check the calculation, is it possible from spark web ui ? like for 310GB of data , what amount of extra data spark might write ?
spark job status
val spark = SparkSession
.builder()
.appName(job.name)
.config("spark.cassandra.connection.host", cassandraHost)
.config("spark.cassandra.connection.port", cassandraPort)
.getOrCreate()
val csv: DataFrame = spark.read
.format(fileFormat)
.option("header", job.header)
.option("inferSchema", job.inferSchema)
.option("delimiter", job.delimiter)
.load(sourceFile)
.filter(job.filter.getOrElse("1==1"))
.distinct()
.na
.fill("unknown")
val columns: Array[Column] = csv.columns
.map(c =>
job.schema.get(c) match {
case Some(newName) => csv.col(c).as(newName.toLowerCase)
case None => null
})
.filter(c => c != null)
val dataToStoreInit = csv.select(columns: _*)
val partitionColumn = job.partitionKeyColumns.get(0)
val dataToStore = dataToStoreInit.repartition(dataToStoreInit(partitionColumn))
dataToStore.write
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> job.outTableName, "keyspace" -> keyspace))
.mode(SaveMode.Append)
.save()
spark.stop()

Spark: Poor performance on distributed system. How to improve>

I wrote a simple Spark program, and want to deploy it to the distributed servers. It is pretty simple:
obtain data-> arrange data->train data->reapply to see training result.
The input data is just 10K rows, with 3 features.
I first run at my local machine, using "local[*]". It runs just about 3 mins.
Now when I deploy to a cluster, it runs extremely slow: half an hour without finished. It becomes very slow at the training stage.
I am curious, if I did something wrong. Please help me to check. I use Spark 1.6.1.
I submit:
spark-submit --packages com.databricks:spark-csv_2.11:1.5.0 orderprediction_2.11-1.0.jar --driver-cores 1 --driver-memory 4g --executor-cores 8 --executor-memory 4g
The code is here:
def main(args: Array[String]) {
// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)
val conf = new SparkConf()
.setAppName("My Prediction")
//.setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val data = sqlContext.read
.option("header","true")
.option("delimiter", "\t")
.format("com.databricks.spark.csv")
.option("inferSchema","true")
.load("mydata.txt")
data.printSchema()
data.show()
val dataDF = data.toDF().filter("clicks >=10")
dataDF.show()
val assembler = new VectorAssembler()
.setInputCols(Array("feature1", "feature2", "feature3"))
.setOutputCol("features")
val trainset = assembler.transform(dataDF).select("target", "features")
trainset.printSchema()
val trainset2 = trainset.withColumnRenamed("target", "label")
trainset2.printSchema()
val trainset3 = trainset2.withColumn("label", trainset2.col("label").cast(DataTypes.DoubleType))
trainset3.cache() // cache data into memory
trainset3.printSchema()
trainset3.show()
// Train a RandomForest model.
println("training Random Forest")
val rf = new RandomForestRegressor()
.setLabelCol("label")
.setFeaturesCol("features")
.setNumTrees(1000)
val rfmodel = rf.fit(trainset3)
println("prediction")
val result = rfmodel.transform(trainset3)
result.show()
}
Update: After investigation, I found it jammed at
collectAsMap at RandomForest.scala:525
It spent already 1.1 hours on this line, still unfinished yet. The data, I believe is only several Megabyte.
You are building a RandomForest made out of 1000 RandomTrees which will train 1000 instances.
In the code collectAsMap is the first action, while all the rest are transformations (are lazy evaluated). So while you see it hanging at that line it is because now all the maps, flatMaps, filters, groupBy, etc are evaluated.

Joining Kafka and Cassandra DataFrames in Spark Streaming ignores C* predicate pushdown

Intent
I'm receiving data from Kafka via direct stream and would like to enrich the messages with data from Cassandra. The Kafka messages (Protobufs) are decoded into DataFrames and then joined with a (supposedly pre-filtered) DF from Cassandra. The relation of (Kafka) streaming batch size to raw C* data is [several streaming messages to millions of C* rows], BUT the join always yields exactly ONE result [1:1] per message. After the join the resulting DF is eventually stored to another C* table.
Problem
Even though I'm joining the two DFs on the full Cassandra primary key and pushing the corresponding filter to C*, it seems that Spark is loading the whole C* data-set into memory before actually joining (which I'd like to prevent by using the filter/predicate pushdown). This leads to a lot of shuffling and tasks being spawned, hence the "simple" join takes forever...
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("test")
.set("spark.cassandra.connection.host", "xxx")
.set("spark.cassandra.connection.keep_alive_ms", "30000")
.setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(10))
ssc.sparkContext.setLogLevel("INFO")
// Initialise Kafka
val kafkaTopics = Set[String]("xxx")
val kafkaParams = Map[String, String](
"metadata.broker.list" -> "xxx:32000,xxx:32000,xxx:32000,xxx:32000",
"auto.offset.reset" -> "smallest")
// Kafka stream
val messages = KafkaUtils.createDirectStream[String, MyMsg, StringDecoder, MyMsgDecoder](ssc, kafkaParams, kafkaTopics)
// Executed on the driver
messages.foreachRDD { rdd =>
// Create an instance of SQLContext
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._
// Map MyMsg RDD
val MyMsgRdd = rdd.map{case (key, MyMsg) => (MyMsg)}
// Convert RDD[MyMsg] to DataFrame
val MyMsgDf = MyMsgRdd.toDF()
.select(
$"prim1Id" as 'prim1_id,
$"prim2Id" as 'prim2_id,
$...
)
// Load DataFrame from C* data-source
val base_data = base_data_df.getInstance(sqlContext)
// Left join on prim1Id and prim2Id
val joinedDf = MyMsgDf.join(base_data,
MyMsgDf("prim1_id") === base_data("prim1_id") &&
MyMsgDf("prim2_id") === base_data("prim2_id"), "left")
.filter(base_data("prim1_id").isin(MyMsgDf("prim1_id"))
&& base_data("prim2_id").isin(MyMsgDf("prim2_id")))
joinedDf.show()
joinedDf.printSchema()
// Select relevant fields
// Persist
}
// Start the computation
ssc.start()
ssc.awaitTermination()
}
Environment
Spark 1.6
Cassandra 2.1.12
Cassandra-Spark-Connector 1.5-RC1
Kafka 0.8.2.2
SOLUTION
From discussions on the DataStax Spark Connector for Apache Cassandra ML
Joining Kafka and Cassandra DataFrames in Spark Streaming ignores C* predicate pushdown
How to create a DF from CassandraJoinRDD
I've learned the following:
Quoting Russell Spitzer
This wouldn't be a case of predicate pushdown. This is a join on a partition key column. Currently only joinWithCassandraTable supports this direct kind of join although we are working on some methods to try to have this automatically done within Spark.
Dataframes can be created from any RDD which can have a schema applied to it. The easiest thing to do is probably to map your joinedRDD[x,y] to Rdd[JoinedCaseClass] and then call toDF (which will require importing your sqlContext implicits.) See the DataFrames documentation here for more info.
So the actual implementation now resembles something like
// Join myMsg RDD with myCassandraTable
val joinedMsgRdd = myMsgRdd.joinWithCassandraTable(
"keyspace",
"myCassandraTable",
AllColumns,
SomeColumns(
"prim1_id",
"prim2_id"
)
).map{case (myMsg, cassandraRow) =>
JoinedMsg(
foo = myMsg.foo
bar = cassandraRow.bar
)
}
// Convert RDD[JoinedMsg] to DataFrame
val myJoinedDf = joinedMsgRdd.toDF()
Have you tried joinWithCassandraTable ? It should pushdown to C* all keys you are looking for...

Resources