How persist(StorageLevel.MEMORY_AND_DISK) works in Spark 3.1 with Java implementation - apache-spark

I am using Apache Spark 3.1 with java in GCP Dataproc cluster. And my code structure is like this.
Dataset<Row> dataset1 = readSpannerData(SparkSession session, Configuration session.sessionState().newHadoopConf());
Dataset<Row> dataset2 = reading some data from table1 bigtable
Dataset<Row> result1 = dataset1.join(dataset2);
dataset1.persist(StorageLevel.MEMORY_AND_DISK());
dataset2.persist(StorageLevel.MEMORY_AND_DISK()); //once the usage is done I am persisting both datasets
System.out.println(result1.count()); // It throws error in this line
The exact error from YARN UI is, select query on spanner table which I am using in the starting of the job, not from any bigtable. I persisted dataset1 only after the usage is done.
And my cluster size is autoscale enabled with max of 250 worker nodes each have 8 core and 1024GB memory.
It is configured to use 2 Executors on each node.(4 cores on each exe).
It was working fine with low volume of data. But it throws error while running with huge data.
Why it throws error in this situation? Will it look into the parent in memory dataset while using the result, calculated from the parent dataset which is already persisted? If we want to maintain that dataset then what is the usage of IN-Memory storage?
How it is working in low data environments? Howmany nodes and how long IN-MEMORY dataset will be maintained in spark job? Will the volume of the data affect IN-MEMORY dataset?
Can any one clarify this doubt?
Thanks In Advance :)

Related

Write to databricks table from spark worker node

Can someone let me know if I can write to a databricks table directly from a worker node in Spark ? Please provide the code snippets. I am partitioning big data around 100 million records and hence it is failing due to memory issues when I issue a collect statement to get the data back into driver node.
In general you are always writing from a Worker Node to a Databricks table. The collect should be avoided at all costs as you see - Driver OOM.
To avoid OOM issues you should do like most do, repartition your records so they fit inside the allowable partition sizes limit - 2GB or now 4GB with newer Spark releases, on your Worker Nodes and all well be fine. E.g.:
val repartitionedWikiDF = wikiDF.repartition(16)
val targetPath = f"{workingDir}/wiki.parquet"
repartitionedwikiDF.write.mode("OVERWRITE").parquet(targetPath)
display(dbutils.fs.ls(targetPath))
You can also perform df.repartition(col, N). There is also range partitioning.
Best approach is like this imo:
import org.apache.spark.sql.functions._
df.repartition(col("country"))
.write.partitionBy("country")
.parquet("repartitionedPartitionedBy.parquet")

How does spark copy data between cassandra tables?

Can anyone please explain the internal working of spark when reading data from one table and writing it to another in cassandra.
Here is my use case:
I am ingesting data coming in from an IOT platform into cassandra through a kafka topic. I have a small python script that parses each message from kafka to get the tablename it belongs to, prepares a query and writes it to cassandra using datastax's cassandra-driver for python. With that script I am able to ingest around 300000 records per min into cassandra. However my incoming data rate is 510000 records per minute so kafka consumer lag keeps on increasing.
Python script is already making concurrent calls to cassandra. If I increase the number of python executors, cassandra-driver starts failing because cassandra nodes become unavailable to it. I am assumin there is a limit of cassandra calls per sec that I am hitting there. Here is the error message that I get:
ERROR Operation failed: ('Unable to complete the operation against any hosts', {<Host: 10.128.1.3 datacenter1>: ConnectionException('Pool is shutdown',), <Host: 10.128.1.1 datacenter1>: ConnectionException('Pool is shutdown',)})"
Recently, I ran a pyspark job to copy data from a couple of columns in one table to another. The table had around 168 million records in it. Pyspark job completed in around 5 hours. So it processed over 550000 records per min.
Here is the pyspark code I am using:
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=sourcetable, keyspace=sourcekeyspace)\
.load().cache()
df.createOrReplaceTempView("data")
query = ("select dev_id,datetime,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, " + field + " as value from data " )
vgDF = spark.sql(query)
vgDF.show(50)
vgDF.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table=newtable, keyspace=newkeyspace)\
.save()
Versions:
Cassandra 3.9.
Spark 2.1.0.
Datastax's spark-cassandra-connector 2.0.1
Scala version 2.11
Cluster:
Spark setup with 3 workers and 1 master node.
3 worker nodes also have a cassandra cluster installed. (each cassandra node with one spark worker node)
Each worker was allowed 10 GB ram and 3 cores.
So I am wondering:
Does spark read all the data from cassandra first and then writes it to the new table or is there some kind of optimization in spark cassandra connector that allows it to move the data around cassandra tables without reading all the records?
If I replace my python script with a spark streaming job in which I parse the packet to get the table name for cassandra, will that help me ingest data more quickly into cassandra?
Spark connector is optimized because it parallelize processing and reading/inserting data into nodes that are owns the data. You may get better throughput by using Cassandra Spark Connector, but this will require more resources.
Talking about your task - 300000 inserts/minute is 5000/second, and this is not very big number frankly speaking - you can increase throughput by putting different optimizations:
Using asynchronous calls to submit requests. You only need to make sure that you submit more requests that could be handled by one connection (but you can also increase this number - I'm not sure how to do it in Python, but please check Java driver doc to get an idea).
use correct consistency level (LOCAL_ONE should give you very good performance)
use correct load balancing policy
you can run several copies of your script in parallel, making sure that they are all in the same Kafka consumer group.

Spark Cassandra Performance Issue

I am a new learner of Spark and Cassandra. I am facing a major performance issue.I am streaming the data from Kafka at every 5 seconds in Spark, then perform analytic on the data in R language using JRI and finally saving the data to Cassandra's respective column family. The time duration(in milliseconds) for saving the data to Cassandra increases very rapidly with the number of input requests [each request is 200KB].
Spark code:
sessionData.foreachRDD(new Function<JavaRDD<NormalizedData>, Void>() {
public Void call(JavaRDD<NormalizedData> rdd) {
System.out.println("step-3 " + System.currentTimeMillis());
javaFunctions(rdd).writerBuilder("keyspace",normalized_data",mapToRow(NormalizedData.class)).saveToCassandra();
System.out.println("step-4 " + System.currentTimeMillis());}}
I was able to enhance performance for the same by using Spark and Cassandra on the same server. This delay was because Spark and Cassandra were on different server though in same region on AWS. The network delay was the main cause as it impacted data locality. Thanks.
You can refer to this blog for Spark-Cassandra connector tuning. You will get an idea on perf numbers that you can expect. Also You can try out another open source product SnappyData, which is the Spark database, which will give you very high performance in your use case.
I am also using Cassandra Spark combination to do realtime analytics. The following things are a few best practices:
Data Locality - Running Cassandra daemon with Worker node in case of Spark standalone or Node Manager in case of Yarn], Mesos worker in case of Mesos
Increase the parallelism i.e., create more partitions/tasks
Use Cassandra Connection Pooling to improve throughput
In your case, you are using JRI to call R in side Java. This is a bit
slowly and performance overhead. So use SparkR to integrate R with Spark instead of JRI directly.

What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?

I am using Spark SQL actually hiveContext.sql() which uses group by queries and I am running into OOM issues. So thinking of increasing value of spark.sql.shuffle.partitions from 200 default to 1000 but it is not helping.
I believe this partition will share data shuffle load so more the partitions less data to hold. I am new to Spark. I am using Spark 1.4.0 and I have around 1TB of uncompressed data to process using hiveContext.sql() group by queries.
If you're running out of memory on the shuffle, try setting spark.sql.shuffle.partitions to 2001.
Spark uses a different data structure for shuffle book-keeping when the number of partitions is greater than 2000:
private[spark] object MapStatus {
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
...
I really wish they would let you configure this independently.
By the way, I found this information in a Cloudera slide deck.
OK so I think your issue is more general. It's not specific to Spark SQL, it's a general problem with Spark where it ignores the number of partitions you tell it when the files are few. Spark seems to have the same number of partitions as the number of files on HDFS, unless you call repartition. So calling repartition ought to work, but has the caveat of causing a shuffle somewhat unnecessarily.
I raised this question a while ago and have still yet to get a good answer :(
Spark: increase number of partitions without causing a shuffle?
It's actually depends on your data and your query, if Spark must load 1Tb, there is something wrong on your design.
Use the superbe web UI to see the DAG, mean how Spark is translating your SQL query to jobs/stages and tasks.
Useful metrics are "Input" and "Shuffle".
Partition your data (Hive / directory layout like /year=X/month=X)
Use spark CLUSTER BY feature, to work per data partition
Use ORC / Parquet file format because they provide "Push-down filter", useless data is not loaded to Spark
Analyze Spark History to see how Spark is reading data
Also, OOM could happen on your driver?
-> this is another issue, the driver will collect at the end the data you want. If you ask too much data, the driver will OOM, try limiting your query, or write another table (Spark syntax CREATE TABLE ...AS).
I came across this post from Cloudera about Hive Partitioning. Check out the "Pointers" section talking about number of partitions and number of files in each partition resulting in overloading the name node, which might cause OOM.

Spark Streaming not distributing task to nodes on cluster

I have two node standalone cluster for spark stream processing. below is my sample code which demonstrate process I am executing.
sparkConf.setMaster("spark://rsplws224:7077")
val ssc=new StreamingContext()
println(ssc.sparkContext.master)
val inDStream = ssc.receiverStream //batch of 500 ms as i would like to have 1 sec latency
val filteredDStream = inDStream.filter // filtering unwanted tuples
val keyDStream = filteredDStream.map // converting to pair dstream
val stateStream = keyDStream .updateStateByKey //updating state for history
stateStream.checkpoint(Milliseconds(2500)) // to remove long lineage and meterilizing state stream
stateStream.count()
val withHistory = keyDStream.join(stateStream) //joining state wit input stream for further processing
val alertStream = withHistory.filter // decision to be taken by comparing history state and current tuple data
alertStream.foreach // notification to other system
My Problem is spark is not distributing this state RDD to multiple nodes or not distributing task to other node and causing high latency in response, my input load is around 100,000 tuples per seconds.
I have tried below things but nothing is working
1) spark.locality.wait to 1 sec
2) reduce memory allocated to executer process to check weather spark distribute RDD or task but even if it goes beyond memory limit of first node (m1) where drive is also running.
3) increased spark.streaming.concurrentJobs from 1 (default) to 3
4) I have checked in streaming ui storage that there are around 20 partitions for state dstream RDD all located on local node m1.
If I run SparkPi 100000 then spark is able to utilize another node after few seconds (30-40) so I am sure that my cluster configuration is fine.
Edit
One thing I have noticed that even for my RDD if I set storage level MEMORY_AND_DISK_SER_2 then also in app ui storage it shows Memory Serialized 1x Replicated
Spark will not distribute stream data across the cluster automatically for it tends to make full use of data locality(to launch a task on where its data lies will be better, this is default configuration). But you can use repartition to distribute stream data and improve the parallelism. You can turn to http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#performance-tuning for more information.
If your not hitting the cluster and your jobs only run locally it most likely means your Spark Master in your SparkConf is set to the local URI not the master URI.
By default the value of spark.default.parallelism property is "Local mode" so all the tasks will be executed in the node is receiving the data.
Change this property in spark-defaults.conf file in order to increase the parallelism level.

Resources