Repartitioning of dataframe in spark does not work - apache-spark

I have a cassandra database with large numbers of records ~4 million. I have 3 slave machines and one driver. I want to load this data in spark memory and do processing of it. When I do the following it reads all the data in one slave machine(300 mb out of 6 Gb) and all other slave machines memory is unused. I did a reparition on the dataframe into 3 but still the data is there on one machine. Because of this it takes a lot of time to process data since every job is executed on one machine. This is what I am doing
val tabledf = _sqlContext.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "events", "keyspace" -> "sams")).load
tabledf.registerTempTable("tempdf");
_sqlContext.cacheTable("tempdf");
val rdd = _sqlContext.sql(query);
val partitionedRdd = rdd.repartition(3)
val count = partitionedRdd.count.toInt
When I do some operations on partitionedRdd it is executed only on one machine since all data is present on one machine only
UPDATE
I am using this in the configuration --conf spark.cassandra.input.split.size_in_mb=32, still all my data is loaded into one executor
Update
I am using spark version 1.4 and spark cassandra connector version 1.4 released

If "Query" only accesses a single C* partition key you will only get a single task because we don't have a way (yet) of automatically getting a single cassandra partition in parallel. If you are accessing multiple C* partitions then try futher shrinking the input split_size in mb.

Related

From where does RDD loads data in Spark?

From where does Spark load data for RDD? Is data already present in Executor nodes or spark shuffles data from Driver node first?
From the name itself - RDD (Resilient Distributed Dataset) - it indicates that the data resides across executors when ever you create it.
Lets say when you run parallelize() for 100 entries, it will distribute that 100 entries across your executors so that each executor has its own chunk of data to do distributed processing.
Shuffling happens - when you do any operations like repartition() or coalesce().
Also if you run functions like collect() spark will try to pull all data from executor and bring it to driver(And you loose the ability of distributed processing)
This reference has more details around internals of spark - Apache Spark architecture

How persist(StorageLevel.MEMORY_AND_DISK) works in Spark 3.1 with Java implementation

I am using Apache Spark 3.1 with java in GCP Dataproc cluster. And my code structure is like this.
Dataset<Row> dataset1 = readSpannerData(SparkSession session, Configuration session.sessionState().newHadoopConf());
Dataset<Row> dataset2 = reading some data from table1 bigtable
Dataset<Row> result1 = dataset1.join(dataset2);
dataset1.persist(StorageLevel.MEMORY_AND_DISK());
dataset2.persist(StorageLevel.MEMORY_AND_DISK()); //once the usage is done I am persisting both datasets
System.out.println(result1.count()); // It throws error in this line
The exact error from YARN UI is, select query on spanner table which I am using in the starting of the job, not from any bigtable. I persisted dataset1 only after the usage is done.
And my cluster size is autoscale enabled with max of 250 worker nodes each have 8 core and 1024GB memory.
It is configured to use 2 Executors on each node.(4 cores on each exe).
It was working fine with low volume of data. But it throws error while running with huge data.
Why it throws error in this situation? Will it look into the parent in memory dataset while using the result, calculated from the parent dataset which is already persisted? If we want to maintain that dataset then what is the usage of IN-Memory storage?
How it is working in low data environments? Howmany nodes and how long IN-MEMORY dataset will be maintained in spark job? Will the volume of the data affect IN-MEMORY dataset?
Can any one clarify this doubt?
Thanks In Advance :)

How does spark copy data between cassandra tables?

Can anyone please explain the internal working of spark when reading data from one table and writing it to another in cassandra.
Here is my use case:
I am ingesting data coming in from an IOT platform into cassandra through a kafka topic. I have a small python script that parses each message from kafka to get the tablename it belongs to, prepares a query and writes it to cassandra using datastax's cassandra-driver for python. With that script I am able to ingest around 300000 records per min into cassandra. However my incoming data rate is 510000 records per minute so kafka consumer lag keeps on increasing.
Python script is already making concurrent calls to cassandra. If I increase the number of python executors, cassandra-driver starts failing because cassandra nodes become unavailable to it. I am assumin there is a limit of cassandra calls per sec that I am hitting there. Here is the error message that I get:
ERROR Operation failed: ('Unable to complete the operation against any hosts', {<Host: 10.128.1.3 datacenter1>: ConnectionException('Pool is shutdown',), <Host: 10.128.1.1 datacenter1>: ConnectionException('Pool is shutdown',)})"
Recently, I ran a pyspark job to copy data from a couple of columns in one table to another. The table had around 168 million records in it. Pyspark job completed in around 5 hours. So it processed over 550000 records per min.
Here is the pyspark code I am using:
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=sourcetable, keyspace=sourcekeyspace)\
.load().cache()
df.createOrReplaceTempView("data")
query = ("select dev_id,datetime,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, " + field + " as value from data " )
vgDF = spark.sql(query)
vgDF.show(50)
vgDF.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table=newtable, keyspace=newkeyspace)\
.save()
Versions:
Cassandra 3.9.
Spark 2.1.0.
Datastax's spark-cassandra-connector 2.0.1
Scala version 2.11
Cluster:
Spark setup with 3 workers and 1 master node.
3 worker nodes also have a cassandra cluster installed. (each cassandra node with one spark worker node)
Each worker was allowed 10 GB ram and 3 cores.
So I am wondering:
Does spark read all the data from cassandra first and then writes it to the new table or is there some kind of optimization in spark cassandra connector that allows it to move the data around cassandra tables without reading all the records?
If I replace my python script with a spark streaming job in which I parse the packet to get the table name for cassandra, will that help me ingest data more quickly into cassandra?
Spark connector is optimized because it parallelize processing and reading/inserting data into nodes that are owns the data. You may get better throughput by using Cassandra Spark Connector, but this will require more resources.
Talking about your task - 300000 inserts/minute is 5000/second, and this is not very big number frankly speaking - you can increase throughput by putting different optimizations:
Using asynchronous calls to submit requests. You only need to make sure that you submit more requests that could be handled by one connection (but you can also increase this number - I'm not sure how to do it in Python, but please check Java driver doc to get an idea).
use correct consistency level (LOCAL_ONE should give you very good performance)
use correct load balancing policy
you can run several copies of your script in parallel, making sure that they are all in the same Kafka consumer group.

Does Spark from DSE laod all data into RDD before running SQL Query?

Running DSE 4.7
So say I have a 4 node DSE Cassandra/Spark cluster...
I have a Cassandra table with say 4,000,000 records in it.
On Spark running the following Spark SQL "select * from table where email = ? or mobile = ?"
Will Spark load all the data into RDD and then filter based on the where clause? Will each spark node have 1,000,000 records per node loaded into memory?
Will spark load all the data into RDD and then filter based on the where clause?
It depends on your database schema. If your query explicitly restricts scan to a single C* partition (and ours where email = ? or mobile = ? definitely does not), Spark will load only part of the data.
In your case it will have to scan all the data.
Will each spark node have 1,000,000 records per node loaded into memory?
Again, it depends of your dataset size and amount of RAM on worker nodes. Spark RDDs are not always fully loaded into RAM, in your case it can be split into smaller parts (e.g. 100k rows), loaded into ram, filtered according to your query and saved after that, one-by-one.

Spark Streaming not distributing task to nodes on cluster

I have two node standalone cluster for spark stream processing. below is my sample code which demonstrate process I am executing.
sparkConf.setMaster("spark://rsplws224:7077")
val ssc=new StreamingContext()
println(ssc.sparkContext.master)
val inDStream = ssc.receiverStream //batch of 500 ms as i would like to have 1 sec latency
val filteredDStream = inDStream.filter // filtering unwanted tuples
val keyDStream = filteredDStream.map // converting to pair dstream
val stateStream = keyDStream .updateStateByKey //updating state for history
stateStream.checkpoint(Milliseconds(2500)) // to remove long lineage and meterilizing state stream
stateStream.count()
val withHistory = keyDStream.join(stateStream) //joining state wit input stream for further processing
val alertStream = withHistory.filter // decision to be taken by comparing history state and current tuple data
alertStream.foreach // notification to other system
My Problem is spark is not distributing this state RDD to multiple nodes or not distributing task to other node and causing high latency in response, my input load is around 100,000 tuples per seconds.
I have tried below things but nothing is working
1) spark.locality.wait to 1 sec
2) reduce memory allocated to executer process to check weather spark distribute RDD or task but even if it goes beyond memory limit of first node (m1) where drive is also running.
3) increased spark.streaming.concurrentJobs from 1 (default) to 3
4) I have checked in streaming ui storage that there are around 20 partitions for state dstream RDD all located on local node m1.
If I run SparkPi 100000 then spark is able to utilize another node after few seconds (30-40) so I am sure that my cluster configuration is fine.
Edit
One thing I have noticed that even for my RDD if I set storage level MEMORY_AND_DISK_SER_2 then also in app ui storage it shows Memory Serialized 1x Replicated
Spark will not distribute stream data across the cluster automatically for it tends to make full use of data locality(to launch a task on where its data lies will be better, this is default configuration). But you can use repartition to distribute stream data and improve the parallelism. You can turn to http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#performance-tuning for more information.
If your not hitting the cluster and your jobs only run locally it most likely means your Spark Master in your SparkConf is set to the local URI not the master URI.
By default the value of spark.default.parallelism property is "Local mode" so all the tasks will be executed in the node is receiving the data.
Change this property in spark-defaults.conf file in order to increase the parallelism level.

Resources