Spark JDBC data fetch Optimization from relational database - apache-spark

a) Is there a way in which Spark can optimize the data fetch from a Relational Database when compared to a traditional java JDBC call.
b) How to reduce the load on the database while running Spark queries as we will be hitting production database directly for all queries. Assume 30 million order records and 150 million order line records in Production for the Spark reporting case.

Re a)
You can of course .cache() the data frame in your Spark app to avoid repeated executions of the JDBC for that data frame during life time of your Spark app
You can read the data frame in via range partitioned parallel JDBC calls using partitionColumn, lowerBound, upperBound and numPartitions properties. This makes sense for distributed (partitioned) database backends.
You can use an integrated Spark cluster with a distributed database engine such as IBM dashDB, which runs Spark executors co-located with the database partitions and exercises local IPC data exchange mechanisms between Spark and the database: https://ibmdatawarehousing.wordpress.com/category/theme-ibm-data-warehouse/
b) Above mentioned Spark-side caching can help if applicable. In addition JDBC data source in Spark does try to push down projections and filter predicated from your Spark SQL / data frame operations to the underlying SQL database. Check the resulting SQLs that hit the database.

Related

Processing of queries using SparkSQL on difference databases

I want to use Spark SQL (installed on Machine 1) with connectors for different data stores like HBase, Hive, Cassandra, and MySQL (installed on Machine 2 to perform simple analytics like Min/Max, averaging, etc.
My Question: Is the processing of these queries done on Machine 1 or Spark SQL acts as just an interface to perform different analytics but on the data store end (ie. Machine 2)?
Yes and no. It depends on your spark job.
Spark SQL is a separate implementation. It is datastore agnostic. When you implement a spark sql job , spark transforms it into something called DAG.
It is a similar technique to a database query plan, but running completely on the spark cluster.
In case of simple min / max, it might be translated into a direct underlying store query. But it might also be translated into something which is preselecting bunch of records, then doing an own data processing. This way it is also possible to join and aggregate data from different data sources.
You can analyze the spark sql plan with common explain statement or via spark web ui.

Spark local rdd Write to local Cassandra DB

I have a DSE cluster where every node in the cluster has both spark and Cassandra running.
When I load data from Cassandra to spark Rdd and do some action on the rdd, i know the data would be distributed into multi nodes. In my case, I want to write these rdds from every node to its local Cassandra dB table directly, is there anyway to do it.
If I do normal rdd collect, all data from spark nodes would be merged and go back to node with driver.
I do not want this to happen as the data flow from nodes back to driver node may take Long time, I want the data been save to local node directly to avoid the data movement across the spark nodes.
When Spark executor is reading data from Cassandra it's sending request to the "best node" that is selected based on the different factors:
When Spark is collocated with Cassandra, then Spark is trying to pull data from the same node
When Spark is on different node, then it's using token-aware routing, and read data from multiple nodes in parallel, as it's defined by the partition ranges.
When it's comes to the writing, and you have multiple executors, then each executor is opening multiple connections to each node, and writing the data using the token-aware routing, meaning that data is sent directly to one of the replicas. Also, Spark is trying to batch multiple rows that are belonging to the same partition into an UNLOGGED BATCH as it's more performant. Even if the Spark partition is colocated with the Cassandra partition, writing could involve an additional network overhead as SCC is writing using the consistency level TWO.
You can get colocated data if you re-partitioned the data to match Cassandra's partitioning), but such re-partition may induce Spark shuffle that could be much more heavyweight compared to the writing data from executor to another node.
P.S. You can find a lot of additional information about Spark Cassandra Connector in the Russell Spitzer's blog.
A word of warning: i only use Cassandra and Spark as separate open source projects, i do not have expertise with DSE.
I am afraid the data need to hit the network to replicate, even when every spark node talks to its local cassandra node.
Without replication and running a Spark job to make sure all data is hashed and preshuffled to the corresponding Cassandra node, it should be possible to use 127.0.0.1:9042 and avoid the network.

Spark Streaming: in-memory aggregation - correct usage

I have a Spark 2.2 Structured streaming flow from an on-premise system into a containerized cloud spark cluster where kafka recieves the data, and SSS maintains a number of queries that flush to disk every ten seconds. A query console-sink is not accessible to external sessions outside the streaming context (hence the CSV flush); the monitoring dashboard runs spark sql from another context to get metrics.
Right now I am only aggregating the data that has come in since streaming was last started. Now I need to aggregate data since forever with the incoming streaming data to provide (near) realtime views. This will mean running a bunch of GROUP BY's on billions of records - maintaining several million aggregate rows in-memory.
My question is regarding how Spark streaming queries can scale like this: how efficient is memory usage (I'll probably use 32 worker contaiers) and is this the correct way to manage a (near-) realtime view of incoming data using kafka and SSS?

Redshift with Spark Streaming

I have a Kafka - Spark Streaming application to ingest and process 60K events per min. I need a database to store my transformed dataframes to be accessed by visualization layer. Can Redshift be used for this with Spark Streaming or should Cassandra be used? I will be processing and storing the dataframes in every spark window of 30 seconds. Also I need to read from the datastore in every window. I guess Redhsift is primarily a data warehousing database not for OLTP sort of the processing.. any ideas?
You should check out SnappyData. SnappyData deeply integrates an in-memory database with Spark that allows hybrid OLTP/OLAP applications. You can write Spark Streaming applications on top of Snappy that can update/delete data from the database. Further, because it does not go over a connector, it performs better than the myriad datastores that have Spark connectors and even the native Spark cache. There may be other datastores that offer hybrid OLTP/OLAP applications on Spark in the aforementioned link.
Disclaimer: I am a SnappyData employee.

Spark for realtime OLAP queries

I have IoT data streaming in via Kafka, and would like to use Spark SQL to analyze it. I was planning on persisting data to S3 using Sector, but there will be a delay of a few minutes while the batch of data gets collected before being written to S3.
How can I make Spark query both the streaming data, and the historical data on S3? Do I run two queries - one with Spark Streaming, and one Spark SQL, and try to combine the results?
Or do I need to use an OLTP database for this type of functionality? I wanted to independently scale compute and storage, which is why I went with Spark + S3.

Resources