Spark Structured Streaming state management with RocksDB - apache-spark

For a particular use case we are using spark structured streaming, but the process is not efficient and stable. Aggregation stateful operation is the most time taking as well as memory crunching stage in the whole job. Spark Streaming provides an implementation of rocksDB to manage state. It helped us to gain some stability but added an overhead of time. So we are looking to optimise the rocksDB implementation. While exploring the logs we got to know that the Memtable Hit count is always zero and the Block Cache hit count is very low. It will be very helpful if someone can throw light on this.
RocksDB in itself provide various tuning parameters like write_buffer_size, min_buffer_to_merge. We tried to expose these parameters to spark. And then set the parameters value high in order to increase our chances of hitting memtable but that didn't help.

RocksDB is mostly a back up for state (other option is HDFS) or used during shuffle when local cache(memory)for a partition key is not within same executor.
You can check stateful operator metrics provided in spark ui to see how memory(cache) is being used before it hits rocksdb.
May be this below article can help on getting more info.
https://medium.com/#vndhya/stateful-processing-in-spark-structured-streaming-memory-aspects-964bc6414346. (disclosure: its written by me)

Related

Will Spark structured streaming benefit from dynamic allocation if number of cores more than number of Kafka partitions?

Supposed we have an application that reads from X partition topic, does some filtering on the data then saves it into storage (no complex shuffling logic, just some simple transformations) using Structured Streaming query. Will this application benefit from dynamic allocation feature that adds more than X single-core executors in case of data spike?
I am asking this, because I've mostly worked with DStreams, where there is quite well known recommendation to have single core per partition so that every executor core will be busy processing data from one partition and adding more executors usually will not give much scaling benefits. My intuition says that no, because the data will still end up on the same workers, but I might be missing something.
are you talking about dynamic allocation by yarn ?
But you can use minPartitions setting in spark structured streaming.
Refer https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

Impala vs Spark performance for ad hoc queries

I'm interested only in query performance reasons and architectural differences behind them. All answers I've seen before were outdated or hadn't provide me with enough context of WHY Impala is better for ad hoc queries.
From 3 considerations below only the 2nd point explain why Impala is faster on bigger datasets. Could you please contribute to the following statements?
Impala doesn't miss time for query pre-initialization, means impalad daemons are always running & ready. In other hand, Spark Job Server provide persistent context for the same purposes.
Impala is in-memory and can spill data on disk, with performance penalty, when data doesn't have enough RAM. The same is true for Spark. The main difference is that Spark is written on Scala and have JVM limitations, so workers bigger than 32 GB aren't recommended (because of GC). In turn, [wrong, see UPD] Impala is implemented on C++, and has high hardware requirements: 128-256+ GBs of RAM recommended. This is very significant, but should benefit Impala only on datasets that requires 32-64+ GBs of RAM.
Impala is integrated with Hadoop infrastructure. AFAIK the main reason to use Impala over another in-memory DWHs is the ability to run over Hadoop data formats without exporting data from Hadoop. Means Impala usually use the same storage/data/partitioning/bucketing as Spark can use, and do not achieve any extra benefit from data structure comparing to Spark. Am I right?
P.S. Is Impala faster than Spark in 2019? Have you seen any performance benchmarks?
UPD:
Questions update:
I. Why Impala recommends 128+ GBs RAM? What is an implementation language of each Impala's component? Docs say that "Impala daemons run on every node in the cluster, and each daemon is capable of acting as the query planner, the query coordinator, and a query execution engine.". If impalad is Java, than what parts are written on C++? Is there smth between impalad & columnar data? Are 256 GBs RAM required for impalad or some other component?
II. Impala loose all in-memory performance benefits when it comes to cluster shuffles (JOINs), right? Does Impala have any mechanics to boost JOIN performance compared to Spark?
III. Impala use Multi-Level Service Tree (smth like Dremel Engine see "Execution model" here) vs Spark's Directed Acyclic Graph. What does actually MLST vs DAG mean in terms of ad hoc query performance? Or it's a better fit for multi-user environment?
First off, I don't think comparison of a general purpose distributed computing framework and distributed DBMS (SQL engine) has much meaning. But if we would still like to compare a single query execution in single-user mode (?!), then the biggest difference IMO would be what you've already mentioned -- Impala query coordinators have everything (table metadata from Hive MetaStore + block locations from NameNode) cached in memory, while Spark will need time to extract this data in order to perform query planning.
Second biggie would probably be shuffle implementation, with Spark writing temp files to disk at stage boundaries against Impala trying to keep everything in-memory. Leading to a radical difference in resilience - while Spark can recover from losing an executor and move on by recomputing missing blocks, Impala will fail the entire query after a single impalad daemon crash.
Less significant performance-wise (since it typically takes much less time compared to everything else) but architecturally important is work distribution mechanism -- compiled whole stage codegens sent to the workers in Spark vs. declarative query fragments communicated to daemons in Impala.
As far as specific query optimization techniques (query vectorization, dynamic partition pruning, cost-based optimization) -- they could be on par today or will be in the near future.

Optimized caching via spark

I am working on a solution to provide low latency results using spark. For this, I was planning to cache the data beforehand on which a user wants to query.
I am able to achieve good performance on the queries. One thing I noticed is that the data on cluster (parquet format) explodes when caching. I understand this is due to deserializing and decoding the data. I am just wondering if there is any other options to reduce the memory footprint.
I tried using
sqlContext.cacheTable("table_name") and also
tbl.persist(StorageLevel.MEMORY_AND_DISK_SER)
But nothing is helping reduce the memory footprint
Perhaps you want to try orc ? There have been improvements in orc support recently (more here: https://www.slideshare.net/Hadoop_Summit/orc-improvement-in-apache-spark-23-95295487). I am not an expert, but I heard that orc uses in memory columnar format... This format gives opportunities for doing things like compressing via techniques like run length encoding of repeated values -- which tends to lower memory footprint.
It also explodes when not caching.
cache has nothing do with reducing memory footprint. You do not state RDD or DF, but I presume latter. This RDD Memory footprint in spark gives an idea for RDDs and the improvements for DFs / DSs: https://spoddutur.github.io/spark-notes/deep_dive_into_storage_formats.html.
You cannot reuse the data for different users. What you could consider is Apache Ignite. See https://ignite.apache.org/use-cases/spark/shared-memory-layer.html

Monitor Spark actual work time vs. communication time

On a Spark cluster, if the jobs are very small, I assume that the clustering will be inefficient since most of the time will be spent on communication between nodes, rather than utilizing the processors on the nodes.
Is there a way to monitor how much time out of a job submitted with spark-submit is wasted on communication, and how much on actual computation?
I could then monitor this ratio to check how efficient my file aggregation scheme or processing algorithm is in terms of distribution efficiency.
I looked through the Spark docs, and couldn't find anything relevant, though I'm sure I'm missing something. Ideas anyone?
You can see this information in the Spark UI, asuming you are running Spark 1.4.1 or higher (sorry but I don't know how to do this for earlier versions of Spark).
Here is a sample image:
Here is the page that the image came from.
A brief summary: You can view a timeline of all the events happening in your Spark job within the Spark UI. From there, you can zoom in on each individual job and each individual task. Each task is divided into shceduler delay, serialization / deserialization, computation, shuffle, etc.
Now, this is obviously a very pretty UI but you might want something more robust so that you can check this info programmatically. It appears here that you can use the REST API to export the logging info in JSON.

Is Tachyon by default implemented by the RDD's in Apache Spark?

I'm trying to understand Spark's in memory feature. In this process i came across Tachyon
which is basically in memory data layer which provides fault tolerance without replication by using lineage systems and reduces re-computation
by check-pointing the data-sets. Now where got confused is, all these features are also achievable by Spark's standard RDDs system. So i wonder does RDDs implement Tachyon behind the curtains to implement these features? If not than what is the use of Tachyon where all of its job can be done by standard RDDs. Or am i making some mistake in relating these two? a detailed explanation or link to one will be a great help. Thank you.
What is in the paper you linked does not reflect the reality of what is in Tachyon as a release open source project, parts of that paper have only ever existed as research prototypes and never been fully integrated into Spark/Tachyon.
When you persist data to the OFF_HEAP storage level via rdd.persist(StorageLevel.OFF_HEAP) it uses Tachyon to write that data into Tachyon's memory space as a file. This removes it from the Java heap thus giving Spark more heap memory to work with.
It does not currently write the lineage information so if your data is too large to fit into your configured Tachyon clusters memory portions of the RDD will be lost and your Spark jobs can fail.

Resources