Why spark-sql cpu utilization is higher than hive? - apache-spark

I am performing same query in both Hive and Spark SQL. We know that Spark is faster than hive, so i got the expected response time.
But when we consider about the CPU Utilization,
Spark process takes above >300%
while Hive takes near 150% for the same.
Is it the real nature of Spark and Hive?
What other metrics needs to be considered?
How to evaluate both in right way?

A big picture
Spark has no superpowers. The source of it advantage over MapReduce, is preference towards fast in-memory access, over slower out-of-core processing depending on distributed storage. So what it does it at its core is cutting off IO wait time.
Conclusion
Higher average CPU utilization is expected. Let's say you want to compute sum of N number. Independent of implementation asymptotic number of operations will be the same. However, if data is in-memory, you can expect lower total time and higher average CPU usage, while if data is on disk, you can expect higher total time and lower average CPU usage (higher IO wait).
Some remarks:
Spark and Hive are not designed with the same goals in mind. Spark is more ETL / streaming ETL tool, Hive database / data warehouse. This implies different optimization under the hood and performance can differ highly, depending on the workload.
Comparing resource usage without the context doesn't make much sense.
In general Spark is less conservative and more resource hungry. It reflects both the design goals, as well as hardware evolution. Spark is a few years younger, and it is enough to see significant drop in the hardware cost.

Related

Will Spark structured streaming benefit from dynamic allocation if number of cores more than number of Kafka partitions?

Supposed we have an application that reads from X partition topic, does some filtering on the data then saves it into storage (no complex shuffling logic, just some simple transformations) using Structured Streaming query. Will this application benefit from dynamic allocation feature that adds more than X single-core executors in case of data spike?
I am asking this, because I've mostly worked with DStreams, where there is quite well known recommendation to have single core per partition so that every executor core will be busy processing data from one partition and adding more executors usually will not give much scaling benefits. My intuition says that no, because the data will still end up on the same workers, but I might be missing something.
are you talking about dynamic allocation by yarn ?
But you can use minPartitions setting in spark structured streaming.
Refer https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

How to determine the number of executors to read a delta table?

I have a delta table which is partitioned by multiple keys, one of which includes date excluding minute details(only upto hour, example - Fri, 15 Jul 2022 07)
Now, with the data keep ingesting via batch and streaming ingestion workflow, what would be the best strategy to evaluate number of executors to read all the data from delta table?
One of the very naive way could be to just let spark autoscale but we may still need to play with shuffle partitions etc. Looking for hints or best practices around the same. Thanks!
If you want to "read all the data from delta table" it does not really matter whether this table is partitioned or not since the query reads all the data and hence loads the whole table.
This is the worst possible query - the dreaded full scan. If it's inevitable, just know that that is the kind of queries where Spark SQL shines so bright utilising the full power of a Spark cluster. You've been warned :)
Executors are simply machines with CPU cores and memory. You're probably more interested in the number of CPU cores for all the tasks to load the delta table.
I'd start this calculation with the number of files for a given version of the delta table. Files are of different size and (I might be wrong here) they are usually chunked (I don't want to use the overloaded term partitioned here, but that's what springs to my mind) to 512MB splits.
The number of splits (512MB blocks) for all the files of a given version of the delta table would be the number of tasks. That would give you the number of CPU cores and hence their "containers", i.e. Spark executors (to evenly saturate available physical resources for the best performance).

Impala vs Spark performance for ad hoc queries

I'm interested only in query performance reasons and architectural differences behind them. All answers I've seen before were outdated or hadn't provide me with enough context of WHY Impala is better for ad hoc queries.
From 3 considerations below only the 2nd point explain why Impala is faster on bigger datasets. Could you please contribute to the following statements?
Impala doesn't miss time for query pre-initialization, means impalad daemons are always running & ready. In other hand, Spark Job Server provide persistent context for the same purposes.
Impala is in-memory and can spill data on disk, with performance penalty, when data doesn't have enough RAM. The same is true for Spark. The main difference is that Spark is written on Scala and have JVM limitations, so workers bigger than 32 GB aren't recommended (because of GC). In turn, [wrong, see UPD] Impala is implemented on C++, and has high hardware requirements: 128-256+ GBs of RAM recommended. This is very significant, but should benefit Impala only on datasets that requires 32-64+ GBs of RAM.
Impala is integrated with Hadoop infrastructure. AFAIK the main reason to use Impala over another in-memory DWHs is the ability to run over Hadoop data formats without exporting data from Hadoop. Means Impala usually use the same storage/data/partitioning/bucketing as Spark can use, and do not achieve any extra benefit from data structure comparing to Spark. Am I right?
P.S. Is Impala faster than Spark in 2019? Have you seen any performance benchmarks?
UPD:
Questions update:
I. Why Impala recommends 128+ GBs RAM? What is an implementation language of each Impala's component? Docs say that "Impala daemons run on every node in the cluster, and each daemon is capable of acting as the query planner, the query coordinator, and a query execution engine.". If impalad is Java, than what parts are written on C++? Is there smth between impalad & columnar data? Are 256 GBs RAM required for impalad or some other component?
II. Impala loose all in-memory performance benefits when it comes to cluster shuffles (JOINs), right? Does Impala have any mechanics to boost JOIN performance compared to Spark?
III. Impala use Multi-Level Service Tree (smth like Dremel Engine see "Execution model" here) vs Spark's Directed Acyclic Graph. What does actually MLST vs DAG mean in terms of ad hoc query performance? Or it's a better fit for multi-user environment?
First off, I don't think comparison of a general purpose distributed computing framework and distributed DBMS (SQL engine) has much meaning. But if we would still like to compare a single query execution in single-user mode (?!), then the biggest difference IMO would be what you've already mentioned -- Impala query coordinators have everything (table metadata from Hive MetaStore + block locations from NameNode) cached in memory, while Spark will need time to extract this data in order to perform query planning.
Second biggie would probably be shuffle implementation, with Spark writing temp files to disk at stage boundaries against Impala trying to keep everything in-memory. Leading to a radical difference in resilience - while Spark can recover from losing an executor and move on by recomputing missing blocks, Impala will fail the entire query after a single impalad daemon crash.
Less significant performance-wise (since it typically takes much less time compared to everything else) but architecturally important is work distribution mechanism -- compiled whole stage codegens sent to the workers in Spark vs. declarative query fragments communicated to daemons in Impala.
As far as specific query optimization techniques (query vectorization, dynamic partition pruning, cost-based optimization) -- they could be on par today or will be in the near future.

Spark Multi-Core experiment [duplicate]

I am doing a simple scaling test on Spark using sort benchmark -- from 1 core, up to 8 cores. I notice that 8 cores is slower than 1 core.
//run spark using 1 core
spark-submit --master local[1] --class john.sort sort.jar data_800MB.txt data_800MB_output
//run spark using 8 cores
spark-submit --master local[8] --class john.sort sort.jar data_800MB.txt data_800MB_output
The input and output directories in each case, are in HDFS.
1 core: 80 secs
8 cores: 160 secs
I would expect 8 cores performance to have x amount of speedup.
Theoretical limitations
I assume you are familiar Amdahl's law but here is a quick reminder. Theoretical speedup is defined as followed :
where :
s - is the speedup of the parallel part.
p - is fraction of the program that can be parallelized.
In practice theoretical speedup is always limited by the part that cannot be parallelized and even if p is relatively high (0.95) the theoretical limit is quite low:
(This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.
Attribution: Daniels220 at English Wikipedia)
Effectively this sets theoretical bound how fast you can get. You can expect that p will be relatively high in case embarrassingly parallel jobs but I wouldn't dream about anything close to 0.95 or higher. This is because
Spark is a high cost abstraction
Spark is designed to work on commodity hardware at the datacenter scale. It's core design is focused on making a whole system robust and immune to hardware failures. It is a great feature when you work with hundreds of nodes
and execute long running jobs but it is doesn't scale down very well.
Spark is not focused on parallel computing
In practice Spark and similar systems are focused on two problems:
Reducing overall IO latency by distributing IO operations between multiple nodes.
Increasing amount of available memory without increasing the cost per unit.
which are fundamental problems for large scale, data intensive systems.
Parallel processing is more a side effect of the particular solution than the main goal. Spark is distributed first, parallel second. The main point is to keep processing time constant with increasing amount of data by scaling out, not speeding up existing computations.
With modern coprocessors and GPGPUs you can achieve much higher parallelism on a single machine than a typical Spark cluster but it doesn't necessarily help in data intensive jobs due to IO and memory limitations. The problem is how to load data fast enough not how to process it.
Practical implications
Spark is not a replacement for multiprocessing or mulithreading on a single machine.
Increasing parallelism on a single machine is unlikely to bring any improvements and typically will decrease performance due to overhead of the components.
In this context:
Assuming that the class and jar are meaningful and it is indeed a sort it is just cheaper to read data (single partition in, single partition out) and sort in memory on a single partition than executing a whole Spark sorting machinery with shuffle files and data exchange.

How to reduce spark batch job creation overhead

We have a requirement where a calculation must be done in near real time (with in 100ms at most) and involves moderately complex computation which can be parallelized easily. One of the options we are considering is to use spark in batch mode apart from Apache Hadoop YARN. I've read that submitting batch jobs to spark has huge overhead however. Is these a way we can reduce/eliminate this overhead?
Spark best utilizes available resources i.e. memory and cores. Spark uses the concept of Data Locality.
If data and the code that operates on it are together than computation tends to be fast. But if code and data are separated, one must move to the other. Typically it is faster to ship serialized code from place to place than a chunk of data because code size is much smaller than data.
If you are low on resources surely scheduling and processing time will shoot. Spark builds its scheduling around this general principle of data locality.
Spark prefers to schedule all tasks at the best locality level, but this is not always possible.
Check https://spark.apache.org/docs/1.2.0/tuning.html#data-locality

Resources