Tez VS Spark - huge performance diffs - apache-spark

I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows
SELECT DT, Sum(1) from mydata GROUP BY DT
DT is partition column, a string that marks date.
In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.
When running on Hive (from beeline), the query runs (actually is still running) for 500+ seconds. (!!!)
To make things worse, this application takes even more resources (significantly) than the spark shell session I ran the job in.
UPDATE: It finished 1 row selected (672.152 seconds)
More information about the environment:
Only one queue used, with capacity scheduler
User under which the job is running is my own user. We have Kerberos used with LDAP
AM Resource: 4096 MB
using tez.runtime.compress with Snappy
data is in Parquet format, no compression applied
tez.task.resource.memory 6134 MB
tez.counters.max 10000
tez.counters.max.groups 3000
tez.runtime.io.sort.mb 8110 MB
tez.runtime.pipelined.sorter.sort.threads 2
tez.runtime.shuffle.fetch.buffer.percent 0.6
tez.runtime.shuffle.memory.limit.percent 0.25
tez.runtime.unordered.output.buffer.size-mb 460 MB
Enable Vectorization and Map Vectorization true
Enable Reduce Vectorization false
hive.vectorized.groupby.checkinterval 4096
hive.vectorized.groupby.flush.percent 0.1
hive.tez.container.size 682
More Updates:
When checking about vectorization on this link, I noticed I don't see Vectorized execution: true anywhere when I used explain. Another thing that caught my attention is the following: table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Namely, when checking table itself: STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' and OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.
What shd be the first thing to check?
Thx

In the end, we gave up and installed LLAP. I'm going to accept it as an answer, as I have sort of an OCD and this unanswered question has been poking my eyes for long enough.

Related

Spark Performance Issue vs Hive

I am working on a pipeline that will run daily. It includes joining 2 tables say x & y ( approx. 18 MB and 1.5 GB sizes respectively) and loading the output of the join to final table.
Following are the facts about the environment,
For table x:
Data size: 18 MB
Number of files in a partition : ~191
file type: parquet
For table y:
Data size: 1.5 GB
Number of files in a partition : ~3200
file type: parquet
Now the problem is:
Hive and Spark are giving same performance (time taken is same)
I tried different combination of resources for spark job.
e.g.:
executors:50 memory:20GB cores:5
executors:70 memory:20GB cores:5
executors:1 memory:20GB cores:5
All three combinations are giving same performance. I am not sure what I am missing here.
I also tried broadcasting the small table 'x' so as to avoid shuffle while joining but not much improvement in performance.
One key observations is:
70% of the execution time is consumed for reading the big table 'y' and I guess this is due to more number of files per partition.
I am not sure how hive is giving the same performance.
Kindly suggest.
I assume you are comparing Hive on MR vs Spark. Please let me know if it is not the case.Because Hive(on tez or spark) vs Spark Sql will not differ
vastly in terms of performance.
I think the main issue is that there are too many small files.
A lot of CPU and time is consumed in the I/O itself, hence you can't experience the processing power of Spark.
My advice is to coalesce the spark dataframes immedietely after reading the parquet files. Please coalesce the 'x' dataframe into single partition and 'y'
dataframe into 6-7 partitions.
After doing the above, please perform the join(broadcastHashJoin).

Compression rate in Spark Application

I am doing some benchmark in a cluster using Spark. Among the various things I want to get a good approximation of the average size reduction achieved by serialization and compression. I am running in client deploy-mode and with the local master, and tired with both shells of versions 1.6 and 2.2 of spark.
I want to do that calculating the in-memory size and then the size on disk, so the fraction should be my answer. I have obviously no problems getting the on-disk size, but I am really struggling with the in-memory one.
Since my RDD is made of doubles and they occupy 8 bytes each in memory I tried counting the number of elements in the RDD and multiplying by 8, but that leaves out a lot of things.
The second approach was using "SizeEstimator" (https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.util.SizeEstimator$
), but this is giving me crazy results! In Spark 1.6 it is either 30, 130 or 230 randomly (47 MB on disk), in Spark 2.2 it starts at 30 and everytime I execute it it increases by 0 or by 1. I know it says it's not super accurate but I can't even find a bit of consistency! I even tried setting persisting level in memory only
rdd.persist(StorageLevel.MEMORY_ONLY)
but still, nothing changed.
Is there any other way I can get the in-memory size of the RDD? Or should I try another approach? I am writing to disk with rdd.SaveAsTextFile, and generating the rdd via RandomRDDs.uniformRDD.
EDIT
sample code:
write
val rdd = RandomRDDs.uniformRDD(sc, nBlocks, nThreads)
rdd.persist(StorageLevel.MEMORY_ONLY_SER)
println("RDD count: " + rdd.count)
rdd.saveAsObjectFile("file:///path/to/folder")
read
val rdd = sc.wholeTextFiles(name,nThreads)
rdd.count() //action so I'm sure the file is actually read
webUI
Try caching the rdd as you mentioned and check in the storage tab of the spark UI.
By default rdd is deserialised and stored in memory. If you want to serialise it then specifically use persist with option MEMORY_ONLY_SER.The memory consumption will be less. In disk data always will be stored in serialised fashion
Check once the spark UI

Data Processing in Parallel using Apache Spark with Pyspark

I have a daily level transaction dataset for three months going upto around 17 gb combined. Now I have a server with 16 cores and 64gb RAM with 1 tb of hardisk space. I have the transaction data broken into 90 files each having the same format and a set of queries which is to be run in this entire dataset and the query for each daily level data is the same for all 90 files. The result after the query is run is appended and then we get the resultant summary back. Now before I start on my endevour I was wondering if Apache Spark with pyspark can be used to solve this. I tried R but it was very slow and ultimately I got memory outage issue
So my question has two parts
How should I create my RDD? Should I pass my entire dataset as an RDD or is there any way I can tell spark to work in Parallel in these 90 datsets
2.Can I expect a significant speed improvement if I am not working with Hadoop

Increase the Query Parallelism Capacity on Cached RDD (DataFrame) with Spark-Job-Server on a Standalone Spark Cluster

First of all, our standalone Spark cluster consists of 20 nodes, each one of them has 40 cores and 128G memory (including the 2 masters).
1.
We use Spark-Job-Server for reusing Spark-Context (in the core, we want to reuse cached RDD for querying), when we set the Spark executor memory to 33G each node and execute the SQL on the DataFrame such as "select * from tablename limit 10", then the result will be in mal-formatted UTF-8 style which cannot be resolved by app.
But if we set the executor-memory below 32G,the result is well formed then. We kept the rest setting untouched while we changed the memory.
Can anyone knows Spark & Spark-Job-Server well tell us the cause of the messed code?
Is it the too much memory giving the reason why we get our result mess coded?
2.
The second thing is a more specific one in our user case.
We load 60G data into the mem and persist it using the memory-only storage level, the data is actually a structured table we will do some query on.
Then we tried the Spark SQL on our cached 60G RDD (registered as a DataFrame), specifically speaking, execute several queries like "select column from tableName where condition clause" in parallel, which lead to OOM exception.
We really want to increase our query parallelism with current cluster .
Can anyone give us some hints or some info which will help us solve our parallelism requirement.

Does Spark incur the same amount of overhead as Hadoop for vnodes?

I just read https://stackoverflow.com/a/19974621/260805. Does Spark (specifically Datastax's Cassandra Spark connector) incur the same amount of overhead as Hadoop when reading from a Cassandra cluster? I know Spark uses threads more heavily than Hadoop does.
Performance with vnodes and without in the connector should be basically the same. With hadoop each vnode split generated it's own task which created a large amount of overhead.
With Spark, tasks contain the token ranges from multiple vnodes and are merged into a single task and the overall task overhead is lower. There is a slight locality issue where it becomes difficult to get balanced numbers of tasks for all the nodes in the C* cluster with smaller data sizes. This issue is being worked on in SPARKC-43.
I'll give three separate answers. I apologize for the rather unstructured answer, but it's been building up over time:
A previous answer:
Here's one potential answer: Why not enable virtual node in an Hadoop node?. I quote:
Does this also apply to Spark?
No, if you're using the official DataStax spark-cassandra-connector. It can process multiple token ranges in a single Spark task. There is still some minor performance hit, but not as huge as with Hadoop.
A production benchmark
We ran a Spark job against a vnode-enabled Cassandra (Datastax Enterprise) datacenter with 3 nodes. The job took 9.7 hours. Running the same job on for slightly less data, using 5 non-vnode nodes, a couple of weeks back took 8.8 hours.
A controlled benchmark
To further test the overhead we ran a controlled benchmark on a Datastax Enterprise node in a single-node cluster. For both vnode enabled/disabled the node was 1) reset, 2) X number of rows were written and then 3) SELECT COUNT(*) FROM emp in Shark was executed a couple of times to get a cold vs. hot cache times. X tested were 10^{0-8}.
Assuming that Shark is not dealing with vnodes in any way, the average (quite stable) overhead for vnodes were ~28 seconds for cold Shark query executions and 17 seconds for hot executions. The latency difference did generally not vary with data size.
All the numbers for the benchmark can be found here. All scripts used to run the benchmark (see output.txt for usage) can be found here.
My only guess why there was a difference between "Cold diff" and "Hot diff" (see spreadsheet) is that it took Shark some time to create metadata, but this is simply speculation.
Conclusion
Our conclusion is that the overhead of vnodes is a constant time between 13 and 30 seconds, independent of data size.

Resources