What is the equivalent configuration between Spark and Drill? - apache-spark

I want to compare the query performance between Spark and Drill. Therefore, the configuration of these two systems has to be identical. What are the parameters I have to consider like driver memory, executor memory for spark, drill max direct memory, planner memory max query memory per node for Drill etc? Can someone give me an example of configuration?

It is possible to get a close comparison between Spark and Drill for specific overlapping use case. I will first describe how Spark and Drill are different, what the overlapping use cases are, and finally how you could tune Spark's memory settings to match Drill as closely as possible for overlapping use cases.
Comparison of Functionality
Both Spark and Drill can function as a SQL compute engine. My definition of a SQL compute engine is a system that can do the following:
Ingest data from files, databases, or message queue.
Execute SQL statements provided by a user on the ingested data.
Write the results of a user's SQL statement to a terminal, file, database table, or message queue.
Drill is only a SQL compute engine while Spark can do more than just a SQL compute engine. The extra things that Spark can do are the following:
Spark has APIs to manipulate data with functional programming operations, not just SQL.
Spark can save results of operations to DataSets. DataSets can be efficiently reused in other operations and are efficiently cached both on disk and in memory.
Spark has some stream processing concepts APIs.
So to accurately compare Drill and Spark you can only consider their overlapping functionality, which is executing a SQL statement.
Comparison of Nodes
A running Spark job is comprised of two types of nodes. An Executor and a Driver. The Executor is like a worker node that is given simple compute tasks and executes them. The Driver orchestrates a Spark job. For example if you have a SQL query or a Spark job written in Python, the Driver is responsible for planning how the work for the SQL query or python script will be distributed to the Executors. The Driver will then monitor the work being done by Executors. The Driver can be run in a variety of modes: on your laptop like a client, on a separate dedicated node or container.
Drill is slightly different. The two participants in a SQL query are the Client and Drillbit. The Client is essentially a dummy commandline terminal for sending SQL commands and receiving results. The Drillbits are responsible for doing the compute work for a query. When the Client sends a SQL command to Drill the client will pick one Drillbit to be a Foreman. There is no restriction on which Drillbit can be a foreman and there can be a different Foreman selected for each query. The Foreman performs two functions during the query:
He plans the query and orchestrates the rest of the Drillbits to divide up the work.
He also participates in the execution of the query and does some of the data processing as well.
The functions of Spark's Driver and Executors are very similar to Drill's Drillbit and Foreman but not quite the same. The main difference being that a Driver cannot function as an Executor simultaneusly, while a Foreman also functions as a Drillbit.
When constructing a cluster comparing Spark and Drill I would do the following:
Drill: Create a cluster with N nodes.
Spark: Create a cluster with N Executors and make sure the Driver has the same amount of memory as the Executors.
Comparison of Memory Models
Spark and Drill both use the JVM. Applications running on the JVM have access to two kinds of memory. On Heap Memory and Off Heap Memory. On heap memory is normal garbage collected memory; for example if you do new Object() the object will be allocated on the heap. Off heap memory is not garbage collected and must be explicitly allocated and freed. When applications consume a large amounts of heap memory (16 GB or more), they can tax JVM garbage collector. In such cases garbage collection can incur a significant compute overhead and depending on the GC algorithm computation can pause for several seconds as garbage collection is done. In contrasts off heap memory is not subject to garbage collection and would not incur these performance penalties.
Spark stores everything on heap by default. It can be configured to store some data in off heap memory, but it is not clear to me when it will actually store data off heap.
Drill stores all its data in off heap memory, and only uses on heap memory for the general engine itself.
Another additional difference is that Spark reserves some of its memory to cache DataSets, while Drill does not caching of data in memory after a query is executed.
In order to compare Spark and Drill apples to apples we would have to configure Spark and Drill to use the same amount of off heap and on heap memory for executing a SQL query. In the following example we will walk through how to configure Drill and spark to use 8gb of on heap memory and 8gb of off heap memory.
Drill Memory Config Example
Set the following in your drill-env.sh file on each Drillbit
export DRILL_HEAP="8G"
export DRILL_MAX_DIRECT_MEMORY="8G"
Once these are configured restart your Drillbits and try your query. Your query may run out of memory because Drill's memory management is still under active development. In order to give you an out you can manually control Drill's memory usage with a query using the planner.width.max_per_node and planner.memory.max_query_memory_per_node options. These options are set in your drill-override.conf. Note you must change these options on all your nodes and restart your Drillbits for them to take effect. A more detailed explanation of these options can be found here.
Spark Memory Config Example
Create a properties file myspark.conf and pass it to the spark submit command. The spark properties file should include the following config.
# 8gb of heap memory for executor
spark.executor.memory 8g
# 8gb of heap memory for driver
spark.driver.memory 8g
# Enable off heap memory and use 8gb of it
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 8000000000
# Do not set aside memory for caching data frames
# Haven't tested if 0.0 works. If it doesn't make this
# as small as possible
spark.memory.storageFraction 0.0
Summary
Create a Drill cluster with N nodes, a Spark cluster with N executors and deploy a dedicated driver, try the memory configurations provided above, and run the same or a similar SQL query on both clusters. Hope this helps.

Related

How can Spark process data that is way larger than Spark storage?

Currently taking a course in Spark and came across the definition of an executor:
Each executor will hold a chunk of the data to be processed. This
chunk is called a Spark partition. It is a collection of rows that
sits on one physical machine in the cluster. Executors are responsible
for carrying out the work assigned by the driver. Each executor is
responsible for two things: (1) execute code assigned by the driver,
(2) report the state of the computation back to the driver
I am wondering what will happen if the storage of the spark cluster is less than the data that needs to be processed? How executors will fetch the data to sit on the physical machine in the cluster?
The same question goes for streaming data, which unbound data. Do Spark save all the incoming data on disk?
The Apache Spark FAQ briefly mentions the two strategies Spark may adopt:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Although Spark uses all available memory by default, it could be configured to run the jobs only with disk.
In section 2.6.4 Behavior with Insufficient Memory of Matei's PhD dissertation on Spark (An Architecture for Fast and General Data Processing on Large Clusters) benchmarks the performance impact due to the reduced amount of memory available.
In practice, you don't usually persist the source dataframe of 100TB, but only the aggregations or intermediate computations that are reused.

Why caching small Spark RDDs takes big memory allocation in Yarn?

The RDDs that are cached (in total 8) are not big, only around 30G, however, on Hadoop UI, it shows that the Spark application is taking lots of memory (no active jobs are running), i.e. 1.4T, why so much?
Why it shows around 100 executors (here, i.e. vCores) even when there's no active jobs running?
Also, if cached RDDs are stored across 100 executors, are those executors preserved and no more other Spark apps can use them for running tasks any more? To rephrase the question: will preserving a little memory resource (.cache) in executors prevents other Spark app from leveraging the idle computing resource of them?
Is there any potential Spark config / zeppelin config that can cause this phenomenon?
UPDATE 1
After checking the Spark conf (zeppelin), it seems there's the default (configured by administrator by default) setting for spark.executor.memory=10G, which is probably the reason why.
However, here's a new question: Is it possible to keep only the memory needed for the cached RDDs in each executors and release the rest, instead of holding always the initially set memory spark.executor.memory=10G?
Spark configuration
Perhaps you can try to repartition(n) your RDD to a fewer n < 100 partitions before caching. A ~30GB RDD would probably fit into storage memory of ten 10GB executors. A good overview of Spark memory management can be found here. This way, only those executors that hold cached blocks will be "pinned" to your application, while the rest can be reclaimed by YARN via Spark dynamic allocation after spark.dynamicAllocation.executorIdleTimeout (default 60s).
Q: Is it possible to keep only the memory needed for the cached RDDs in each executors and release the rest, instead of holding always the initially set memory spark.executor.memory=10G?
When Spark uses YARN as its execution engine, YARN allocates the containers of a specified (by application) size -- at least spark.executor.memory+spark.executor.memoryOverhead, but may be even bigger in case of pyspark -- for all the executors. How much memory Spark actually uses inside a container becomes irrelevant, since the resources allocated to a container will be considered off-limits to other YARN applications.
Spark assumes that your data is equally distributed on all the executors and tasks. That's the reason why you set memory per task. So to make Spark to consume less memory, your data has to be evenly distributed:
If you are reading from Parquet files or CSVs, make sure that they have similar sizes. Running repartition() causes shuffling, which if the data is so skewed may cause other problems if executors don't have enough resources
Cache won't help to release memory on the executors because it doesn't redistribute the data
Can you please see under "Event Timeline" on the Stages "how big are the green bars?" Normally that's tied to the data distribution, so that's a way to see how much data is loaded (proportionally) on every task and how much they are doing. As all tasks have same memory assigned, you can see graphically if resources are wasted (in case there are mostly tiny bars and few big bars). A sample of wasted resources can be seen on the image below
There are different ways to create evenly distributed files for your process. I mention some possibilities, but for sure there are more:
Using Hive and DISTRIBUTE BY clause: you need to use a field that is equally balanced in order to create as many files (and with proper size) as expected
If the process creating those files is a Spark process reading from a DB, try to create as many connections as files you need and use a proper field to populate Spark partitions. That is achieved, as explained here and here with partitionColumn, lowerBound, upperBound and numPartitions properties
Repartition may work, but see if coalesce also make sense in your process or in the previous one generating the files you are reading from

setting tuning parameters of a spark job

I'm relatively new to spark and I have a few questions related to the tuning optimizations with respect to the spark submit command.
I have followed : How to tune spark executor number, cores and executor memory?
and I understand how to utilise maximum resources out of my spark cluster.
However, I was recently asked how to define the number of cores, memory and cores when I have a relatively smaller operation to do as if I give maximum resources, it is going to be underutilised .
For instance,
if I have to just do a merge job (read files from hdfs and write one single huge file back to hdfs using coalesce) for about 60-70 GB (assume each file is of 128 mb in size which is the block size of HDFS) of data(in avro format without compression), what would be the ideal memory, no of executor and cores required for this?
Assume I have the configurations of my nodes same as the one mentioned in the link above.
I can't understand the concept of how much memory will be used up by the entire job provided there are no joins, aggregations etc.
The amount of memory you will need depends on what you run before the write operation. If all you're doing is reading data combining it and writing it out, then you will need very little memory per cpu because the dataset is never fully materialized before writing it out. If you're doing joins/group-by/other aggregate operations all of those will require much ore memory. The exception to this rule is that spark isn't really tuned for large files and generally is much more performant when dealing with sets of reasonably sized files. Ultimately the best way to get your answers is to run your job with the default parameters and see what blows up.

Does Apache Spark cache RDD in node-level or cluster-level?

I know that Apache Spark persist method saves RDDs in memory and that if there is not enough memory space, it stores the remaining partitions of the RDD in the filesystem (disk). What I can't seem to understand is the following:
Imagine we have a cluster and we want to persist an RDD. Suppose node A does not have a lot of memory space and that node B does. Let's suppose now that after running the persist command, node A runs out of memory. The question now is:
Does Apache Spark search for more memory space in node B and try to store everything in memory?
Or given that there is not enough space in node A, Spark stores the remaining partitions of the RDD in the disk of node A even if there some memory space available in node B?
Thanks for your answers.
Normally Spark doesn't search for the free space. Data is cached locally on the executor responsible for a particular partition.
The only exception is the case when you use replicated persistence mode - in that case additional copy will be place on another node.
The closest thing I could find is this To cache or not to cache. I had plenty of situations when data was mildly skewed and was getting memory related exceptions/failures when trying to cache/persist into RAM, one way around it was to use StorageLevels like MEMORY_AND_DISK, but obviously it was taking longer to cache and than read those partitions.
Also in Spark UI you can find the information about executors and how much of their memory is used for caching, you can experiment and monitor how it behaves.

Spark job out of RAM (java.lang.OutOfMemoryError), even though there's plenty. xmx too low?

I'm getting java.lang.OutOfMemoryError with my Spark job, even though only 20% of the total memory is in use.
I've tried several configurations:
1x n1-highmem-16 + 2x n1-highmem-8
3x n1-highmem-8
My dataset consist of 1.8M records, read from a local json file on the master node. The entire dataset in json format is 7GB. The job I'm trying to execute involves a simple computation followed by a reduceByKey. Nothing extraordinary. The job runs fine on my single home computer with only 32GB ram (xmx28g), although it requires some caching to disk.
The job is submitted through spark-submit, locally on the server (SSH).
Stack trace and Spark config can be viewed here: https://pastee.org/sgda
The code
val rdd = sc.parallelize(Json.load()) // load everything
.map(fooTransform) // apply some trivial transformation
.flatMap(_.bar.toSeq) // flatten results
.map(c => (c, 1)) // count
.reduceByKey(_ + _)
.sortBy(_._2)
log.v(rdd.collect.map(toString).mkString("\n"))
The root of the problem is that you should try to offload more I/O to the distributed tasks instead of shipping it back and forth between the driver program and the worker tasks. While it may not be obvious at times which calls are driver-local and which ones describe a distributed action, rules of thumb include avoiding parallelize and collect unless you absolutely need all of the data in one place. The amounts of data you can Json.load() and the parallelize will max out at whatever largest machine type is possible, whereas using calls like sc.textFile theoretically scale to hundreds of TBs or even PBs without problem.
The short-term fix in your case would be to try passing spark-submit --conf spark.driver.memory=40g ... or something in that range. Dataproc defaults allocate less than a quarter of the machine to driver memory because commonly the cluster must support running multiple concurrent jobs, and also needs to leave enough memory on the master node for the HDFS namenode and the YARN resource manager.
Longer term you might want to experiment with how you can load the JSON data as an RDD directly, instead of loading it in a single driver and using parallelize to distribute it, since this way you can dramatically speed up the input reading time by having tasks load the data in parallel (and also getting rid of the warning Stage 0 contains a task of very large size which is likely related to the shipping of large data from your driver to worker tasks).
Similarly, instead of collect and then finishing things up on the driver program, you can do things like sc.saveAsTextFile to save in a distributed manner, without ever bottlenecking through a single place.
Reading the input as sc.textFile would assume line-separated JSON, and you can parse inside some map task, or you can try using sqlContext.read.json. For debugging purposes, it's often enough instead of using collect() to just call take(10) to take a peek at some records without shipping all of it to the driver.

Resources