Distributing partitions across cluster - apache-spark

In apache spark one is allowed to load datasets from many different sources. According to my understanding computing nodes of spark cluster can be different than these used by hadoop to store data (am I right?). What is more, we can even load local file into spark job. Here goes main question: Even if we use the same computers for hdfs and spark purposes, is it always true that spark, during creation of RDD, will shuffle all data? Or spark will just try to load data in the way to take advantage of already existing data locality?

You can use HDFS as the common underlying storage for both MapReduce (Hadoop) and Spark engines, and use a cluster manager like YARN to perform resource management. Spark will try to take advantage of data locality, and execute tasks as close as possible to the data.
This is how it works: If data is available on a node to process, but the CPU is not free, Spark will wait for a certain amount of time (determined by the configuration parameter: spark.locality.wait seconds, default is 3 seconds) for the CPU to become available.
If CPU is still not free after the configured time has passed, Spark will switch the task to a lower locality level. It will then again wait for spark.locality.wait seconds and if a timeout occurs again, it will switch to a yet lower locality level.
The locality levels are defined as below, in order from closest to data, to farthest from data (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.TaskLocality$):
PROCESS_LOCAL (data is in the same JVM as the running code)
NODE_LOCAL (data is on the same node)
NO_PREF (data is accessed equally quickly from anywhere and has no locality preference)
RACK_LOCAL (data is on the same rack of servers)
ANY (data is elsewhere on the network and not in the same rack)
Waiting time for locality levels can also be individually configured. For longer jobs, the wait time can be increased to a larger value than default of 3 seconds, since the CPU might be tied up longer.

Related

Why caching small Spark RDDs takes big memory allocation in Yarn?

The RDDs that are cached (in total 8) are not big, only around 30G, however, on Hadoop UI, it shows that the Spark application is taking lots of memory (no active jobs are running), i.e. 1.4T, why so much?
Why it shows around 100 executors (here, i.e. vCores) even when there's no active jobs running?
Also, if cached RDDs are stored across 100 executors, are those executors preserved and no more other Spark apps can use them for running tasks any more? To rephrase the question: will preserving a little memory resource (.cache) in executors prevents other Spark app from leveraging the idle computing resource of them?
Is there any potential Spark config / zeppelin config that can cause this phenomenon?
UPDATE 1
After checking the Spark conf (zeppelin), it seems there's the default (configured by administrator by default) setting for spark.executor.memory=10G, which is probably the reason why.
However, here's a new question: Is it possible to keep only the memory needed for the cached RDDs in each executors and release the rest, instead of holding always the initially set memory spark.executor.memory=10G?
Spark configuration
Perhaps you can try to repartition(n) your RDD to a fewer n < 100 partitions before caching. A ~30GB RDD would probably fit into storage memory of ten 10GB executors. A good overview of Spark memory management can be found here. This way, only those executors that hold cached blocks will be "pinned" to your application, while the rest can be reclaimed by YARN via Spark dynamic allocation after spark.dynamicAllocation.executorIdleTimeout (default 60s).
Q: Is it possible to keep only the memory needed for the cached RDDs in each executors and release the rest, instead of holding always the initially set memory spark.executor.memory=10G?
When Spark uses YARN as its execution engine, YARN allocates the containers of a specified (by application) size -- at least spark.executor.memory+spark.executor.memoryOverhead, but may be even bigger in case of pyspark -- for all the executors. How much memory Spark actually uses inside a container becomes irrelevant, since the resources allocated to a container will be considered off-limits to other YARN applications.
Spark assumes that your data is equally distributed on all the executors and tasks. That's the reason why you set memory per task. So to make Spark to consume less memory, your data has to be evenly distributed:
If you are reading from Parquet files or CSVs, make sure that they have similar sizes. Running repartition() causes shuffling, which if the data is so skewed may cause other problems if executors don't have enough resources
Cache won't help to release memory on the executors because it doesn't redistribute the data
Can you please see under "Event Timeline" on the Stages "how big are the green bars?" Normally that's tied to the data distribution, so that's a way to see how much data is loaded (proportionally) on every task and how much they are doing. As all tasks have same memory assigned, you can see graphically if resources are wasted (in case there are mostly tiny bars and few big bars). A sample of wasted resources can be seen on the image below
There are different ways to create evenly distributed files for your process. I mention some possibilities, but for sure there are more:
Using Hive and DISTRIBUTE BY clause: you need to use a field that is equally balanced in order to create as many files (and with proper size) as expected
If the process creating those files is a Spark process reading from a DB, try to create as many connections as files you need and use a proper field to populate Spark partitions. That is achieved, as explained here and here with partitionColumn, lowerBound, upperBound and numPartitions properties
Repartition may work, but see if coalesce also make sense in your process or in the previous one generating the files you are reading from

SparkDataframe.load(),when I execute a load command where actually my data get stored?

If I am loading one table from cassandra using spark dataframe.load().Where will my data gets loaded.Is it in spark memory.Or in datanode blocks ,if I am using yarn resource manager.
It will try to store in memory per number of partitions on the Worker Nodes / which in this context is a slightly better term than Data Nodes.
It will spill to disk if not enough memory on the Worker Nodes.
Per number of Cores / Executors, processing will occur. E.g. if you have, say, 20 Executors with 1 Core each, your concurrency of processing is 20 and spilling will occur via eviction. If you run out of disk, an error will result.
Worker Nodes is a better term here compared to Data Nodes, unless you have HDFS and processing locally, then Worker Node is equal to Data Node. Although you could argue what's in a name?
Of course, an Action will need to have been initiated.
And repartition and join or union latterly in the data pipeline affect things, but that goes without saying.

Improving SQL Query using Spark Multi Clusters

I was experimenting if Spark with multi clusters can improve slow SQL query. I created two workers for master and they are running on local Spark Standalone. Yes, I did halve the memory and the number of cores to create workers on local machine. I specified partitions for sqlContext, using partitionColumn, lowerBound, UpperBoundand numberPartitions, so that tasks (or partitions) can be distributed over workers. I described them as below (partitionColumn is unique):
df = sqlContext.read.format("jdbc").options(
url = "jdbc:sqlserver://localhost;databasename=AdventureWorks2014;integratedSecurity=true;",
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver",
dbtable = query,
partitionColumn = "RowId",
lowerBound = 1,
upperBound = 10000000,
numPartitions = 4).load()
I ran my script over the master after specifying the options, but I couldn't get any performance improvement against when running on spark without cluster. I know I should have not halved the memory for integrity of the experiment. But I would like to know if that might be the case or any reason if that's not the case. Any thoughts are welcome. Many thanks.
There are multiple factors which play a role here, though the weights of each of these can differ on a case by case basis.
As nicely pointed out by mtoto, increasing number of workers on a single machine, is unlikely to bring any performance gains.
Multiple workers on a single machine have access to the same fixed pool of resources. Since worker doesn't participate in the processing itself, you just use a higher fraction of this pool for management.
There legitimate cases when we prefer a higher number of executor JVMs, but it is not the same as increasing number of workers (the former one is an application resource, the latter one is a cluster resource).
It is not clear if you use the same number of cores for baseline and multi-worker configuration, nevertheless cores are not the only resource you have to consider working with Spark. Typical Spark jobs are IO (mostly network and disk) bound. Increasing number of threads on a single node, without making sure that there is sufficient disk and network configuration, will just make them wait for the data.
Increasing cores alone is useful only for jobs which are CPU bound (and these will typically scale better on a single machine).
Fiddling with Spark resources won't help you, if external resource cannot keep up with the requests. A high number of concurrent batch reads from a single non-replicated database will just throttle the server.
In this particular case you make it even worse by running a database server on the same node as Spark. It has some advantages (all traffic can go through loopback), but unless database and Spark use different sets of disks, they'll be competing over disk IO (and other resources as well).
Note:
It is not clear what is the query, but if it is slow when executed directly against database, fetching it from Spark will it even slower. You should probably take a closer look at query and/or database structure and configuration first.

what factors affect how many spark job concurrently

We recently have set up the Spark Job Server to which the spark jobs are submitted.But we found out that our 20 nodes(8 cores/128G Memory per node) spark cluster can only afford 10 spark jobs running concurrently.
Can someone share some detailed info about what factors would actually affect how many spark jobs can be run concurrently? How can we tune the conf so that we can take full advantage of the cluster?
Question is missing some context, but first - it seems like Spark Job Server limits the number of concurrent jobs (unlike Spark itself, which puts a limit on number of tasks, not jobs):
From application.conf
# Number of jobs that can be run simultaneously per context
# If not set, defaults to number of cores on machine where jobserver is running
max-jobs-per-context = 8
If that's not the issue (you set the limit higher, or are using more than one context), then the total number of cores in the cluster (8*20 = 160) is the maximum number of concurrent tasks. If each of your jobs creates 16 tasks, Spark would queue the next incoming job waiting for CPUs to be available.
Spark creates a task per partition of the input data, and the number of partitions is decided according to the partitioning of the input on disk, or by calling repartition or coalesce on the RDD/DataFrame to manually change the partitioning. Some other actions that operate on more than one RDD (e.g. union) may also change the number of partitions.
Some things that could limit the parallelism that you're seeing:
If your job consists of only map operations (or other shuffle-less operations), it will be limited to the number of partitions of data you have. So even if you have 20 executors, if you have 10 partitions of data, it will only spawn 10 task (unless the data is splittable, in something like parquet, LZO indexed text, etc).
If you're performing a take() operation (without a shuffle), it performs an exponential take, using only one task and then growing until it collects enough data to satisfy the take operation. (Another question similar to this)
Can you share more about your workflow? That would help us diagnose it.

Spark and RDD partitioning

As in spark we can load data directly from HDFS and number of partitions of RDD will be equal to number of partitions of file. HDFS as known for keeping duplicate chunks of files, so question is how spark deal with this and how RDD partition being governed.
Correct me if I went wrong in asking question.
You want to bring computation to data, so depending where the task will be performed (which physical node will keep the persistent data), you will use the closest available replica (same rack, etc) or perform the scheduling based on where the data is available. This part is handled by the YARN scheduler.
As you can check from spark user guide there are some configuration regarding the data locality that you can set (extracted from spark 1.6 user guide http://spark.apache.org/docs/latest/configuration.html ) :
spark.locality.wait
default : 3s
How long to wait to launch a data-local task before giving up and launching it on a less-local node. The same wait will be used to step through multiple locality levels (process-local, node-local, rack-local and then any). It is also possible to customize the waiting time for each level by setting spark.locality.wait.node, etc. You should increase this setting if your tasks are long and see poor locality, but the default usually works well.
spark.locality.wait.node
default : spark.locality.wait
Customize the locality wait for node locality. For example, you can set this to 0 to skip node locality and search immediately for rack locality (if your cluster has rack information).
spark.locality.wait.process
default:spark.locality.wait
Customize the locality wait for process locality. This affects tasks that attempt to access cached data in a particular executor process.
spark.locality.wait.rack
default:spark.locality.wait
Customize the locality wait for rack locality

Resources