How does partitions map to tasks in Spark? - apache-spark

If I partition an RDD into say 60 and I have a total of 20 cores spread across 20 machines, i.e. 20 instances of single core machines, then the number of tasks is 60 (equal to the number of partitions). Why is this beneficial over having a single partition per core and having 20 tasks?
Additionally, I have run an experiment where I have set the number of partitions to 2, checking the UI shows 2 tasks running at any one time; however, what has surprised me is that it switches instances on completion of tasks, e.g. node1 and node2 do the first 2 tasks, then node6 and node8 do the next set of 2 tasks etc. I thought by setting the number of partitions to less than the cores (and instances) in a cluster then the program would just use the minimum number of instances required. Can anyone explain this behaviour?

For the first question: you might want to have more granular tasks than strictly necessary in order to load less into memory at the same time. Also, it can help with error tolerance, as less work needs to be redone in case of failure. It is nevertheless a parameter. In general the answer depends on the kind of workload (IO bound, memory bound, CPU bound).
As for the second one, I believe version 1.3 has some code to dynamically request resources. I'm unsure in which version the break is, but older versions just request the exact resources you configure your driver with. As for how comes a partition moves from one node to another, well, AFAIK it will pick the data for a task from the node that has a local copy of that data on HDFS. Since hdfs has multiple copies (3 by default) of each block of data, there are multiple options to run any given piece).

Related

Cassandra CPU imbalance in Azure

We have a 30+ node Cassandra cluster (3.11.2) in 4 data centers. One of the centers consists of 8 nodes in Azure running on Standard DS12 v2 (4cpu, 28gb) nodes with a 500GB premium SSD drive. All in the same data center (central US).
We are seeing a dramatic CPU imbalance in the node activity when pushed to the max. We have a keyspace with about 200 million records, and we're running a process to check and refresh the records if necessary from another data stream.
What's happening, is we have 4 nodes that are running at 70-90% CPU compared to 15-25% of the other 4. The measurement of the CPU is being done in the nodes themselves, because Azure's own metrics is broken and never represents what is actually happening.
Digging into a pair of nodes (one low CPU and one high) the difference is the iowait% of the two. The data in the keyspace is balanced (within reason - they are all within 5% of another in record count and size). It looks like the number of reads is balanced, and even the read latency as reported by Cassandra is similar.
When I do an iostat compare of the nodes, the high CPU node is reporting a much higher (by 50 to 100%) rKB/s numbers... which is likely leading to the difference in iowait% time.
These nodes are 100% configured the same, running the same version of everything (OS, libraries, everything) that I can think to look. I cannot figure out why some nodes are deciding to do more disk reads that the others, resulting in the cluster as a whole slowing down.
Anybody have any suggestions on where I can look for differences?
The only thing that is a pattern, is the nodes that are slower are the 4 nodes that were added later in our expansion. We started with 4 nodes for a while and added 4 more when we needed space. All the appropriate repairs and other tasks required with node additions were done - the fact that the records and physical size of the data files on disk being equal should attest to that.
When we shut down our refresh process, the all the nodes settle down to a even 5% or less CPU across the board. No compaction or any other maintenance is happening that would indicate something different.
plz help... :)
Our final solution for this - to fix ONLY the unbalanced problem was to cleanup, full repair and compact. At that point the nodes are relatively equally used. We suspect expanding the cluster (adding nodes) may have left elements of data on the older nodes that were not compacted out based on regular compaction events.
We are still working to try to solve the load issue; but now at least all the nodes are feeling the same CPU crunch.

Unfair split of workload among spark executors

I am currently using spark to process documents. I have two servers at my disposal (innov1 and innov2) and I am using yarn as the resource manager.
The first step is to gather the paths of the files from a database, filter them, repartition them and persist them in a RDD[String]. However, I can't manage to have a fair sharing of the persist among all the executors:
persisted RDD memory taken among executors
and this lead to the executors not doing the same amount of work after that:
Work done by each executors (do not care about the 'dead' here, it's another problem)
And this happens randomly, sometimes it's innov1 that takes all the persist, and then only executors on innov1 work (but it tends to be innov2 in general). Right now, each time two executors are on innov1, I just kill the job to relaunch, and I pray for them to be on innov2 (which is utterly stupid, and break the goal of using spark).
What I have tried so far (and that didn't work):
make the driver sleep 60 seconds before the loading from the database (maybe innov1 takes more time to wake up?)
add spark.scheduler.minRegisteredResourcesRatio=1.0 when I submit the job (same idea than above)
persist with replication x2 (idea from this link), hoping that some of the block would be replicated on innov1
Note for point 3, sometimes it was persisting a replication on the same executor (which is a bit counter intuitive), or even weirder, not replicated at all (innov2 is not able to communicate with innov1?).
I am open to any suggestion, or link to similar problems I would have missed.
Edit:
I can't really put code here, as it's part of my company's product. I can give a simplified version however:
val rawHBaseRDD : RDD[(ImmutableBytesWritable, Result)] = sc
.newAPIHadoopRDD(...)
.map(x => (x._1, x._2)) // from doc of newAPIHadoopRDD
.repartition(200)
.persist(MEMORY_ONLY)
val pathsRDD: RDD[(String, String)] = rawHBaseRDD
.mapPartitions {
...
extract the key and the path from ImmutableBytesWritable and
Result.rawCells()
...
}
.filter(some cond)
.repartition(200)
.persist(MEMORY_ONLY)
For both persist, everything is on innov2. Is it possible that it's because the data are only on innov2? even if it's the case, I would assume that repartition help to share the rows between innov1 and innov2, but it doesn't happen here.
Your persisted data set is not very big - some ~100MB according to your screenshot. You have allocated 10 cores with 20GB of memory, so the 100MB fits easily into the memory of a single executor and that is basically what is happening.
In other words, you have allocated many more resources than are actually needed, so Spark just randomly picks the subset of resources that it needs to complete the job. Sometimes those resources happen to be on one worker, sometimes on another and sometimes it uses resources from both workers.
You have to remember that to Spark, it makes no difference if all resources are placed on a single machine or on a 100 different machines - as long as you are not trying to use more resources than available (in which case you would get an OOM).
Unfortunately (fortunately?) the problem solved by itself today. I assume it is not spark related as I hadn't modified the code until the resolution.
It's probably due to the complete reboot of all services with Ambari (even if I am not 100% sure, because I already tried this before), as it's the only "major" change that happened today.

Cassandra concurrent read and write

I am trying to understand the Cassandra concurrent read and writes. I come across the property called
concurrent_reads (Defaults are 8)
A good rule of thumb is 4 concurrent_reads per processor core. May increase the value for systems with fast I/O storage
So as per the definition, Correct me If am wrong, 4 threads can access the database concurrently. So let's say I am trying to run the following query,
SELECT max(column1) from 'testtable' WHERE duration = 'month';
I am just trying to execute this query, What will be the use of concurrent read in executing this query?
Thats how many active reads can run at a single time per host. This is viewable if you type nodetool tpstats under the read stage. If the active is at pegged at the number of concurrent readers and you have a pending queue it may be worth trying to increase this. Its pretty normal for people to have this at ~128 when using decent sized heaps and SSDs. This is very hardware dependent so defaults are conservative.
Keep in mind that the activity on this thread is very fast, usually measured in sub ms but assuming they take 1ms even with only 4, given little's law you have a maximum of 4000 (local) reads per second per node max (1000/1 * 4), with RF=3 and quorum consistency that means your doing a minimum of 2 reads per request so can divide in 2 to think of a theoretical (real life is ickier) max throughput.
The aggregation functions (ie max) are processed on the coordinator, after fetching the data of the replicas (each doing a local read and sending response) and are not directly impacted by the concurrent reads since handled in the native transport and request response stages.
From cassandra 2.2 onward, the standard aggregate functions min, max, avg, sum, count are built-in. So, I don't think concurrent_reads will have any effect on your query.

Spark Direct Stream Concurrent Job Limit

I am running a spark direct stream from kafka where I need to run many concurrent jobs in order to process all the data in time. In spark you can set spark.streaming.concurrentJobs to a number of concurrent jobs you want to run.
What I want to know is a logical way to determine how many concurrent jobs I can run within my given environment. For privacy issues at my company, I cannot tell you the specs that I have, but what I would want to know is which specs are relevant in determining a limit and why?
Of course the alternative is that I could keep increasing it and testing, then adjusting based on results but I would like a more logical approach and I want to actually understand what determines that limit and why.
To test different numbers of concurrent jobs and see the overall execution time is the most reliable method. However, I suppose the best number roughly equals to the value of Runtime.getRuntime().availableProcessors();
So my advice is to start with that number of available processors, then increase and decrease it by 1,2, and 3. Then make a chart (execution time against the number of jobs) and you'll see the optimal number of jobs.

Spark performs poorly when generating non-associate features

I have been using Spark as a tool for my own feature-generation project. For this specific project, I have two data-sources which I load into RDDs as follows:
Datasource1: RDD1 = [(key,(time,quantity,user-id,...)j] => ... => bunch of other attributes such as transaction-id, etc.
Datasource2: RDD2 = [(key,(t1,t2)j)]
In RDD1, time denotes the time-stamp where the event has happened and, in RDD2, denotes the acceptable time-interval for each feature. The feature-key is "key". I have two types of features as follows:
associative features: number of items
non-associative features: Example: unique number of users
For each feature-key, I need to see which events fall in the interval (t1,t2) and then aggregate those things. So, I have a join followed by a reduce operation as follows:
`RDD1.join(RDD2).map((key,(v1,v2))=>(key,featureObj)).reduceByKey(...)`
The initial value for my feature would be featureObj=(0,set([])) where the first argument keeps number of items and the second stores number of unique user ids. I also partition the input data to make sure that RDD1 and RDD2 use the same partitioner.
Now, when I run the job to just calculate the associative feature, it runs very fast on a cluster of 16 m2.xlarge, in only 3 minutes. The minute I add the second one, the computation time jumps to 5min. I tried to add a couple of other non-associate features and, every time, the run-time increases fast. Right now, my job runs in 15minutes for 15 features 10 of them are non-associative. I also tried to use KyroSerializer and persist RDDs in a serialized form but nothing special happened. Since I will be moving to implement more features, this issue seems to become a bottleneck.
PS. I tried to do the same task on a single big host (128GB of Ram and 16 cores). With 145 features, the whole job was done in 10minutes. I am under the impression that the main Spark bottleneck is JOIN. I checked my RDDs and noticed that both are co-partitioned in the same way. As a single job is calling these two RDDs, I presume they are co-located too? However, spark web-console still shows "2.6GB" shuffle-read and "15.6GB" shuffle-write.
Could someone please advise me if I am doing something really crazy here? Am I using Spark for a wrong application? Thanks for the comments in advance.
With best regards,
Ali
I noticed poor performance with shuffle operations, too. It turned out that the shuffle ran very fast when data was shuffled from one core to another within the same executor (locality PROCESS_LOCAL), but much slower than expected in all other situations, even NODE_LOCAL was very slow. This can be seen in the Spark UI.
Further investigation with CPU and garbage collection monitoring found that at some point garbage collection made one of the nodes in my cluster unresponsive, and this would block the other nodes shuffling data from or to this node, too.
There are a lot of options that you can tweak in order to improve garbage collection performance. One important thing is to enable early reclamation of humongous objects for the G1 garbage collector, which requires java 8u45 or higher.
In my case the biggest problem was memory allocation in netty. When I turned direct buffer memory off by setting spark.shuffle.io.preferDirectBufs = false, my jobs ran much more stable.

Resources