Wordcount in a large file using Spark

Wordcount in a large file using Spark - apache-spark

I have a question about how I can work on large files using Spark. Let's say I have a really large file (1 TB) while I only have access to 500GB RAM in my cluster. A simple wordcount application would look like the follows:
sc.textfile(path_to_file).flatmap(split_line_to_words).map(lambda x: (x,1)).reduceByKey()
When I do not have access to enough memory, will the above application fail due to OOM? if so, what are some ways I can fix this?

Well, this is not an issue.
N Partitions equal to block size of HDFS (like) file system will be created on Worker Nodes at some stage physically- resulting in many N small tasks to execute, easily fitting inside the 500GB, over the life of the Spark App.
Partitions and its task equivalent will run concurrently, based on how many executors you have allocated. If you have, say, M executors with 1 core, then max M tasks run concurrently. Depends also on scheduling and resource allocation mode.
Spark handles like any OS as it were, situations of size and resources and depending on resources available, more or less can be done. The DAG Scheduler plays a role in all this. But keeping it simple here.

Related

How do you determine shuffle partitions for Spark application?

I am new to spark so am following this amazing tutorial from sparkbyexamples.com and while reading I found this section:
Shuffle partition size & Performance
Based on your dataset size, a number of cores and memory PySpark
shuffling can benefit or harm your jobs. When you dealing with less
amount of data, you should typically reduce the shuffle partitions
otherwise you will end up with many partitioned files with less number
of records in each partition. which results in running many tasks with
lesser data to process.
On other hand, when you have too much of data and having less number
of partitions results in fewer longer running tasks and some times you
may also get out of memory error.
Getting the right size of the shuffle partition is always tricky and
takes many runs with different values to achieve the optimized number.
This is one of the key properties to look for when you have
performance issues on PySpark jobs.
Can someone help me understand how do you determine how many shuffle partitions you will need for your job?

As you quoted, it’s tricky, but this is my strategy:
If you’re using “static allocation”, means you tell Spark how many executors you want to allocate for the job, then it’s easy, number of partitions could be executors * cores per executor * factor. factor = 1 means each executor will handle 1 job, factor = 2 means each executor will handle 2 jobs, and so on
If you’re using “dynamic allocation”, then it’s trickier. You can read the long description here https://databricks.com/blog/2021/03/17/advertising-fraud-detection-at-scale-at-t-mobile.html. The general idea is you need to answer many questions like what’s the size if your data (how big in terms of gigabytes), how its structure looks like (how many files, how many folders, how many rows etc), how would you read it (from hdfs or from hive or from jdbc), how much resources do you have (cores, executors, memory), … Then you run and benchmark over and over to find the sweet spot that is “just right” for your circumstances.
Update #1:
So what is the general industry practice, will a company simply use first tactic and allocate more hardware or they will use dynamic allocation?
Usually, if you have an on-premise Hadoop environment, you can choose between static (default mode) and dynamic allocation (advanced mode). Also, I often start with dynamic because I have no idea how big the data and its transformation is, so stick with dynamic give me flexibility to expand my work without thinking too much about Spark configuration. But you also can start with static if you want to, nothing preventing you to do so.
Then eventually, when it came to productionize process, you also can choose between static (very stable but consumes more resources) vs dynamic (less stable, i.e fail sometimes due to resources allocation, but save resources.
Finally, most Hadoop cloud solution (like Databricks) come with dynamic allocation by default, which is is less costly.

setting tuning parameters of a spark job

I'm relatively new to spark and I have a few questions related to the tuning optimizations with respect to the spark submit command.
I have followed : How to tune spark executor number, cores and executor memory?
and I understand how to utilise maximum resources out of my spark cluster.
However, I was recently asked how to define the number of cores, memory and cores when I have a relatively smaller operation to do as if I give maximum resources, it is going to be underutilised .
For instance,
if I have to just do a merge job (read files from hdfs and write one single huge file back to hdfs using coalesce) for about 60-70 GB (assume each file is of 128 mb in size which is the block size of HDFS) of data(in avro format without compression), what would be the ideal memory, no of executor and cores required for this?
Assume I have the configurations of my nodes same as the one mentioned in the link above.
I can't understand the concept of how much memory will be used up by the entire job provided there are no joins, aggregations etc.

The amount of memory you will need depends on what you run before the write operation. If all you're doing is reading data combining it and writing it out, then you will need very little memory per cpu because the dataset is never fully materialized before writing it out. If you're doing joins/group-by/other aggregate operations all of those will require much ore memory. The exception to this rule is that spark isn't really tuned for large files and generally is much more performant when dealing with sets of reasonably sized files. Ultimately the best way to get your answers is to run your job with the default parameters and see what blows up.

Number of Executor Cores and benefits or otherwise - Spark

Some run-time clarifications are requested.
In a thread elsewhere I read, it was stated that a Spark Executor should only have a single Core allocated. However, I wonder if this is really always true. Reading the various SO-questions and the likes of, as well as Karau, Wendell et al, it is clear that there are equal and opposite experts who state one should in some cases specify more Cores per Executor, but the discussion tends to be more technical than functional. That is to say, functional examples are lacking.
My understanding is that a Partition of an RDD or DF, DS, is serviced by a single Executor. Fine, no issue, makes perfect sense. So, how can the Partition benefit from multiple Cores?
If I have a map followed by, say a, filter, these are not two Tasks that can be interleaved - as in what Informatica does, as my understanding is they are fused together. This being so, then what is an example of benefit from an assigned Executor running more Cores?
From JL: In other (more technical) words, a Task is a computation on the records in a RDD partition in a Stage of a RDD in a Spark Job. What does it mean functionally speaking, in practice?
Moreover, can Executor be allocated if not all Cores can be acquired? I presume there is a wait period and that after a while it may be allocated in a more limited capacity. True?
From a highly rated answer on SO, What is a task in Spark? How does the Spark worker execute the jar file?, the following is stated: When you create the SparkContext, each worker starts an executor. From another SO question: When a SparkContext is created, each worker node starts an executor.
Not sure I follow these assertions. If Spark does not know the number of partitions etc. in advance, why allocate Executors so early?
I ask this, as even this excellent post How are stages split into tasks in Spark? does not give a practical example of multiple Cores per Executor. I can follow the post clearly and it fits in with my understanding of 1 Core per Executor.

My understanding is that a Partition (...) serviced by a single Executor.
That's correct, however the opposite is not true - a single executor can handle multiple partitions / tasks across multiple stages or even multiple RDDs).
then what is an example of benefit from an assigned Executor running more Cores?
First and foremost processing multiple tasks at the same time. Since each executor is a separate JVM, which is a relatively heavy process, it might preferable to keep only instance for a number of threads. Additionally it can provide further advantages, like exposing shared memory that can be used across multiple tasks (for example to store broadcast variables).
Secondary application is applying multiple threads to a single partition when user invokes multi-threaded code. That's however not something that is done by default (Number of CPUs per Task in Spark)
See also What are the benefits of running multiple Spark tasks in the same JVM?
If Spark does not know the number of partitions etc. in advance, why allocate Executors so early?
Pretty much by extension of the points made above - executors are not created to handle specific task / partition. There are long running processes, and as long as dynamic allocation is not enabled, there are intended to last for the full lifetime of the corresponding application / driver (preemption or failures, as well as already mentioned dynamic allocation, can affect that, but that's the basic model).

Spark Executor OOM issue

I have a typical batch job that reads CSV from cloud storage then do a bunch of join and aggregate, the whole file does not exceed 3G. But I keep getting OOM exception when writing the result back to storage, I have two executor, each has 80G of RAM, it just doesn't make sense, here is the screen shot of my spark UI and exception. And suggestion is appreciated, if my code is super sub-optimal in terms of memory, why it doesn't show up on the spark UI?
update: the source code is too convoluted to show here, but I figured out the essential cause is multiple join.
Dataset<Row> ret = something dataframe
for (String cmd : cmds) {
ret = ret.join(processDataset(ret, cmd), "primary_key")
}
so, each processDataset(ret, cmd), if you run it on its own, it's very fast, but if you have this kinda of for loop join for a lot of times, say 10 or 20 times, it gets much much much slower, and have this OOM issues.

When I have problems with memory I check these things:
Have more executors (more than 2, defined by total-executor-cores in spark-submit and spark.executor.core in SparkSession)
Have less cores per executor (3-5). You have 14 which much more than recommended (spark.executor.core)
Add memory to executors (spark.executor.memory)
Add memory to driver (driver-memory in spark-submit script)
Make more partitions (make partitions smaller in size) (.config("spark.sql.shuffle.partitions", numPartitionsShuffle) in SparkSession)
Look at PeakExecutionMemory of a Tasks in Stages (one of the additional metrics to turn on) tab to see if it is not to big
If you use Mesos in Agents tab you can see the real usage of memory per driver and executors (see this answer How to get Mesos Agents Framework Executor Memory
Look at explain in your code to analyze the execution plan
See if one of your joins does not explode your memory by making multiple duplicates of lines

Spark job out of RAM (java.lang.OutOfMemoryError), even though there's plenty. xmx too low?

I'm getting java.lang.OutOfMemoryError with my Spark job, even though only 20% of the total memory is in use.
I've tried several configurations:
1x n1-highmem-16 + 2x n1-highmem-8
3x n1-highmem-8
My dataset consist of 1.8M records, read from a local json file on the master node. The entire dataset in json format is 7GB. The job I'm trying to execute involves a simple computation followed by a reduceByKey. Nothing extraordinary. The job runs fine on my single home computer with only 32GB ram (xmx28g), although it requires some caching to disk.
The job is submitted through spark-submit, locally on the server (SSH).
Stack trace and Spark config can be viewed here: https://pastee.org/sgda
The code
val rdd = sc.parallelize(Json.load()) // load everything
.map(fooTransform) // apply some trivial transformation
.flatMap(_.bar.toSeq) // flatten results
.map(c => (c, 1)) // count
.reduceByKey(_ + _)
.sortBy(_._2)
log.v(rdd.collect.map(toString).mkString("\n"))

The root of the problem is that you should try to offload more I/O to the distributed tasks instead of shipping it back and forth between the driver program and the worker tasks. While it may not be obvious at times which calls are driver-local and which ones describe a distributed action, rules of thumb include avoiding parallelize and collect unless you absolutely need all of the data in one place. The amounts of data you can Json.load() and the parallelize will max out at whatever largest machine type is possible, whereas using calls like sc.textFile theoretically scale to hundreds of TBs or even PBs without problem.
The short-term fix in your case would be to try passing spark-submit --conf spark.driver.memory=40g ... or something in that range. Dataproc defaults allocate less than a quarter of the machine to driver memory because commonly the cluster must support running multiple concurrent jobs, and also needs to leave enough memory on the master node for the HDFS namenode and the YARN resource manager.
Longer term you might want to experiment with how you can load the JSON data as an RDD directly, instead of loading it in a single driver and using parallelize to distribute it, since this way you can dramatically speed up the input reading time by having tasks load the data in parallel (and also getting rid of the warning Stage 0 contains a task of very large size which is likely related to the shipping of large data from your driver to worker tasks).
Similarly, instead of collect and then finishing things up on the driver program, you can do things like sc.saveAsTextFile to save in a distributed manner, without ever bottlenecking through a single place.
Reading the input as sc.textFile would assume line-separated JSON, and you can parse inside some map task, or you can try using sqlContext.read.json. For debugging purposes, it's often enough instead of using collect() to just call take(10) to take a peek at some records without shipping all of it to the driver.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string