Hadoop on Azure - file processing on larger number of nodes takes the same amount of time

I ran a wordcount program in python on HDInsight clusters of different size and every time it took the same amount of time. The file size is 600 MB and I ran it on 2, 4 and 8 nodes - every time the same amount of time (not to the second but very close).
I expected the time to change since the file is processed by larger number of nodes as the cluster grows in size... I am wondering if this is the case with a file which is relatively small? Or is there a way to define number of nodes on which the job should be done? - I personally don't think so since the cluster size is set in advance.
Or is it the nature of the wordcount application and the fact that the reducer does the same amount of work?
Or is because it's python - I read somewhere it is said to be slower than java (or scala on spark)?
The same thing happens on Spark clusters - although the nodes number goes up the time does not go down.

Per my experience, 600MB data size for processing on Hadoop is small. Not all time cost for processing files, because Hadoop need some time to prepare startup for M/R job & data on HDFS.
For a small dataset, it's not necessary for using too more compute nodes. Even, the performance got by a single computer would be higher than the cluster on Hadoop, such as the Hadoop sample wordcount for several small text files.
As I known, the dataset size on Hadoop need to over hundreds of GB level generally for performance advantage, and performance increase with an increase in the number of nodes.
As references, there is a SO thread (Why submitting job to mapreduce takes so much time in General?) that you can know.


very small Batch processing with spark

We are working on a project, where we need to process some dataset which is very small, in fact, less than 100 rows in csv format. There are around 20-30 such jobs that process these kinds of datasets. But the load can grow in future, and it can reach into big data category. Is it fine to start with spark for these extra-small load, so that system remains scalable tomorrow? Or should we write a normal program for now in java/c# that runs on schedule? And in future if load of some of these tasks becomes really high, switch to spark?
Absolutely fine,One thing to remember before running Job is to check memory and allocating memory based on size of data.
Say you have 10 cores , 50GB ram and initially you have csv files of 3kb or 1MB in size.Giving 50Gb ram and 10cores for 1Mb Files is a false approach,
Before you tigger the Job you should be carefull in allocating memory and number of executors.
For above csv files of 3Mb data you can give 2-cores at maximum and 5Gb of RAM to get job done.With the increase of size in data you can increase of usage of cores and memory.
Before you open sparkshell(Here I am using Pyspark and yarn as resource manager).This Can be done by example:
pyspark --master yarn --num-executors-memory <512M ,2G>
Spark cluster does not scale to small data

i am currently evaluating Spark 2.1.0 on a small cluster (3 Nodes with 32 CPUs and 128 GB Ram) with a benchmark in linear regression (Spark ML). I only measured the time for the parameter calculation (not including start, data loading, …) and recognized the following behavior. For small datatsets 0.1 Mio – 3 Mio datapoints the measured time is not really increasing and stays at about 40 seconds. Only with larger datasets like 300 Mio datapoints the processing time went up to 200 seconds. So it seems, the cluster does not scale at all to small datasets.
I also compared the small dataset on my local pc with the cluster using only 10 worker and 16GB ram. The processing time of the cluster is larger by a factor of 3. So is this considered normal behavior of SPARK and explainable by communication overhead or am I doing something wrong (or is linear regression not really representative)?
The cluster is a standalone cluster (without Yarn or Mesos) and the benchmarks where submitted with 90 worker, each with 1 core and 4 GB ram.
Spark submit:
./spark-submit --master spark://server:7077 --class Benchmark --deploy-mode client --total-executor-cores 90 --executor-memory 4g --num-executors 90 .../Benchmark.jar pathToData
The optimum cluster size and configuration varies based on the data and the nature of the job. In this case, I think that your intuition is correct, the job seems to take disproportionately longer to complete on smaller dataset, because of the excess overhead given the size of the cluster (cores and executors).
Notice that increasing the amount of data by two orders of magnitude increases the processing time only 5-fold. You are increasing the data toward an optimum size for your cluster setup.
Spark is a great tool for processing lots of data, but it isn't going to be competitive with running a single process on a single machine if the data will fit. However it can be much faster than other distributed processing tools that are disk-based, where the data does not fit on a single machine.
I was at a talk a couple years back and the speaker gave an analogy that Spark is like a locomotive racing a bicycle:- the bike will win if the load is light, it is quicker to accelerate and more agile, but with a heavy load the locomotive might take a while to get up to speed, but it's going to be faster in the end. (I'm afraid I forget the speakers name, but it was at a Cassandra meetup in London, and the speaker was from a company in the energy sector).
I agree with #ImDarrenG's assessment and generally also the locomotive/bicycle analogy.
With such a small amount of data, I would strongly recommend
A) caching the entire dataset and
B) broadcasting the dataset to each node (especially if you need to do something like your 300M row table join to the small datasets)
Another thing to consider is the # of files (if you're not already cached), because if you're reading in a single unsplittable file, only 1 core will be able to read that file in. However once you cache the dataset (coalescing or repartitioning as appropriate), performance will no longer be bound by disk/serializing the rows.

Spark+yarn - scale memory with input size

I am running a spark on yarn cluster with pyspark. I have a dataset which requires loading several binary files per key, and then running some calculation that is difficult to decompose into parts - so it generally has to operate across all the data for a single key.
Currently, I set spark.executor.memory and spark.yarn.executor.memoryOverhead to "sane" values that work most of the time, however certain keys end up having a much larger amount of data than the average, and in these cases, the memory is insufficient and the executor ends up getting killed.
I currently do one of the following:
1) Run jobs with the default memory setting and just rerun when certain keys fail with more memory
2) If I know one of my keys has much more data, I can scale up the memory for the job as a whole, however this has the downside of drastically reducing the number of running containers I get / number of jobs running in parallel.
Ideally I would have a system where I could send off a job and have the memory in an executor scale with input size, however I know that's not spark's model. Are there any extra settings that can help me here or any tricks for dealing with this problem? Anything obvious I'm missing as a fix?
You can test the following approach: set executor memory and executor yarn overhead to your max values and add spark.executor.cores with number greater than 1 (start with 2). Additionally set spark.task.maxFailures to some big number (lets say 10).
Then on normal-sized keys spark will probably finish tasks as usual but some partitions with larger keys with fail. They will be added to retry stage and since number of partitions to retry will be much lower than initial partitions, spark will distribute them evenly to executors. If number of partitions will be lower or equal number of executors, every partition will have twice memory compared to initial execution and may succeed.
Let me know if it will work for you.

Processing time for my Spark program does not decrease on increasing the number of nodes in the cluster

I have a Cloudera cluster with 3 nodes on which Apache Spark is installed. I am running a Spark program which reads data from HBase tables, transforms the data and stores it in a different HBase table. With 3 nodes the time taken in approximately 1 minutes 10 seconds for 5 million rows HBase data. On decreasing or increasing the number of nodes, the time taken came similar whereas it was expected to reduce after increasing the number of nodes and increase by increasing the number of nodes.Below was the time taken:
1) With 3 nodes: Approximately 1 minute 10 seconds for 5 million rows.
2) With 1 node: Approximately 1 minute 10 seconds for 5 million rows.
3) With 6 nodes: Approximately 1 minute 10 seconds for 5 million rows.
What can be the reason for same time taken despite increasing or decreasing the number of nodes?
By default, Hbase will probably read the 5 million rows from a single region or maybe 2 regions (degree of parallelism). The write will occur to a single region or maybe 2 based on the scale of the data.
Is Spark your bottleneck? If you allocate variable resources (more/less cores or memory) it will only lead to change in overall times of the job if the computation on the job is the bottleneck.
If your computation (the transform) is relatively simple, the bottleneck might be reading from HBase or writing from HBase. In that case irrespective of how many node/cores you may give it. The run time will be constant.
From the runtimes you have mentioned it seems that's the issue.
The bottleneck may be one or both hbase and spark side. You can check the hbase side for your tables number of region servers. It is same meaning with the read and write parallelism of data. The more the better usually. You must notice the hotspotting issue
The spark side parallelism can be checked with your number of rdd for your data. May be you should repartition your data. Added to this,cluster resource utilization may be your problem. For checking this you can monitor spark master web interface. Number of nodes, number of workers per node, and number of job, task per worker etc. Also you must check number of cpu and amont of ram usage per worker within this interface.
For details here

Apache Spark running out of memory with smaller amount of partitions

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.
