Understand how Spark is transforming input file to worker nodes - apache-spark

I have a Spark cluster with 3 worker nodes. Take the simplified word count as sample:
val textFile = sc.textFile("hdfs://input/words")
textFile.count
This application is creating a RDD, and calculating how many lines. Due to the input file is huge, when actually performing count function, does Spark splits the input into 3 parts and separately move them to the 3 worker nodes? If so, how does Spark partition the input file (how Spark determine which line send to which worker node)?

You are trying to process file "hdfs://input/words". This file is already split as soon as you store it on HDFS(Since you have taken example of HDFS file above). If file has 3 blocks, Spark will see it as 3 partitions of file.
Spark does not need to move file to worker nodes. since file is on HDFS. it is already on machines which will be used as worker nodes by spark.
I hope this is clear.

Related

Spark Streaming and Kafka problem with partitions

I created an application, using Spark Streaming, that receives the path of some files from Kafka and open them to analyze the content. I would like to read these files in parallel inside Spark with a flatMap() function witch returns the elements inside each file. I am sending the file paths using a Kafka topic with 8 partitions sending 8 paths for each batch time. By default inside Spark I have 8 partitions but the paths are not equally distributed so there are tasks that read more file than other. How can I balance the 8 partitions in order to have exactly one path for each partition?
Thank You.

Does Spark distributes dataframe across nodes internally?

I am trying to use Spark for processing csv file on cluster. I want to understand if I need to explicitly read the file on each of the worker nodes to do the processing in parallel or will the driver node read the file and distribute the data across cluster for processing internally? (I am working with Spark 2.3.2 and Python)
I know RDD's can be parallelized using SparkContext.parallelize() but what in case of Spark DataFrames?
if __name__=="__main__":
spark=SparkSession.builder.appName('myApp').getOrCreate()
df=spark.read.csv('dataFile.csv',header=True)
df=df.filter("date>'2010-12-01' AND date<='2010-12-02' AND town=='Madrid'")
So if I am running the above code on cluster, will the entire operation be done by driver node or will it distribute df across cluster and each worker perform processing on its data partition?
To be strict, if you run the above code it will not read or process any data. DataFrames are basically an abstraction implemented on top of RDDs. As with RDDs, you have to distinguish transformations and actions. As your code only consists of one filter(...) transformation, noting will happen in terms of readind or processing of data. Spark will only create the DataFrame which is an execution plan. You have to perform an action like count() or write.csv(...) to actually trigger processing of the CSV file.
If you do so, the data will then be read and processed by 1..n worker nodes. It is never read or processed by the driver node. How many or your worker nodes are actually involved depends -- in your code -- on the number of partitions of your source file. Each partition of the source file can be processed in parallel by one worker node. In your example it is probably a single CSV file, so when you call df.rdd.getNumPartitions() after you read the file, it should return 1. Hence, only one worker node will read the data. The same is true if you check the number of partitions after your filter(...) operation.
Here are two ways of how the processing of your single CSV file can be parallelized:
You can manually repartition your source DataFrame by calling df.repartition(n) with n the number of partitions you want to have. But -- and this is a significant but -- this means that all data is potentially send over the network (aka shuffle)!
You perform aggregations or joins on the DataFrame. These operations have to trigger a shuffle. Spark then uses the number of partitions specified in spark.sql.shuffle.partitions(default: 200) to partition the resulting DataFrame.

pyspark split load uniformly across all executors

I have a 5 node cluster.I am loading a 100k csv file to a dataframe using pyspark and performing some etl operations and writing the output to a parquet file.
When I load the data frame how can divide the dataset uniformly across all executors os that each executor processes 20k records.
If possible, make sure that the input data is split into smaller files.
that way each executor will read and process a single file.
In the case that you can't modify the input files, you can call df.repartition(5), but keep in mind that it will cause an expensive shuffle operation

what will happen in a cluster environment when I do spark.textFile("hdfs://...log.txt")

Guys I am new to Spark and learnt some basic concept of Spark. Although I now have some understanding of the concepts such as partition, stage, tasks,transformation but I find it is a bit difficult for me to connect these concepts or dots.
Assuming the file has 4 lines(each line take 64MB so it is the same as the size of each partition by default) and I have one master node and 4 slave nodes.
val input = spark.textFile("hdfs://...log.txt")
#whatever transformation here
val splitedLines = input.map(line => line.split(" "))
.map(words => (words(0), 1))
.reduceByKey{(a,b) => a + b}
I am wondering what will happen on the master node and slave node?
Here is my understanding please correct me if I am wrong.
When I start the context SparkContext, each worker starts an executor according to this post What is a task in Spark? How does the Spark worker execute the jar file?
Then the application will get pushed to the slave node
Will each of the 4 slave nodes read one line from the file? If so, that means on each slave node, a RDD will be generated? Then DAG will be generated based on RDD and stage will be built and tasks will be identified as well. In this case, each slave node has one RDD and one partition to hold the RDD.
OR, Will the master node read the entire file and build a RDD,then DAG, then stage, and then only push the task to the slave nodes and then the slave node will only process tasks such as map, filter or reduceByKey. But if this is the case, how will the slave nodes read the file? How the file or RDD is distributed among the slaves?
What I am looking for is to understand the flow step by step and to understand where each step happens, on the master node or slave node?
thank you for your time.
cheers
Will each of the 4 slave nodes read one line from the file?
Yes, since the file is split the file will read parallely. (Tuneable property # of lines to read)
How the file or RDD is distributed among the slaves?
HDFS takes care of the splitting and spark workers will be responsible for reading.
Source : here https://github.com/jaceklaskowski/mastering-apache-spark-book

How does Spark parallelize the processing of a 1TB file?

Imaginary problem
A gigantic CSV log file, let's say 1 TB in size, the file is located on a USB drive
The log contains activities logs of users around the world, let's assume that the line contains 50 columns, among those there is Country.
We want a line count per country, descending order.
Let's assume the Spark cluster has enough nodes with RAM to process the entire 1TB in memory (20 nodes, 4 cores CPU, each node has 64GB RAM)
My Poorman's conceptual solution
Using SparkSQL & Databricks spark-csv
$ ./spark-shell --packages com.databricks:spark-csv_2.10:1.4.0
val dfBigLog = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/media/username/myUSBdrive/bogusBigLog1TB.log")
dfBigLog.select("Country")
.groupBy("Country")
.agg(count($"Country") as "CountryCount")
.orderBy($"CountryCount".desc).show
Question 1: How does Spark parallelize the processing?
I suppose the majority of the execution time (99% ?) of the above solution is to read the 1TB file from the USB drive into the Spark cluster. Reading the file from the USB drive is not parallelizable. But after reading the entire file, what does Spark do under the hood to parallelize the processing?
How many nodes used for creating the DataFrame? (maybe only one?)
How many nodes used for groupBy & count? Let's assume there are 100+ countries (but Spark doesn't know that yet). How would Spark partition to distribute the 100+ country values on 20 nodes?
Question 2: How to make the Spark application the fastest possible?
I suppose the area of improvement would be to parallelize the reading of the 1TB file.
Convert the CSV File into a Parquet file format + using Snappy compression. Let's assume this can be done in advance.
Copy the Parquet file on HDFS. Let's assume the Spark cluster is within the same Hadoop cluster and the datanodes are independant from the 20 nodes Spark cluster.
Change the Spark application to read from HDFS. I suppose Spark would now use several nodes to read the file as Parquet is splittable.
Let's assume the Parquet file compressed by Snappy is 10x smaller, size = 100GB, HDFS block = 128 MB in size. Total 782 HDFS blocks.
But then how does Spark manage to use all the 20 nodes for both creating the DataFrame and the processing (groupBy and count)? Does Spark use all the nodes each time?
Question 1: How does Spark parallelize the processing (of reading a
file from a USB drive)?
This scenario is not possible.
Spark relies on a hadoop compliant filesystem to read a file. When you mount the USB drive, you can only access it from the local host. Attempting to execute
.load("/media/username/myUSBdrive/bogusBigLog1TB.log")
will fail in cluster configuration, as executors in the cluster will not have access to that local path.
It would be possible to read the file with Spark in local mode (master=local[*]) in which case you only will have 1 host and hence the rest of the questions would not apply.
Question 2: How to make the Spark application the fastest possible?
Divide and conquer.
The strategy outlined in the question is good. Using Parquet will allow Spark to do a projection on the data and only .select("Country") column, further reducing the amount of data required to be ingested and hence speeding things up.
The cornerstone to parallelism in Spark are partitions. Again, as we are reading from a file, Spark relies on the Hadoop filesystem. When reading from HDFS, the partitioning will be dictated by the splits of the file on HDFS. Those splits will be evenly distributed among the executors. That's how Spark will initially distribute the work across all available executors for the job.
I'm not deeply familiar with the Catalist optimizations, but I think I could assume that .groupBy("Country").agg(count($"Country") will become something similar to: rdd.map(country => (country,1)).reduceByKey(_+_)
The map operation will not affect partitioning, so can be applied on site.
The reduceByKey will be applied first locally on each partition and partial results will be combined on the driver. So most counting happens distributed in the cluster, and adding it up will be centralized.
Reading the file from the USB drive is not parallelizable.
USB drive or any other data source the same rules apply. Either source is accessible from the driver and all worker machines and data is accessed in parallel (up to the source limits) or data is not accessed at all you get an exception.
How many nodes used for creating the DataFrame? (maybe only one?)
Assuming that files is accessible from all machines it depends on a configuration. For starters you should take a look at the split size.
How many nodes used for the GroupBy & Count?
Once again it depends on a configuration.

Resources