It looks like addfile and broadcast do similar things. How are they different? When should you use one vs. the other?
Broadcast is used for variables that you need in your code, it could be a static list that is required to be referred by each task, from the documentation of Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks.
They can be used, for example, to give every node a copy of a large
input dataset in an efficient manner. Spark also attempts to
distribute broadcast variables using efficient broadcast algorithms to
reduce communication cost.
Add file is used to make a file available on every node, it could be a jar file or a library that program refers.
Hoping this clarifies.
Cheers !
Addfile adds a file to the spark. It generally used to load local files to spark.
Where as broadcast is the concept of joining 2 datasets in spark. In case one of the RDD/DataFrame is small it can be broadcasted to all the executors so that it can do map join.
Addfile knows from where to load the file but in case of broadcast the underlying file could be distributed but the dataframe created on top of the distributed files could be small. This can be achieved through filtering / transformation.
Typical usecases:
Use Broadcast for variables / data.
Use addFile for libraries / custom code etc.
There are exceptions to both though, in cases there is huge data ( in my case it was ~900MB ) I used the addFile mechanism to pass on the file to all nodes and then load the data there. This proved to be working a bit faster as my file was already in gzipped format.
Related
I have some (200ish) large zip files (some >1GB) that should be unzipped and processed using Python geo- and imageprocessing libraries. The results will be written as new files in FileStore, and later used for ML tasks in Databricks.
What would the general approach be, if I want to exploit the Spark cluster processing power? I'm thinking of adding the filenames to a DataFrame, and using user defined functions to process them via Select or similar. I believe I should be able to make this run in parallel on the cluster, where the workers will get just the filename, and then load the files locally.
Is this reasonable, or is there some completely different direction I should go?
Update - Or maybe like this:
zipfiles = ...
def f(x):
print("Processing " + x)
spark = SparkSession.builder.appName('myApp').getOrCreate()
rdd = spark.sparkContext.parallelize(zipfiles)
rdd.foreach(f)
Update 2:
For anyone doing this. Since Spark by default will reserve almost all available memory you might have to reduce that with this setting: spark.executor.memory 1g
Or you might run out of memory quickly on the worker.
Yes, you can use Spark as a generic-parallel-processing engine, give or take some serialization issues. For example, in one project I've used spark to scan many bloom filters in parallel, and random-access indexed files where the bloom filters returned positive. Most likely you will need to use the RDD api for such tailor made solutions.
Is there a way to set the preferred locations of RDD partitions manually?
I want to make sure certain partition be computed in a certain machine.
I'm using an array and the 'Parallelize' method to create a RDD from that.
Also I'm not using HDFS, The files are on the local disk. That's why I want to modify the execution node.
Is there a way to set the preferredLocations of RDD partitions manually?
Yes, there is, but it's RDD-specific and so different kinds of RDDs have different ways to do it.
Spark uses RDD.preferredLocations to get a list of preferred locations to compute each partition/split on (e.g. block locations for an HDFS file).
final def preferredLocations(split: Partition): Seq[String]
Get the preferred locations of a partition, taking into account whether the RDD is checkpointed.
As you see the method is final which means that no one can ever override it.
When you look at the source code of RDD.preferredLocations you will see how a RDD knows its preferred locations. It is using the protected RDD.getPreferredLocations method that a custom RDD may (but don't have to) override to specify placement preferences.
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
So, now the question has "morphed" into another about what are the RDDs that allow for setting their preferred locations. Find yours and see the source code.
I'm using an array and the 'Parallelize' method to create a RDD from that.
If you parallelize your local dataset it's no longer distributed and can be such, but...why would you want to use Spark for something you can process locally on a single computer/node?
If however you insist and do really want to use Spark for local datasets, the RDD behind SparkContext.parallelize is...let's have a look at the source code... ParallelCollectionRDD which does allow for location preferences.
Let's then rephrase your question to the following (hoping I won't lose any important fact):
What are the operators that allow for creating a ParallelCollectionRDD and specifying the location preferences explicitly?
To my great surprise (as I didn't know about the feature), there is such an operator, i.e. SparkContext.makeRDD, that...accepts one or more location preferences (hostnames of Spark nodes) for each object.
makeRDD[T](seq: Seq[(T, Seq[String])]): RDD[T] Distribute a local Scala collection to form an RDD, with one or more location preferences (hostnames of Spark nodes) for each object. Create a new partition for each collection item.
In other words, rather than using parallelise you have to use makeRDD (which is available in Spark Core API for Scala, but am not sure about Python that I'm leaving as a home exercise for you :))
The same reasoning I'm applying to any other RDD operator / transformation that creates some sort of RDD.
When Spark ingest the Data, is there specific situation where it has to go trough the driver and then from the driver the worker ? Same question apply for a direct read by the worker.
I guess i am simply trying to map out what are the condition or situation that lead to one way or the other, and how does partitioning happen in each case.
If you limit yourself to built-in methods then unless you create distributed data structure from a local one with method like:
SparkSession.createDataset
SparkContext.parallelize
data is always accessed directly by the workers, but the details of the data distribution will vary from source to source.
RDDs typically depend on Hadoop input formats, but Spark SQL and data source API, are at least partially independent, at least when it comes to configuration,
It doesn't mean data is always properly distributed. In some cases (JDBC, streaming receivers) data may still be piped trough a single node.
I am looking for something equivalent to hadoop's InputFormat. But I do not have a .java class from Hadoop. My question is how is this done in spark, without using Hadoop's way of identifying inputs.
Sorry if this is a dumb question but I am extremly new to Hadoop/Spark.
Thanks
I am taking a presumption that in case of MR InputFormat data will be small as it is mostly used to define coherent data groups (to be processed in single map or MR). So it is unlikely that file which is defining coherent group in too big to fit in memory. So it is possible to read the data from InputFormat and cache it in memory in case of Spark. Later you can read the content of this file, create an iterator (which will identify the data part say Hive partition) and then generating dynamic path for data part using this iterator.
I am writing Spark application (Single client) and dealing with lots of small files upon whom I want to run an algorithm. The same algorithm for each one of them. But the files cannot be loaded into the same RDD for the algorithm to work, Because it should sort data within one file boundary.
Today I work on a file at a time, As a result I have poor resource utilization (Small amount of data each action, lots of overhead)
Is there any way to perform the same action/transformation on multiple RDD's simultaneously (And only using one driver program)? Or should I look for another platform? Because such mode of operation isn't classic for Spark.
If you use SparkContext.wholeTextFiles, then you could read the files into one RDD and each partition of the RDD would have the content of a single file. Then, you could work on each partition separately using SparkContext.mapPartitions(sort_file), where sort_file is the sorting function that you want to apply on each file. This would use concurrency better than your current solution, as long as your files are small enough that they can be processed in a single partition.