How to control preferred locations of RDD partitions? - apache-spark

Is there a way to set the preferred locations of RDD partitions manually?
I want to make sure certain partition be computed in a certain machine.
I'm using an array and the 'Parallelize' method to create a RDD from that.
Also I'm not using HDFS, The files are on the local disk. That's why I want to modify the execution node.

Is there a way to set the preferredLocations of RDD partitions manually?
Yes, there is, but it's RDD-specific and so different kinds of RDDs have different ways to do it.
Spark uses RDD.preferredLocations to get a list of preferred locations to compute each partition/split on (e.g. block locations for an HDFS file).
final def preferredLocations(split: Partition): Seq[String]
Get the preferred locations of a partition, taking into account whether the RDD is checkpointed.
As you see the method is final which means that no one can ever override it.
When you look at the source code of RDD.preferredLocations you will see how a RDD knows its preferred locations. It is using the protected RDD.getPreferredLocations method that a custom RDD may (but don't have to) override to specify placement preferences.
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
So, now the question has "morphed" into another about what are the RDDs that allow for setting their preferred locations. Find yours and see the source code.
I'm using an array and the 'Parallelize' method to create a RDD from that.
If you parallelize your local dataset it's no longer distributed and can be such, but...why would you want to use Spark for something you can process locally on a single computer/node?
If however you insist and do really want to use Spark for local datasets, the RDD behind SparkContext.parallelize is...let's have a look at the source code... ParallelCollectionRDD which does allow for location preferences.
Let's then rephrase your question to the following (hoping I won't lose any important fact):
What are the operators that allow for creating a ParallelCollectionRDD and specifying the location preferences explicitly?
To my great surprise (as I didn't know about the feature), there is such an operator, i.e. SparkContext.makeRDD, that...accepts one or more location preferences (hostnames of Spark nodes) for each object.
makeRDD[T](seq: Seq[(T, Seq[String])]): RDD[T] Distribute a local Scala collection to form an RDD, with one or more location preferences (hostnames of Spark nodes) for each object. Create a new partition for each collection item.
In other words, rather than using parallelise you have to use makeRDD (which is available in Spark Core API for Scala, but am not sure about Python that I'm leaving as a home exercise for you :))
The same reasoning I'm applying to any other RDD operator / transformation that creates some sort of RDD.

Related

Creating Spark RDD or Dataframe from an External Source

I am building a substantial application in Java that uses Spark and Json. I anticipate that the application will process large tables, and I want to use Spark SQL to execute queries against those tables. I am trying to use a streaming architecture so that data flows directly from an external source into Spark RDDs and dataframes. I'm having two difficulties in building my application.
First, I want to use either JavaSparkContext or SparkSession to parallelize the data. Both have a method that accepts a Java List as input. But, for streaming, I don't want to create a list in memory. I'd rather supply either a Java Stream or an Iterator. I figured out how to wrap those two objects so that they look like a List, but it cannot compute the size of the list until after the data has been read. Sometimes this works, but sometimes Spark calls the size method before the entire input data has been read, which causes an unsupported operation exception.
Is there a way to create an RDD or a dataframe directly from a Java Stream or Iterator?
For my second issue, Spark can create a dataframe directly from JSON, which would be my preferred method. But, the DataFrameReader class has methods for this operation that require a string to specify a path. The nature of the path is not documented, but I assume that it represents a path in the file system or possibly a URL or URI (the documentation doesn't say how Spark resolves the path). For testing, I'd prefer to supply the JSON as a string, and in the production, I'd like the user to specify where the data resides. As a result of this limitation, I'm having to roll my own JSON deserialization, and it's not working because of issues related to parallelization of Spark tasks.
Can Spark read JSON from an InputStream or some similar object?
These two issues seem to really limit the adaptability of Spark. I sometimes feel that I'm trying to fill an oil tanker with a garden hose.
Any advice would be welcome.
Thanks for the suggestion. After a lot of work, I was able to adapt the example at github.com/spirom/spark-data-sources. It is not straightforward, and because the DataSourceV2 API is still evolving, my solution may break in a future iteration. The details are too intricate to post here, so if you are interested, please contact me directly.

When to use broadcast vs. addfile in Spark?

It looks like addfile and broadcast do similar things. How are they different? When should you use one vs. the other?
Broadcast is used for variables that you need in your code, it could be a static list that is required to be referred by each task, from the documentation of Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks.
They can be used, for example, to give every node a copy of a large
input dataset in an efficient manner. Spark also attempts to
distribute broadcast variables using efficient broadcast algorithms to
reduce communication cost.
Add file is used to make a file available on every node, it could be a jar file or a library that program refers.
Hoping this clarifies.
Cheers !
Addfile adds a file to the spark. It generally used to load local files to spark.
Where as broadcast is the concept of joining 2 datasets in spark. In case one of the RDD/DataFrame is small it can be broadcasted to all the executors so that it can do map join.
Addfile knows from where to load the file but in case of broadcast the underlying file could be distributed but the dataframe created on top of the distributed files could be small. This can be achieved through filtering / transformation.
Typical usecases:
Use Broadcast for variables / data.
Use addFile for libraries / custom code etc.
There are exceptions to both though, in cases there is huge data ( in my case it was ~900MB ) I used the addFile mechanism to pass on the file to all nodes and then load the data there. This proved to be working a bit faster as my file was already in gzipped format.

How does Apache Spark store lineages?

Apache spark claims it will store the lineages instead of the RDD's itself so that it can recompute in case of a failure. I am wondering how it stores the lineages? For example an RDD can be made of bunch of user provided transformation functions so does it store the "source code of those user provided functions" ?
Simplifying things a little bit RDDs are recursive data structures which describe lineages. Each RDD has a set of dependencies and it is computed in a specific context. Functions which are passed to Spark actions and transformations are first-class objects, can be stored, assigned, passed around and captured as a part of the closure and there is no reason (no to mention means) to store source code.
RDDs belong to the Driver and are not equivalent to the data. When data is accessed on the workers, RDDs are long gone and the only thing that matters is a given task.

How to combine small parquet files with Spark?

I have a Hive table that has a lot of small parquet files and I am creating a Spark data frame out of it to do some processing using SparkSQL. Since I have a large number of splits/files my Spark job creates a lot of tasks, which I don't want. Basically what I want is the same functionality that Hive provides, that is, to combine these small input splits into larger ones by specifying a max split size setting. How can I achieve this with Spark? I tried using the coalesce function, but I can only specify the number of partitions with it (I can only control the number of output files with it). Instead I really want some control over the (combined) input split size that a task processes.
Edit: I am using Spark itself, not Hive on Spark.
Edit 2: Here is the current code I have:
//create a data frame from a test table
val df = sqlContext.table("schema.test_table").filter($"my_partition_column" === "12345")
//coalesce it to a fixed number of partitions. But as I said in my question
//with coalesce I cannot control the file sizes, I can only specify
//the number of partitions
df.coalesce(8).write.mode(org.apache.spark.sql.SaveMode.Overwrite)
.insertInto("schema.test_table")
I have not tried but read it in getting started guide that setting this property should work "hive.merge.sparkfiles=true"
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
In case using Spark on Hive, than Spark's abstraction doesn't provide explicit split of data. However we can control the parallelism in several ways.
You can leverage DataFrame.repartition(numPartitions: Int) to explicitly control the number of partitions.
In case you are using Hive Context than ensure hive-site.xml contains the CombinedInputFormat. That may help.
For more info, take a look at following documentation about Spark data parallelism - http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism.

Actions/Transformations on multiple RDD's simultaneously in Spark

I am writing Spark application (Single client) and dealing with lots of small files upon whom I want to run an algorithm. The same algorithm for each one of them. But the files cannot be loaded into the same RDD for the algorithm to work, Because it should sort data within one file boundary.
Today I work on a file at a time, As a result I have poor resource utilization (Small amount of data each action, lots of overhead)
Is there any way to perform the same action/transformation on multiple RDD's simultaneously (And only using one driver program)? Or should I look for another platform? Because such mode of operation isn't classic for Spark.
If you use SparkContext.wholeTextFiles, then you could read the files into one RDD and each partition of the RDD would have the content of a single file. Then, you could work on each partition separately using SparkContext.mapPartitions(sort_file), where sort_file is the sorting function that you want to apply on each file. This would use concurrency better than your current solution, as long as your files are small enough that they can be processed in a single partition.

Resources