How does Apache Spark store lineages? - apache-spark

Apache spark claims it will store the lineages instead of the RDD's itself so that it can recompute in case of a failure. I am wondering how it stores the lineages? For example an RDD can be made of bunch of user provided transformation functions so does it store the "source code of those user provided functions" ?

Simplifying things a little bit RDDs are recursive data structures which describe lineages. Each RDD has a set of dependencies and it is computed in a specific context. Functions which are passed to Spark actions and transformations are first-class objects, can be stored, assigned, passed around and captured as a part of the closure and there is no reason (no to mention means) to store source code.
RDDs belong to the Driver and are not equivalent to the data. When data is accessed on the workers, RDDs are long gone and the only thing that matters is a given task.

Related

Creating Spark RDD or Dataframe from an External Source

I am building a substantial application in Java that uses Spark and Json. I anticipate that the application will process large tables, and I want to use Spark SQL to execute queries against those tables. I am trying to use a streaming architecture so that data flows directly from an external source into Spark RDDs and dataframes. I'm having two difficulties in building my application.
First, I want to use either JavaSparkContext or SparkSession to parallelize the data. Both have a method that accepts a Java List as input. But, for streaming, I don't want to create a list in memory. I'd rather supply either a Java Stream or an Iterator. I figured out how to wrap those two objects so that they look like a List, but it cannot compute the size of the list until after the data has been read. Sometimes this works, but sometimes Spark calls the size method before the entire input data has been read, which causes an unsupported operation exception.
Is there a way to create an RDD or a dataframe directly from a Java Stream or Iterator?
For my second issue, Spark can create a dataframe directly from JSON, which would be my preferred method. But, the DataFrameReader class has methods for this operation that require a string to specify a path. The nature of the path is not documented, but I assume that it represents a path in the file system or possibly a URL or URI (the documentation doesn't say how Spark resolves the path). For testing, I'd prefer to supply the JSON as a string, and in the production, I'd like the user to specify where the data resides. As a result of this limitation, I'm having to roll my own JSON deserialization, and it's not working because of issues related to parallelization of Spark tasks.
Can Spark read JSON from an InputStream or some similar object?
These two issues seem to really limit the adaptability of Spark. I sometimes feel that I'm trying to fill an oil tanker with a garden hose.
Any advice would be welcome.
Thanks for the suggestion. After a lot of work, I was able to adapt the example at github.com/spirom/spark-data-sources. It is not straightforward, and because the DataSourceV2 API is still evolving, my solution may break in a future iteration. The details are too intricate to post here, so if you are interested, please contact me directly.

Lazy loading of partitioned parquet in Apache Spark

As I understand it, Apache Spark uses lazy evaluation. So for example code like the following that consists only of transformations will do no actual processing:
val transformed_df = df.filter("some_field = 10").select("some_other_field", "yet_another_field")
Only when we do an "action" will any processing actually occur:
transformed_df.show()
I had been under the impression that load operations are also lazy in spark. (See How spark loads the data into memory.)
However, my experiences with spark have not borne this out. When I do something like the following,
val df = spark.read.parquet("/path/to/parquet/")
execution seems to depend greatly on the size of the data in the path. In other words, it's not strictly lazy. This is inconvenient if the data is partitioned and I only need to look at a fraction of the partitions.
For example:
df.filter("partitioned_field = 10").show()
If the data is partitioned in storage on "partitioned_field", I would have expected spark to wait until show() is called, and then read only data under "/path/to/parquet/partitioned_field=10/". But again, this doesn't seem to be the case. Spark appears to perform at least some operations on all of the data as soon as read or load is called.
I could get around this by only loading /path/to/parquet/partitioned_field=10/ in the first place, but this is much less elegant than just calling "read" and filtering on the partitioned field, and it's harder to generalize.
Is there a more elegant preferred way to lazily load partitions of parquet data?
(To clarify, I am using Spark 2.4.3)
I think I've stumbled on an answer to my question while learning about a key distinction that is often overlooked when talking about lazy evaluation in spark.
Data is lazily evaluated, but schemas are not. So if we are reading parquet, which is a structured data type, spark does have to at least determine the schema of any files it's reading as soon as read() or load() is called. So calling read() on a large number of files will take longer than on a small number of files.
Given that partitions are part of the schema, it's less surprising to me now that spark has to look at all of the files in the path to determine the schema before filtering on a partition field.
It would be convenient for my purposes if spark were to wait until schema evaluation was strictly necessary and was able to filter on partition fields prior to determining the rest of the schema, but it sounds like this is not the case. I believe Dataset objects always must have a schema, so I'm not sure there's a way around this problem without significant changes to Spark.
In conclusion, it seems like my only option currently is to pass in a list of paths for the partitions that I need rather than the base path if I want to avoid evaluating the schema over the entire data repository.

Spark - do transformations also involve driver operations

My course notes have the following sentence: "RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset." But I think this is misleading because the transformation reduceByKey is performed locally on the workers and then on the driver as well (although the change does not take place until there's an action to be performed). Could you please correct me if I am wrong.
Here are the concepts
In Spark Transformation defines where one RDD generates one or more RDD. Everytime a new RDD is created. RDDs are immutable so any transformation on one RDD generates a new RDD and its added to DAG.
Action in spark are the function where new RDDs are not generated its generated other datatypes like String, int etc.. and result is returned to driver or other storage system.
Transformations are lazy in nature and nothing happen until action is triggered.
reduceByKey - Its a transformation as it generates a RDD from input RDD and its a WIDE TRANFORMATION. In reduce by key nothing happens until action is triggered. Please see the image below
reduce - its an action as it generates a non RDD type. Please see the image below
As a matter of fact, driver's first responsibility is managing the job. Moreover, RDD's objects are not located on driver to have an action on them. So, all the results are on workers till the actions' turns come. The thing which I mean is about lazy execution of spark, it means at first of the execution the plan is reviewed to the first action and if it could not find any then the whole program would result nothing. Otherwise, whole the program will be executed on the input data which would be presented as rdd object on the worker nodes to reach the action and all the data during this period would all be on workers and just the result according to the type of the action would be sent to or at least managed by the driver.

How to control preferred locations of RDD partitions?

Is there a way to set the preferred locations of RDD partitions manually?
I want to make sure certain partition be computed in a certain machine.
I'm using an array and the 'Parallelize' method to create a RDD from that.
Also I'm not using HDFS, The files are on the local disk. That's why I want to modify the execution node.
Is there a way to set the preferredLocations of RDD partitions manually?
Yes, there is, but it's RDD-specific and so different kinds of RDDs have different ways to do it.
Spark uses RDD.preferredLocations to get a list of preferred locations to compute each partition/split on (e.g. block locations for an HDFS file).
final def preferredLocations(split: Partition): Seq[String]
Get the preferred locations of a partition, taking into account whether the RDD is checkpointed.
As you see the method is final which means that no one can ever override it.
When you look at the source code of RDD.preferredLocations you will see how a RDD knows its preferred locations. It is using the protected RDD.getPreferredLocations method that a custom RDD may (but don't have to) override to specify placement preferences.
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
So, now the question has "morphed" into another about what are the RDDs that allow for setting their preferred locations. Find yours and see the source code.
I'm using an array and the 'Parallelize' method to create a RDD from that.
If you parallelize your local dataset it's no longer distributed and can be such, but...why would you want to use Spark for something you can process locally on a single computer/node?
If however you insist and do really want to use Spark for local datasets, the RDD behind SparkContext.parallelize is...let's have a look at the source code... ParallelCollectionRDD which does allow for location preferences.
Let's then rephrase your question to the following (hoping I won't lose any important fact):
What are the operators that allow for creating a ParallelCollectionRDD and specifying the location preferences explicitly?
To my great surprise (as I didn't know about the feature), there is such an operator, i.e. SparkContext.makeRDD, that...accepts one or more location preferences (hostnames of Spark nodes) for each object.
makeRDD[T](seq: Seq[(T, Seq[String])]): RDD[T] Distribute a local Scala collection to form an RDD, with one or more location preferences (hostnames of Spark nodes) for each object. Create a new partition for each collection item.
In other words, rather than using parallelise you have to use makeRDD (which is available in Spark Core API for Scala, but am not sure about Python that I'm leaving as a home exercise for you :))
The same reasoning I'm applying to any other RDD operator / transformation that creates some sort of RDD.

On which way does RDD of spark finish fault-tolerance?

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. But, I did not find the internal mechanism on which the RDD finish fault-tolerance. Could somebody describe this mechanism?Thanks.
Let me explain in very simple terms as I understand.
Faults in a cluster can happen when one of the nodes processing data is crashed. In spark terms, RDD is split into partitions and each node (called the executors) is operating on a partition at any point of time. (Theoretically, each each executor can be assigned multiple tasks depending on the number of cores assigned to the job versus the number of partitions present in the RDD).
By operation, what is really happening is a series of Scala functions (called transformations and actions in Spark terms depending on if the function is pure or side-effecting) executing on a partition of the RDD. These operations are composed together and Spark execution engine views these as a Directed Acyclic Graph of operations.
Now, if a particular node crashes in the middle of an operation Z, which depended on operation Y, which inturn on operation X. The cluster manager (YARN/Mesos) finds out the node is dead and tries to assign another node to continue processing. This node will be told to operate on the particular partition of the RDD and the series of operations X->Y->Z (called lineage) that it has to execute, by passing in the Scala closures created from the application code. Now the new node can happily continue processing and there is effectively no data-loss.
Spark also uses this mechanism to guarantee exactly-once processing, with the caveat that any side-effecting operation that you do like calling a database in a Spark Action block can be invoked multiple times. But if you view your transformations like pure functional mapping from one RDD to another, then you can be rest assured that the resulting RDD will have the elements from the source RDD processed only once.
The domain of fault-tolerance in Spark is very vast and it needs much bigger explanation. I am hoping to see others coming up with technical details on how this is implemented, etc. Thanks for the great topic though.

Resources