Save to distributed file system from worker processes

Save to distributed file system from worker processes - apache-spark

I am quite new with pyspark. In my application with pyspark, I want to achieve following things:
Create a RDD using python list and partition it into some partitions.
Now use rdd.mapPartitions(func)
Here, the function "func" performs an iterative operation which, reads content of saved file into a local variable (for e.g. numpy array), performs some updates using the rdd partion data and again saves the content of variable to some common file system.
I am not able to figure out how to read and write a variable inside a worker process which is accessible to all processes??

Related

Does Spark hold DataFrame in memory when loaded from a file?

If I create a Dataframe like so:
val usersDF = spark.read.csv("examples/src/main/resources/users.csv")
Does spark actually load (/copy) the data (from the csv file) into memory, or into the underlying filesystem as a distributed dataset?
I ask because after loading the df, any change in the underlying file's data is not reflecting in queries against the dataframe. (Unless ofcourse the dataframe is again freshly loaded by invoking the above line of code.
I am using interactive queries on Databricks notebooks.

Unless until you perform an action on that file, the file doesn't gets loaded into memory and you will see all the contents of the file till the time it is loaded into memory when an action occurs in the execution plan.
And if an action has already been taken on the file during which any modification has been done to the file, then you will see the cached result of the first execution if it is able to fit in MEMORY.

How to control preferred locations of RDD partitions?

Is there a way to set the preferred locations of RDD partitions manually?
I want to make sure certain partition be computed in a certain machine.
I'm using an array and the 'Parallelize' method to create a RDD from that.
Also I'm not using HDFS, The files are on the local disk. That's why I want to modify the execution node.

Is there a way to set the preferredLocations of RDD partitions manually?
Yes, there is, but it's RDD-specific and so different kinds of RDDs have different ways to do it.
Spark uses RDD.preferredLocations to get a list of preferred locations to compute each partition/split on (e.g. block locations for an HDFS file).
final def preferredLocations(split: Partition): Seq[String]
Get the preferred locations of a partition, taking into account whether the RDD is checkpointed.
As you see the method is final which means that no one can ever override it.
When you look at the source code of RDD.preferredLocations you will see how a RDD knows its preferred locations. It is using the protected RDD.getPreferredLocations method that a custom RDD may (but don't have to) override to specify placement preferences.
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
So, now the question has "morphed" into another about what are the RDDs that allow for setting their preferred locations. Find yours and see the source code.
I'm using an array and the 'Parallelize' method to create a RDD from that.
If you parallelize your local dataset it's no longer distributed and can be such, but...why would you want to use Spark for something you can process locally on a single computer/node?
If however you insist and do really want to use Spark for local datasets, the RDD behind SparkContext.parallelize is...let's have a look at the source code... ParallelCollectionRDD which does allow for location preferences.
Let's then rephrase your question to the following (hoping I won't lose any important fact):
What are the operators that allow for creating a ParallelCollectionRDD and specifying the location preferences explicitly?
To my great surprise (as I didn't know about the feature), there is such an operator, i.e. SparkContext.makeRDD, that...accepts one or more location preferences (hostnames of Spark nodes) for each object.
makeRDD[T](seq: Seq[(T, Seq[String])]): RDD[T] Distribute a local Scala collection to form an RDD, with one or more location preferences (hostnames of Spark nodes) for each object. Create a new partition for each collection item.
In other words, rather than using parallelise you have to use makeRDD (which is available in Spark Core API for Scala, but am not sure about Python that I'm leaving as a home exercise for you :))
The same reasoning I'm applying to any other RDD operator / transformation that creates some sort of RDD.

Create Spark RDD procedurally

I need to create a Spark RDD (or DataFrame, either is fine), by repetitively calling a custom function that will generate records one by one. Is it possible?
There is no file I can read from because I am interfacing with another system that manages a complex pipeline to produce the records, AND the file generated would be too big anyway (hundreds of TB) for us to consider persisting to disk.

RDD to in.file to external process to out.file to RDD

I need to call an external process from my EMR Spark job. I see that rdd.pipe would allow me to pipe an RDD to a process. (As an aside, is that one process per RDD, or one per element?).
However, my external process requires a filename as input and generates a file as output.
How can I invoke this external process and subsequently load the output file as an RDD?

is that one process per RDD, or one per element?
Neither. It is a process per partition.
process requires a filename as input and generates a file as output. How can
The simplest solution is to write a simple wrapper which writes to randomly generated path, invokes your program, reads the file and writes to stdout and this is pretty much all what pipe is about. Unless you write to distributed file system you wouldn't be able to retriever the output otherwise.

Actions/Transformations on multiple RDD's simultaneously in Spark

I am writing Spark application (Single client) and dealing with lots of small files upon whom I want to run an algorithm. The same algorithm for each one of them. But the files cannot be loaded into the same RDD for the algorithm to work, Because it should sort data within one file boundary.
Today I work on a file at a time, As a result I have poor resource utilization (Small amount of data each action, lots of overhead)
Is there any way to perform the same action/transformation on multiple RDD's simultaneously (And only using one driver program)? Or should I look for another platform? Because such mode of operation isn't classic for Spark.

If you use SparkContext.wholeTextFiles, then you could read the files into one RDD and each partition of the RDD would have the content of a single file. Then, you could work on each partition separately using SparkContext.mapPartitions(sort_file), where sort_file is the sorting function that you want to apply on each file. This would use concurrency better than your current solution, as long as your files are small enough that they can be processed in a single partition.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string