Share a single date amongst Spark nodes - apache-spark

I'd like to run a spark job that outputs to some directory that contains the day at which the job started. Is there a way to share a single date object (joda.time for example) in all spark nodes, so no matter what node outputs what pipe, they all output to the same dir structure?

If the question is
Is there a way to share a single date object (joda.time for example)
in all spark nodes
then naturally the answer is "broadcast the object"
if the real question is how do I specify path of output, then, really you do not need to broadcast the path. You can just say rdd.saveAsFile(/path) and the function will automatically dump each partition in a single file (like part000 or so). Of course, all worker nodes must have access to the location specified by "path" variable, so in a real cluster it has to be HDFS or S3 or NFS or likes.
From documentation:
saveAsTextFile(path)
Write the elements of the dataset as a text file (or set of text
files) in a given directory in the local filesystem, HDFS or any other
Hadoop-supported file system. Spark will call toString on each element
to convert it to a line of text in the file.

Simply create the object in your driver program (as a val) and close over it where you need it. It should be copied over to the worker nodes to be used as you need.

Related

Loop through RDD elements, read its content for further processing

I have a folder that contains n number of files.
I am creating an RDD that contains all the filenames of above folder with the code below:
fnameRDD = spark.read.text(filepath).select(input_file_name()).distinct().rdd)
I want to iterate through these RDD elements and process following steps:
Read content of each element (each element is a filepath, so need to read content throgh SparkContext)
Above content should be another RDD which I want to pass as an argument to a Function
Perform certain steps on the RDD passed as argument inside called function
I already have a Function written which has steps that I've tested for Single file and it works fine
But I've tried various things syntactically to do first 2 steps, but I am just getting invalid syntax every time.
I know I am not supposed to use map() since I want to read a file in each iteration which will require sc, but map will be executed inside worker node where sc can't be referenced.
Also, I know I can use wholeTextFiles() as an alternative, but that means I'll be having text of all the files in memory throughout the process, which doesn't seems efficient to me.
I am open to suggestions for different approaches as well.
There are possibly other, more efficient ways to do it but assuming you already have a function SomeFunction(df: DataFrame[value: string]), the easiest would be to use toLocalIterator() on your fnameRDD to process one file at a time. For example:
for x in fnameRDD.toLocalIterator():
fileContent = spark.read.text(x[0])
# fileContent.show(truncate=False)
SomeFunction(fileContent)
A couple of thoughts regarding efficiency:
Unlike .collect(), .toLocalIterator() brings data to driver memory one partition at a time. But in your case, after you call .distinct(), all the data will reside in a single partition, and so will be moved to the driver all at once. Hence, you may want to add .repartition(N) after .distinct(), to break that single partition into N smaller ones, and avoid the need to have large heap on the driver. (Of course, this is only relevant if your list of input files is REALLY long.)
The method to list file names itself seems to be less than efficient. Perhaps you'd want to consider something more direct, using FileSystem API for example like in this article.
I believe you're looking for recursive file lookup,
spark.read.option("recursiveFileLookup", "true").text(filepathroot)
if you point this to the root directory of your files, spark will traverse the directory and pick up all the files that sit under the root and child folders, this will read the file into a single dataframe

Spark tagging file names for purpose of possible later deletion/rollback?

I am using Spark 2.4 in AWS EMR.
I am using Pyspark and SparkSQL for my ELT/ETL and using DataFrames with Parquet input and output on AWS S3.
As of Spark 2.4, as far as I know, there is no way to tag or to customize the file name of output files (parquet). Please correct me?
When I store parquet output files on S3 I end up with file names which look like this:
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
The middle part of the file name looks like it has embedded GUID/UUID :
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
I would like to know if I can obtain this GUID/UUID value from the PySpark or SparkSQL function at run-time, to log/save/display this value in a text file?
I need to log this GUID/UUID value because I may need to later remove the files with this value as part of their names, for a manual rollback purposes (for example, I may discover a day or a week later that this data is somehow corrupt and needs to be deleted, so all files tagged with GUID/UUID can be identified and removed).
I know that I can partition the table manually on a GUID column but then I end up with too many partitions, so it hurts performance. What I need is to somehow tag the files, for each data load job, so I can identify and delete them easily from S3, hence GUID/UUID value seems like one possible solution.
Open for any other suggestions.
Thank you
Is this with the new "s3a specific committer"? If so, it means that they're using netflix's code/trick of using a GUID on each file written so as to avoid eventual consistency problems. That doesn't help much though.
consider offering a patch to Spark which lets you add a specific prefix to a file name.
Or for Apache Hadoop & Spark (i.e. not EMR), an option for the S3A committers to put that prefix in when they generate temporary filenames.
Short term: well, you can always list the before-and-after state of the directory tree (tip: use FileSystem.listFiles(path, recursive) for speed), and either remember the new files, or rename them (which will be slow: Remembering the new filenames is better)
Spark already writes files with UUID in names. Instead of creating too many partitions you can setup customer file naming (e.g. add some id). May be this is solution for you - https://stackoverflow.com/a/43377574/1251549
Not tried yet (but planning) - https://github.com/awslabs/amazon-s3-tagging-spark-util
In theory, you can tag with jobid (or whatever) and then run something
Both solutions lead to perform multiple s3 list objects API request check tags/filename and delete file one by one.

What are the differences between sc.parallelize and sc.textFile?

I am new to Spark. can someone please clear my doubt:
Lets assume below is my code:
a = sc.textFile(filename)
b = a.filter(lambda x: len(x)>0 and x.split("\t").count("111"))
c = b.collect()
I hope below is what happens internally: (Please correct if my understanding is wrong)
(1) variable a will be saved as a RDD variable containing the expected txt file content
(2) The driver node breaks up the work into tasks and each task contains information about the split of the data it will operate on.
Now these Tasks are assigned to worker nodes.
(3) when collection action (i.e collect() in our case) is invoked, the results will be returned to the master from different nodes, and saved as a local variable c.
Now I want to understand what difference below code makes:
a = sc.textFile(filename).collect()
b = sc.parallelize(a).filter(lambda x: len(x)>0 and x.split("\t").count("111"))
c = b.collect()
Could someone please clarify ?
(1) variable a will be saved as a RDD variable containing the expected txt file content
(Highlighting mine) Not really. The line just describes what will happen after you execute an action, i.e. the RDD variable does not contain the expected txt file content.
The RDD describes the partitions that, when an action is called, become tasks that will read their parts of the input file.
(2) The driver node breaks up the work into tasks and each task contains information about the split of the data it will operate on. Now these Tasks are assigned to worker nodes.
Yes, but only when an action is called which is c=b.collect() in your case.
(3) when collection action (i.e collect() in our case) is invoked, the results will be returned to the master from different nodes, and saved as a local variable c.
YES! That's the most dangerous operation memory-wise since all the Spark executors running somewhere in the cluster start sending data back to the driver.
Now I want to understand what difference below code makes
Quoting the documentation of sc.textFile:
textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.
Quoting the documentation of sc.parallelize:
parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T] Distribute a local Scala collection to form an RDD.
The difference is with the datasets - files (for textFile) while a local collection (for parallelize). Either does the same things under the covers, i.e. they both build a description of how to access the data that are going to be processed using transformations and an action.
The main difference is therefore the source of the data.

How to create an RDD with the whole content of files as values?

I have a directory with many files, and I want to create a RDD whose value is the content of each file. How can I do that?
You can use SparkContext.wholeTextFiles method that reads:
a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Just keep in mind that individual files have to fit into worker memory and generally speaking it is less efficient than using textFile.

How to refer to the local filesystem where spark-submit is executed on?

Is it possible to write output of spark program's result in driver node when it is processed in cluster?
df = sqlContext("hdfs://....")
result = df.groupby('abc','cde').count()
result.write.save("hdfs:...resultfile.parquet", format="parquet") # this works fine
result = result.collect()
with open("<my drivernode local directory>//textfile") as myfile:
myfile.write(result) # I'll convert to python object before writing
Could someone give some idea how to refer to the local filesystem where I gave spark-submit?
tl;dr Use . (the dot) and the current working directory is resolved by API.
From what I understand from your question, you are asking about saving local files in driver or workers while running spark.
This is possible and is quite straightforward.
The point is that in the end, the driver and workers are running python so you can use python "open", "with", "write" and so on.
To do this in the workers you'll need to run "foreach" or "map" on your rdd and then save locally (This can be tricky, as you may have more than one partition on each executor).
Saving from the driver is even easier, after you collected the data you have a regular python object and you can save it in any stranded pythonic way.
BUT
When you save any local file, may it be on worker or driver, that file is created inside the container that the worker or driver are running in. Once the execution is over those containers are deleted and you would not be able to access any local data stored in them.
The way to solve this is to move those local files to somewhere also while the container is still alive. You can do this with a shell command, inserting into data base and so on.
For example, I use this technique to insert results of calculations into MySQL without the need to do collect. I save results locally on workers as part of a "map" operation and then upload them using MySQL "LOAD DATA LOCAL INFILE".

Resources