What are the differences between sc.parallelize and sc.textFile? - apache-spark

I am new to Spark. can someone please clear my doubt:
Lets assume below is my code:
a = sc.textFile(filename)
b = a.filter(lambda x: len(x)>0 and x.split("\t").count("111"))
c = b.collect()
I hope below is what happens internally: (Please correct if my understanding is wrong)
(1) variable a will be saved as a RDD variable containing the expected txt file content
(2) The driver node breaks up the work into tasks and each task contains information about the split of the data it will operate on.
Now these Tasks are assigned to worker nodes.
(3) when collection action (i.e collect() in our case) is invoked, the results will be returned to the master from different nodes, and saved as a local variable c.
Now I want to understand what difference below code makes:
a = sc.textFile(filename).collect()
b = sc.parallelize(a).filter(lambda x: len(x)>0 and x.split("\t").count("111"))
c = b.collect()
Could someone please clarify ?

(1) variable a will be saved as a RDD variable containing the expected txt file content
(Highlighting mine) Not really. The line just describes what will happen after you execute an action, i.e. the RDD variable does not contain the expected txt file content.
The RDD describes the partitions that, when an action is called, become tasks that will read their parts of the input file.
(2) The driver node breaks up the work into tasks and each task contains information about the split of the data it will operate on. Now these Tasks are assigned to worker nodes.
Yes, but only when an action is called which is c=b.collect() in your case.
(3) when collection action (i.e collect() in our case) is invoked, the results will be returned to the master from different nodes, and saved as a local variable c.
YES! That's the most dangerous operation memory-wise since all the Spark executors running somewhere in the cluster start sending data back to the driver.
Now I want to understand what difference below code makes
Quoting the documentation of sc.textFile:
textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.
Quoting the documentation of sc.parallelize:
parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T] Distribute a local Scala collection to form an RDD.
The difference is with the datasets - files (for textFile) while a local collection (for parallelize). Either does the same things under the covers, i.e. they both build a description of how to access the data that are going to be processed using transformations and an action.
The main difference is therefore the source of the data.

Related

Loop through RDD elements, read its content for further processing

I have a folder that contains n number of files.
I am creating an RDD that contains all the filenames of above folder with the code below:
fnameRDD = spark.read.text(filepath).select(input_file_name()).distinct().rdd)
I want to iterate through these RDD elements and process following steps:
Read content of each element (each element is a filepath, so need to read content throgh SparkContext)
Above content should be another RDD which I want to pass as an argument to a Function
Perform certain steps on the RDD passed as argument inside called function
I already have a Function written which has steps that I've tested for Single file and it works fine
But I've tried various things syntactically to do first 2 steps, but I am just getting invalid syntax every time.
I know I am not supposed to use map() since I want to read a file in each iteration which will require sc, but map will be executed inside worker node where sc can't be referenced.
Also, I know I can use wholeTextFiles() as an alternative, but that means I'll be having text of all the files in memory throughout the process, which doesn't seems efficient to me.
I am open to suggestions for different approaches as well.
There are possibly other, more efficient ways to do it but assuming you already have a function SomeFunction(df: DataFrame[value: string]), the easiest would be to use toLocalIterator() on your fnameRDD to process one file at a time. For example:
for x in fnameRDD.toLocalIterator():
fileContent = spark.read.text(x[0])
# fileContent.show(truncate=False)
SomeFunction(fileContent)
A couple of thoughts regarding efficiency:
Unlike .collect(), .toLocalIterator() brings data to driver memory one partition at a time. But in your case, after you call .distinct(), all the data will reside in a single partition, and so will be moved to the driver all at once. Hence, you may want to add .repartition(N) after .distinct(), to break that single partition into N smaller ones, and avoid the need to have large heap on the driver. (Of course, this is only relevant if your list of input files is REALLY long.)
The method to list file names itself seems to be less than efficient. Perhaps you'd want to consider something more direct, using FileSystem API for example like in this article.
I believe you're looking for recursive file lookup,
spark.read.option("recursiveFileLookup", "true").text(filepathroot)
if you point this to the root directory of your files, spark will traverse the directory and pick up all the files that sit under the root and child folders, this will read the file into a single dataframe

How can I append to same file in HDFS(spark 2.11)

I am trying to store Stream Data into HDFS using SparkStreaming,but it Keep creating in new file insted of appending into one single file or few multiple files
If it keep creating n numbers of files,i feel it won't be much efficient
HDFS FILE SYSYTEM
Code
lines.foreachRDD(f => {
if (!f.isEmpty()) {
val df = f.toDF().coalesce(1)
df.write.mode(SaveMode.Append).json("hdfs://localhost:9000/MT9")
}
})
In my pom I am using respective dependencies:
spark-core_2.11
spark-sql_2.11
spark-streaming_2.11
spark-streaming-kafka-0-10_2.11
As you already realized Append in Spark means write-to-existing-directory not append-to-file.
This is intentional and desired behavior (think what would happen if process failed in the middle of "appending" even if format and file system allow that).
Operations like merging files should be applied by a separate process, if necessary at all, which ensures correctness and fault tolerance. Unfortunately this requires a full copy which, for obvious reasons is not desired on batch-to-batch basis.
It’s creating file for each rdd as every time you are reinitialising the DataFrame variable. I would suggest have a DataFrame variable and assign as null outside of loop and inside each rdd union with the local DataFrame. After the loop write using the outer DataFrame.

Spark data manipulation with wholeTextFiles

I have 20k compressed files of ~2MB to manipulate in spark. My initial idea was to use wholeTextFiles() so that I get filename - > content tuples. This is useful because I need to maintain this kind of pairing (because the processing is done on a per file basis, with each file representing a minute of gathered data). However, whenever I need to map/filter/etc the data and to maintain this filename - > association, the code gets ugly (and perhaps not efficient?) i.e.
Data.map(lambda (x,y) : (x, y.changeSomehow))
The data itself, so the content of each file, would be nice to read as a separate RDD because it contains 10k's of lines of data; however, one cannot have an rdd of rdds (as far as i know).
Is there any way to ease the process? Any workaround that would basically allow me to use the content of each file as an rdd, hence allowing me to do rdd.map(lambda x: change(x)) without the ugly keeping track of filename (and usage of list comprehensions instead of transformations) ?
The goal of course is to also maintain the distributed approach and to not inhibit it in any way.
The last step of the processing will be to gather together everything through a reduce.
More background: trying to identify (near) ship collisions on a per minute basis, then plot their path
If you have normal map functions (o1->o2), you can use mapValues function. You've got also flatMap (o1 -> Collection()) function: flatMapValues.
It will keep Key (in your case - file name) and change only values.
For example:
rdd = sc.wholeTextFiles (...)
# RDD of i.e. one pair, /test/file.txt -> Apache Spark
rddMapped = rdd.mapValues (lambda x: veryImportantDataOf(x))
# result: one pair: /test/file.txt -> Spark
Using reduceByKey you can reduce results

Spark SQL - READ and WRITE in sequence or pipeline?

I am working on a cost function for Spark SQL.
While modelling the TABLE SCAN behaviour I cannot understand if READ and WRITE are carried out in pipeline or in sequence.
Let us consider the following SQL query:
SELECT * FROM table1 WHERE columnA = ‘xyz’;
Each task:
Reads a data block (either locally or from a remote node)
Filter out the tuples that do not satisfy the predicate
Write to the disk the remaining tuples
Are (1), (2) and (3) carried out in sequence or in pipeline? In other words, the data block is completely read (all the disk pages composing it) first and then it is filtered and then it is rewritten to the disk or are these activities carried out in pipeline? (i.e. while reading the (n+1)-tuple, n-tuple can be processed and written).
Thanks in advance.
Whenever you submit a job, first thing spark does is create DAG (Directed acyclic graph) for your job.
After creating DAG, spark knows, which tasks it can run in parallel, which task are dependent on output of previous step and so on.
So, in your case,
Spark will read your data in parallel (which you can see in partition), filter them out (in each partition).
Now, since saving required filtering, so it will wait for filtering to finish for at least one partition, then start to save it.
After some more digging I found out that Spark SQL uses a so called "volcano style pull model".
According to such model, a simple scan-filter-write query whould be executed in pipeline and are fully distributed.
In other words, while reading the partition (HDFS block), filtering can be executed on read rows. No need to read the whole block to kick off the filtering. Writing is performed accordingly.

Share a single date amongst Spark nodes

I'd like to run a spark job that outputs to some directory that contains the day at which the job started. Is there a way to share a single date object (joda.time for example) in all spark nodes, so no matter what node outputs what pipe, they all output to the same dir structure?
If the question is
Is there a way to share a single date object (joda.time for example)
in all spark nodes
then naturally the answer is "broadcast the object"
if the real question is how do I specify path of output, then, really you do not need to broadcast the path. You can just say rdd.saveAsFile(/path) and the function will automatically dump each partition in a single file (like part000 or so). Of course, all worker nodes must have access to the location specified by "path" variable, so in a real cluster it has to be HDFS or S3 or NFS or likes.
From documentation:
saveAsTextFile(path)
Write the elements of the dataset as a text file (or set of text
files) in a given directory in the local filesystem, HDFS or any other
Hadoop-supported file system. Spark will call toString on each element
to convert it to a line of text in the file.
Simply create the object in your driver program (as a val) and close over it where you need it. It should be copied over to the worker nodes to be used as you need.

Resources