How to get hold of intermediate results? - apache-spark

I'm using Apache Spark as MapReduce implementation and was wondering whether there is a way to get hold of intermediate results. The simple API allows to collect the results from the triggering application once all the map steps have completed, in its simplest form e.g.
val results = mapResult.collect()
I'm interested in collecting intermediate map results as they complete. Is there a way to accomplish this?

You can use dataframe's cache() method to cache the computation result and so when you call the action later then it'll use the cached result instead of re-computing the DAG. Something like:
# caches the result so the action called after this will use this cached
# result instead of re-computing the DAG
results.cache()
results.show(1)
Later, you may want to free up the memory used for caching the result with:
results.unpersist()

Related

How does pyspark perform union?

If I have a code like this:
def my_func(df):
df1 = trans1(df)
df2 = trans2(df1)
df3 = trans3(df1)
df4 = df2.unionAll(df3)
return df4
And I run a df.collect()on the result of the function while not having persisted anything. How many times will the operations in '''trans1''' be run? Once or twice? Thanks!
Coming into your question In your DF there is no Action .
So it will not perform anything .
But hypothetically I am taking an example you have performed cache
df.cache().storageLevel
Post that you have performed some count action .
Caching/persistence is lazy when used with Dataset API so you have to trigger the caching using count operator or similar that in turn submits a Spark job.
In your case even after union there is no action if you have used action to write into disk.
Yes. Only actions (like saving to an external storage) can trigger the persistence for future reuse.
you can check out Storage tab in web UI about it.
Twice
All transformations in spark all lazily evaluated which basically creates a lineage of instructions, and once an action is performed, it traverses the lineage from bottom to top until it finds the materialised data.
Since df1 is not materialised it will perform the same operation twice for two different linages. If df1 is persisted, first graph will perform full transformation whereas second will do only further computation by reusing it. You can see it in DAG and SQL plan tab.
Please note that it is not necessary to always cache the data. Sometimes caching can cost more than re-computation, therefore you should cache the if and only if the computation cost is higher.

Spark 1.6 Dataframe cache not working correctly

My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory, And whenever that cached dataframe is called in the program it uses already computed dataframe from cache.
but that is not how my program is working.
I have a dataframe like below which I am caching it, and then immediately I run a df.count action.
val df = inputDataFrame.select().where().withColumn("newcol" , "").cache()
df.count
When I run the program. In Spark UI I see that first line runs for 4 min and
when it comes to second line it again runs 4 min basically first line is re computed twice?
Shouldn't first line computed and cached when second line triggers?
how to resolve this behavior. I am stuck, please advise.
My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory,
It is not correct. Simple cache and count (take wouldn't work on RDD either) is a valid method for RDDs but it is not the case with Datasets, which use much more advanced optimizations. With query:
df.select(...).where(...).withColumn("newcol" , "").count()
any column, which is not used in where clause can be ignored.
There is an important discussion on the developer list and quoting Sean Owen
I think the right answer is "don't do that" but if you really had to you could trigger a Dataset operation that does nothing per partition. I presume that would be more reliable because the whole partition has to be computed to make it available in practice. Or, go so far as to loop over every element.
Translated to code:
df.foreach(_ => ())
There is
df.registerAsTempTable("df")
sqlContext.sql("CACHE TABLE df")
which is eager but it is no longer (Spark 2 and forward) documented and should be avoided.
No, if you call cache on a DataFrame it's not cached in this moment, it's only "marked" for potential future caching. The actual caching is only done when an action is followed later. You can also see your cached DataFrame in Spark UI under "Storage"
Another problem in your code is that count on DataFrame does not compute the entire DataFrame because not all columns need to be computed for that. You can use df.rdd.count() to force the entire evualation (see How to force DataFrame evaluation in Spark).
The question is why your first operation takes so long, even if no action is called. I think this is related to the caching logic (e.g. size estimations etc) being computed when calling cache (see eg. Why is rdd.map(identity).cache slow when rdd items are big?)

How to Identify the list of available RDDs?

I am using the below command to get the list of available registered Temp tables
sqlContext.sql("show tables").collect().foreach(println)
Is there any similar command to get list of available RDDs?
Here is my requirement (using scala)
1. Need to create some RDD on the fly
2. Identify list of available RDDs
3. remove/delete/clear the unwanted RDDs and move forward
How to delete an RDD in PySpark for the purpose of releasing resources?
An additional note, I went through this link, but it doesn't answer all my questions... also i tried the below but don't find any difference before and after unpersist, so not sure how to confirm that my RDD has been released the memory
val tempRDD1 = RDD1.reduceByKey((acc,value)=> acc+value)
tempRDD1.collect.foreach(println)
tempRDD1.unpersist()
tempRDD1.collect.foreach(println)
The RDD data is not saved until it is 1. persisted (cached) and 2. an action occurs to force the preceding transformations to occur. If either of these do not occur, no data will be stored. Any RDD that appears to be "created", will just create an action plan to produce the data if it is needed later. This model is called lazy evaluation.
In your example, no RDD is ever cached, so no data will ever be stored in memory. And the unpersist call will have no effect.

Best procedure to modify Inmutable Spark RDDs

In the past, I worked with low level parallelization (openmpi, openmp,...)
I am currently working in a Spark project and I don't know the best procedure to work with RDDs because they are inmutable.
I will explain my problem with a simple example, imagine that in my RDD I have an object and I need to update one attribute.
The most practical and memory efficient way to solve this is implementing a method called setAttribute(new_value).
Spark RDDs are inmutable, so I need to create a function (for example myModifiedCopy(new_value)) that returns a copy of this object but with the new_value in its attribute and updating the RDD like this:
myRDD = myRDD.map(x->x.myModifiedCopy(new_value)).cache()
My objects are very complex and they use a lot of RAM memory (they are really huge). This procedure is slow, you have to create a complete copy of every element of the RDD just to modify an small value.
Is there a better procedure to deal with this kind of problems?
Do you recommend a different technology?
I would kill for a mutable RDD.
Thank you very much in advance.
I beleive you have some misconceptions of Apache Spark. When you do a transformation, indeed you aren't creating a whole copy of that RDD in memory, you are just "designing" the series of tiny conversions to execute in each record when you run an action.
For instance, map, filter and flatMap are entirely transformations, thus lazy, so when you execute them you just design the plan but don't execute it. On the other hand, collect or count behave differently they trigger all previous transformations (doing everything that was defined in the intermediate stages) until they get the result.

reducer concept in Spark

I'm coming from a Hadoop background and have limited knowledge about Spark. BAsed on what I learn so far, Spark doesn't have mapper/reducer nodes and instead it has driver/worker nodes. The worker are similar to the mapper and driver is (somehow) similar to reducer. As there is only one driver program, there will be one reducer. If so, how simple programs like word count for very big data sets can get done in spark? Because driver can simply run out of memory.
The driver is more of a controller of the work, only pulling data back if the operator calls for it. If the operator you're working on returns an RDD/DataFrame/Unit, then the data remains distributed. If it returns a native type then it will indeed pull all of the data back.
Otherwise, the concept of map and reduce are a bit obsolete here (from a type of work persopective). The only thing that really matters is whether the operation requires a data shuffle or not. You can see the points of shuffle by the stage splits either in the UI or via a toDebugString (where each indentation level is a shuffle).
All that being said, for a vague understanding, you can equate anything that requires a shuffle to a reducer. Otherwise it's a mapper.
Last, to equate to your word count example:
sc.textFile(path)
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_+_)
In the above, this will be done in one stage as the data loading (textFile), splitting(flatMap), and mapping can all be done independent of the rest of the data. No shuffle is needed until the reduceByKey is called as it will need to combine all of the data to perform the operation...HOWEVER, this operation has to be associative for a reason. Each node will perform the operation defined in reduceByKey locally, only merging the final data set after. This reduces both memory and network overhead.
NOTE that reduceByKey returns an RDD and is thus a transformation, so the data is shuffled via a HashPartitioner. All of the data does NOT pull back to the driver, it merely moves to nodes that have the same keys so that it can have its final value merged.
Now, if you use an action such as reduce or worse yet, collect, then you will NOT get an RDD back which means the data pulls back to the driver and you will need room for it.
Here is my fuller explanation of reduceByKey if you want more. Or how this breaks down in something like combineByKey

Resources