Global variable value update at task level in spark - apache-spark

I have a requirement of doing some integer variable updates based on some computations in my transformations . For eg if I get some discrepancy in record matching then I want to increment a value and use it then and there .
I have explored the use of accumulators,
but its value can only be used in driver
which will be very tedious for me as I am dealing with billions of rows.
Please suggest me a possible solution for global variable updates in spark like COUNTERS in MapReduce framework.

Accumulators is the best alternative of counters in spark.but have to use in action instead of transformation as computations inside transformations are evaluated lazily.you can find example in below link.
https://github.com/prithvirajbose/spark-dev/blob/master/src/main/scala/examples/PurchaseLogAnalysis.scala
I faced the same problem.Using Accumulators is a good practice for this use case and it wont affect the spark performance and safer to use.

Related

A question about spark distributied aggregation

I am reading up on spark from here
At one point the blog says:
consider an app that wants to count the occurrences of each word in a corpus and pull the results into the driver as a map. One approach, which can be accomplished with the aggregate action, is to compute a local map at each partition and then merge the maps at the driver. The alternative approach, which can be accomplished with aggregateByKey, is to perform the count in a fully distributed way, and then simply collectAsMap the results to the driver.
So, as I understand this, the two approaches described are:
Approach 1:
Create a hash map for within each executor
Collect key 1 from all the executors on the driver and aggregate
Collect key 2 from all the executors on the driver and aggregate
and so on and so forth
This is where the problem is. I do not think this approach 1 ever happens in spark unless the user was hell-bent on doing it and start using collect along with filter to get the data key by key on the driver and then writing code on the driver to merge the results
Approach 2 (I think this is what usually happens in spark unless you use groupBy wherein the combiner is not run. This is typical reduceBy mechanism):
Compute first level of aggregation on map side
Shuffle
Compute second level of aggregation from all the partially aggregated results from the step 1
Which leads me to believe that I am misunderstanding the approach 1 and what the author is trying to say. Can you please help me understand what the approach 1 in the quoted text is?

PySpark: Best practice to add more columns to a DataFrame

Spark Dataframes has a method withColumn to add one new column at a time. To add multiple columns, a chain of withColumns are required. Is this the best practice to do this?
I feel that usingmapPartitions has more advantages. Let's say I have a chain of three withColumns and then one filter to remove Rows based on certain conditions. These are four different operations (I am not sure if any of these are wide transformations, though). But I can do it all in one go if I do a mapPartitions. It also helps if I have a database connection that I would prefer to open once per RDD partition.
My question has two parts.
The first part, this is my implementation of mapPartitions. Are there any unforeseen issues with this approach? And is there a more elegant way to do this?
df2 = df.rdd.mapPartitions(add_new_cols).toDF()
def add_new_cols(rows):
db = open_db_connection()
new_rows = []
new_row_1 = Row("existing_col_1", "existing_col_2", "new_col_1", "new_col_2")
i = 0
for each_row in rows:
i += 1
# conditionally omit rows
if i % 3 == 0:
continue
db_result = db.get_some_result(each_row.existing_col_2)
new_col_1 = ''.join([db_result, "_NEW"])
new_col_2 = db_result
new_f_row = new_row_1(each_row.existing_col_1, each_row.existing_col_2, new_col_1, new_col_2)
new_rows.append(new_f_row)
db.close()
return iter(new_rows)
The second part, what are the tradeoffs in using mapPartitions over a chain of withColumn and filter?
I read somewhere that using the available methods with Spark DFs are always better than rolling out your own implementation. Please let me know if my argument is wrong. Thank you! All thoughts are welcome.
Are there any unforeseen issues with this approach?
Multiple. The most severe implications are:
A few times higher memory footprint to compared to plain DataFrame code and significant garbage collection overhead.
High cost of serialization and deserialization required to move data between execution contexts.
Introducing breaking point in the query planner.
As is, cost of schema inference on toDF call (can be avoided if proper schema is provided) and possible re-execution of all preceding steps.
And so on...
Some of these can be avoided with udf and select / withColumn, other cannot.
let's say I have a chain of three withColumns and then one filter to remove Rows based on certain conditions. These are four different operations (I am not sure if any of these are wide transformations, though). But I can do it all in one go if I do a mapPartitions
Your mapPartitions doesn't remove any operations, and doesn't provide any optimizations, that Spark planner cannot excluding. Its only advantage is that it provides a nice scope for expensive connection objects.
I read somewhere that using the available methods with Spark DFs are always better than rolling out your own implementation
When you start using executor-side Python logic you already diverge from Spark SQL. Doesn't matter if you use udf, RDD or newly added vectorized udf. At the end of the day you should make decision based on overall structure of your code - if it is predominantly Python logic executed directly on the data it might be better to stick with RDD or skip Spark completely.
If it is just a fraction of the logic, and doesn't cause severe performance issue, don't sweat about it.
using df.withColumn() is the best way to add columns. they're all added lazily

Best procedure to modify Inmutable Spark RDDs

In the past, I worked with low level parallelization (openmpi, openmp,...)
I am currently working in a Spark project and I don't know the best procedure to work with RDDs because they are inmutable.
I will explain my problem with a simple example, imagine that in my RDD I have an object and I need to update one attribute.
The most practical and memory efficient way to solve this is implementing a method called setAttribute(new_value).
Spark RDDs are inmutable, so I need to create a function (for example myModifiedCopy(new_value)) that returns a copy of this object but with the new_value in its attribute and updating the RDD like this:
myRDD = myRDD.map(x->x.myModifiedCopy(new_value)).cache()
My objects are very complex and they use a lot of RAM memory (they are really huge). This procedure is slow, you have to create a complete copy of every element of the RDD just to modify an small value.
Is there a better procedure to deal with this kind of problems?
Do you recommend a different technology?
I would kill for a mutable RDD.
Thank you very much in advance.
I beleive you have some misconceptions of Apache Spark. When you do a transformation, indeed you aren't creating a whole copy of that RDD in memory, you are just "designing" the series of tiny conversions to execute in each record when you run an action.
For instance, map, filter and flatMap are entirely transformations, thus lazy, so when you execute them you just design the plan but don't execute it. On the other hand, collect or count behave differently they trigger all previous transformations (doing everything that was defined in the intermediate stages) until they get the result.

Is it possible to get and use a JavaSparkContext from within a task?

I've come across a situation where I'd like to do a "lookup" within a Spark and/or Spark Streaming pipeline (in Java). The lookup is somewhat complex, but fortunately, I have some existing Spark pipelines (potentially DataFrames) that I could reuse.
For every incoming record, I'd like to potentially launch a spark job from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good idea?
Not considering the performance implications, is this even possible?
Is it possible to get and use a JavaSparkContext from within a task?
No. The spark context is only valid on the driver and Spark will prevent serialization of it. Therefore it's not possible to use the Spark context from within a task.
For every incoming record, I'd like to potentially launch a spark job
from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good
idea?
Without more details, my umbrella answer would be: Probably not a good idea.
Not considering the performance implications, is this even possible?
Yes, probably by bringing the base collection to the driver (collect) and iterating over it. If that collection doesn't fit in memory of the driver, please previous point.
If we need to process every record, consider performing some form of join with the 'decorating' dataset - that will be only 1 large job instead of tons of small ones.

How to find the number of keys created in map part?

I am trying to write Spark application that would find me the number of keys that has been created in the map function. I could find no function that would allow me to do that.
One way I've thought of is using accumulator where I'd add 1 to the accumulator variable in the reduce function. My idea is based on the assumption that accumulator variables are shared across nodes as counters.
Please guide.
if you are looking something like the Hadoop counters in spark, the most accurate approximation is an Accumulator that you can increase in every task, but you do not have any information of the amount of data that Spark has processed so far.
If you only want to know how many distinct keys do you have in your rdd, you could do something like a count of the distinct mapped keys (rdd.map(t=>t_1)).distinct.count)
Hope this will be useful for you

Resources