Best procedure to modify Inmutable Spark RDDs - apache-spark

In the past, I worked with low level parallelization (openmpi, openmp,...)
I am currently working in a Spark project and I don't know the best procedure to work with RDDs because they are inmutable.
I will explain my problem with a simple example, imagine that in my RDD I have an object and I need to update one attribute.
The most practical and memory efficient way to solve this is implementing a method called setAttribute(new_value).
Spark RDDs are inmutable, so I need to create a function (for example myModifiedCopy(new_value)) that returns a copy of this object but with the new_value in its attribute and updating the RDD like this:
myRDD = myRDD.map(x->x.myModifiedCopy(new_value)).cache()
My objects are very complex and they use a lot of RAM memory (they are really huge). This procedure is slow, you have to create a complete copy of every element of the RDD just to modify an small value.
Is there a better procedure to deal with this kind of problems?
Do you recommend a different technology?
I would kill for a mutable RDD.
Thank you very much in advance.

I beleive you have some misconceptions of Apache Spark. When you do a transformation, indeed you aren't creating a whole copy of that RDD in memory, you are just "designing" the series of tiny conversions to execute in each record when you run an action.
For instance, map, filter and flatMap are entirely transformations, thus lazy, so when you execute them you just design the plan but don't execute it. On the other hand, collect or count behave differently they trigger all previous transformations (doing everything that was defined in the intermediate stages) until they get the result.

Related

Spark Aggregator Is Slow At ObjectHashAggregate

We have a spark app that does some customized aggregation on our data. And we've created a Spark Aggregator to do it, which is basically extending the org.apache.spark.sql.expressions.Aggregator and calling it through a UDAF. These aggregators basically defines a pre-aggregate process so that executors could process the data at local first and then go to exchange phase, which is pretty sweet. More on them could be found here https://spark.apache.org/docs/latest/sql-ref-functions-udf-aggregate.html
Now my question is, I've noticed that for these aggregators, the second ObjectHashAggregate process is always much much slower than the first partial ObjectHashAggregate. For example in the screenshot I've attached, the second ObjectHashAggregate is significantly slower than the first one. SparkJobScreenShot
However in our code both of them should be running kinda the same aggregation methods on the data or on the intermediate results. So why is it the second one is much slower than the first one? Is it because the hashing and exchanging after the first ObjectHashAggregate? I'm not really sure what could be causing Spark with such behavior.

PySpark: Best practice to add more columns to a DataFrame

Spark Dataframes has a method withColumn to add one new column at a time. To add multiple columns, a chain of withColumns are required. Is this the best practice to do this?
I feel that usingmapPartitions has more advantages. Let's say I have a chain of three withColumns and then one filter to remove Rows based on certain conditions. These are four different operations (I am not sure if any of these are wide transformations, though). But I can do it all in one go if I do a mapPartitions. It also helps if I have a database connection that I would prefer to open once per RDD partition.
My question has two parts.
The first part, this is my implementation of mapPartitions. Are there any unforeseen issues with this approach? And is there a more elegant way to do this?
df2 = df.rdd.mapPartitions(add_new_cols).toDF()
def add_new_cols(rows):
db = open_db_connection()
new_rows = []
new_row_1 = Row("existing_col_1", "existing_col_2", "new_col_1", "new_col_2")
i = 0
for each_row in rows:
i += 1
# conditionally omit rows
if i % 3 == 0:
continue
db_result = db.get_some_result(each_row.existing_col_2)
new_col_1 = ''.join([db_result, "_NEW"])
new_col_2 = db_result
new_f_row = new_row_1(each_row.existing_col_1, each_row.existing_col_2, new_col_1, new_col_2)
new_rows.append(new_f_row)
db.close()
return iter(new_rows)
The second part, what are the tradeoffs in using mapPartitions over a chain of withColumn and filter?
I read somewhere that using the available methods with Spark DFs are always better than rolling out your own implementation. Please let me know if my argument is wrong. Thank you! All thoughts are welcome.
Are there any unforeseen issues with this approach?
Multiple. The most severe implications are:
A few times higher memory footprint to compared to plain DataFrame code and significant garbage collection overhead.
High cost of serialization and deserialization required to move data between execution contexts.
Introducing breaking point in the query planner.
As is, cost of schema inference on toDF call (can be avoided if proper schema is provided) and possible re-execution of all preceding steps.
And so on...
Some of these can be avoided with udf and select / withColumn, other cannot.
let's say I have a chain of three withColumns and then one filter to remove Rows based on certain conditions. These are four different operations (I am not sure if any of these are wide transformations, though). But I can do it all in one go if I do a mapPartitions
Your mapPartitions doesn't remove any operations, and doesn't provide any optimizations, that Spark planner cannot excluding. Its only advantage is that it provides a nice scope for expensive connection objects.
I read somewhere that using the available methods with Spark DFs are always better than rolling out your own implementation
When you start using executor-side Python logic you already diverge from Spark SQL. Doesn't matter if you use udf, RDD or newly added vectorized udf. At the end of the day you should make decision based on overall structure of your code - if it is predominantly Python logic executed directly on the data it might be better to stick with RDD or skip Spark completely.
If it is just a fraction of the logic, and doesn't cause severe performance issue, don't sweat about it.
using df.withColumn() is the best way to add columns. they're all added lazily

reducer concept in Spark

I'm coming from a Hadoop background and have limited knowledge about Spark. BAsed on what I learn so far, Spark doesn't have mapper/reducer nodes and instead it has driver/worker nodes. The worker are similar to the mapper and driver is (somehow) similar to reducer. As there is only one driver program, there will be one reducer. If so, how simple programs like word count for very big data sets can get done in spark? Because driver can simply run out of memory.
The driver is more of a controller of the work, only pulling data back if the operator calls for it. If the operator you're working on returns an RDD/DataFrame/Unit, then the data remains distributed. If it returns a native type then it will indeed pull all of the data back.
Otherwise, the concept of map and reduce are a bit obsolete here (from a type of work persopective). The only thing that really matters is whether the operation requires a data shuffle or not. You can see the points of shuffle by the stage splits either in the UI or via a toDebugString (where each indentation level is a shuffle).
All that being said, for a vague understanding, you can equate anything that requires a shuffle to a reducer. Otherwise it's a mapper.
Last, to equate to your word count example:
sc.textFile(path)
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_+_)
In the above, this will be done in one stage as the data loading (textFile), splitting(flatMap), and mapping can all be done independent of the rest of the data. No shuffle is needed until the reduceByKey is called as it will need to combine all of the data to perform the operation...HOWEVER, this operation has to be associative for a reason. Each node will perform the operation defined in reduceByKey locally, only merging the final data set after. This reduces both memory and network overhead.
NOTE that reduceByKey returns an RDD and is thus a transformation, so the data is shuffled via a HashPartitioner. All of the data does NOT pull back to the driver, it merely moves to nodes that have the same keys so that it can have its final value merged.
Now, if you use an action such as reduce or worse yet, collect, then you will NOT get an RDD back which means the data pulls back to the driver and you will need room for it.
Here is my fuller explanation of reduceByKey if you want more. Or how this breaks down in something like combineByKey

Spark: Importing Data

I currently have a spark app that reads a couple of files and forms a data frame out of them and implements some logic on the data frames.
I can see the number and size of these files growing by a lot in the future and wanted to understand what goes on behind the scenes to be able to keep up with this growth.
Firstly, I just wanted to double check that since all machines on the cluster can access the files (which is a requirement by spark), the task of reading in data from these files is distributed and no one machine is burdened by it?
I was looking at the Spark UI for this app but since it only shows what actions were performed by which machines and since "sc.textFile(filePath)" is not an action I couldn't be sure what machines are performing this read.
Secondly, what advantages/disadvantages would I face if I were to read this data from a database like Cassandra instead of just reading in files?
Thirdly, in my app I have some code where I perform a collect (val treeArr = treeDF.collect()) on the dataframe to get an array and then I have some logic implemented on those arrays. But since these are not RDDs, how does Spark distribute this work? Or does it distribute them at all?
In other words, should I be doing maximum amount of my work transforming and performing actions on RDDs than converting them into arrays or some other data structure and then implementing the logic like I would in any programming language?
I am only about two weeks into Spark so I apologize if these are stupid questions!
Yes, sc.textFile is distributed. It even has an optional minPartitions argument.
This question is too broad. But the short answer is that you should benchmark it for yourself.
collect fetches all the data to the master. After that it's just a plain array. Indeed the idea is that you should not use collect if you want to perform distributed computations.

How to access Map modified in RDD, in the driver program of Apache Spark?

Need help.
I am working on Apache Spark 1.2.0. I have a requirement or rather I should say I am stuck in some issue.
Its like :-
I am running a map function on RDD in which I am creating some Object instances and storing those instances in a ConcurrentMap against some key. Now after Map function has finished I need data that was stored in ConcurrentMap in the driver program. Which as of now is blank outside the map function.
Is it at all possible ? Can I access it by any means ?
Thanks
Nitin
I think you are misusing Spark or misunderstanding the concept. The thing you want to do can be achieved with mapPartitions function. This function will provide you an iterator over all the rows in the input RDD partition, this way you would know when the processing has finished and would be able to either save the ConcurrentMap you've generated to persistent storage or return its iterator as the function result
It you would elaborate on your use case or attach the code, I would be able to recommend the right solution for you

Resources