Parallelism of window-functions in Spark 2 - apache-spark

I have a rather simple (but essential) question about Spark window functions (such as e.g. lead, lag, count, sum, row_number etc):
If a specify my window as Window.partitionBy(lit(0)) (i.e. I need to run the window fucntion accross the entire dataframe), is the window aggregate function running in parallel, or are all records moved in one single tasks?
EDIT:
Especially for "rolling" operations (e.g. rolling average using something like avg(...).Window.partitionBy(lit(0)).orderBy(...).rowsBetween(-10,10)), this operation could very well be split up into different tasks even tough all data is in the same partition of the window because only 20 row are needed at once to compute the average

If you define as Window.partitionBy(lit(0)) or if you don't define partitionBy at all, then all of the partitions of a dataframe will be collected as one in one executor and that executor is going to perform the aggregating function on whole dataframe. So parallelism will not be preserved.
The collection is different than the collect() function as collect() function will collect all the partitions into driver node but partitionBy function will collect data to executor where the partitions are easy to be collected.

Related

Splitting spark data into partitions and writing those partitions to disk in parallel

Problem outline: Say I have 300+ GB of data being processed with spark on an EMR cluster in AWS. This data has three attributes used to partition on the filesystem for use in Hive: date, hour, and (let's say) anotherAttr. I want to write this data to a fs in such a way that minimizes the number of files written.
What I'm doing right now is getting the distinct combinations of date, hour, anotherAttr, and a count of how many rows make up combination. I collect them into a List on the driver, and iterate over the list, building a new DataFrame for each combination, repartitioning that DataFrame using the number of rows to guestimate file size, and writing the files to disk with DataFrameWriter, .orc finishing it off.
We aren't using Parquet for organizational reasons.
This method works reasonably well, and solves the problem that downstream teams using Hive instead of Spark don't see performance issues resulting from a high number of files. For example, if I take the whole 300 GB DataFrame, do a repartition with 1000 partitions (in spark) and the relevant columns, and dumped it to disk, it all dumps in parallel, and finishes in ~9 min with the whole thing. But that gets up to 1000 files for the larger partitions, and that destroys Hive performance. Or it destroys some kind of performance, honestly not 100% sure what. I've just been asked to keep the file count as low as possible. With the method I'm using, I can keep the files to whatever size I want (relatively close anyway), but there is no parallelism and it takes ~45 min to run, mostly waiting on file writes.
It seems to me that since there's a 1-to-1 relationship between some source row and some destination row, and that since I can organize the data into non-overlapping "folders" (partitions for Hive), I should be able to organize my code/DataFrames in such a way that I can ask spark to write all the destination files in parallel. Does anyone have suggestions for how to attack this?
Things I've tested that did not work:
Using a scala parallel collection to kick off the writes. Whatever spark was doing with the DataFrames, it didn't separate out the tasks very well and some machines were getting massive garbage collection problems.
DataFrame.map - I tried to map across a DataFrame of the unique combinations, and kickoff writes from inside there, but there's no access to the DataFrame of the data that I actually need from within that map - the DataFrame reference is null on the executor.
DataFrame.mapPartitions - a non-starter, couldn't come up with any ideas for doing what I want from inside mapPartitions
The word 'partition' is also not especially helpful here because it refers both to the concept of spark splitting up the data by some criteria, and to the way that the data will be organized on disk for Hive. I think I was pretty clear in the usages above. So if I'm imagining a perfect solution to this problem, it's that I can create one DataFrame that has 1000 partitions based on the three attributes for fast querying, then from that create another collection of DataFrames, each one having exactly one unique combination of those attributes, repartitioned (in spark, but for Hive) with the number of partitions appropriate to the size of the data it contains. Most of the DataFrames will have 1 partition, a few will have up to 10. The files should be ~3 GB, and our EMR cluster has more RAM than that for each executor, so we shouldn't see a performance hit from these "large" partitions.
Once that list of DataFrames is created and each one is repartitioned, I could ask spark to write them all to disk in parallel.
Is something like this possible in spark?
One thing I'm conceptually unclear on: say I have
val x = spark.sql("select * from source")
and
val y = x.where(s"date=$date and hour=$hour and anotherAttr=$anotherAttr")
and
val z = x.where(s"date=$date and hour=$hour and anotherAttr=$anotherAttr2")
To what extent is y is a different DataFrame than z? If I repartition y, what effect does the shuffle have on z, and on x for that matter?
We had the same problem (almost) and we ended up by working directly with RDD (instead of DataFrames) and implementing our own partitioning mechanism (by extending org.apache.spark.Partitioner)
Details: we are reading JSON messages from Kafka. The JSON should be grouped by customerid/date/more fields and written in Hadoop using Parquet format, without creating too many small files.
The steps are (simplified version):
a)Read the messages from Kafka and transform them to a structure of RDD[(GroupBy, Message)]. GroupBy is a case class containing all the fields that are used for grouping.
b)Use a reduceByKeyLocally transformation and obtain a map of metrics (no of messages/messages size/etc) for each group - eg Map[GroupBy, GroupByMetrics]
c)Create a GroupPartitioner that's using the previously collected metrics (and some input parameters like the desired Parquet size etc) to compute how many partitions should be created for each GroupBy object. Basically we are extending org.apache.spark.Partitioner and overriding numPartitions and getPartition(key: Any)
d)we partition the RDD from a) using the previously defined partitioner: newPartitionedRdd = rdd.partitionBy(ourCustomGroupByPartitioner)
e)Invoke spark.sparkContext.runJob with two parameters: the first one is the RDD partitioned at d), the second one is a custom function (func: (TaskContext, Iterator[T]) that will write the messages taken from Iterator[T] into Hadoop/Parquet
Let's say that we have 100 mil messages, grouped like that
Group1 - 2 mil
Group2 - 80 mil
Group3 - 18 mil
and we decided that we have to use 1.5 mil messages per partition to obtain Parquet files greater than 500MB. We'll end up with 2 partitions for Group1, 54 for Group2, 12 for Group3.
This statement:
I collect them into a List on the driver, and iterate over the list,
building a new DataFrame for each combination, repartitioning that
DataFrame using the number of rows to guestimate file size, and
writing the files to disk with DataFrameWriter, .orc finishing it off.
is completely off-beam where Spark is concerned. Collecting to driver is never a good approach, volumes and OOM issues and latency in your approach is high.
Use so the below so as to simplify and get parallelism of Spark benefits saving time and money for your boss:
df.repartition(cols...)...write.partitionBy(cols...)...
shuffle occurs via repartition, no shuffling ever with partitionBy.
That simple, with Spark's default parallelism utilized.

Spark containers killed by YARN during group by

I have a data set extracted from Hbase, which is a long form of wide table, i.e has rowKey, columnQualifier and value columns. To get a form of pivot, I need to group by rowKey, which is a string UUID, into a collection and make an object out of the collection. The problem is that only group-by I manage to perform is count the number of elements in groups; other group-bys fail due to container being kill due to memory overflow beyond YARN container limits. I did experiment a lot with the memory sizes, including overhead, partitioning with and without sorting etc. I went even into a high number of partitions i.e. about 10 000 but the job dies the same. I tried both DataFrame groupBy and collect_list, as well as Dataset grouByKey and mapGroups.
The code works on a small data set but not on the larger one. The data set is about 500 GB in Parquet files. The data is not skewed as the largest group in group by have only 50 elements. Thus, by all known to me means the partitions should easily fit in memory as the aggregated data per one rowKey is not really large. The data keys and values are mostly strings and there are not long.
I am using Spark 2.0.2; the above computations were all done is Scala.
You're probably running into the dreaded groupByKey shuffle. Please read this Databricks article on avoiding groupByKey, which details the underlying differences between the two functions.
If you don't want the read the article, the short story is this: Though groupByKey and reduceByKey produce the same results, groupByKey instantiates a shuffle of ALL data, while reduceByKey tries to minimize data shuffle by reducing first. A bit like MapReduce Combiners, if you're familiar with that concept.

Managing Spark partitions after DataFrame unions

I have a Spark application that will need to make heavy use of unions whereby I'll be unioning lots of DataFrames together at different times, under different circumstances. I'm trying to make this run as efficiently as I can. I'm still pretty much brand-spanking-new to Spark, and something occurred to me:
If I have DataFrame 'A' (dfA) that has X number of partitions (numAPartitions), and I union that to DataFrame 'B' (dfB) which has Y number of partitions (numBPartitions), then what will the resultant unioned DataFrame (unionedDF) look like, with result to partitions?
// How many partitions will unionedDF have?
// X * Y ?
// Something else?
val unionedDF : DataFrame = dfA.unionAll(dfB)
To me, this seems like its very important to understand, seeing that Spark performance seems to rely heavily on the partitioning strategy employed by DataFrames. So if I'm unioning DataFrames left and right, I need to make sure I'm constantly managing the partitions of the resultant unioned DataFrames.
The only thing I can think of (so as to properly manage partitions of unioned DataFrames) would be to repartition them and then subsequently persist the DataFrames to memory/disk as soon as I union them:
val unionedDF : DataFrame = dfA.unionAll(dfB)
unionedDF.repartition(optimalNumberOfPartitions).persist(StorageLevel.MEMORY_AND_DISK)
This way, as soon as they are unioned, we repartition them so as to spread them over the available workers/executors properly, and then the persist(...) call tells to Spark to not evict the DataFrame from memory, so we can continue working on it.
The problem is, repartitioning sounds expensive, but it may not be as expensive as the alternative (not managing partitions at all). Are there generally-accepted guidelines about how to efficiently manage unions in Spark-land?
Yes, Partitions are important for spark.
I am wondering if you could find that out yourself by calling:
yourResultedRDD.getNumPartitions()
Do I have to persist, post union?
In general, you have to persist/cache an RDD (no matter if it is the result of a union, or a potato :) ), if you are going to use it multiple times. Doing so will prevent spark from fetching it again in memory and can increase the performance of your application by 15%, in some cases!
For example if you are going to use the resulted RDD just once, it would be safe not to do persist it.
Do I have to repartition?
Since you don't care about finding the number of partitions, you can read in my memoryOverhead issue in Spark
about how the number of partitions affects your application.
In general, the more partitions you have, the smaller the chunk of data every executor will process.
Recall that a worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker.
Isn't the Dataframe always in memory?
Not really. And that's something really lovely with spark, since when you handle bigdata you don't want unnecessary things to lie in the memory, since this will threaten the safety of your application.
A DataFrame can be stored in temporary files that spark creates for you, and is loaded in the memory of your application only when needed.
For more read: Should I always cache my RDD's and DataFrames?
Union just add up the number of partitions in dataframe 1 and dataframe 2. Both dataframe have same number of columns and same order to perform union operation. So no worries, if partition columns different in both the dataframes, there will be max m + n partitions.
You doesn't need to repartition your dataframe after join, my suggestion is to use coalesce in place of repartition, coalesce combine common partitions or merge some small partitions and avoid/reduce shuffling data within partitions.
If you cache/persist dataframe after each union, you will reduce performance and lineage is not break by cache/persist, in that case, garbage collection will clean cache/memory in case of some heavy memory intensive operation and recomputing will increase computation time for the same, may be this time partial computation is required for clear/removed data.
As spark transformation are lazy, i.e; unionAll is lazy operation and coalesce/repartition is also lazy operation and come in action at the time of first action, so try to coalesce unionall result after an interval like counter of 8 and reduce partition in resulting dataframe. Use checkpoints to break lineage and store data, if there is lots of memory intensive operation in your solution.

Spark map is only one task while it should be parallel (PySpark)

I have a RDD with around 7M entries with 10 normalized coordinates in each. I also have a number of centers and I'm trying to map every entry to the closest (Euclidean distance) center. The problem is that this only generates one task which means it is not parallelizing. This is the form:
def doSomething(point,centers):
for center in centers.value:
if(distance(point,center)<1):
return(center)
return(None)
preppedData.map(lambda x:doSomething(x,centers)).take(5)
The preppedData RDD is cached and already evaluated, the doSomething function is represented a lot easier than it actually is but it's the same principle. The centers is a list that has been broadcast. Why is this map only in one task?
Similar pieces of code in other projects just map to +- 100 tasks and get run on all the executors, this one is 1 task on 1 executor. My job has 8 executors with 8 GB and 2 cores per executor available.
This could be due to the conservative nature of the take() method.
See the code in RDD.scala.
What it does is first take the first partition of your RDD (if your RDD doesn't require a shuffle, this will require only one task) and if there are enough results in that one partition, it will return that. If there is not enough data in your partition, it will then grow the number of partitions it tries to take until it gets the required number of elements.
Since your RDD is already cached, and your operation is only a map function, as long as any of your RDDs have >5 rows, this will only ever require one task. More tasks would be unnecessary.
This code exists to avoid overloading the driver with too much data by fetching from all partitions at once for a small take.

collect RDD with buffer in pyspark

I would like a way to return rows from my RDD one at a time (or in small batches) so that I can collect the rows locally as I need them. My RDD is large enough that it cannot fit into memory on the name node, so running collect() would cause an error.
Is there a way to recreate the collect() operation but with a generator, so that rows from the RDD are passed into a buffer? Another option would be to take() 100000 rows at a time from a cached RDD, but I don't think take() allows you to specify a start position?
The best available option is to use RDD.toLocalIterator which collects only a single partition at the time. It creates a standard Python generator:
rdd = sc.parallelize(range(100000))
iterator = rdd.toLocalIterator()
type(iterator)
## generator
even = (x for x in iterator if not x % 2)
You can adjust amount of data collected in a single batch using a specific partitioner and adjusting a number of partitions.
Unfortunately it comes with a price. To collect small batches you have to start multiple Spark jobs and it is quite expensive. So generally speaking collecting an element at the time is not an option.

Resources