Spark writing data back to HDFS

Spark writing data back to HDFS - apache-spark

I have a question about spark writing the result after computation. I know that each executor writes its result back to HDFS/local-filesystem(based on the cluster manager used) after it completes working on its partitions.
This makes sense because waiting for all executors to complete and writing the result back is not really required if you don't need any aggregation of results.
But how does the write operation work when the data needs to be sorted on a particular column ( eg ID) in ascending or descending order?
Will spark's logical plan sort partitions first based on their ID at each executor before even computations begin? In that case, any executor could complete first and start writing its result to HDFS so how does the whole framework make sure that the final result is sorted?
Thanks in advance

From what I understood from this answer: https://stackoverflow.com/a/32888236/1206998 sorting is a process that shuffle all dataset items into "sorted" partition, using RangePartitioner: the "boundaries" between partitions are items that are selected as percentile items of a sample of the dataset.
So something like:
collect a sample set
sort items
select the k*i-th items where i is the sample size divided by the output partition number
broadcast those boundaries
on all input partition, for all items, find which output partition the items should go to by comparing with the broadcast boundaries
send/shuffle data in those output partition
sort items inside each partition
If we have dataset [1,5,6,8, 10, 20, 100] (distributed and in any order) and sort it into 3 partitions, that would gives:
partition 1 = [1,5,6] (sorted within partition)
partition 2 = [8,10] ( " )
partition 3 = [20,100] ( " )
And thus, any later operations can be done on each partition independently, including writing.
Keep in mind that:
spark manage data in-memory and depending on config, it writes partition data locally.
Write is done per partition, but the output files (in distributed FSs like hdfs) are hidden until all data are written. Well at least for parquet writer, not sure for other writers.
As you can expect, sorting is an expensive operation

Related

How does Spark decide the partitions number of the next stage when shuffle in SparkSQL?

Of course I know the spark.sql.shuffle.partitionsconfig,
but for example, when I set this config 300 on the small dataset which just has 200 rows, the config is not valid, the actual partition number is just 2,
anthor example, I set this config 3000 on the dataset which has 30 billion rows, the config is not valid too, the actual partition number is just 600,
we see that when we set a big value partitions config on a small dataset, the config would be not valid,
So I just want to know How does Spark decide the partitions number of the next stage when shuffle in SparkSQL? Or How to force this config to be valid ?
My Spark SQL is just like below:
set spark.sql.shuffle.partitions=3000;
with base_data as (
select
device_id
from
table_name
where
dt = '20210621'
distribute by
rand()
)
select count(1) from base_data

In general Narrow transformation does not change number of partitions .
Wide transformations transformation does not change number of partitions.
Narrow transformation In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().
Wide transformation — In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey and reducebyKey.
Update after question change:
you can assume "spark.sql.shuffle.partitions" as a query hint where we are forcing executors that make that number of partitions for joins or aggregations in my view we should not play with this value unless we are very sure that what are no of grouping key would be.
This will make unnecessary shuffling of data over the network.

Splitting spark data into partitions and writing those partitions to disk in parallel

Problem outline: Say I have 300+ GB of data being processed with spark on an EMR cluster in AWS. This data has three attributes used to partition on the filesystem for use in Hive: date, hour, and (let's say) anotherAttr. I want to write this data to a fs in such a way that minimizes the number of files written.
What I'm doing right now is getting the distinct combinations of date, hour, anotherAttr, and a count of how many rows make up combination. I collect them into a List on the driver, and iterate over the list, building a new DataFrame for each combination, repartitioning that DataFrame using the number of rows to guestimate file size, and writing the files to disk with DataFrameWriter, .orc finishing it off.
We aren't using Parquet for organizational reasons.
This method works reasonably well, and solves the problem that downstream teams using Hive instead of Spark don't see performance issues resulting from a high number of files. For example, if I take the whole 300 GB DataFrame, do a repartition with 1000 partitions (in spark) and the relevant columns, and dumped it to disk, it all dumps in parallel, and finishes in ~9 min with the whole thing. But that gets up to 1000 files for the larger partitions, and that destroys Hive performance. Or it destroys some kind of performance, honestly not 100% sure what. I've just been asked to keep the file count as low as possible. With the method I'm using, I can keep the files to whatever size I want (relatively close anyway), but there is no parallelism and it takes ~45 min to run, mostly waiting on file writes.
It seems to me that since there's a 1-to-1 relationship between some source row and some destination row, and that since I can organize the data into non-overlapping "folders" (partitions for Hive), I should be able to organize my code/DataFrames in such a way that I can ask spark to write all the destination files in parallel. Does anyone have suggestions for how to attack this?
Things I've tested that did not work:
Using a scala parallel collection to kick off the writes. Whatever spark was doing with the DataFrames, it didn't separate out the tasks very well and some machines were getting massive garbage collection problems.
DataFrame.map - I tried to map across a DataFrame of the unique combinations, and kickoff writes from inside there, but there's no access to the DataFrame of the data that I actually need from within that map - the DataFrame reference is null on the executor.
DataFrame.mapPartitions - a non-starter, couldn't come up with any ideas for doing what I want from inside mapPartitions
The word 'partition' is also not especially helpful here because it refers both to the concept of spark splitting up the data by some criteria, and to the way that the data will be organized on disk for Hive. I think I was pretty clear in the usages above. So if I'm imagining a perfect solution to this problem, it's that I can create one DataFrame that has 1000 partitions based on the three attributes for fast querying, then from that create another collection of DataFrames, each one having exactly one unique combination of those attributes, repartitioned (in spark, but for Hive) with the number of partitions appropriate to the size of the data it contains. Most of the DataFrames will have 1 partition, a few will have up to 10. The files should be ~3 GB, and our EMR cluster has more RAM than that for each executor, so we shouldn't see a performance hit from these "large" partitions.
Once that list of DataFrames is created and each one is repartitioned, I could ask spark to write them all to disk in parallel.
Is something like this possible in spark?
One thing I'm conceptually unclear on: say I have
val x = spark.sql("select * from source")
and
val y = x.where(s"date=$date and hour=$hour and anotherAttr=$anotherAttr")
and
val z = x.where(s"date=$date and hour=$hour and anotherAttr=$anotherAttr2")
To what extent is y is a different DataFrame than z? If I repartition y, what effect does the shuffle have on z, and on x for that matter?

We had the same problem (almost) and we ended up by working directly with RDD (instead of DataFrames) and implementing our own partitioning mechanism (by extending org.apache.spark.Partitioner)
Details: we are reading JSON messages from Kafka. The JSON should be grouped by customerid/date/more fields and written in Hadoop using Parquet format, without creating too many small files.
The steps are (simplified version):
a)Read the messages from Kafka and transform them to a structure of RDD[(GroupBy, Message)]. GroupBy is a case class containing all the fields that are used for grouping.
b)Use a reduceByKeyLocally transformation and obtain a map of metrics (no of messages/messages size/etc) for each group - eg Map[GroupBy, GroupByMetrics]
c)Create a GroupPartitioner that's using the previously collected metrics (and some input parameters like the desired Parquet size etc) to compute how many partitions should be created for each GroupBy object. Basically we are extending org.apache.spark.Partitioner and overriding numPartitions and getPartition(key: Any)
d)we partition the RDD from a) using the previously defined partitioner: newPartitionedRdd = rdd.partitionBy(ourCustomGroupByPartitioner)
e)Invoke spark.sparkContext.runJob with two parameters: the first one is the RDD partitioned at d), the second one is a custom function (func: (TaskContext, Iterator[T]) that will write the messages taken from Iterator[T] into Hadoop/Parquet
Let's say that we have 100 mil messages, grouped like that
Group1 - 2 mil
Group2 - 80 mil
Group3 - 18 mil
and we decided that we have to use 1.5 mil messages per partition to obtain Parquet files greater than 500MB. We'll end up with 2 partitions for Group1, 54 for Group2, 12 for Group3.

This statement:
I collect them into a List on the driver, and iterate over the list,
building a new DataFrame for each combination, repartitioning that
DataFrame using the number of rows to guestimate file size, and
writing the files to disk with DataFrameWriter, .orc finishing it off.
is completely off-beam where Spark is concerned. Collecting to driver is never a good approach, volumes and OOM issues and latency in your approach is high.
Use so the below so as to simplify and get parallelism of Spark benefits saving time and money for your boss:
df.repartition(cols...)...write.partitionBy(cols...)...
shuffle occurs via repartition, no shuffling ever with partitionBy.
That simple, with Spark's default parallelism utilized.

Spark coalescing on the number of objects in each partition

We are starting to experiment with spark on our team.
After we do reduce job in Spark, we would like to write the result to S3, however we would like to avoid collecting the spark result.
For now, we are writing the files to Spark forEachPartition of the RDD, however this resulted in a lot of small files. We would like to be able to aggregate the data into a couple files partitioned by the number of objects written to the file.
So for example, our total data is 1M objects (this is constant), we would like to produce 400K objects file, and our current partition produce around 20k objects file (this varies a lot for each job). Ideally we want to produce 3 files, each containing 400k, 400k and 200k instead of 50 files of 20K objects
Does anyone have a good suggestion?
My thought process is to let each partition handle which index it should write it to by assuming that each partition will roughy produce the same number of objects.
So for example, partition 0 will write to the first file, while partition 21 will write to the second file since it will assume that the starting index for the object is 20000 * 21 = 42000, which is bigger than the file size.
The partition 41 will write to the third file, since it is bigger than 2 * file size limit.
This will not always result on the perfect 400k file size limit though, more of an approximation.
I understand that there is coalescing, but as I understand it coalesce is to reduce the number of partition based on the number of partition wanted. What I want is to coalesce the data based on the number of objects in each partition, is there a good way to do it?

What you want to do is to re-partition the files into three partitions; the data will be split approximately 333k records per partition. The partition will be approximate, it will not be exactly 333,333 per partition. I do not know of a way to get the 400k/400k/200k partition you want.
If you have a DataFrame `df', you can repartition into n partitions as
df.repartition(n)
Since you want a maximum number or records per partition, I would recommend this (you don't specify Scala or pyspark, so I'm going with Scala; you can do the same in pyspark) :
val maxRecordsPerPartition = ???
val numPartitions = (df.count() / maxRecordsPerPartition).toInt + 1
df
.repartition(numPartitions)
.write
.format('json')
.save('/path/file_name.json')
This will guarantee your partitions are less than maxRecordsPerPartition.

We have decided to just go with the number of files being generated and just making sure that each files contain less than 1 million line items

Spark containers killed by YARN during group by

I have a data set extracted from Hbase, which is a long form of wide table, i.e has rowKey, columnQualifier and value columns. To get a form of pivot, I need to group by rowKey, which is a string UUID, into a collection and make an object out of the collection. The problem is that only group-by I manage to perform is count the number of elements in groups; other group-bys fail due to container being kill due to memory overflow beyond YARN container limits. I did experiment a lot with the memory sizes, including overhead, partitioning with and without sorting etc. I went even into a high number of partitions i.e. about 10 000 but the job dies the same. I tried both DataFrame groupBy and collect_list, as well as Dataset grouByKey and mapGroups.
The code works on a small data set but not on the larger one. The data set is about 500 GB in Parquet files. The data is not skewed as the largest group in group by have only 50 elements. Thus, by all known to me means the partitions should easily fit in memory as the aggregated data per one rowKey is not really large. The data keys and values are mostly strings and there are not long.
I am using Spark 2.0.2; the above computations were all done is Scala.

You're probably running into the dreaded groupByKey shuffle. Please read this Databricks article on avoiding groupByKey, which details the underlying differences between the two functions.
If you don't want the read the article, the short story is this: Though groupByKey and reduceByKey produce the same results, groupByKey instantiates a shuffle of ALL data, while reduceByKey tries to minimize data shuffle by reducing first. A bit like MapReduce Combiners, if you're familiar with that concept.

Understanding Shuffle and rePartitioning in spark

I would greatly appreciate if someone could answer these few spark shuffle related questions in simplified terms .
In spark, when loading a data-set ,we specify the number of partitions, which tells how many block the input data(RDD) should be divided into ,and based on the number of partitions, equal number of tasks are launched (correct me, if the assumption is wrong).for X number of cores in worker node.corresponding X number of task run at one time.
Along similar lines ,here are the few questions.
Since,All byKey operations along with coalesce, repartition,join and cogroup, causes data shuffle.
Is data shuffle another name for repartitiong operation?
What happens to the initial partitions(number of partitions declared)when repartitions happens.
Can someone give example(explain) how data movement across the cluster happens.i have seen couple of examples where
random arrow movement of keys is shown (but dont know how the movement is being driven),for example if we have already have data in 10 partitions,does the re partitioning operation combine all data first ,and then send the particular key to the particular partition based on the hash-code%numberofpartitions.

First of all, HDFS blocks is divided into number of partition not in the blocks. These petitions resides in the work of memory. These partitions resides in the worker memory.
Q- Is data shuffle another name for repartitiong operation?
A- No. Generally repartition means increasing the existing partition in which the data is divided into into. So whenever we increase the partition, we are actually trying to “move” the data in number of new partitions set in code not “Shuffling” . Shuffling is somewhat when we move the data of particular key in one partition.
Q- What happens to the initial partitions(number of partitions declared)when repartitions happens?
A- Covered above
One more underlying thing is rdd.repartition(n) will not do change the no. Of partitions of rdd, its a tranformation, which will work when some other rdd is created like
rdd1=rdd.repartition(n)
Now it will create new rdd1 that have n number of partition.To do this, we can call coalesce function like rdd.coalesce(n) Being an action function, this will change the partitions of rdd itself.
Q- Can someone give example(explain) how data movement across across the cluster happens.i have seen couple of examples where random arrow movement of keys is shown (but dont know how the movement is being driven),for example if we have already have data in 10 partitions,does the re partitioning operation combine all data first ,and then send the particular key to the particular partition based on the hash-code%numberofpartitions.
Ans- partition and partitioning at two different different concept so partition is something in which the data is divided evenly in the number of partitions set by the user but in partitioning, data is shuffled among those partitions according to algorithms set by user like HashPartitioning & RangePartitioning.
Like rdd= sc.textFile(“../path”,5) rdd.partitions.size/length
O/p: Int: 5(No.of partitions)
rdd.partitioner.isDefined
O/p: Boolean= false
rdd.partitioner
O/p: None(partitioning scheme)
But,
rdd=sc.textFile(“../path”,5).partitionBy(new org.apache.spark.HashPartition(10).cache()
rdd.partitions.size
O/p: Int: 10
rdd.partitioner.isDefined
O/p: Boolean: true
rdd.partitioner
O/p: HashPartitioning#
Hope this will help!!!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string