optimizing reading from partitioned parquet files in s3 bucket

optimizing reading from partitioned parquet files in s3 bucket - apache-spark

I have a large dataset in parquet format (~1TB in size) that is partitioned into 2 hierarchies: CLASS and DATE
There are only 7 classes. But the Date is ever increasing from 2020-01-01 onwards.
My data is partitioned by CLASS first and then DATE
So something like:
CLASS1---DATE 1
---DATE 2
--- .
--- .
--- .
---DATE N
CLASS2---DATE 1
---DATE 2
--- .
--- .
--- .
---DATE N
I load my data by CLASS in a for-loop. If I load the entire parquet file, YARN kills the job since it overloads the memory instances. But I load all the days since I am doing a percentile calculation in my modeling. This method takes about 23hrs to complete.
However, if I repartition such that I only have the CLASS partition, the job takes about 10hrs.
Does having too many sub-partitions slow down the spark executor jobs?
I keep the partition hierarchy as CLASS -> DATE only because I need to append new data by DATE every day.
If having only 1 partition is more efficient, then I would have to repartition to just the CLASS partition every day after loading new data.
Could someone explain why having a single partition works faster? And if so, what would be the best method to partition the data on a daily basis by appending and without repartitioning the entire dataset?
Thank You
EDIT:
I use the for loop on the file structure to loop by CLASS partition like so:
fs = s3fs.S3FileSystem(anon=False)
inpath="s3://bucket/file.parquet/"
Dirs= fs.ls(inpath)
for paths in Dirs:
customPath='s3://' + uvapath + '/'
class=uvapath.split('=')[1]
df=spark.read.parquet(customPath)
outpath="s3://bucket/Output_" + class + ".parquet"
#Perform calculations
df.write.mode('overwrite').parquet(outpath)
The loaded df will have all the dates for CLASS=1. I then output the file as separate parquet files for each CLASS such that I have 7 parquet files:
Output_1.parquet
Output_2.parquet
Output_3.parquet
Output_4.parquet
Output_5.parquet
Output_6.parquet
Output_7.parquet
I then merge the 7 parquets into a single parquet is not a problem as the resulting parquet files are much smaller.

I have the partitioned data with three columns, year, month, and id. The folder path hierarchy is
year=2020/month=08/id=1/*.parquet
year=2020/month=08/id=2/*.parquet
year=2020/month=08/id=3/*.parquet
...
year=2020/month=09/id=1/*.parquet
year=2020/month=09/id=2/*.parquet
year=2020/month=09/id=3/*.parquet
and I can read the DataFrame by loading the root path.
val df = spark.read.parquet("s3://mybucket/")
Then, the partitioned column is automatically added to the DataFrame. Now, then you can filter your data for the partitioned column in a way that
val df_filtered = df.filter("year = '2020' and month = '09'")
and do something with df_filtered then the spark will use only the partitioned data!
For your repeated processing, you can use the fair scheduler of the spark. Add the fair.xml file into src/main/resources of your project with the below code,
<?xml version="1.0"?>
<allocations>
<pool name="fair">
<schedulingMode>FAIR</schedulingMode>
<weight>10</weight>
<minShare>0</minShare>
</pool>
</allocations>
and set the spark configuration after creating the spark session.
spark.sparkContext.setLocalProperty("spark.scheduler.mode", "FAIR")
spark.sparkContext.setLocalProperty("spark.scheduler.allocation.file", getClass.getResource("/fair.xml").getPath)
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "fair")
Then you can do your job in parallel. You may want to parallelize the job depends on the CLASS, so
val classes = (1 to 7).par
val date = '2020-09-25'
classes foreach { case i =>
val df_filtered = df.filter(s"CLASS == '$i' and DATE = '$date'")
// Do your job
}
the code will work at the same time with different CLASS values.

Related

Spark : repartitionByRange creating multiple files

I am writing data as Parquet files as below -
df.repartitionByRange($"key", rand)
.write
.option("maxRecordsPerFile", 5000)
.partitionBy("key")
.parquet("somelocation")
I have used a string column(key) for partitioning which is a city as I have more filters based on that.
Even after specifying maxRecordsPerFile, multiple small files(tens or hundreds of records) are getting created in 1 partition folder.

AFAIK, the below use case may help to solve your problem,
Terminologies:
1. maxRecordsPerFile - Limit the max number of records written per file.
2. repartitionByRange(10, $"id")
repartitionByRange(numPartitions: Int, partitionExprs: Column*)
which will create numPartitions by splitting the partitionExprs into partitionExprs/numPartitions equal records splits.
improve compression when writing data out to disk,
memory partitioning based
in order to write data on disk properly, you’ll almost always need to repartition the data in memory first.
3. partitionedBy("directory you wanted to write")
method that specifies if the data should be written to disk in folders. By default, Spark does not write data to disk in nested folders.
disk level partitioning
case 1: input rows - 1000, repartition-10, maxRecordsPerFile=inputrows/repartitioncount . 1000/10=100. leads to 10 part-xxxxx files with equal number of records(100 records in each file) within a disk level partition directory(partition=1)
import org.apache.spark.sql.functions.{col, lit, when}
val df=spark.range(1000)
val df1=df.withColumn("partitioncol",lit("1"))
df1.repartitionByRange(10, $"id").write.option("maxRecordsPerFile", 100).partitionBy("partitioncol").parquet("/FileStore/import-stage/all4")
case 2: input rows - 1000, repartition-10, maxRecordsPerFile>inputrows/repartitioncount . 1000. again leads to 10 part-xxxxx files with equal number of records(100 records in each file) within a disk level partition directory(partition=1)
import org.apache.spark.sql.functions.{col, lit, when}
val df=spark.range(1000)
val df1=df.withColumn("partitioncol",lit("1"))
df1.repartitionByRange(10, $"id").write.option("maxRecordsPerFile", 1000).partitionBy("partitioncol").parquet("/FileStore/import-stage/all4")
case 3: input rows - 1000, repartition-10, maxRecordsPerFile<inputrows/repartitioncount, example = 10. leads to 100 part-xxxxx files with equal number of records(10 records in each file) within a disk level partition directory(partition=1)
import org.apache.spark.sql.functions.{col, lit, when}
val df=spark.range(1000)
val df1=df.withColumn("partitioncol",lit("1"))
df1.repartitionByRange(10, $"id").write.option("maxRecordsPerFile", 10).partitionBy("partitioncol").parquet("/FileStore/import-stage/all4")

How to handle small file problem in spark structured streaming?

I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into respective parquet files in hdfs store.
I am able to store and read parquet files, I kept a trigger time of 15 seconds to 1 minutes. These files are very small in size hence resulting into many files.
These parquet files need to be read latter by hive queries.
So
1) Is this strategy works in production environment ? or does it lead to any small file problem later ?
2) What are the best practices to handle/design this kind of scenario i.e. industry standard ?
3) How these kind of things generally handled in Production?
Thank you.

I know this question is too old. I had similar problem & I have used spark structured streaming query listeners to solve this problem.
My use case is fetching data from kafka & storing in hdfs with year, month, day & hour partitions.
Below code will take previous hour partition data, apply repartitioning & overwrite data in existing partition.
val session = SparkSession.builder().master("local[2]").enableHiveSupport().getOrCreate()
session.streams.addListener(AppListener(config,session))
class AppListener(config: Config,spark: SparkSession) extends StreamingQueryListener {
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = {}
override def onQueryProgress(event: StreamingQueryListener.QueryProgressEvent): Unit = {
this.synchronized {AppListener.mergeFiles(event.progress.timestamp,spark,config)}
}
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = {}
}
object AppListener {
def mergeFiles(currentTs: String,spark: SparkSession,config:Config):Unit = {
val configs = config.kafka(config.key.get)
if(currentTs.datetime.isAfter(Processed.ts.plusMinutes(5))) {
println(
s"""
|Current Timestamp : ${currentTs}
|Merge Files : ${Processed.ts.minusHours(1)}
|
|""".stripMargin)
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val ts = Processed.ts.minusHours(1)
val hdfsPath = s"${configs.hdfsLocation}/year=${ts.getYear}/month=${ts.getMonthOfYear}/day=${ts.getDayOfMonth}/hour=${ts.getHourOfDay}"
val path = new Path(hdfsPath)
if(fs.exists(path)) {
val hdfsFiles = fs.listLocatedStatus(path)
.filter(lfs => lfs.isFile && !lfs.getPath.getName.contains("_SUCCESS"))
.map(_.getPath).toList
println(
s"""
|Total files in HDFS location : ${hdfsFiles.length}
| ${hdfsFiles.length > 1}
|""".stripMargin)
if(hdfsFiles.length > 1) {
println(
s"""
|Merge Small Files
|==============================================
|HDFS Path : ${hdfsPath}
|Total Available files : ${hdfsFiles.length}
|Status : Running
|
|""".stripMargin)
val df = spark.read.format(configs.writeFormat).load(hdfsPath).cache()
df.repartition(1)
.write
.format(configs.writeFormat)
.mode("overwrite")
.save(s"/tmp${hdfsPath}")
df.cache().unpersist()
spark
.read
.format(configs.writeFormat)
.load(s"/tmp${hdfsPath}")
.write
.format(configs.writeFormat)
.mode("overwrite")
.save(hdfsPath)
Processed.ts = Processed.ts.plusHours(1).toDateTime("yyyy-MM-dd'T'HH:00:00")
println(
s"""
|Merge Small Files
|==============================================
|HDFS Path : ${hdfsPath}
|Total files : ${hdfsFiles.length}
|Status : Completed
|
|""".stripMargin)
}
}
}
}
def apply(config: Config,spark: SparkSession): AppListener = new AppListener(config,spark)
}
object Processed {
var ts: DateTime = DateTime.now(DateTimeZone.forID("UTC")).toDateTime("yyyy-MM-dd'T'HH:00:00")
}
Sometime data is huge & I have divided data into multiple files using below logic. File size will be around ~160 MB
val bytes = spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes
val dataSize = bytes.toLong
val numPartitions = (bytes.toLong./(1024.0)./(1024.0)./(10240)).ceil.toInt
df.repartition(if(numPartitions == 0) 1 else numPartitions)
.[...]
Edit-1
Using this - spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes we can get the size of actual Dataframe once its loaded into memory, for example you can check below code.
scala> val df = spark.read.format("orc").load("/tmp/srinivas/")
df: org.apache.spark.sql.DataFrame = [channelGrouping: string, clientId: string ... 75 more fields]
scala> import org.apache.commons.io.FileUtils
import org.apache.commons.io.FileUtils
scala> val bytes = spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes
bytes: BigInt = 763275709
scala> FileUtils.byteCountToDisplaySize(bytes.toLong)
res5: String = 727 MB
scala> import sys.process._
import sys.process._
scala> "hdfs dfs -ls -h /tmp/srinivas/".!
Found 2 items
-rw-r----- 3 svcmxns hdfs 0 2020-04-20 01:46 /tmp/srinivas/_SUCCESS
-rw-r----- 3 svcmxns hdfs 727.4 M 2020-04-20 01:46 /tmp/srinivas/part-00000-9d0b72ea-f617-4092-ae27-d36400c17917-c000.snappy.orc
res6: Int = 0

We had a similar problem, too. After a lot of Googling, it seemed the generally accepted way was to write another job that every so often aggregates the many small files and writes them elsewhere in larger, consolidated files. This is what we now do.
As an aside: there is a limit to what you can do here anyway as the more parallelism you have, the greater the number of files because each executor thread writes to its own file. They never write to a shared file. This appears to be the nature of the beast that is parallel processing.

This is a common burning question of spark streaming with no any fixed answer.
I took an unconventional approach which is based on idea of append.
As you are using spark 2.4.1, this solution will be helpful.
So, if append were supported in columnar file format like parquet or orc, it would have been just easier as the new data could be appended in same file and file size can get on bigger and bigger after every micro-batch.
However, as it is not supported, I took versioning approach to achieve this. After every micro-batch, the data is produced with a version partition.
e.g.
/prod/mobility/cdr_data/date=01–01–2010/version=12345/file1.parquet
/prod/mobility/cdr_data/date=01–01–2010/version=23456/file1.parquet
What we can do is that, in every micro-batch, read the old version data, union it with the new streaming data and write it again at the same path with new version. Then, delete old versions. In this way after every micro-batch, there will be a single version and single file in every partition. The size of files in each partition will keep on growing and get bigger.
As union of streaming dataset and static dataset isn't allowed, we can use forEachBatch sink (available in spark >=2.4.0) to convert streaming dataset to static dataset.
I have described how to achieve this optimally in the link. You might want to have a look.
https://medium.com/#kumar.rahul.nitk/solving-small-file-problem-in-spark-structured-streaming-a-versioning-approach-73a0153a0a

You can set a trigger.
df.writeStream
.format("parquet")
.option("checkpointLocation", "path/to/checkpoint/dir")
.option("path", "path/to/destination/dir")
.trigger(Trigger.ProcessingTime("30 seconds"))
.start()
The larger the trigger size, the larger the file size.
Or optionally you could run the job with a scheduler(e.g. Airflow) and a trigger Trigger.Once() or better Trigger.AvailableNow(). It runs a the job only once a period and process all data with appropriate file size.

Filter JSON records to diffrent datasets Spark-Java

I'm using Java-Spark.
I have the following Java records in rdd from Kafka (As string):
{"code":"123", "date":"14/07/2018",....}
{"code":"124", "date":"15/07/2018",....}
{"code":"123", "date":"15/07/2018",....}
{"code":"125", "date":"14/07/2018",....}
That I'm read to Dataset as follow:
Dataset<Row> df = sparkSession.read().json(jsonSet);
Dataset<Row> dfSelect = df.select(cols);//Where cols is Column[]
I want to write the JSON records to different Hive table and different partitions by mapping to diffrent datasets,
Meaning that:
{"code":"123", "date":"14/07/2018",....} Write to HDFS dir -> /../table123/partition=14_07_2018
{"code":"124", "date":"15/07/2018",....} Write to HDFS dir -> /../table124/partition=15_07_2018
{"code":"123", "date":"15/07/2018",....} Write to HDFS dir -> /../table123/partition=15_07_2018
{"code":"125", "date":"14/07/2018",....} Write to HDFS dir -> /../table125/partition=14_07_2018
How can I mapping the Jsons by code and by date and then write by:
dfSelectByTableAndDate123.write().format("parquet").mode("append").save(pathByTableAndDate);
dfSelectByTableAndDate124.write().format("parquet").mode("append").save(pathByTableAndDate);
dfSelectByTableAndDate125.write().format("parquet").mode("append").save(pathByTableAndDate);
Thanks

You can convert you json to java objects, then reduce it by date which will give you rows grouped by same date. Each set then you can write as you wish below is pseudo code in scala
case class MyType(code: String,date: String)
newDs = df.as[MyType]
newDs.reduceByKey("date").values

Spark bucketing read performance

Spark version - 2.2.1.
I've created a bucketed table with 64 buckets, I'm executing an aggregation function select t1.ifa,count(*) from $tblName t1 where t1.date_ = '2018-01-01' group by ifa . I can see that 64 tasks in Spark UI, which utilize just 4 executors (each executor has 16 cores) out of 20. Is there a way I can scale out the number of tasks or that's how bucketed queries should run (number of running cores as the number of buckets)?
Here's the create table:
sql("""CREATE TABLE level_1 (
bundle string,
date_ date,
hour SMALLINT)
USING ORC
PARTITIONED BY (date_ , hour )
CLUSTERED BY (ifa)
SORTED BY (ifa)
INTO 64 BUCKETS
LOCATION 'XXX'""")
Here's the query:
sql(s"select t1.ifa,count(*) from $tblName t1 where t1.date_ = '2018-01-01' group by ifa").show

With bucketing, the number of tasks == number of buckets, so you should be aware of the number of cores/tasks that you need/want to use and then set it as the buckets number.

num of task = num of buckets is probably the most important and under-discussed aspect of bucketing in Spark. Buckets (by default) are historically solely useful for creating "pre-shuffled" dataframes which can optimize large joins. When you read a bucketed table all of the file or files for each bucket are read by a single spark executor (30 buckets = 30 spark tasks when reading the data) which would allow the table to be joined to another table bucketed on the same # of columns. I find this behavior annoying and like the user above mentioned problematic for tables that may grow.
You might be asking yourself now, why and when in the would I ever want to bucket and when will my real-world data grow exactly in the same way over time? (you probably partitioned your big data by date, be honest) In my experience you probably don't have a great use case to bucket tables in the default spark way. BUT ALL IS NOT LOST FOR BUCKETING!
Enter "bucket-pruning". Bucket pruning only works when you bucket ONE column but is potentially your greatest friend in Spark since the advent of SparkSQL and Dataframes. It allows Spark to determine which files in your table contain specific values based on some filter in your query, which can MASSIVELY reduce the number of files spark physically reads, resulting in hugely efficient and fast queries. (I've taken 2+hr queries down to 2 minutes and 1/100th of the Spark workers). But you probably don't care because of the # of buckets to tasks issue means your table will never "scale-up" if you have too many files per bucket, per partition.
Enter Spark 3.2.0. There is a new feature coming that will allow bucket pruning to stay active when you disable bucket-based reading, allowing you to distribute the spark reads with bucket-pruning/scan. I also have a trick for doing this with spark < 3.2 as follows.
(note the leaf-scan for files with vanilla spark.read on s3 is added overhead but if your table is big it doesn't matter, bc your bucket optimized table will be a distributed read across all your available spark workers and will now be scalable)
val table = "ex_db.ex_tbl"
val target_partition = "2021-01-01"
val bucket_target = "valuex"
val bucket_col = "bucket_col"
val partition_col = "date"
import org.apache.spark.sql.functions.{col, lit}
import org.apache.spark.sql.execution.FileSourceScanExec
import org.apache.spark.sql.execution.datasources.{FileScanRDD,FilePartition}
val df = spark.table(tablename).where((col(partition_col)===lit(target_partition)) && (col(bucket_col)===lit(bucket_target)))
val sparkplan = df.queryExecution.executedPlan
val scan = sparkplan.collectFirst { case exec: FileSourceScanExec => exec }.get
val rdd = scan.inputRDDs.head.asInstanceOf[FileScanRDD]
val bucket_files = for
{ FilePartition(bucketId, files) <- rdd.filePartitions f <- files }
yield s"$f".replaceAll("path: ", "").split(",")(0)
val format = bucket_files(0).split("
.").last
val result_df = spark.read.option("mergeSchema", "False").format(format).load(bucket_files:_*).where(col(bucket_col) === lit(bucket_target))

Spark streaming data sharing between batches

Spark streaming processes the data in micro batches.
Each interval data is processed in parallel using RDDs with out any data sharing between each interval.
But my use case needs to share the data between intervals.
Consider the Network WordCount example which produces the count of all words received in that interval.
How would I produce following word count ?
Relative count for the words "hadoop" and "spark" with the previous interval count
Normal word count for all other words.
Note: UpdateStateByKey does the Stateful processing but this applies function on every record instead of particular records.
So, UpdateStateByKey doesn't fit for this requirement.
Update:
consider the following example
Interval-1
Input:
Sample Input with Hadoop and Spark on Hadoop
output:
hadoop 2
sample 1
input 1
with 1
and 1
spark 1
on 1
Interval-2
Input:
Another Sample Input with Hadoop and Spark on Hadoop and another hadoop another spark spark
output:
another 3
hadoop 1
spark 2
and 2
sample 1
input 1
with 1
on 1
Explanation:
1st interval gives the normal word count of all words.
In the 2nd interval hadoop occurred 3 times but the output should be 1 (3-2)
Spark occurred 3 times but the output should be 2 (3-1)
For all other words it should give the normal word count.
So, while processing 2nd Interval data, it should have the 1st interval's word count of hadoop and spark
This is a simple example with illustration.
In actual use case, fields that need data sharing are part of the RDD element(RDD) and huge no of values needs to be tracked.
i.e, in this example like hadoop and spark keywords nearly 100k keywords to be tracked.
Similar usecases in Apache Storm:
Distributed caching in storm
Storm TransactionalWords

This is possible by "remembering" the last RDD received and using a left join to merge that data with the next streaming batch. We make use of streamingContext.remember to enable RDDs produced by the streaming process to be kept for the time we need them.
We make use of the fact that dstream.transform is an operation that executes on the driver and therefore we have access to all local object definitions. In particular we want to update the mutable reference to the last RDD with the required value on each batch.
Probably a piece of code makes that idea more clear:
// configure the streaming context to remember the RDDs produced
// choose at least 2x the time of the streaming interval
ssc.remember(xx Seconds)
// Initialize the "currentData" with an empty RDD of the expected type
var currentData: RDD[(String, Int)] = sparkContext.emptyRDD
// classic word count
val w1dstream = dstream.map(elem => (elem,1))
val count = w1dstream.reduceByKey(_ + _)
// Here's the key to make this work. Look how we update the value of the last RDD after using it.
val diffCount = count.transform{ rdd =>
val interestingKeys = Set("hadoop", "spark")
val interesting = rdd.filter{case (k,v) => interestingKeys(k)}
val countDiff = rdd.leftOuterJoin(currentData).map{case (k,(v1,v2)) => (k,v1-v2.getOrElse(0))}
currentData = interesting
countDiff
}
diffCount.print()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string