Spark creating huge amount of task when read from parquet files - apache-spark

Im having a very high tasks number on spark queries that read from small partitioned parquet data.
I'm trying to query a table that is stored in an S3 bucket in parquet snappy file format. The table is partitioned by date/hour (one partition example: '2021/01/01 10:00:00'). Each partition contains 15/18 files with a size between 30 and 70 kB.
A simple count by partition on 1 year of data is calculated using almost 20.000 tasks. My concern is why is spark creating so many tasks for reading so little amount of data. Is there any mechanism to make each single task read all the content from a single partition? I think it would be more efficient than 15 tasks reading 30kB of data each.
spark.sql.("select count(1), date_hour from forecast.hourly_data where date_hour between '2021_01_01-00' and '2022_01_01-00' group by date_hour")
[Stage 0:> (214 + 20) / 19123]
My spark version is 2.4.7 and configuration is in default mode.

The amount of tasks are based on the amount of files that you are reading in. You can repartition after reading in the data.

Related

How spark (2.3 or new version) determine the number of tasks to read hive table files in gs bucket or hdfs?

Input Data:
a hive table (T) with 35 files (~1.5GB each, SequenceFile)
files are in a gs bucket
default fs.gs.block.size=~128MB
all other parameters are default
Experiment 1:
create a dataproc with 2 workers (4 core per worker)
run select count(*) from T;
Experiment 1 Result:
~650 tasks created to read the hive table files
each task read ~85MB data
Experiment 2:
create a dataproc with 64 workers (4 core per worker)
run select count(*) from T;
Experiment 2 Result:
~24,480 tasks created to read the hive table files
each task read ~2.5MB data
(seems to me 1 task read 2.5MB data is not a good idea as time to open the file would probably be longer than reading 2.5MB.)
Q1: Any idea how spark determines the number of tasks to read hive table data files?
I repeated the same experiments by putting the same data in hdfs and I got similar results.
My understanding is that the number of tasks to read hive table files should be the same as the number of blocks in hdfs. Q2: Is that correct? Q3: Is that also correct when data is in gs bucket (instead of hdfs)?
Thanks in advance!
The number of tasks in one stage is equal to the number of partitions of the input data, which is in turn determined by the data size and the related configs (dfs.blocksize (HDFS), fs.gs.block.size (GCS), mapreduce.input.fileinputformat.split.minsize, mapreduce.input.fileinputformat.split.maxsize). For a complex query which involves multiple stages, it is the sum of the number of tasks of all stages.
There is no difference between HDFS and GCS, except they use different configs for block size, dfs.blocksize vs fs.gs.block.size.
See the following related questions:
How are stages split into tasks in Spark?
How does Spark SQL decide the number of partitions it will use when loading data from a Hive table?

How Spark SQL reads Parquet partitioned files

I have a parquet file of around 1 GB. Each data record is a reading from an IOT device which captures the energy consumed by the device in the last one minute.
Schema: houseId, deviceId, energy
The parquet file is partitioned on houseId and deviceId. A file contains the data for the last 24 hours only.
I want to execute some queries on the data residing in this parquet file using Spark SQL An example query finds out the average energy consumed per device for a given house in the last 24 hours.
Dataset<Row> df4 = ss.read().parquet("/readings.parquet");
df4.as(encoder).registerTempTable("deviceReadings");
ss.sql("Select avg(energy) from deviceReadings where houseId=3123).show();
The above code works well. I want to understand that how spark executes this query.
Does Spark read the whole Parquet file in memory from HDFS without looking at the query? (I don't believe this to be the case)
Does Spark load only the required partitions from HDFS as per the query?
What if there are multiple queries which need to be executed? Will Spark look at multiple queries while preparing an execution plan? One query may be working with just one partition whereas the second query may need all the partitions, so a consolidated plan shall load the whole file from disk in memory (if memory limits allow this).
Will it make a difference in execution time if I cache df4 dataframe above?
Does Spark read the whole Parquet file in memory from HDFS without looking at the query?
It shouldn't scan all data files, but it might in general, access metadata of all files.
Does Spark load only the required partitions from HDFS as per the query?
Yes, it does.
Does Spark load only the required partitions from HDFS as per the query?
It does not. Each query has its own execution plan.
Will it make a difference in execution time if I cache df4 dataframe above?
Yes, at least for now, it will make a difference - Caching dataframes while keeping partitions

spark behavior on hive partitioned table

I use Spark 2.
Actually I am not the one executing the queries so I cannot include query plans. I have been asked this question by the data science team.
We are having hive table partitioned into 2000 partitions and stored in parquet format. When this respective table is used in spark, there are exactly 2000 tasks that are executed among the executors. But we have a block size of 256 MB and we are expecting the (total size/256) number of partitions which will be much lesser than 2000 for sure. Is there any internal logic that spark uses physical structure of data to create partitions. Any reference/help would be greatly appreciated.
UPDATE: It is the other way around. Actually our table is very huge like 3 TB having 2000 partitions. 3TB/256MB would actually come to 11720 but we are having exactly same number of partitions as the table is partitioned physically. I just want to understand how the tasks are generated on data volume.
In general Hive partitions are not mapped 1:1 to Spark partitions. 1 Hive partition can be split into multiple Spark partitions, and one Spark partition can hold multiple hive-partitions.
The number of Spark partitions when you load a hive-table depends on the parameters:
spark.files.maxPartitionBytes (default 128MB)
spark.files.openCostInBytes (default 4MB)
You can check the partitions e.g. using
spark.table(yourtable).rdd.partitions
This will give you an Array of FilePartitions which contain the physical path of your files.
Why you got exactly 2000 Spark partitions from your 2000 hive partitions seems a coincidence to me, in my experience this is very unlikely to happen. Note that the situation in spark 1.6 was different, there the number of spark partitions resembled the number of files on the filesystem (1 spark partition for 1 file, unless the file was very large)
I just want to understand how the tasks are generated on data volume.
Tasks are a runtime artifact and their number is exactly the number of partitions.
The number of tasks does not correlate to data volume in any way. It's a Spark developer's responsibility to have enough partitions to hold the data.

Spark partition by files

I have several thousand compressed CSV files on a S3 bucket, each of size approximately 30MB(around 120-160MB after decompression), which I want to process using spark.
In my spark job, I am doing simple filter select queries on each row.
While partitioning Spark is dividing the files into two or more parts and then creating tasks for each partition. Each task is taking around 1 min to complete just to process 125K records. I want to avoid this partitioning of a single file across many tasks.
Is there a way to fetch files and partition data such that each task works on one complete file, that is, Number of tasks = Number of input files.?
as well as playing with spark options, you can tell the s3a filesystem client to tell it to tell Spark that the "block size" of a file in S3 is 128 MB. The default is 32 MB, which is close enough to your "approximately 30MB" number that spark could be splitting the files in two
spark.hadoop.fs.s3a.block.size 134217728
using the wholeTextFiles() operation is safer though

Spark DataFrames with Parquet and Partitioning

I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partitions. But when the dataframe reads in the file to process it, won't it be processing a large data to partition ratio because if it was processing the file uncompressed the block size would have been much larger meaning the partitions would be larger as well.
So let me clarify, parquet compressed (these numbers are not fully accurate).
1GB Par = 5 Blocks = 5 Partitions which might be decompressed to 5GB making it 25 blocks/25 partitions. But unless you repartition the 1GB par file you will be stuck with just 5 partitions when optimally it would be 25 partitions? Or is my logic wrong.
Would make sense to repartition to increase speed? Or am I thinking about this wrong. Can anyone shed some light on this?
Assumptions:
1 Block = 1 Partition For Spark
1 Core operated on 1 Partition
Spark DataFrame doesn't load parquet files in memory. It uses Hadoop/HDFS API to read it during each operation. So the optimal number of partitions depends on HDFS block size (different from a Parquet block size!).
Spark 1.5 DataFrame partitions parquet file as follows:
1 partition per HDFS block
If HDFS block size is less than configured in Spark parquet block size a partition will be created for multiple HDFS blocks such as total size of partition is no less than parquet block size
I saw the other answer but I thought I can clarify more on this. If you are reading Parquet from posix filesystem then you can increase number of partitioning readings by just having more workers in Spark.
But in order to control the balance of data that comes into workers one may use the hierarchical data structure of the Parquet files, and later in the workers you may point to different partitions or parts of the Parquet file. This will give you control over how much of data should go to each worker according to the domain of your dataset (if by balancing data in workers you mean equal batch of data per worker is not efficient).

Resources