Spark SQL - Identifying What Partitions Were Output by an INSERT Statement - apache-spark

I am trying to find out if there is a way for a Spark SQL job to identify what partition values were output in both static and dynamic partition scenarios.
I am already familiar ApplicationListeners and QueryListeners and have found that combined you can access what exact static partition values were written to. When a dynamic partition write is performed, however, you only get the partition keys and not the corresponding values.
For these jobs, there is a constraint that they run off of only SQL so I cannot use some sort of accumulator or broadcast variable.
e.g.
spark.sql("INSERT OVERWRITE .....")
As an example, if I have a job write out to the base path
s3://bucket/table
With partition columns (part_a String, part_b String)
I want to be able to, as a post process, see the partition values and/or paths I've written to.
If I wrote to part_a=a and part_b=b (via SQL), and then part_a=1 and part_b=2 resulting in the below paths:
s3://bucket/table/part_a=a/part_b=b
s3://bucket/table/part_a=1/part_b=2
How can I either obtain the above paths post run or at a bare minimum the dynamic partition values? I have used AspectJ to this effect to catch the actual filesystem paths being passed on write but this is very brittle to the Spark version itself and essentially breaks on every upgrade.

Related

Get PySpark to output one file per column value (repartition / partitionBy not working)

I've seen many answers and blob posts suggesting that:
df.repartition('category').write().partitionBy('category')
Will output one file per category, but this doesn't appear to be true if the number of unique 'category' values in df is less than the number of default partitions (usually 200).
When I use the above code on a file with 100 categories, I end up with 100 folders each containing between 1 and 3 "part" files, rather than having all rows with a given "category" value in the same "part". The answer at https://stackoverflow.com/a/42780452/529618 seems to explain this.
What is the fastest way get exactly one file per partition value?
Things I've tried
I've seen many claims that
df.repartition(1, 'category').write().partitionBy('category')
df.repartition(2, 'category').write().partitionBy('category')
Will create "exactly one file per category" and "exactly two files per category" respectively, but this doesn't appear to be how this parameter works. The documentation makes it clear that the numPartitions argument is the total number of partitions to create, not the number of partitions per column value. Based on that documentation, specifying this argument as 1 should (accidentally) output a single file per partition when the file is written, but presumably only because it removes all parallelism and forces your entire RDD to be shuffled / recalculated on a single node.
required_partitions = df.select('category').distinct().count()
df.repartition(required_partitions, 'category').write().partitionBy('category')
The above seems like a workaround based on the documented behaviour, but one that would be costly for several reasons. For one, a separate count if df is expensive and not cached (and/or so big that it would be wasteful to cache just for this purpose), and also any repartitioning of a dataframe can cause unnecessary shuffling in a multi-stage workflow that has various dataframe outputss along the way.
The "fastest" way probably depends on the actual hardware set-up and actual data (in case it is skewed). To my knowledge, I also agree that df.repartition('category').write().partitionBy('category') will not help solving your problem.
We faced a similar problem in our application but instead of doing first a count and then the repartition, we separated the writing of the data and the requirement to have only a single file per partition into two different Spark jobs. The first job is optimized to write the data. The second job just iterates over the partitioned folder structure and simply reads the data per folder/partition, coalesces its data to one partition and overwrites them back. Again, I can not tell if that is the fastest way also to your environment, but for us it did the trick.
Having done some research on this topic lead to the Auto Optimize Writes feature on Databricks for writing to a Delta Table. Here, they use a similar approach: First writing the data and then running a separate OPTIMIZE job to aggregate the files into a single file. In the mentioned link you will find this explanation:
"After an individual write, Azure Databricks checks if files can further be compacted, and runs an OPTIMIZE job [...] to further compact files for partitions that have the most number of small files."
As a side note: Make sure to keep the configuration spark.sql.files.maxRecordsPerFile to 0 (default value) or to a negative number. Otherwise, this configuration alone could lead to multiple files for data with the same value in the column "category".
You can try coalesce(n); coalesce is used to decrease the number of partitions, which is an optimized version of repartition.
n = The number of partitions you want to be output.

Incremental batch processing in pyspark

In our spark application, we are running multiple batch processes everyday. sources for these batch process are different like Oracle, mongoDB, Files. We are storing different value for incremental processing based on source like latest timestamp for some oracle tables, ID for some oracle table, list for some file system and using those values for next incremental run.
Currently calculation of these offset value are dependent on source types, we need to customize code to store this value every time when we add new source type.
Is there any generic way to resolve this issue like checkpoint in streaming.
I always like to look into the destination for the last written partition, or get some max(primary_key) and then based on that value select data from the source database to write during the current run.
There would be no need to store anything, you would just need to supply to your batch processing algorithm the table name, source type, and primary key/timestamp column. The algorithm would then find the latest value you already have.
It really depends on your load philosophy and how your storage is divided; if you have raw/source/prepared layers. It is a good idea to load data in a raw format which can be easily compared to the original source in order to do what I described above.
Alternatives include:
Writing a file which contains that primary column and the latest value, your batch job would read this file to determine what to read next.
Updating the job execution configuration with an argument corresponding to the latest value, so on the next run the latest value is passed to your algorithm.

Incremental and parallelism read from RDBMS in Spark using JDBC

I'm working on a project that involves reading data from RDBMS using JDBC and I succeeded reading the data. This is something I will be doing fairly constantly, weekly. So I've been trying to come up with a way to ensure that after the initial read, subsequent ones should only pull updated records instead of pulling the entire data from the table.
I can do this with sqoop incremental import by specifying the three parameters (--check-column, --incremental last-modified/append and --last-value). However, I dont want to use sqoop for this. Is there a way I can replicate same in Spark with Scala?
Secondly, some of the tables do not have unique column which can be used as partitionColumn, so I thought of using a row-number function to add a unique column to these table and then get the MIN and MAX of the unique column as lowerBound and upperBound respectively. My challenge now is how to dynamically parse these values into the read statement like below:
val queryNum = "select a1.*, row_number() over (order by sales) as row_nums from (select * from schema.table) a1"
val df = spark.read.format("jdbc").
option("driver", driver).
option("url",url ).
option("partitionColumn",row_nums).
option("lowerBound", min(row_nums)).
option("upperBound", max(row_nums)).
option("numPartitions", some value).
option("fetchsize",some value).
option("dbtable", queryNum).
option("user", user).
option("password",password).
load()
I know the above code is not right and might be missing a whole lot of processes but I guess it'll give a general overview of what I'm trying to achieve here.
It's surprisingly complicated to handle incremental JDBC reads in Spark. IMHO, it severely limits the ease of building many applications and may not be worth your trouble if Sqoop is doing the job.
However, it is doable. See this thread for an example using the dbtable option:
Apache Spark selects all rows
To keep this job idempotent, you'll need to read in the max row of your prior output either directly from loading all data files or via a log file that you write out each time. If your data files are massive you may need to use the log file, if smaller you could potentially load.

Write to a datepartitioned Bigquery table using the beam.io.gcp.bigquery.WriteToBigQuery module in apache beam

I'm trying to write a dataflow job that needs to process logs located on storage and write them in different BigQuery tables. Which output tables are going to be used depends on the records in the logs. So I do some processing on the logs and yield them with a key based on a value in the log. After which I group the logs on the keys. I need to write all the logs grouped on the same key to a table.
I'm trying to use the beam.io.gcp.bigquery.WriteToBigQuery module with a callable as the table argument as described in the documentation here
I would like to use a date-partitioned table as this will easily allow me to write_truncate on the different partitions.
Now I encounter 2 main problems:
The CREATE_IF_NEEDED gives an error because it has to create a partitioned table. I can circumvent this by making sure the tables exist in a previous step and if not create them.
If i load older data I get the following error:
The destination table's partition table_name_x$20190322 is outside the allowed bounds. You can only stream to partitions within 31 days in the past and 16 days in the future relative to the current date."
This seems like a limitation of streaming inserts, any way to do batch inserts ?
Maybe I'm approaching this wrong, and should use another method.
Any guidance as how to tackle these issues are appreciated.
Im using python 3.5 and apache-beam=2.13.0
That error message can be logged when one mixes the use of an ingestion-time partitioned table a column-partitioned table (see this similar issue). Summarizing from the link, it is not possible to use column-based partitioning (not ingestion-time partitioning) and write to tables with partition suffixes.
In your case, since you want to write to different tables based on a value in the log and have partitions within each table, forgo the use of the partition decorator when selecting which table (use "[prefix]_YYYYMMDD") and then have each individual table be column-based partitioned.

Why is execution time of spark sql query different between first time and second time of execution?

I am using spark sql to run some aggregated query on the parquet data source.
My parquet data source includes a table with columns: id int, time timestamp, location int, counter_1 long, counter_2 long, ..., counter_48. The total data size is about 887 MB.
My spark version is 2.4.0. I run one master and one slave on a single machine (4 cores, 16G memory).
Using spark-shell, I ran the spark command:
spark.time(spark.sql("SELECT location, sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is 17s.
The second time I ran a similar command (only change columns):
spark.time(spark.sql("SELECT location, sum(counter_2)+sum(counter_6)+sum(counter_11)+sum(counter_16)+sum(cou
nter_21)+sum(counter_26)+sum(counter_31)+sum(counter_36 )+sum(counter_41)+sum(counter_46) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is about 3s.
My first question is: Why are they different? I know it is not data caching because of the parquet format. Is it about reusing something like query planning?
I did another test: The first command is
spark.time(spark.sql("SELECT location, sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is 17s.
In the second command, I change the aggregate function:
spark.time(spark.sql("SELECT location, avg(counter_1)+avg(counter_5)+avg(counter_10)+avg(counter_15)+avg(cou
nter_20)+avg(counter_25)+avg(counter_30)+avg(counter_35 )+avg(counter_40)+avg(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is about 5s.
My second question is: Why is the second command is faster than the first command but the execution time difference is slightly smaller than the first scenario?
Finally, I have a problem related to above scenarios: The are about 200 formulas like:
formula1 = sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45)
formula2 = avg(counter_2)+avg(counter_5)+avg(counter_11)+avg(counter_15)+avg(cou
nter_21)+avg(counter_25)+avg(counter_31)+avg(counter_35 )+avg(counter_41)+avg(counter_45)
I have to run the following format frequently:
select formulaX,formulaY, ..., formulaZ from table where time > value1 and time < value2 and location in (value1, value 2...) group by location
My third question is: Is there anyway to optimize the performance (the query used once should be faster if it is used again in the future)? Does spark optimize itself or do I have to write some code, change config?
It's called Exchange Reuse. When Spark runs shuffling (i.e. aggregation, join) it stores a copy of the shuffle data on local worker nodes for potential reuse. This is an internally controlled behavior and cannot be directly influenced by end user. If you find you're keep re-using a particular portion of data (or query outcome), you could consider explicitly CACHING it by using the cache(). However, bear in mind although this allows Spark to reuse cached result for potentially faster query performance (if, and only if the Analyzer Plan of your cached query matches your new query), over using CACHE can cause whole lot of different performance problems.
A bad example is when your dataset is very large, it may cause Disk Spill problem. That is, the dataset doesn't fit into your cluster's available memory and needs to be written to slower hard disks.
Another bad example is when your query only needs to access a subset of the cached data. By caching the entire dataset in memory, Spark is forced to perform full in-memory table scan. Not only that's waste of resource but also results in a slower query performance as oppose to not using cache at all.
The best thing to do is try & error with a few of your own example queries, look at the Spark UI and check if there is sign of Disk Spill or large amount of input data scan.
Every query/data combination is unique hence you'll need to experiment it a bit to find the best performance tuning method for your own workload.
When doing an aggregate spark creates what are called shuffle files. If you run the same query twice, it will reuse the shuffle files which are stored locally on the workers fs. Unfortunately you can't rely on them to always be there because eventually the file handler gets gc'd. If your going to run 10 queries on the same dataset, cache it or use databricks.

Resources