Load parquet folders to spark dataframe based on condition - apache-spark

I have directory which has folders based on the date and running date is part of the folder name. I have a daily spark job in which i need to load last 7 days files on any given day.
Unfortunately the folder contains other files as well to try partition discovery.
I have files as below format.
prefix-yyyyMMdd/
How to load folders within last 7 days in one shot.?
Since it is running date, i cannot have predefined regex that can be used to load the data, as i have to consider month and year changes.
I have couple of brute force solutions
to load all the data into 7 dataframes and do unionAll with all 7, to get one dataframe from 7 dataframes. This looks performance inefficient, but not a entirely bad one
Load entire folder and do where condition on column that has the date.
This looks storage heavy, as the folder contains years worth of data
Both doesn't look performance efficient and considering each file data it self is huge, i would like to know if there are any better solutions.
Is there a better way to do it.?

DataFrameReader methods can take multiple paths, e.g.
spark.read.parquet("prefix-20190704", "prefix-20190703", ...)

Related

Databricks query performance when filtering on a column correlated to the partition-column

Setting: Delta-lake, Databricks SQL compute used by powerbi.
I am wondering about the following scenario: We have a column timestamp and a derived column date (which is the date of timestamp), and we choose to partitionby date. When we query we use timestamp in the filter, not date.
My understanding is that databrikcs a priori wont connect the timestamp and the date, and seemingly wont get any advantage of the partitioning. But since the files are in fact partitioned by timestamps (implicitly), when databricks looks at the min/max timestamps of all the files, it will find that it can skip most files after all. So it seems like we can get quite a benefit of partitioning even if its on a column we dont explicitly use in the query.
Is this correct?
What is the performance cost (roughly) of having to filter away files in this way vs using the partitioning directly.
Will databricks have all the min/max information in memory, or does it have to go out and look at the files for each query?
Yes, Databricks will take implicit advantage of this partitioning through data skipping because there will be min/max statistics associated with specific data files. The min/max information will be loaded into memory from the transaction log, but it will need to make decision which files it need to hit on every query. But because everything is in memory, it shouldn't be very big performance overhead, until you have hundreds of thousands files.
One thing that you may consider - use generated column instead of explicit date column. Declare it as date GENERATED ALWAYS AS (CAST(timestampColumn AS DATE)), and partition by it. The advantage is that when you're doing a query on timestampColumn, then it should do partition filtering on the date column automatically.

Databricks create view partitioning fields on path

I think it's a noob question but I couldn't find a answer at all. Let's assume I already have my data organized like this /mnt/raw/mydata/YYYY/MM/DD:
Example: /mnt/raw/mydata/2020/10/20
where YYYY is the year, MM is month and DD day. I would like to create a view that can map fields to the folder name. I've only seen examples to create views with 'YEAR=2020'. Is that possible?
It's related to partition discovery described here
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html but my folders don't have the field name. I would like to know it I can spedicy that the fisrt level is the field YEAR, the second is the Month and the third is the day.
CREATE TEMPORARY VIEW parquetTable
USING org.apache.spark.sql.parquet
OPTIONS (
path "examples/src/main/resources/people.parquet"
)
Yes, you can have folder break down by the distinct values of a field. This can be achieved through partitioning your table/dataframe. It is recommended that you store your data in parquet format to optimize for space and that your folder structure (i.e.: chosen partitions) contains data that is approx. 1GB in file size. See the links below for more details:
https://spark.apache.org/docs/3.0.1/sql-data-sources-parquet.html#partition-discovery
https://docs.databricks.com/delta/best-practices.html#choose-the-right-partition-column
https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-table-datasource.html
CREATE OR REPLACE TABLE MyTable USING parquet OPTIONS (path "/mnt/raw/mydata") PARTITIONED BY (year, month, day)

How can I make my Athena SQL query faster

I am running this on AWS Athena based on PrestoDB. My original plan was to query data 3 months in the past to analyze that data. However, even the query times for 2 hours in the past takes more than 30 minutes, at which point the Query times out. Is there any more efficient way for the query to be carried out?
SELECT column1, dt, column 2
FROM database1
WHERE date_parse(dt, '%Y%m%d%H%i%s') > CAST(now() - interval '1' hour AS timestamp)
The date column is recorded in the form of a string YYYYmmddhhmmss
Likely, the problem is that the query applies a function on the column being filtered. This is inefficient, becase the database needs to convert the entire column before it is able to filter it. One says that this predicate is non-SARGable.
Your primary effort should go into fixing your data model and store dates as dates rather than strings.
That said, the string format that you are using to represent dates still makes it possible to use direct filtering. The idea is to convert the filter value to the target string format (rather than converting the column value to a date):
where dt > date_format(now() - interval '1' hour, '%Y%m%d%H%i%s')
There are a lot of different factors that influence the time it takes for Athena to execute a query. The amount of data is usually dominates, but other important factors are data format (there's a huge difference between CSV and Parquet for example), and the number of files. In contrast to many other new database situations the complexity of the query is less often an important factor, and your query is very straightforward and is not the problem (it doesn't help that you apply a function in both sides of the WHERE condition, but it's not a big deal in Athena since the filtering is brute force and applying a function on each row isn't that big a deal compared to IO in an engine like Athena.
If you provide more information about the number of files, the data format, and so on we can probably help you better, because without that kind of information it could be just about anything. My suspicion is that you have something like a single prefix with tens or hundreds of millions of files – this is the worst possible case for Athena.
When Athena plans a query it lists the table's location on S3. S3's list operation has a page size of 1000, so if there are more files than that Athena will have to list sequentially until it gets the full listing. This cannot be parallelised, and it's also not very fast.
You need to avoid, almost at all cost, having more than 1000 files in the same prefix. If you have more files than that you can add prefixes (directories), because Athena will list S3 as if it was a file system, and parallelise listings of prefixes. A 1000 files each in table-data/a/, table-data/b/, table-data/c/ is much better than 3000 files in table-data/.
The reason why I suspect it's lots of small files rather than a lot of data is that if it was a lot of data you would probably have said so – and lots of data is actually something Athena is really good at. Ripping though terabytes of data is no problem unless it's a billion tiny files.

PySpark: how to read in partitioning columns when reading parquet

I have data stored in a parquet files and hive table partitioned by year, month, day. Thus, each parquet file is stored in /table_name/year/month/day/ folder.
I want to read in data for only some of the partitions. I have list of paths to individual partitions as follows:
paths_to_files = ['hdfs://data/table_name/2018/10/29',
'hdfs://data/table_name/2018/10/30']
And then try to do something like:
df = sqlContext.read.format("parquet").load(paths_to_files)
However, then my data does not include the information about year, month and day, as this is not part of the data per se, rather the information is stored in the path to the file.
I could use sql context and a send hive query with some select statement with where on the year, month and day columns to select only data from partitions i am interested in. However, i'd rather avoid constructing SQL query in python as I am very lazy and don't like reading SQL.
I have two questions:
what is the optimal way (performance-wise) to read in the data stored as parquet, where information about year, month, day is not present in the parquet file, but is only included in the path to the file? (either send hive query using sqlContext.sql('...'), or use read.parquet,... anything really.
Can i somehow extract the partitioning columns when using the
approach i outlined above?
Reading the direct file paths to the parent directory of the year partitions should be enough for a dataframe to determine there's partitions under it. However, it wouldn't know what to name the partitions without the directory structure /year=2018/month=10, for example.
Therefore, if you have Hive, then going via the metastore would be better because the partitions are named there, Hive stores extra useful information about your table, and then you're not reliant on knowing the direct path to the files on disk from the Spark code.
Not sure why you think you need to read/write SQL, though.
Use the Dataframe API instead, e.g
df = spark.table("table_name")
df_2018 = df.filter(df['year'] == 2018)
df_2018.show()
Your data isn't stored in a way optimal for parquet so you'd have to load files one by one and add the dates
Alternatively, you can move the files to a directory structure fit for parquet
( e.g. .../table/year=2018/month=10/day=29/file.parquet)
then you can read the parent directory (table) and filter on year, month, and day (and spark will only read the relevant directories) also you'd get these as attributes in your dataframe

pyspark: Efficiently have partitionBy write to same number of total partitions as original table

I had a question that is related to pyspark's repartitionBy() function which I originally posted in a comment on this question. I was asked to post it as a separate question, so here it is:
I understand that df.partitionBy(COL) will write all the rows with each value of COL to their own folder, and that each folder will (assuming the rows were previously distributed across all the partitions by some other key) have roughly the same number of files as were previously in the entire table. I find this behavior annoying. If I have a large table with 500 partitions, and I use partitionBy(COL) on some attribute columns, I now have for example 100 folders which each contain 500 (now very small) files.
What I would like is the partitionBy(COL) behavior, but with roughly the same file size and number of files as I had originally.
As demonstration, the previous question shares a toy example where you have a table with 10 partitions and do partitionBy(dayOfWeek) and now you have 70 files because there are 10 in each folder. I would want ~10 files, one for each day, and maybe 2 or 3 for days that have more data.
Can this be easily accomplished? Something like df.write().repartition(COL).partitionBy(COL) seems like it might work, but I worry that (in the case of a very large table which is about to be partitioned into many folders) having to first combine it to some small number of partitions before doing the partitionBy(COL) seems like a bad idea.
Any suggestions are greatly appreciated!
You've got several options. In my code below I'll assume you want to write in parquet, but of course you can change that.
(1) df.repartition(numPartitions, *cols).write.partitionBy(*cols).parquet(writePath)
This will first use hash-based partitioning to ensure that a limited number of values from COL make their way into each partition. Depending on the value you choose for numPartitions, some partitions may be empty while others may be crowded with values -- for anyone not sure why, read this. Then, when you call partitionBy on the DataFrameWriter, each unique value in each partition will be placed in its own individual file.
Warning: this approach can lead to lopsided partition sizes and lopsided task execution times. This happens when values in your column are associated with many rows (e.g., a city column -- the file for New York City might have lots of rows), whereas other values are less numerous (e.g., values for small towns).
(2) df.sort(sortCols).write.parquet(writePath)
This options works great when you want (1) the files you write to be of nearly equal sizes (2) exact control over the number of files written. This approach first globally sorts your data and then finds splits that break up the data into k evenly-sized partitions, where k is specified in the spark config spark.sql.shuffle.partitions. This means that all values with the same values of your sort key are adjacent to each other, but sometimes they'll span a split, and be in different files. This, if your use-case requires all rows with the same key to be in the same partition, then don't use this approach.
There are two extra bonuses: (1) by sorting your data its size on disk can often be reduced (e.g., sorting all events by user_id and then by time will lead to lots of repetition in column values, which aids compression) and (2) if you write to a file format the supports it (like Parquet) then subsequent readers can read data in optimally by using predicate push-down, because the parquet writer will write the MAX and MIN values of each column in the metadata, allowing the reader to skip rows if the query specifies values outside of the partition's (min, max) range.
Note that sorting in Spark is more expensive than just repartitioning and requires an extra stage. Behind the scenes Spark will first determine the splits in one stage, and then shuffle the data into those splits in another stage.
(3) df.rdd.partitionBy(customPartitioner).toDF().write.parquet(writePath)
If you're using spark on Scala, then you can write a customer partitioner, which can get over the annoying gotchas of the hash-based partitioner. Not an option in pySpark, unfortunately. If you really want to write a custom partitioner in pySpark, I've found this is possible, albeit a bit awkward, by using rdd.repartitionAndSortWithinPartitions:
df.rdd \
.keyBy(sort_key_function) \ # Convert to key-value pairs
.repartitionAndSortWithinPartitions(numPartitions=N_WRITE_PARTITIONS,
partitionFunc=part_func) \
.values() # get rid of keys \
.toDF().write.parquet(writePath)
Maybe someone else knows an easier way to use a custom partitioner on a dataframe in pyspark?
df.repartition(COL).write().partitionBy(COL)
will write out one file per partition. This will not work well if one of your partition contains a lot of data. e.g. if one partition contains 100GB of data, Spark will try to write out a 100GB file and your job will probably blow up.
df.repartition(2, COL).write().partitionBy(COL)
will write out a maximum of two files per partition, as described in this answer. This approach works well for datasets that are not very skewed (because the optimal number of files per partition is roughly the same for all partitions).
This answer explains how to write out more files for the partitions that have a lot of data and fewer files for the small partitions.

Resources