Databricks create view partitioning fields on path - databricks

I think it's a noob question but I couldn't find a answer at all. Let's assume I already have my data organized like this /mnt/raw/mydata/YYYY/MM/DD:
Example: /mnt/raw/mydata/2020/10/20
where YYYY is the year, MM is month and DD day. I would like to create a view that can map fields to the folder name. I've only seen examples to create views with 'YEAR=2020'. Is that possible?
It's related to partition discovery described here
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html but my folders don't have the field name. I would like to know it I can spedicy that the fisrt level is the field YEAR, the second is the Month and the third is the day.
CREATE TEMPORARY VIEW parquetTable
USING org.apache.spark.sql.parquet
OPTIONS (
path "examples/src/main/resources/people.parquet"
)

Yes, you can have folder break down by the distinct values of a field. This can be achieved through partitioning your table/dataframe. It is recommended that you store your data in parquet format to optimize for space and that your folder structure (i.e.: chosen partitions) contains data that is approx. 1GB in file size. See the links below for more details:
https://spark.apache.org/docs/3.0.1/sql-data-sources-parquet.html#partition-discovery
https://docs.databricks.com/delta/best-practices.html#choose-the-right-partition-column
https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-table-datasource.html
CREATE OR REPLACE TABLE MyTable USING parquet OPTIONS (path "/mnt/raw/mydata") PARTITIONED BY (year, month, day)

Related

Spark find max of date partitioned column

I have a parquet partitioned in the following way:
data
/batch_date=2020-01-20
/batch_date=2020-01-21
/batch_date=2020-01-22
/batch_date=2020-01-23
/batch_date=2020-01-24
Here batch_date which is the partition column is of date type.
I want only read the data from the latest date partition but as a consumer I don't know what is the latest value.
I could use a simple group by something like
df.groupby().agg(max(col('batch_date'))).first()
While this would work it's a very inefficient way since it involves a groupby.
I want to know if we can query the latest partition in a more efficient way.
Thanks.
Doing the method suggested by #pasha701 would involve loading the entire spark data frame with all the batch_date partitions and then finding max of that. I think the author is asking for a way to directly find the max partition date and load only that.
One way is to use hdfs or s3fs, and load the contents of the s3 path as a list and then finding the max partition and then loading only that. That would be more efficient.
Assuming you are using AWS s3 format, something like this:
import sys
import s3fs
datelist=[]
inpath="s3:bucket_path/data/"
fs = s3fs.S3FileSystem(anon=False)
Dirs = fs.ls(inpath)
for paths in Dirs:
date=paths.split('=')[1]
datelist.append(date)
maxpart=max(datelist)
df=spark.read.parquet("s3://bucket_path/data/batch_date=" + maxpart)
This would do all the work in lists without loading anything into memory until it finds the one you want to load.
Function "max" can be used without "groupBy":
df.select(max("batch_date"))
Using Show partitions to get all partition of table
show partitions TABLENAME
Output will be like
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
we can get data form specific partition using below query
select * from TABLENAME where pt='2012.07.28.10' and is_complete='1' limit 1;
Or additional filter or group by can be applied on it.
This worked for me in Pyspark v2.4.3. First extract partitions (this is for a dataframe with a single partition on a date column, haven't tried it when a table has >1 partitions):
df_partitions = spark.sql("show partitions database.dataframe")
"show partitions" returns dataframe with single column called 'partition' with values like partitioned_col=2022-10-31. Now we create a 'value' column extracting just the date part as string. This is then converted to date and the max is taken:
date_filter = df_partitions.withColumn('value', to_date(split('partition', '=')[1], 'yyyy-MM-dd')).agg({"value":"max"}).first()[0]
date_filter contains the maximum date from the partition and can be used in a where clause pulling from the same table.

Load parquet folders to spark dataframe based on condition

I have directory which has folders based on the date and running date is part of the folder name. I have a daily spark job in which i need to load last 7 days files on any given day.
Unfortunately the folder contains other files as well to try partition discovery.
I have files as below format.
prefix-yyyyMMdd/
How to load folders within last 7 days in one shot.?
Since it is running date, i cannot have predefined regex that can be used to load the data, as i have to consider month and year changes.
I have couple of brute force solutions
to load all the data into 7 dataframes and do unionAll with all 7, to get one dataframe from 7 dataframes. This looks performance inefficient, but not a entirely bad one
Load entire folder and do where condition on column that has the date.
This looks storage heavy, as the folder contains years worth of data
Both doesn't look performance efficient and considering each file data it self is huge, i would like to know if there are any better solutions.
Is there a better way to do it.?
DataFrameReader methods can take multiple paths, e.g.
spark.read.parquet("prefix-20190704", "prefix-20190703", ...)

PySpark: how to read in partitioning columns when reading parquet

I have data stored in a parquet files and hive table partitioned by year, month, day. Thus, each parquet file is stored in /table_name/year/month/day/ folder.
I want to read in data for only some of the partitions. I have list of paths to individual partitions as follows:
paths_to_files = ['hdfs://data/table_name/2018/10/29',
'hdfs://data/table_name/2018/10/30']
And then try to do something like:
df = sqlContext.read.format("parquet").load(paths_to_files)
However, then my data does not include the information about year, month and day, as this is not part of the data per se, rather the information is stored in the path to the file.
I could use sql context and a send hive query with some select statement with where on the year, month and day columns to select only data from partitions i am interested in. However, i'd rather avoid constructing SQL query in python as I am very lazy and don't like reading SQL.
I have two questions:
what is the optimal way (performance-wise) to read in the data stored as parquet, where information about year, month, day is not present in the parquet file, but is only included in the path to the file? (either send hive query using sqlContext.sql('...'), or use read.parquet,... anything really.
Can i somehow extract the partitioning columns when using the
approach i outlined above?
Reading the direct file paths to the parent directory of the year partitions should be enough for a dataframe to determine there's partitions under it. However, it wouldn't know what to name the partitions without the directory structure /year=2018/month=10, for example.
Therefore, if you have Hive, then going via the metastore would be better because the partitions are named there, Hive stores extra useful information about your table, and then you're not reliant on knowing the direct path to the files on disk from the Spark code.
Not sure why you think you need to read/write SQL, though.
Use the Dataframe API instead, e.g
df = spark.table("table_name")
df_2018 = df.filter(df['year'] == 2018)
df_2018.show()
Your data isn't stored in a way optimal for parquet so you'd have to load files one by one and add the dates
Alternatively, you can move the files to a directory structure fit for parquet
( e.g. .../table/year=2018/month=10/day=29/file.parquet)
then you can read the parent directory (table) and filter on year, month, and day (and spark will only read the relevant directories) also you'd get these as attributes in your dataframe

Can you add more than one partition in one "ALTER TABLE" command?

I'm using Amazon Athena to query through some log files stored in an S3 bucket, and am using partitions to section off days of the year for the files I need to query. I was wondering -- since I have a large batch of days to add to my table, could I do it all in one ALTER TABLE command, or do I need to have as many ALTER TABLE commands as the number of partitions I would like to create?
This is an example of the command I am using at the moment:
ALTER TABLE
logfiles
ADD PARTITION
(day='20170525')
location 's3://log-bucket/20170525/';
If I do have to use one ALTER TABLE command per partition, is there a way to create a range of days, and then have Athena loop through it to create the partitions, instead of me manually copy/pasting out this command 100+ times?
It appears that you can add many partitions to one ALTER TABLE command, per the Athena documentation found at https://docs.aws.amazon.com/athena/latest/ug/alter-table-add-partition.html or go do the athena root and search for add partition.
ALTER TABLE orders ADD
PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_14_May_2016'
PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_15_May_2016';

Azure table storage - customer data per day

What do you recommend for the following scenario:
If I have in a table 100.000.000 items (a lot of items to be more exact) how can I get those items per day?
Once the items are added into the table, they are not modified or deleted anymore. Basically is just insert and read them.
My question is about retrieving them without having to loop through all 100.000.000 items.
Should I make the PartitionKey a datetime or just date and then retrieve by partionkey where is equal for example 22.10.2013?
What do you recommend?
If you are reading the items per day, then using the date (just the Date part, not the full DateTime) as the PartitionKey is the best solution.
When using a Date as the Key, I prefer converting it to a String in the YYYYMMDD (or YYYY-MM-DD) format.
The use of a datetime as a PartitionKey is an anti-pattern since all writes go in the same partition - which limits scalability. The Azure Storage scalability targets indicate that you can do 2,000 operations a second against a partition but 20,000 operations a second against a storage account. You can get round this by sharding inserts across a set of buckets for the day - and prepending the date with the bucket name.

Resources