Write dataframe without column names as part of the file path - apache-spark

I have to write a Spark dataframe in the path of the format: base_path/{year}/{month}/{day}/{hour}/
If I do something like below:
pc = ["year", "month", "day", "hour"]
df.write.partitionBy(*pc).parquet("base_path/", mode = 'append')
It creates the location as: base_path/year=2022/month=04/day=25/hour=10/.
I do not want the column names like year, month, day and hour to be the part of path but something like: base_path/2022/04/25/10/. Any solution for this?

The column names are written as part of the path because they are not written in the object itself so you need the column name in the path in order to be able to read it back (following hive style convention).
For more information about this see here.
If you would still want to write the data with the above path you can use multiple write commands with the explicit path and filter according to the partition values.
The current logic for determining the partition path is located here and there doesn't seem to be a way to replace this in a pluggable way (you could technically load a different implementation in the JVM or write you own writer implementation but I would not recommend that)

Related

Parquet Format - split columns in different files

On the parquet documentation is explicitly mentioned that the design supports splitting the metadata and data into different files , including also the possibility that different column groups can be stored in different files.
However , I could not find any instructions on how to achieve that. In my use case I would like to store the metadata in one file , store columns 1-100 data in one file and 101-200 in a second file .
Any idea how to achieve this ?
If you are using PySpark, it's as easy as this:
df = spark.createDataFrameFrom(...)
df.write.parquet('file_name.parquet')
and it will create a folder called file_name.parquet in the default location in HDFS. You can just create two dataframes, one with columns 1-100, and the other dataframe with columns 101-200 and save them separately. It automatically will save the metadata, if you mean the data frame schema.
You can select a range of columns like this:
df_first_hundred = df.select(df.columns[:100])
df_second_hundred = df.select(df.columns[100:])
Save them as separate files:
df_first_hundred.write.parquet('df_first_hundred')
df_second_hundred.write.parquet('df_second_hundred')

Spark find max of date partitioned column

I have a parquet partitioned in the following way:
data
/batch_date=2020-01-20
/batch_date=2020-01-21
/batch_date=2020-01-22
/batch_date=2020-01-23
/batch_date=2020-01-24
Here batch_date which is the partition column is of date type.
I want only read the data from the latest date partition but as a consumer I don't know what is the latest value.
I could use a simple group by something like
df.groupby().agg(max(col('batch_date'))).first()
While this would work it's a very inefficient way since it involves a groupby.
I want to know if we can query the latest partition in a more efficient way.
Thanks.
Doing the method suggested by #pasha701 would involve loading the entire spark data frame with all the batch_date partitions and then finding max of that. I think the author is asking for a way to directly find the max partition date and load only that.
One way is to use hdfs or s3fs, and load the contents of the s3 path as a list and then finding the max partition and then loading only that. That would be more efficient.
Assuming you are using AWS s3 format, something like this:
import sys
import s3fs
datelist=[]
inpath="s3:bucket_path/data/"
fs = s3fs.S3FileSystem(anon=False)
Dirs = fs.ls(inpath)
for paths in Dirs:
date=paths.split('=')[1]
datelist.append(date)
maxpart=max(datelist)
df=spark.read.parquet("s3://bucket_path/data/batch_date=" + maxpart)
This would do all the work in lists without loading anything into memory until it finds the one you want to load.
Function "max" can be used without "groupBy":
df.select(max("batch_date"))
Using Show partitions to get all partition of table
show partitions TABLENAME
Output will be like
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
we can get data form specific partition using below query
select * from TABLENAME where pt='2012.07.28.10' and is_complete='1' limit 1;
Or additional filter or group by can be applied on it.
This worked for me in Pyspark v2.4.3. First extract partitions (this is for a dataframe with a single partition on a date column, haven't tried it when a table has >1 partitions):
df_partitions = spark.sql("show partitions database.dataframe")
"show partitions" returns dataframe with single column called 'partition' with values like partitioned_col=2022-10-31. Now we create a 'value' column extracting just the date part as string. This is then converted to date and the max is taken:
date_filter = df_partitions.withColumn('value', to_date(split('partition', '=')[1], 'yyyy-MM-dd')).agg({"value":"max"}).first()[0]
date_filter contains the maximum date from the partition and can be used in a where clause pulling from the same table.

PySpark: how to read in partitioning columns when reading parquet

I have data stored in a parquet files and hive table partitioned by year, month, day. Thus, each parquet file is stored in /table_name/year/month/day/ folder.
I want to read in data for only some of the partitions. I have list of paths to individual partitions as follows:
paths_to_files = ['hdfs://data/table_name/2018/10/29',
'hdfs://data/table_name/2018/10/30']
And then try to do something like:
df = sqlContext.read.format("parquet").load(paths_to_files)
However, then my data does not include the information about year, month and day, as this is not part of the data per se, rather the information is stored in the path to the file.
I could use sql context and a send hive query with some select statement with where on the year, month and day columns to select only data from partitions i am interested in. However, i'd rather avoid constructing SQL query in python as I am very lazy and don't like reading SQL.
I have two questions:
what is the optimal way (performance-wise) to read in the data stored as parquet, where information about year, month, day is not present in the parquet file, but is only included in the path to the file? (either send hive query using sqlContext.sql('...'), or use read.parquet,... anything really.
Can i somehow extract the partitioning columns when using the
approach i outlined above?
Reading the direct file paths to the parent directory of the year partitions should be enough for a dataframe to determine there's partitions under it. However, it wouldn't know what to name the partitions without the directory structure /year=2018/month=10, for example.
Therefore, if you have Hive, then going via the metastore would be better because the partitions are named there, Hive stores extra useful information about your table, and then you're not reliant on knowing the direct path to the files on disk from the Spark code.
Not sure why you think you need to read/write SQL, though.
Use the Dataframe API instead, e.g
df = spark.table("table_name")
df_2018 = df.filter(df['year'] == 2018)
df_2018.show()
Your data isn't stored in a way optimal for parquet so you'd have to load files one by one and add the dates
Alternatively, you can move the files to a directory structure fit for parquet
( e.g. .../table/year=2018/month=10/day=29/file.parquet)
then you can read the parent directory (table) and filter on year, month, and day (and spark will only read the relevant directories) also you'd get these as attributes in your dataframe

Azure databricks spark - write to blob storage

I have a data frame with two columns - filepath (wasbs file path for blobs), string and want to write each string to a seperate blob with that file name. How can i do this?
You can only write to one wasb container at a time - not sure if this is part of your question, but I want to clarify either way. In addition, spark writes files to directories, not single files. If you want to accomplish exactly what youre asking for, you'll have to repartition to 1 partition and partition by filepath.
After that step you'll need to use the azure sdk to rename the files and move them up to the parent directory.
UPDATED ANSWER:
I found a much simpler way of accomplishing this using dbutils.fs.put. You would need to loop through each row of your DataFrame, calling dbutils.fs.put() for each row.
Assuming your input file (assumed CSV) with two columns looks something like:
filepath, stringValue
wasbs://container#myaccount.blob.core.windows.net/demo1.txt,"demo string 1"
wasbs://container#myaccount.blob.core.windows.net/demo2.txt,"demo string 2"
wasbs://container#myaccount.blob.core.windows.net/demo3.txt,"demo string 3"
wasbs://container#myaccount.blob.core.windows.net/demo4.txt,"demo string 4"
wasbs://container#myaccount.blob.core.windows.net/demo5.txt,"demo string 5"
You can use the following to loop through each row in your input DataFrame:
df = spark.read.option("header", True).csv("wasbs://container#myaccount.blob.core.windows.net/demo-data.csv")
rowList = df.rdd.collect()
for row in rowList:
dbutils.fs.put(str(row[0]), str(row[1]), True)
The put method writes a given String out to a file, encoded in UTF-8, so using this you can loop through each record in your DataFrame, passing the first column in as
the file path, and the second as the string contents to write to the file.
This also has the benefit of writing the string to a single file, so you don't need to go through the process of renaming and moving files.
OLD ANSWER:
Due to the distributed nature of Spark, writing a DataFrame to files results in a directory being created which will contain multiple files. You can use coalesce to force the processing to a single worker and file, whose name will start with part-0000.
DISCLAIMER: This is recommended only for small files, as larger data files can lead to out of memory exceptions.
To accomplish what you are attempting, you would need to loop through each row of your DataFrame, creating a new DataFrame for each row which contains only the string value you want written to the file.
Assuming your input file (assumed CSV) with two columns looks something like:
filepath, stringValue
wasbs://container#myaccount.blob.core.windows.net/demo1,"demo string 1"
wasbs://container#myaccount.blob.core.windows.net/demo2,"demo string 2"
wasbs://container#myaccount.blob.core.windows.net/demo3,"demo string 3"
wasbs://container#myaccount.blob.core.windows.net/demo4,"demo string 4"
wasbs://container#myaccount.blob.core.windows.net/demo5,"demo string 5"
You can use the following to loop through each row in your input DataFrame:
from pyspark.sql import *
from pyspark.sql.types import StringType
df = spark.read.option("header", True).csv("wasbs://container#myaccount.blob.core.windows.net/demo-data.csv")
rowList = df.rdd.collect()
for row in rowList:
dfRow = spark.createDataFrame([str(row[1])], StringType())
dfRow.coalesce(1).write.mode("overwrite").text(row[0])
This will result in directories being created in your Blob Storage account container named demo1, demo2, demo3, demo4, and demo5. Each of those will contain multiple files. The file within each directory whose name begins with part-0000 is the file that will contain your string value.
If you need those files to have different names, and be in a different location, you can then use dbutils.fs methods to handle moving the files and doing the renames. You can also use this to do any cleanup of the directories that were created, if desired.

Spark filter dataframe returns empty result

I'm working in a project with Scala and Spark processing files that are stored in HDFS. Those files are landing in HDFS everyday in the morning. I have a job that reads that file from HDFS each day, process it and then writes the result in HDFS. After I convert the file into a Dataframe, this job executes a filter to get only the rows that contain a timestamp higher than the highest timestamp that was processed within the last file. This filter has an unknown behavior only some days. Some days works as expected and other days despite of the new file contains rows that match that filter, the filter result is empty. This happens all the times for the same file when it's executed in TEST environment but in my local works as expected using the same file with the same HDFS connection.
I've tried to filter in different ways but none of then work in that environment for some specific files but all of then work fine in my LOCAL:
1) Spark sql
val diff = fp.spark.sql("select * from curr " +
s"where TO_DATE(CAST(UNIX_TIMESTAMP(substring(${updtDtCol},
${substrStart},${substrEnd}),'${dateFormat}') as TIMESTAMP))" +
s" > TO_DATE(CAST(UNIX_TIMESTAMP('${prevDate.substring(0,10)}'
,'${dateFormat}') as TIMESTAMP))")
2) Spark filter functions
val diff = df.filter(date_format(unix_timestamp(substring(col(updtDtCol),0,10),dateFormat).cast("timestamp"),dateFormat).gt(date_format(unix_timestamp(substring(col("PrevDate"),0,10),dateFormat).cast("timestamp"),dateFormat)))
3) Adding extra column with the result of the filter and then filter by this new column
val test2 = df.withColumn("PrevDate", lit(prevDate.substring(0,10)))
.withColumn("DatePre", date_format(unix_timestamp(substring(col("PrevDate"),0,10),dateFormat).cast("timestamp"),dateFormat))
.withColumn("Result", date_format(unix_timestamp(substring(col(updtDtCol),0,10),dateFormat).cast("timestamp"),dateFormat).gt(date_format(unix_timestamp(substring(col("PrevDate"),0,10),dateFormat).cast("timestamp"),dateFormat)))
.withColumn("x", when(date_format(unix_timestamp(substring(col(updtDtCol),0,10),dateFormat).cast("timestamp"),dateFormat).gt(date_format(unix_timestamp(substring(col("PrevDate"),0,10),dateFormat).cast("timestamp"),dateFormat)), lit(1)).otherwise(lit(0)))
val diff = test2.filter("x == 1")
I think that the issue is not caused either by the filter itself or probably by the file but I would like to receive feedback about what should I check or if anybody has faced this before.
Please let me know what information could be useful to post here in order to receive some feedback.
A part of file example looks like the following:
|TIMESTAMP |Result|x|
|2017-11-30-06.46.41.288395|true |1|
|2017-11-28-08.29.36.188395|false |0|
The TIMESTAMP values are compared with the previousDate (for instance: 2017-11-29) and I create a column called 'Result' with the result of that comparison that always works in both environment and also another column called 'x' with the same result.
As I mentioned before, if I use the comparator function between both dates or the result in column 'Result' or 'x' to filter the dataframe, sometimes the result is an empty dataframe but in local using the same HDFS and file, the result contains data.
I suspect it to be a data/date format issue. Did you get a chance to verify if the dates converted are as expected?
If the date string for both the columns has timezone included, the behavior is predictable.
If only of one of them has timezone included, the results will be different when executed in local and remote. It totally depends on timezone of cluster.
For debugging the issue, I would suggest you to have additional columns to capture the unix_timestamp(..)/millis of the respective date strings and have and additional column to capture the difference of the two columns. The diff column should help to find out where and why conversions gone wrong. Hope this helps.
In case anybody wants to know what happened with this issue and how I finally found the cause of the error here is the explanation. Basically it was caused by the different timezone of the machines where the job was executed (LOCAL machine and TEST server). The unix_timestamp function returned the correct value having in mind the timezone of the servers. Basically at the end I didn't have to use the unix_timestamp function and I didn't need to use the full content of the date field. Next time I will double check this before.

Resources