How to partition S3 output files by a combination of column values? - apache-spark

I have data which I am crawling into AWS Glue. There I am using PySpark and converting it to Parquet format. My original data is CSV looks something like this:
id, date, data 1, 202003, x 2, 202003, y 1, 202004, z
etc...
I am able to convert the data successfully, but I am unsure the best way to to get the desired output. The output should be split by id and date in S3. So it should have something like:
s3://bucket/outputdata/{id}_{date}/{data}.parquet
Where id and date are the actual id and date values in the data. The name of the files within obviously does not matter, I just want to be able to create "folders" in the S3 object prefix and split the data within them.
I am very new to AWS Glue and I have a feeling I am missing something very obvious.
Thanks in advance.

You can create a partition column by concatenating your two existing columns and then partition by the new column on write e.g.
from pyspark.sql.functions import concat, col, lit
df1 = df.withColumn('p', concat(col('id'), lit('_'), col('date')))
df1.write.partitionBy('p').parquet('s3://bucket/outputdata')

Related

Write each row of a dataframe to a separate json file in s3 with pyspark

in one of my projects, I need to write each row of a dataframe into a separate S3 file in json format. In the actual implementation, map/foreach's input is a Row, though I don't seem to find any member function on Row that could transform a row into json format.
I'm using spark df and don't want to convert it to pandas (as it involves sending everything to the driver?), hence cannot use the to_json function. Is there any other way to do it? I can definitely write my own json converter based on my specific df schema, but just wondering if there is a readily available module.

Parquet Format - split columns in different files

On the parquet documentation is explicitly mentioned that the design supports splitting the metadata and data into different files , including also the possibility that different column groups can be stored in different files.
However , I could not find any instructions on how to achieve that. In my use case I would like to store the metadata in one file , store columns 1-100 data in one file and 101-200 in a second file .
Any idea how to achieve this ?
If you are using PySpark, it's as easy as this:
df = spark.createDataFrameFrom(...)
df.write.parquet('file_name.parquet')
and it will create a folder called file_name.parquet in the default location in HDFS. You can just create two dataframes, one with columns 1-100, and the other dataframe with columns 101-200 and save them separately. It automatically will save the metadata, if you mean the data frame schema.
You can select a range of columns like this:
df_first_hundred = df.select(df.columns[:100])
df_second_hundred = df.select(df.columns[100:])
Save them as separate files:
df_first_hundred.write.parquet('df_first_hundred')
df_second_hundred.write.parquet('df_second_hundred')

Spark find max of date partitioned column

I have a parquet partitioned in the following way:
data
/batch_date=2020-01-20
/batch_date=2020-01-21
/batch_date=2020-01-22
/batch_date=2020-01-23
/batch_date=2020-01-24
Here batch_date which is the partition column is of date type.
I want only read the data from the latest date partition but as a consumer I don't know what is the latest value.
I could use a simple group by something like
df.groupby().agg(max(col('batch_date'))).first()
While this would work it's a very inefficient way since it involves a groupby.
I want to know if we can query the latest partition in a more efficient way.
Thanks.
Doing the method suggested by #pasha701 would involve loading the entire spark data frame with all the batch_date partitions and then finding max of that. I think the author is asking for a way to directly find the max partition date and load only that.
One way is to use hdfs or s3fs, and load the contents of the s3 path as a list and then finding the max partition and then loading only that. That would be more efficient.
Assuming you are using AWS s3 format, something like this:
import sys
import s3fs
datelist=[]
inpath="s3:bucket_path/data/"
fs = s3fs.S3FileSystem(anon=False)
Dirs = fs.ls(inpath)
for paths in Dirs:
date=paths.split('=')[1]
datelist.append(date)
maxpart=max(datelist)
df=spark.read.parquet("s3://bucket_path/data/batch_date=" + maxpart)
This would do all the work in lists without loading anything into memory until it finds the one you want to load.
Function "max" can be used without "groupBy":
df.select(max("batch_date"))
Using Show partitions to get all partition of table
show partitions TABLENAME
Output will be like
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
we can get data form specific partition using below query
select * from TABLENAME where pt='2012.07.28.10' and is_complete='1' limit 1;
Or additional filter or group by can be applied on it.
This worked for me in Pyspark v2.4.3. First extract partitions (this is for a dataframe with a single partition on a date column, haven't tried it when a table has >1 partitions):
df_partitions = spark.sql("show partitions database.dataframe")
"show partitions" returns dataframe with single column called 'partition' with values like partitioned_col=2022-10-31. Now we create a 'value' column extracting just the date part as string. This is then converted to date and the max is taken:
date_filter = df_partitions.withColumn('value', to_date(split('partition', '=')[1], 'yyyy-MM-dd')).agg({"value":"max"}).first()[0]
date_filter contains the maximum date from the partition and can be used in a where clause pulling from the same table.

Can I retrieve limited dataset from a CSV file in Azure Data Lake U-SQL?

Can I filter data loaded from a CSV File using `U-SQL's EXTRACT? I know we can limit the data using the where condition in the select after the EXTRACT. But, I want to filter it during the use of the 'EXTRACT'
I have huge CSV file. I don't want to load all of it into the first dataset itself.
e.g.
I have lot of auto claims in the dataset. I want to filter it while I 'EXTRACT' it based on a date in the dataset.
The answer is yes. However only column pruning can be pushed into the extractors. Since there is no semantics assigned to the data before you extract it with the Csv extractor, filters will be applied on the generated rowset. There are many examples out there that show you how to do so. Here is an example from one of the hands-on-labs.
Yes you can filter data loaded from csv file. You can do something like this:
#log =
EXTRACT UserId int,
StartDate DateTime,
Location string,
....
....
Url string
FROM "/Samples/Data/Log.csv"
USING Extractors.csv();
#result =
SELECT Location, Url, StartDate
FROM #log
WHERE StartDate >= DateTime.Parse("2017/01/6") AND StartDate <= DateTime.Parse("2018/06/08");
OUTPUT #result
TO "/output/cleanlog.csv"
USING Outputters.Csv();

Spark Python: Converting multiple lines from inside a loop into a dataframe

I have a loop that is going to create multiple rows of data which I want to convert into a dataframe.
Currently I am creating a CSV format string and inside the loop keep appending to it along separated by a newline. I am creating a CSV file so that I can also save it as a text file for other processing.
File Header:
output_str="Col1,Col2,Col3,Col4\n"
Inside for loop:
output_str += "Val1,Val2,Val3,Val4\n"
I then create an RDD by splitting it with the newline and then convert in into the dataframe as follows.
output_rdd = sc.parallelize(output_str.split("\n"))
output_df = output_rdd.map(lambda x: (x, )).toDF()
It creates a dataframe but only has 1 column. I know that is because of the map function where I am making it into a list with only 1 item in the set. What I need is a list with multiple items. So perhaps I should be calling split() function on every line to get a list. But I am getting a feeling that there should be a much more straight-forward way. Appreciate any help. Thanks.
Edit: To give more information, using Spark SQL I have filtered my dataset to those rows that contain the problem. However the rows contain information in following format (separated by '|'). And I need to extract those values from column 3 which has corresponding flag set to 1 in column 4 (Here it is 0xcd)
Field1|Field2|0xab,0xcd,0xef|0x00,0x01,0x00
So I am collecting the output at the driver and then parsing the last 2 columns after which I am left with regular strings that I want to put back in a dataframe. I am not sure if I can achieve the same using Spark SQL to parse the output in the manner I want.
Yes, indeed your current approach seems a little too complicated... Creating large string in Spark Driver and then parallelizing it with Spark is not really performant.
First of all question from where you are getting your input data? In my opinion you should use one of existing Spark readers to read it. For example you can use:
CSV -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv
jdbc -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.jdbc
json -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
parquet -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet
not structured text file -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.SparkContext.textFile
In next step you can preprocess it using Spark DataFrame or RDD API depending on your use case.
A bit late, but currently you're applying a map to create a tuple for each row containing the string as the first element. Instead of this, you probably want to split the string, which can easily be done inside the map step. Assuming all of your rows have the same number of elements you can replace:
output_df = output_rdd.map(lambda x: (x, )).toDF()
with
output_df = output_rdd.map(lambda x: x.split()).toDF()

Resources