How to Include the Value of Partitioned Column in a Spark data frame or Spark SQL Temp Table in AWS Glue? - apache-spark

I am using python 3, Glue 1.0 for this code.
I have partitioned data in S3. The data is partitioned in year,month,day,extra_field_name columns.
When I load the data into data frame, I am getting all the columns in it's schema other than the partitioned ones.
Here is the code and output
glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": path_list, "recurse" : True, 'groupFiles': 'inPartition'}, format = "parquet").toDF().registerTempTable(final_arguement_list["read_table_" + str(i+1)])
The path_list variable contains a string of list of paths that need to be loaded into a data frame.
I am printing schema using the below command
glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": path_list, "recurse" : True}, format = "parquet").toDF().printSchema()
The schema that I am getting in cloudwatch logs does not contain any of the partitioned columns.
Please note that I have already tried loading data by giving path by only providing path till year, month, day, extra_field_name separately but still getting only those columns which are present in the parquet files itself.

I was able to do this with an additional step of having a crawler crawl the directory on S3, and then use the table from Glue Catalog as the source for Glue ETL.
Once you have a crawler over the location s3://path/to/source/data/, automatically year,month and day will be treated as partition columns. And then you could try the following in your Glue ETL script.
data_dyf = glueContext.create_dynamic_frame.from_catalog(
database = db_name,
table_name = tbl_name,
push_down_predicate="(year=='2018' and month=='05')"
)
You can find more details here

As a workaround, I have created a duplicate column in the data frame itself named - year_2, month_2, day_2 and extra_field_name_2 as a copy of year, month, day and extra_field_name.
During data ingestion phase, I have partitioned the data frame on year, month, day and extra_field_name and stored it in S3 which retains the column value of year_2, month_2, day_2 and extra_field_name_2 in the parquet files itself.
While performing data manipulation, I am loading the data in a dynamic frame by providing the list of paths in the following manner:
['s3://path/to/source/data/year=2018/month=1/day=4/', 's3://path/to/source/data/year=2018/month=1/day=5/', 's3://path/to/source/data/year=2018/month=1/day=6/']
This gives me year_2, month_2, day_2 and extra_field_name_2 in the dynamic frame that I can further use for data manipulation.

Try passing the basePath to the connection_options argument:
glueContext.create_dynamic_frame_from_options(
connection_type = "s3",
connection_options = {
"paths": path_list,
"recurse" : True,
"basePath": "s3://path/to/source/data/"
},
format = "parquet").toDF().printSchema()
This way, partition discovery will discover the partitions that are above your paths. According to the documentation, these options will be passed to the Spark SQL DataSource.
Edit: given that your experiment shows it doesn’t work, have you considered passing the top level directory and filtering from there for the dates of interest? The Reader will only read the relevant Hive partitions, as the filter gets ”pushed down” to the file system.
(glueContext.create_dynamic_frame_from_options(
connection_type = "s3",
connection_options = {
"paths": ["s3://path/to/source/data/"],
"recurse" : True,
},
format = "parquet")
.toDF()
.filter(
(col("year") == 2018)
&& (col("month") == 1)
&& (col("day").between(4, 6)
).printSchema()

Related

Delta Live Tables and ingesting AVRO

So, im trying to load avro files in to dlt and create pipelines and so fourth.
As a simple data frame in Databbricks, i can read and unpack to avro files, using functions json / rdd.map /lamba function. Where i can create a temp view then do a sql query and then select the fields i want.
--example command
in_path = '/mnt/file_location/*/*/*/*/*.avro'
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)
data.createOrReplaceTempView("eventhub")
--selecting the data
sql_query1 = sqlContext.sql("""
select distinct
data.field.test1 as col1
,data.field.test2 as col2
,data.field.fieldgrp.city as city
from
eventhub
""")
However, i am trying to replicate the process , but use delta live tables and pipelines.
I have used autoloader to load the files into a table, and kept the format as is. So bronze is just avro in its rawest form.
I then planned to create a view that listed the unpack avro file. Much like I did above with "eventhub". Whereby it will then allow me to create queries.
The trouble is, I cant get it to work in dlt. I fail at the 2nd step, after i have imported the file into a bronze layer. It just does not seem to apply the functions to make the data readable/selectable.
This is the sort of code i have been trying. However, it does not seem to pick up the schema, so it is as if the functions are not working. so when i try and select a column, it does not recognise it.
--unpacked data
#dlt.view(name=f"eventdata_v")
def eventdata_v():
avroDf = spark.read.format("delta").table("live.bronze_file_list")
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)
return data
--trying to query the data but it does not recognise field names, even when i select "data" only
#dlt.view(name=f"eventdata2_v")
def eventdata2_v():
df = (
dlt.read("eventdata_v")
.select("data.field.test1 ")
)
return df
I have been working on this for weeks, trying to use different approach's but still no luck.
Any help will be so appreciated. Thankyou

AWS Glue performance when write

After performing joins and aggregation i want the output to be in 1 file and partition based on some column.
when I use repartition(1) the time taken by job is 1 hr and if I remove preparation(1) there will be multiple partitions of that file it takes 30 mins (refer to example below).
So is there a way to write data into 1 file ??
...
...
df= df.repartition(1)
glueContext.write_dynamic_frame.from_options(
frame = df,
connection_type = "s3",
connection_options = {
"path": "s3://s3path"
"partitionKeys": ["choice"]
},
format = "csv",
transformation_ctx = "datasink2")
Is there any other way to increase the write performance. does changing format helps? and how to achieve parallelism by having 1 file output
S3 storage example
**if repartition(1)** // what I want but takes more time
choice=0/part-00-001
..
..
choice=500/part-00-001
**if removed** // takes less time but multiple files are present
choice=0/part-00-001
....
choice=0/part-00-0032
..
..
choice=500/part-00-001
....
choice=500/part-00-0032
Instead of using df.repartition(1)
USE df.repartition("choice")
df= df.repartition("choice")
glueContext.write_dynamic_frame.from_options(
frame = df,
connection_type = "s3",
connection_options = {
"path": "s3://s3path"
"partitionKeys": ["choice"]
},
format = "csv",
transformation_ctx = "datasink2")
If the goal is to have one single file, use coalesce instead of repartition, it avoids data shuffle.

Stream writes having multiple identical keys to delta lake

I am writing streams to delta lake through spark structured streaming. Each streaming batch contains key - value (also contains timestamp as one column). delta lake doesn't support of update with multiple same keys at source( steaming batch) So I want to update delta lake with only record with latest timestamp. How can I do this ?
This is code snippet I am trying:
def upsertToDelta(microBatchOutputDF: DataFrame, batchId: Long) {
println(s"Executing batch $batchId ...")
microBatchOutputDF.show()
deltaTable.as("t")
.merge(
microBatchOutputDF.as("s"),
"s.key = t.key")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
}
Thanks in advance.
You can eliminate records having older timestamp from your "microBatchOutputDF" dataframe & keep only record with latest timestamp for given key.
You can use spark's 'reduceByKey' operation & implement custom reduce function as below.
def getLatestEvents(input: DataFrame) : RDD[Row] = {
input.rdd.map(x => (x.getAs[String]("key"), x)).reduceByKey(reduceFun).map(_._2) }
def reduceFun(x: Row, y: Row) : Row = {
if (x.getAs[Timestamp]("timestamp").getTime > y.getAs[Timestamp]("timestamp").getTime) x else y }
Assumed key is of type string & timestamp of type timestamp. And call "getLatestEvents" for your streaming batch 'microBatchOutputDF'. It ignores older timestamp events & keeps only latest one.
val latestRecordsDF = spark.createDataFrame(getLatestEvents(microBatchOutputDF), <schema of DF>)
Then call deltalake merge operation on top of 'latestRecordsDF'
In streaming for a microbatch, you might got more than one records for a given key. In order to update it with target table, you have to figure out the latest record for the key in the microbatch. In your case you can use max of timestamp column and the value column to find the latest record and use that one for merge operation.
You can refer this link for more details on finding the latest record for the given key.

Spark DataFrame ORC Hive table reading issue

I am trying to read a Hive table in Spark. Below is the Hive Table format:
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \u0001
serialization.format \u0001
When I am trying to read it using the Spark SQL with the below command:
val c = hiveContext.sql("""select
a
from c_db.c cs
where dt >= '2016-05-12' """)
c. show
I am getting the below warning:-
18/07/02 18:02:02 WARN ReaderImpl: Cannot find field for: a in _col0,
_col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34, _col35, _col36, _col37, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49, _col50, _col51, _col52, _col53, _col54, _col55, _col56, _col57, _col58, _col59, _col60, _col61, _col62, _col63, _col64, _col65, _col66, _col67,
The read starts but it is very slow and getting network time out.
When i am trying to read the Hive table directory directly i am getting the below error.
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
val c = hiveContext.read.format("orc").load("/a/warehouse/c_db.db/c")
c.select("a").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input
columns: [_col18, _col3, _col8, _col66, _col45, _col42, _col31,
_col17, _col52, _col58, _col50, _col26, _col63, _col12, _col27, _col23, _col6, _col28, _col54, _col48, _col33, _col56, _col22, _col35, _col44, _col67, _col15, _col32, _col9, _col11, _col41, _col20, _col2, _col25, _col24, _col64, _col40, _col34, _col61, _col49, _col14, _col13, _col19, _col43, _col65, _col29, _col10, _col7, _col21, _col39, _col46, _col4, _col5, _col62, _col0, _col30, _col47, trans_dt, _col57, _col16, _col36, _col38, _col59, _col1, _col37, _col55, _col51, _col60, _col53];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
I can convert the Hive table to TextInputFormat but that should be my last option as i would like to get the benefit of OrcInputFormat to compress the table size.
Really appreciate your suggestion.
I found workaround with reading table such way:
val schema = spark.table("db.name").schema
spark.read.schema(schema).orc("/path/to/table")
The issue occurs generally with large tables, as it fails to read to max field length. I added meta-store read as true (set spark.sql.hive.convertMetastoreOrc=true;) and it worked for me.
I think the table doesnt have named columns or if it has, Spark isnt able to read the names probably.
You can use the default column names that Spark has given as mentioned in the Error. Or also set column names in the Spark code.
Use printSchema and toDF method to rename the columns. But yes, you will need the mappings. This might require selecting and showing columns individually.
Setting (set spark.sql.hive.convertMetastoreOrc=true;) conf is working. But its trying to modify metadata of hive table. Can you please explain me, What is going to modify and does it effect the table.
Thanks

SparkSQL: Am I doing in right?

Here is how I use Spark-SQL in a little application I am working with.
I have two Hbase tables say t1,t2.
My input being a csv file, I parse each and every line and query(SparkSQL) the table t1. I write the output to another file.
Now I parse the second file and query the second table and I apply certain functions over the result and I output the data.
the table t1 hast the purchase details and t2 has the list of items that were added to cart along with the time frame by each user.
Input -> CustomerID(list of it in a csv file)
Output - > A csv file in a particular format mentioned below.
CustomerID, Details of the item he brought,First item he added to cart,All the items he added to cart until purchase.
For a input of 1100 records, It takes two hours to complete the whole process!
I was wondering if I could speed up the process but I am struck.
Any help?
How about this DataFrame approach...
1) Create a dataframe from CSV.
how-to-read-csv-file-as-dataframe
or something like this in example.
val csv = sqlContext.sparkContext.textFile(csvPath).map {
case(txt) =>
try {
val reader = new CSVReader(new StringReader(txt), delimiter, quote, escape, headerLines)
val parsedRow = reader.readNext()
Row(mapSchema(parsedRow, schema) : _*)
} catch {
case e: IllegalArgumentException => throw new UnsupportedOperationException("converted from Arg to Op except")
}
}
2) Create Another DataFrame from Hbase data (if you are using Hortonworks) or phoenix.
3) do join and apply functions(may be udf or when othewise.. etc..) and resultant file could be a dataframe again
4) join result dataframe with second table & output data as CSV as in pseudo code as an example below...
It should be possible to prepare dataframe with custom columns and corresponding values and save as CSV file.
you can this kind in spark shell as well.
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema","true").
load("cars93.csv")
val df2=df.filter("quantity <= 4.0")
val col=df2.col("cost")*0.453592
val df3=df2.withColumn("finalcost",col)
df3.write.format("com.databricks.spark.csv").
option("header","true").
save("output-csv")
Hope this helps.. Good luck.

Resources