I am using Spark SQL in AWS Glue script to transform some data in S3.
Here is the script logic
Data Format CSV
Programming Language: Python
1) Pull the data from S3 using Glue’s Catalog into Glue’s DynamicDataFrame
2) Extract the Spark Data Frame from Glue’s Data frame using toDF()
3) Make the Spark Data Frame Spark SQL Table
createOrReplaceTempView()
4) Use SQL query to transform (Here is where I am having issues)
5) Convert the final data frame to Glue Dynamic Data Frame
6) Store final Data Frame into S3 using
glueContext.write_dynamic_frame.from_options()
Problem
When I am using comparison in SQL such as WHERE >
or
(case when <some_columns> > <some int> then 1 else 0 end) as <some_newcol>
I am getting the following error
pyspark.sql.utils.AnalysisException: u"cannot resolve '(sales.`cxvalue` >
100000)' due to data type mismatch: differing types in '(sales.`cxvalue` >
100000)' (struct<int:int,string:string> and int).; line 1 pos 35;\n'Project
['demand_amt]\n+- 'Filter (cxvalue#4 > 100000)\n +- SubqueryAlias sales\n +-
LogicalRDD [sales_id#0, customer_name#1, customer_loc#2, demand_amt#3L,
cxvalue#4]\n"
pyspark.sql.utils.AnalysisException: u"cannot resolve '(sales.`cxvalue` =
100000)' due to data type mismatch: differing types in '(sales.`cxvalue` =
100000)' (struct<int:int,string:string> and int).; line 1 pos 33;\n'Project
[customer_name#1, CASE WHEN (cxvalue#4 = 100000) THEN demand_amt#3 ELSE 0 END AS
small#12, CASE WHEN cxvalue#4 IN (200000,300000,400000) THEN demand_amt#3 ELSE 0
END AS medium#13]\n+- SubqueryAlias sales\n +- LogicalRDD [sales_id#0,
customer_name#1, customer_loc#2, demand_amt#3, cxvalue#4]\n"
This tells me it is considering a colums as both numeric and string and this is specific to Spark and not AWS. SUM()
GROUP BY works fine only comparision
I have tried the following steps
1) Tried to change the column type using Spark method - Failed
df=df.withColumn(<column> df[<columns>].cast(DoubleType())) # df is Spark Data
111
Glue does not allow to change the data type of spark data frame column type
2) Used Glue’s resoveChoice method as explained in https://github.com/aws-samples/aws-gluesamples/
blob/master/examples/resolve_choice.md . resolveChoice method worked - but sql Failed with the same error
3) Used cast(<columns> as <data_type>) in SQL query – Failed
4) Spun up Spark Cluster on my Google Cloud (Just to ensure nothing AWS related). Used Spark only with same above logic – Failed with the same error
5) On same Spark cluster and same data set used the same logic but enforced schema using StructType and StructField
while creating a new Spark data frame – Passed
Here is the Sample Data
+--------+-------------+------------+----------+-------+
|sales_id|customer_name|customer_loc|demand_amt|cxvalue|
+--------+-------------+------------+----------+-------+
| 1| ABC| Denver CO| 1200| 300000|
| 2| BCD| Boston MA| 212| 120000|
| 3| CDE| Phoenix AZ| 332| 100000|
| 4| BCD| Boston MA| 211| 120000|
| 5| DEF| Portland OR| 2121|1000000|
| 6| CDE| Phoenix AZ| 32| 100000|
| 7| ABC| Denver CO| 3227| 300000|
| 8| DEF| Portland OR| 2121|1000000|
| 9| BCD| Boston MA| 21| 120000|
| 10| ABC| Denver CO| 1200|300000 |
+--------+-------------+------------+----------+-------+
These are sample code and queries where things fail
sdf_sales.createOrReplaceTempView("sales")
tbl1="sales"
sql2="""select customer_name, (case when cxvalue < 100000 then 1 else 0) as small,
(case when cxvalue in (200000, 300000, 400000 ) then demand_amt else 0 end) as medium
from {0}
""".format(tbl1)
sql4="select demand_amt from {0} where cxvalue>100000".format(tbl1)
However, these queries work great with successful Glue Job
sql3="""select customer_name, sum(demand_amt) as total_spent from {0} GROUP BY customer_name""".format(tbl1)
Challenge:
Wish glue somehow allowed me to change Spark Dataframe schema. Any suggestion will be appreciated.
AWS Glue resolveChoice fixed the issue.
Programing logic error: Treated Spark Frame as mutable
Related
I am consuming kafka data having "eventtime" (datetime) field in the packet. I want to create hdfs directories in "year/month/day" structure in streaming based on the date part of the eventtime field .
I am using delta-core_2.11:0.6.1, Spark : 2.4 versions
Example :
/temp/deltalake/data/project_1/2022/12/1
/temp/deltalake/data/project_1/2022/12/2
.
.
and so on.
The thing I found nearest to my requirement was : partitionBy(Keys) in delta lake documentation.
That will create the data in this format : /temp/deltalake/data/project_1/year=2022/month=12/day=1
data.show() :
+----+-------+-----+-------+---+-------------------+----------+
|S_No|section| Name| City|Age| eventtime| date|
+----+-------+-----+-------+---+-------------------+----------+
| 1| a|Name1| Indore| 25|2022-02-10 23:30:14|2022-02-10|
| 2| b|Name2| Delhi| 25|2021-08-12 10:50:10|2021-08-12|
| 3| c|Name3| Ranchi| 30|2022-12-10 15:00:00|2022-12-10|
| 4| d|Name4|Kolkata| 30|2022-05-10 00:30:00|2022-05-10|
| 5| e|Name5| Mumbai| 30|2022-07-01 10:32:12|2022-07-01|
+----+-------+-----+-------+---+-------------------+----------+
data
.write
.format("delta")
.mode("overwrite")
.option("mergeSchema", "true")
.partitionBy(Keys)
.save("/temp/deltalake/data/project_1/")
But this too didn't work. I referred to this below medium article:
https://medium.com/#aravinthR/partitioned-delta-lake-part-3-5cc52b64ebda
Would be great if anyone can help me out in figuring out a possible solution.
I have a stream which I read in pyspark using spark.readStream.format('delta'). The data consists of multiple columns including a type, date and value column.
Example DataFrame;
type
date
value
1
2020-01-21
6
1
2020-01-16
5
2
2020-01-20
8
2
2020-01-15
4
I would like to create a DataFrame that keeps track of the latest state per type. One of the most easy methods to do when working on static (batch) data is to use windows, but using windows on non-timestamp columns is not supported. Another option would look like
stream.groupby('type').agg(last('date'), last('value')).writeStream
but I think Spark cannot guarantee the ordering here, and using orderBy is also not supported in structured streaming before the aggrations.
Do you have any suggestions on how to approach this challenge?
simple use the to_timestamp() function that can be import by from pyspark.sql.functions import *
on the date column so that you use the window function.
e.g
from pyspark.sql.functions import *
df=spark.createDataFrame(
data = [ ("1","2020-01-21")],
schema=["id","input_timestamp"])
df.printSchema()
+---+---------------+-------------------+
|id |input_timestamp|timestamp |
+---+---------------+-------------------+
|1 |2020-01-21 |2020-01-21 00:00:00|
+---+---------------+-------------------+
"but using windows on non-timestamp columns is not supported"
are you saying this from stream point of view, because same i am able to do.
Here is the solution to your problem.
windowSpec = Window.partitionBy("type").orderBy("date")
df1=df.withColumn("rank",rank().over(windowSpec))
df1.show()
+----+----------+-----+----+
|type| date|value|rank|
+----+----------+-----+----+
| 1|2020-01-16| 5| 1|
| 1|2020-01-21| 6| 2|
| 2|2020-01-15| 4| 1|
| 2|2020-01-20| 8| 2|
+----+----------+-----+----+
w = Window.partitionBy('type')
df1.withColumn('maxB', F.max('rank').over(w)).where(F.col('rank') == F.col('maxB')).drop('maxB').show()
+----+----------+-----+----+
|type| date|value|rank|
+----+----------+-----+----+
| 1|2020-01-21| 6| 2|
| 2|2020-01-20| 8| 2|
+----+----------+-----+----+
I need to compare two DataFrames. One of them is static and other is streaming.
Sample static DataFrame looks like the following:
id, value
2786, 5
7252, 3
2525, 4
8038, 1
Sample streaming DataFrame looks like the following:
id, value
2786, 9
7252, 8
2525, 7
The result DataFrame should look like this:
id, value
8038, 1
Value is not important at all. I just need to find that for this mini-batch I don't have a value with id 8038 specified. I tried to use joins and subtract() function for this, but the problem is that stream - static joins don't support the kinds of joins that I need, and subtract don't work when static DataFrame on the left. For example these expressions will return an error:
staticDF.subtract(streamingDF)
staticDF.join(streamingDF, staticDF.id = streamingDF.id, "left_anti")
Is there any way to get the id that there is in staticDF, but not in streamingDF in Spark Structured Streaming ?
You can use foreachBatch sink and then use left anti join for static dataframe and micro batch.
streamingDf.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
println("------------------------")
println("Batch "+batchId+ " data")
println("Total Records " + batchDF.count())
println("------------------------")
staticDf.join(batchDF, staticDf("id") === batchDF("id"),"left_anti")
.select(staticDf("*")).show()
//You can also write your output using any writer
//e.g. df.write.format("csv").save("src/test/resources")
}.start()
Inputs:
static df
+----+-----+
| id|value|
+----+-----+
|2786| 5|
|7252| 3|
|2525| 4|
|8038| 1|
+----+-----+
streaming batch 0
2786,9
7252,8
2525,7
streaming batch 1
2786,9
7252,8
Output:
------------------------
Batch 0 data
Total Records 3
------------------------
+----+-----+
| id|value|
+----+-----+
|8038| 1|
+----+-----+
------------------------
Batch 1 data
Total Records 2
------------------------
+----+-----+
| id|value|
+----+-----+
|2525| 4|
|8038| 1|
+----+-----+
Below is my Hive table definition:
CREATE EXTERNAL TABLE IF NOT EXISTS default.test2(
id integer,
count integer
)
PARTITIONED BY (
fac STRING,
fiscaldate_str DATE )
STORED AS PARQUET
LOCATION 's3://<bucket name>/backup/test2';
I have the data in hive table as below, (I just inserted sample data)
select * from default.test2
+---+-----+----+--------------+
| id|count| fac|fiscaldate_str|
+---+-----+----+--------------+
| 2| 3| NRM| 2019-01-01|
| 1| 2| NRM| 2019-01-01|
| 2| 3| NRM| 2019-01-02|
| 1| 2| NRM| 2019-01-02|
| 2| 3| NRM| 2019-01-03|
| 1| 2| NRM| 2019-01-03|
| 2| 3|STST| 2019-01-01|
| 1| 2|STST| 2019-01-01|
| 2| 3|STST| 2019-01-02|
| 1| 2|STST| 2019-01-02|
| 2| 3|STST| 2019-01-03|
| 1| 2|STST| 2019-01-03|
+---+-----+----+--------------+
This table is partitioned on two columns (fac, fiscaldate_str) and we are trying to dynamically execute insert overwrite at partition level by using spark dataframes - dataframe writer.
However, when trying this, we are either ending up with duplicate data or all other partitions got deleted.
Below are the codes snippets for this using spark dataframe.
First I am creating dataframe as
df = spark.createDataFrame([(99,99,'NRM','2019-01-01'),(999,999,'NRM','2019-01-01')], ['id','count','fac','fiscaldate_str'])
df.show(2,False)
+---+-----+---+--------------+
|id |count|fac|fiscaldate_str|
+---+-----+---+--------------+
|99 |99 |NRM|2019-01-01 |
|999|999 |NRM|2019-01-01 |
+---+-----+---+--------------+
Getting duplicate with below snippet,
df.coalesce(1).write.mode("overwrite").insertInto("default.test2")
All other data get deleted and only the new data is available.
df.coalesce(1).write.mode("overwrite").saveAsTable("default.test2")
OR
df.createOrReplaceTempView("tempview")
tbl_ald_kpiv_hist_insert = spark.sql("""
INSERT OVERWRITE TABLE default.test2
partition(fac,fiscaldate_str)
select * from tempview
""")
I am using AWS EMR with Spark 2.4.0 and Hive 2.3.4-amzn-1 along with S3.
Can anyone have any idea why I am not able to dynamically overwrite the data into partitions ?
Your question is less easy to follow, but I think you mean you want a partition overwritten. If so, then this is what you need, all you need - the second line:
df = spark.createDataFrame([(99,99,'AAA','2019-01-02'),(999,999,'BBB','2019-01-01')], ['id','count','fac','fiscaldate_str'])
df.coalesce(1).write.mode("overwrite").insertInto("test2",overwrite=True)
Note the overwrite=True. The comment made is neither here nor there, as the DF.writer is being used. I am not addressing the coalesce(1).
Comment to Asker
I ran this as I standardly do - when prototyping and answering here - on a Databricks Notebook and expressly set the following and it worked fine:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","static")
spark.conf.set("hive.exec.dynamic.partition.mode", "strict")
You ask to update the answer with:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic").
Can do as I have just done; may be in your environment this is needed, but I did certainly not need to do so.
UPDATE 19/3/20
This worked on prior Spark releases, now the following applie afaics:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
// In Databricks did not matter the below settings
//spark.conf.set("hive.exec.dynamic.partition", "true")
//spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
Seq(("CompanyA1", "A"), ("CompanyA2", "A"),
("CompanyB1", "B"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.partitionBy("id")
.saveAsTable("KQCAMS9")
spark.sql(s"SELECT * FROM KQCAMS9").show(false)
val df = Seq(("CompanyA3", "A"))
.toDF("company", "id")
// disregard coalsece
df.coalesce(1).write.mode("overwrite").insertInto("KQCAMS9")
spark.sql(s"SELECT * FROM KQCAMS9").show(false)
spark.sql(s"show partitions KQCAMS9").show(false)
All OK this way now from 2.4.x. onwards.
I ran into a surprising behavior when using .select():
>>> my_df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 3| 5|
| 2| 4| 6|
+---+---+---+
>>> a_c = s_df.select(col("a"), col("c")) # removing column b
>>> a_c.show()
+---+---+
| a| c|
+---+---+
| 1| 5|
| 2| 6|
+---+---+
>>> a_c.filter(col("b") == 3).show() # I can still filter on "b"!
+---+---+
| a| c|
+---+---+
| 1| 5|
+---+---+
This behavior got my wondering... Are my following points correct?
DataFrames are just views, a simple DataFrame is a view of itself. In my case a_c is just a view into my_df.
When I created a_c no new data was created, a_c is just pointing at the same data my_df is pointing.
If there is additional information that is relevant, please add!
This is happening because of the lazy nature of Spark. It is "smart" enough to push the filter down so that it happens at a lower level - before the filter*. So, since this all happens within the same stage of execution and is able to still be resolved. In fact you can see this in explain:
== Physical Plan ==
*Project [a#0, c#2]
+- *Filter (b#1 = 3) <---Filter before Project
+- LocalTableScan [A#0, B#1, C#2]
You can force a shuffle and new stage, then see your filter fail, though. Even catching it at compile time. Here's an example:
a_c.groupBy("a","c").count.filter(col("b") === 3)
*There is also a projection pruning that pushes the selection down to database layers if it realizes it doesn't need the column at any point. However I believe the filter would cause it to "need" it and not prune...but I didn't test that.
Let us start with some basics about the spark underlying.This will make your understanding easy.
RDD : Underlying the spark core is the data structure called RDD ,which are
lazily evaluated. By lazy evaluation we mean that RDD computation
happens when the action (like calling a count in RDD or show in dataset).
Dataset or Dataframe(which Dataset[Row]) also uses RDDs at the core.
This means every transformation (like filter) will be realized only when the action is triggered (show).
So your question
"When I created a_c no new data was created, a_c is just pointing at the same data my_df is pointing."
As there is no data which was realized. We have to realize it to bring it to memory. Your filter works on the initial dataframe.
The only way to make your a_c.filter(col("b") == 3).show() throw a run time exception is to cache your intermediate dataframe by using dataframe.cache.
So spark will throw"main" org.apache.spark.sql.AnalysisException: Cannot resolve column name
Eg.
val a_c = s_df.select(col("a"), col("c")).cache
a_c.filter(col("b") == 3).show()
So spark will throw"main" org.apache.spark.sql.AnalysisException: Cannot
resolve column name.