How to CREATE TABLE USING delta with Spark 2.4.4? - apache-spark

This is Spark 2.4.4 and Delta Lake 0.5.0.
I'm trying to create a table using delta data source and seems I'm missing something. Although the CREATE TABLE USING delta command worked fine neither the table directory is created nor insertInto works.
The following CREATE TABLE USING delta worked fine, but insertInto failed.
scala> sql("""
create table t5
USING delta
LOCATION '/tmp/delta'
""").show
scala> spark.catalog.listTables.where('name === "t5").show
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
| t5| default| null| EXTERNAL| false|
+----+--------+-----------+---------+-----------+
scala> spark.range(5).write.option("mergeSchema", true).insertInto("t5")
org.apache.spark.sql.AnalysisException: `default`.`t5` requires that the data to be inserted have the same number of columns as the target table: target table has 0 column(s) but the inserted data has 1 column(s), including 0 partition column(s) having constant value(s).;
at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion.org$apache$spark$sql$execution$datasources$PreprocessTableInsertion$$preprocess(rules.scala:341)
...
I thought I'd create with columns defined, but that didn't work either.
scala> sql("""
create table t6
(id LONG, name STRING)
USING delta
LOCATION '/tmp/delta'
""").show
org.apache.spark.sql.AnalysisException: delta does not allow user-specified schemas.;
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325)
at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:194)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3370)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3370)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
... 54 elided

The OSS version of Delta does not have the SQL Create Table syntax as of yet. This will be implemented the future versions using Spark 3.0.
To create a Delta table, you must write out a DataFrame in Delta format. An example in Python being
df.write.format("delta").save("/some/data/path")
Here's a link to the create table documentation for Python, Scala, and Java.

An example with pyspark 3.0.0 & delta 0.7.0
print(f"LOCATION '{location}")
spark.sql(f"""
CREATE OR REPLACE TABLE {TABLE_NAME} (
CD_DEVICE INT,
FC_LOCAL_TIME TIMESTAMP,
CD_TYPE_DEVICE STRING,
CONSUMTION DOUBLE,
YEAR INT,
MONTH INT,
DAY INT )
USING DELTA
PARTITIONED BY (YEAR , MONTH , DAY, FC_LOCAL_TIME)
LOCATION '{location}'
""")
Where "location" is a dir HDFS for spark cluster mode save de delta table.

tl;dr CREATE TABLE USING delta is not supported by Spark before 3.0.0 and Delta Lake before 0.7.0.
Delta Lake 0.7.0 with Spark 3.0.0 (both just released) do support CREATE TABLE SQL command.
Be sure to "install" Delta SQL using spark.sql.catalog.spark_catalog configuration property with org.apache.spark.sql.delta.catalog.DeltaCatalog.
$ ./bin/spark-submit \
--packages io.delta:delta-core_2.12:0.7.0 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
scala> spark.version
res0: String = 3.0.0
scala> sql("CREATE TABLE delta_101 (id LONG) USING delta").show
++
||
++
++
scala> spark.table("delta_101").show
+---+
| id|
+---+
+---+
scala> sql("DESCRIBE EXTENDED delta_101").show(truncate = false)
+----------------------------+---------------------------------------------------------+-------+
|col_name |data_type |comment|
+----------------------------+---------------------------------------------------------+-------+
|id |bigint | |
| | | |
|# Partitioning | | |
|Not partitioned | | |
| | | |
|# Detailed Table Information| | |
|Name |default.delta_101 | |
|Location |file:/Users/jacek/dev/oss/spark/spark-warehouse/delta_101| |
|Provider |delta | |
|Table Properties |[] | |
+----------------------------+---------------------------------------------------------+-------+

Related

Create folder wise structure in Delta Format on HDFS

I am consuming kafka data having "eventtime" (datetime) field in the packet. I want to create hdfs directories in "year/month/day" structure in streaming based on the date part of the eventtime field .
I am using delta-core_2.11:0.6.1, Spark : 2.4 versions
Example :
/temp/deltalake/data/project_1/2022/12/1
/temp/deltalake/data/project_1/2022/12/2
.
.
and so on.
The thing I found nearest to my requirement was : partitionBy(Keys) in delta lake documentation.
That will create the data in this format : /temp/deltalake/data/project_1/year=2022/month=12/day=1
data.show() :
+----+-------+-----+-------+---+-------------------+----------+
|S_No|section| Name| City|Age| eventtime| date|
+----+-------+-----+-------+---+-------------------+----------+
| 1| a|Name1| Indore| 25|2022-02-10 23:30:14|2022-02-10|
| 2| b|Name2| Delhi| 25|2021-08-12 10:50:10|2021-08-12|
| 3| c|Name3| Ranchi| 30|2022-12-10 15:00:00|2022-12-10|
| 4| d|Name4|Kolkata| 30|2022-05-10 00:30:00|2022-05-10|
| 5| e|Name5| Mumbai| 30|2022-07-01 10:32:12|2022-07-01|
+----+-------+-----+-------+---+-------------------+----------+
data
.write
.format("delta")
.mode("overwrite")
.option("mergeSchema", "true")
.partitionBy(Keys)
.save("/temp/deltalake/data/project_1/")
But this too didn't work. I referred to this below medium article:
https://medium.com/#aravinthR/partitioned-delta-lake-part-3-5cc52b64ebda
Would be great if anyone can help me out in figuring out a possible solution.

Spark Dataframe issue in overwriting the partition data of Hive table

Below is my Hive table definition:
CREATE EXTERNAL TABLE IF NOT EXISTS default.test2(
id integer,
count integer
)
PARTITIONED BY (
fac STRING,
fiscaldate_str DATE )
STORED AS PARQUET
LOCATION 's3://<bucket name>/backup/test2';
I have the data in hive table as below, (I just inserted sample data)
select * from default.test2
+---+-----+----+--------------+
| id|count| fac|fiscaldate_str|
+---+-----+----+--------------+
| 2| 3| NRM| 2019-01-01|
| 1| 2| NRM| 2019-01-01|
| 2| 3| NRM| 2019-01-02|
| 1| 2| NRM| 2019-01-02|
| 2| 3| NRM| 2019-01-03|
| 1| 2| NRM| 2019-01-03|
| 2| 3|STST| 2019-01-01|
| 1| 2|STST| 2019-01-01|
| 2| 3|STST| 2019-01-02|
| 1| 2|STST| 2019-01-02|
| 2| 3|STST| 2019-01-03|
| 1| 2|STST| 2019-01-03|
+---+-----+----+--------------+
This table is partitioned on two columns (fac, fiscaldate_str) and we are trying to dynamically execute insert overwrite at partition level by using spark dataframes - dataframe writer.
However, when trying this, we are either ending up with duplicate data or all other partitions got deleted.
Below are the codes snippets for this using spark dataframe.
First I am creating dataframe as
df = spark.createDataFrame([(99,99,'NRM','2019-01-01'),(999,999,'NRM','2019-01-01')], ['id','count','fac','fiscaldate_str'])
df.show(2,False)
+---+-----+---+--------------+
|id |count|fac|fiscaldate_str|
+---+-----+---+--------------+
|99 |99 |NRM|2019-01-01 |
|999|999 |NRM|2019-01-01 |
+---+-----+---+--------------+
Getting duplicate with below snippet,
df.coalesce(1).write.mode("overwrite").insertInto("default.test2")
All other data get deleted and only the new data is available.
df.coalesce(1).write.mode("overwrite").saveAsTable("default.test2")
OR
df.createOrReplaceTempView("tempview")
tbl_ald_kpiv_hist_insert = spark.sql("""
INSERT OVERWRITE TABLE default.test2
partition(fac,fiscaldate_str)
select * from tempview
""")
I am using AWS EMR with Spark 2.4.0 and Hive 2.3.4-amzn-1 along with S3.
Can anyone have any idea why I am not able to dynamically overwrite the data into partitions ?
Your question is less easy to follow, but I think you mean you want a partition overwritten. If so, then this is what you need, all you need - the second line:
df = spark.createDataFrame([(99,99,'AAA','2019-01-02'),(999,999,'BBB','2019-01-01')], ['id','count','fac','fiscaldate_str'])
df.coalesce(1).write.mode("overwrite").insertInto("test2",overwrite=True)
Note the overwrite=True. The comment made is neither here nor there, as the DF.writer is being used. I am not addressing the coalesce(1).
Comment to Asker
I ran this as I standardly do - when prototyping and answering here - on a Databricks Notebook and expressly set the following and it worked fine:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","static")
spark.conf.set("hive.exec.dynamic.partition.mode", "strict")
You ask to update the answer with:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","d‌​ynamic").
Can do as I have just done; may be in your environment this is needed, but I did certainly not need to do so.
UPDATE 19/3/20
This worked on prior Spark releases, now the following applie afaics:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
// In Databricks did not matter the below settings
//spark.conf.set("hive.exec.dynamic.partition", "true")
//spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
Seq(("CompanyA1", "A"), ("CompanyA2", "A"),
("CompanyB1", "B"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.partitionBy("id")
.saveAsTable("KQCAMS9")
spark.sql(s"SELECT * FROM KQCAMS9").show(false)
val df = Seq(("CompanyA3", "A"))
.toDF("company", "id")
// disregard coalsece
df.coalesce(1).write.mode("overwrite").insertInto("KQCAMS9")
spark.sql(s"SELECT * FROM KQCAMS9").show(false)
spark.sql(s"show partitions KQCAMS9").show(false)
All OK this way now from 2.4.x. onwards.

Spark Structured Streaming watermark error

Followup to this question
I have json streaming data in the format same as below
| A | B |
|-------|------------------------------------------|
| ABC | [{C:1, D:1}, {C:2, D:4}] |
| XYZ | [{C:3, D :6}, {C:9, D:11}, {C:5, D:12}] |
I need to transform it to the format below
| A | C | D |
|-------|-----|------|
| ABC | 1 | 1 |
| ABC | 2 | 4 |
| XYZ | 3 | 6 |
| XYZ | 9 | 11 |
| XYZ | 5 | 12 |
To achieve this performed the transformations as suggested to the previous question.
val df1 = df0.select($"A", explode($"B")).toDF("A", "Bn")
val df2 = df1.withColumn("SeqNum", monotonically_increasing_id()).toDF("A", "Bn", "SeqNum")
val df3 = df2.select($"A", explode($"Bn"), $"SeqNum").toDF("A", "B", "C", "SeqNum")
val df4 = df3.withColumn("dummy", concat( $"SeqNum", lit("||"), $"A"))
val df5 = df4.select($"dummy", $"B", $"C").groupBy("dummy").pivot("B").agg(first($"C"))
val df6 = df5.withColumn("A", substring_index(col("dummy"), "||", -1)).drop("dummy")
Now I trying to save the result to a csv file in HDFS
df6.withWatermark("event_time", "0 seconds")
.writeStream
.trigger(Trigger.ProcessingTime("0 seconds"))
.queryName("query_db")
.format("parquet")
.option("checkpointLocation", "/path/to/checkpoint")
.option("path", "/path/to/output")
// .outputMode("complete")
.start()
Now I get the below error.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
EventTimeWatermark event_time#223: timestamp, interval
My doubt is that I am not performing any aggregation that will require it store the aggregated value beyond the processing time for that row. Why do I get this error? Can I keep watermarking as 0 seconds?
Any help on this will be deeply appreciated.
As per my understanding, watermarking is required only when you are performing window operation on event time. Spark used watermarking to handle late data and for the same purpose Spark needs to save older aggregation.
The following link explains this very well with example:
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
I don't see any window operations in your transformation and if that is the case then I think you can try running the stream query without watermarking.
when grouping spark streaming structures you have to already have the watermark in the dataframe and take it into account while grouping, by including a window of the watermarks in your aggregation.
df.groupBy(col("dummy"), window(col("event_time"), "1 day")).

Insert into TempView using Spark.sql

How can I make a simple insert in Spark SQL ?
spark 2.1
I am able to make it work with simple sql code inside spark, with Spark.sql but it is not possible for me to make just an insert.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Basics').getOrCreate()
df=spark.read.json(/path/.'/people.json')
df.sow()
+-----+---------+
|age | name |
+-----+---------+
|null | Michael |
| 30 | And |
+-----+---------+
df.CreateOrReplaceTempView('people') # create temp table
spark.sql("SELECT * FROM people where age == 30")
+-----+---------+
|age | name |
+-----+---------+
| 30 | Andy |
+-----+---------+
So I understand SQL but I dont know who to make an Insert.
I tried all the posibles ways I imagine.
You don't insert into dataframes, they are immutable and lazy.
You need to create a new dataframe which is the union between the original dataframe and the new data you want to add to it.

Spark sql count changes on changing case

I have a table that has data distribution like :
sqlContext.sql( """ SELECT
count(to_Date(PERIOD_DT)), to_date(PERIOD_DT)
from dbname.tablename group by to_date(PERIOD_DT) """).show
+-------+----------+
| _c0| _c1|
+-------+----------+
|1067177|2016-09-30|
|1042566|2017-07-07|
|1034333|2017-07-31|
+-------+----------+
However, when I run a query like the following :
sqlContext.sql(""" SELECT COUNT(*)
from dbname.tablename
where PERIOD_DT = '2017-07-07' """).show
Surprisingly, it returns :
+-------+
| _c0|
+-------+
|3144076|
+-------+
But if I changed PERIOD_DT to lowercase, i.e., period_dt , it returns the correct result
sqlContext.sql("""
SELECT COUNT(*)
from dbname.table
where period_dt='2017-07-07' """).show
+-------+
| _c0|
+-------+
|1042566|
+-------+
period_dt is the column on which the table is partitioned and it's type is char(10)
The table data is stored as Parquet :
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
What might be causing this inconsistency?
It is a case sensitive issue . Because of limitations of hive meta store schema, table is always lowercase. Parquet should resolve the issue

Resources