Hive - Create table LIKE doesn't work in spark-sql - apache-spark

In my pyspark job I'm trying to create a temp table using LIKE clause as below.
CREATE EXTERNAL TABLE IF NOT EXISTS stg.new_table_name LIKE stg.exiting_table_name LOCATION s3://s3-bucket/warehouse/stg/existing_table_name
My job fails as below -
mismatched input 'LIKE' expecting (line 1, pos 56)\n\n== SQL
==\nCREATE EXTERNAL TABLE IF NOT EXISTS stg.new_table_name LIKE
stg.exiting_table_name LOCATION
s3://s3-bucket/warehouse/stg/existing_table_name
Doesn't spark support LIKE clause to create new table using metadata of existing table?
My sparksession config:
self.session = SparkSession \
.builder \
.appName(self.app_name) \
.config("spark.dynamicAllocation.enabled", "false") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("mapreduce.fileoutputcommitter.algorithm.version", "2") \
.config("hive.load.dynamic.partitions.thread", "10") \
.config("hive.mv.files.thread", "30") \
.config("fs.trash.interval", "0") \
.enableHiveSupport()

Related

Can i exclude the column used for partitioning when writing to parquet?

i need to create parquet files, reading from jdbc. The table is quite big and all columns are varchars. So i created a new column with a random int to make partitioning.
so my read jdbc looks something like this:
data_df = sparkSession.read.format('jdbc') \
.option('url', 'jdbc:netezza://host:port/db') \
.option('dbtable', """(SELECT * FROM schema.table) A""") \
.option('user', 'user') \
.option('password', 'password') \
.option('partitionColumn','random_number') \
.option('lowerBound','1') \
.option('upperBound','200') \
.option('numPartitions','200') \
.load()
and my write to parquet looks something like this:
data_df.write.mode("overwrite").parquet("parquetfile.parquet").partitionBy('random_number')
The generated parquet also contains the 'random_number' column, but i only made that column for partitioning, is there a way to exclude that column to the writing of the parquet files?
Thanks for any help, i'm new to spark :)
I'm expecting to exclude the random_number column, but lack the knowledge if this is possible if i need the column for partitioning
So do you want to repartition in memory using a column but not writing it, you can just use .repartition(col("random_number")) before writing droping the column then write your data:
data_df = sparkSession.read.format('jdbc') \
.option('url', 'jdbc:netezza://host:port/db') \
.option('dbtable', """(SELECT * FROM schema.table) A""") \
.option('user', 'user') \
.option('password', 'password') \
.option('partitionColumn','random_number') \
.option('lowerBound','1') \
.option('upperBound','200') \
.option('numPartitions','200') \
.load()
.repartition(col("random_number")).drop("random_number")
then:
data_df.write.mode("overwrite").parquet("parquetfile.parquet")

upsert (merge) delta with spark structured streaming

I need to upsert data in real time (with spark structured streaming) in python
This data is read in realtime (format csv) and then is written as a delta table (here we want to update the data that's why we use merge into from delta)
I am using delta engine with databricks
I coded this:
from delta.tables import *
spark = SparkSession.builder \
.config("spark.sql.streaming.schemaInference", "true")\
.appName("SparkTest") \
.getOrCreate()
sourcedf= spark.readStream.format("csv") \
.option("header", True) \
.load("/mnt/user/raw/test_input") #csv data that we read in real time
spark.conf.set("spark.sql.shuffle.partitions", "1")
spark.createDataFrame([], sourcedf.schema) \
.write.format("delta") \
.mode("overwrite") \
.saveAsTable("deltaTable")
def upsertToDelta(microBatchOutputDF, batchId):
microBatchOutputDF.createOrReplaceTempView("updates")
microBatchOutputDF._jdf.sparkSession().sql("""
MERGE INTO deltaTable t
USING updates s
ON s.Id = t.Id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
sourcedf.writeStream \
.format("delta") \
.foreachBatch(upsertToDelta) \
.outputMode("update") \
.option("checkpointLocation", "/mnt/user/raw/checkpoints/output")\
.option("path", "/mnt/user/raw/PARQUET/output") \
.start() \
.awaitTermination()
but nothing gets written as expected in the output path , the checkpoint path gets filled in as expected , a display in the delta table gives me results too
display(table("deltaTable"))
in the spark UI I see the writestream step :
sourcedf.writeStream \ .format("delta") \ ....
first at Snapshot.scala:156+details
RDD: Delta Table State #1 - dbfs:/user/hive/warehouse/deltatable/_delta_log
any idea how to fix this so I can upsert csv data into delta tables in S3 in real time with spark
Best regards
Apologies for a late reply, but just in case anyone else has the same problem. I have found the below worked for me, I wonder is it because you didn't use "cloudFiles" on your readstream to make use of autoloader?:
%python
sourcedf= spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("cloudFiles.includeExistingFiles","true") \
.schema(csvSchema) \
.load("/mnt/user/raw/test_input")
%sql
CREATE TABLE IF NOT EXISTS deltaTable(
col1 int NOT NULL,
col2 string NOT NULL,
col3 bigint,
col4 int
)
USING DELTA
LOCATION '/mnt/user/raw/PARQUET/output'
%python
def upsertToDelta(microBatchOutputDF, batchId):
microBatchOutputDF.createOrReplaceTempView("updates")
microBatchOutputDF._jdf.sparkSession().sql("""
MERGE INTO deltaTable t
USING updates s
ON s.Id = t.Id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
%python
sourcedf.writeStream \
.format("delta") \
.foreachBatch(upsertToDelta) \
.outputMode("update") \
.option("checkpointLocation", "/mnt/user/raw/checkpoints/output") \
.start("/mnt/user/raw/PARQUET/output")

Spark dynamic partitioning: SchemaColumnConvertNotSupportedException on read

Question
Is there any way to store data with different (not compatible) schemas in different partitions?
The issue
I use PySpark v2.4.5, parquet format and dynamic partitioning with the following hierachy: BASE_PATH/COUNTRY=US/TYPE=sms/YEAR=2020/MONTH=04/DAY=10/. Unfortunatelly it can't be changed.
I got SchemaColumnConvertNotSupportedException on read. That happens because schema differs between different types (i.e. between sms and mms). Looks like Spark trying to merge to schemas on read under the hood.
If to be more precise, I can read data for F.col('TYPE') == 'sms', because mms schema can be converted to sms. But when I'm filtering by F.col('TYPE') == 'mms', than Spark fails.
Code
# Works, because Spark doesn't try to merge schemas
spark_session \
.read \
.option('mergeSchema', False) \
.parquet(BASE_PATH + '/COUNTRY_CODE=US/TYPE=mms/YEAR=2020/MONTH=04/DAY=07/HOUR=00') \
.show()
# Doesn't work, because Spark trying to merge schemas for TYPE=sms and TYPE=mms. Mms data can't be converted to merged schema.
# Types are correct, from explain Spark treat date partitions as integers
# Predicate pushdown isn't used for some reason, there is no PushedFilter in explained plan
spark_session \
.read \
.option('mergeSchema', False) \
.parquet(BASE_PATH) \
.filter(F.col('COUNTRY') == 'US') \
.filter(F.col('TYPE') == 'mms') \
.filter(F.col('YEAR') == 2020) \
.filter(F.col('MONTH') == 4) \
.filter(F.col('DAY') == 10) \
.show()
Just for situation it may be useful for someone. It's possible to have different data within different partitions. To make Spark no infer schema for parquet - specify the schema:
spark_session \
.read \
.schema(some_schema) \
.option('mergeSchema', False) \
.parquet(BASE_PATH) \
.filter(F.col('COUNTRY') == 'US') \
.filter(F.col('TYPE') == 'mms') \
.filter(F.col('YEAR') == 2020) \
.filter(F.col('MONTH') == 4) \
.filter(F.col('DAY') == 10) \
.show()

Upsert data in postgresql using spark structured streaming

I am trying to run a structured streaming application using (py)spark. My data is read from a Kafka topic and then I am running windowed aggregation on event time.
# I have been able to create data frame pn_data_df after reading data from Kafka
Schema of pn_data_df
|
- id StringType
- source StringType
- source_id StringType
- delivered_time TimeStamp
windowed_report_df = pn_data_df.filter(pn_data_df.source == 'campaign') \
.withWatermark("delivered_time", "24 hours") \
.groupBy('source_id', window('delivered_time', '15 minute')) \
.count()
windowed_report_df = windowed_report_df \
.withColumn('start_ts', unix_timestamp(windowed_report_df.window.start)) \
.withColumn('end_ts', unix_timestamp(windowed_report_df.window.end)) \
.selectExpr('CAST(source_id as LONG)', 'start_ts', 'end_ts', 'count')
I am writing this windowed aggregation to my postgresql database which I have already created.
CREATE TABLE pn_delivery_report(
source_id bigint not null,
start_ts bigint not null,
end_ts bigint not null,
count integer not null,
unique(source_id, start_ts)
);
Writing to postgresql using spark jdbc allows me to either Append or Overwrite. Append mode fails if there is an existing composite key existing in the database, and Overwrite just overwrites entire table with current batch output.
def write_pn_report_to_postgres(df, epoch_id):
df.write \
.mode('append') \
.format('jdbc') \
.option("url", "jdbc:postgresql://db_endpoint/db") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", "pn_delivery_report") \
.option("user", "postgres") \
.option("password", "PASSWORD") \
.save()
windowed_report_df.writeStream \
.foreachBatch(write_pn_report_to_postgres) \
.option("checkpointLocation", '/home/hadoop/campaign_report_df_windowed_checkpoint') \
.outputMode('update') \
.start()
How can I execute a query like
INSERT INTO pn_delivery_report (source_id, start_ts, end_ts, COUNT)
VALUES (1001, 125000000001, 125000050000, 128),
(1002, 125000000001, 125000050000, 127) ON conflict (source_id, start_ts) DO
UPDATE
SET COUNT = excluded.count;
in foreachBatch.
Spark has a jira feature ticket open for it, but it seems that it has not been prioritised till now.
https://issues.apache.org/jira/browse/SPARK-19335
that's worked for me:
def _write_streaming(self,
df,
epoch_id
) -> None:
df.write \
.mode('append') \
.format("jdbc") \
.option("url", f"jdbc:postgresql://localhost:5432/postgres") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", 'table_test') \
.option("user", 'user') \
.option("password", 'password') \
.save()
df_stream.writeStream \
.foreachBatch(_write_streaming) \
.start() \
.awaitTermination()
You need to add ".awaitTermination()" at the end.

Spark 2.3.1 AWS EMR not returning data for some columns yet works in Athena/Presto and Spectrum

I am using PySpark on Spark 2.3.1 on AWS EMR (Python 2.7.14)
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.debug.maxToStringFields", 100) \
.enableHiveSupport() \
.getOrCreate()
spark.sql('select `message.country` from datalake.leads_notification where `message.country` is not null').show(10)
This returns no data, 0 rows found.
Every value for each row in above table is returned Null.
Data is stored in PARQUET.
When I ran same SQL query on AWS Athena/Presto or on AWs Redshift Spectrum then I get all column data returned correctly (most column values are not null).
This is the Athena SQL and Redshift SQL query that returns correct data:
select "message.country" from datalake.leads_notification where "message.country" is not null limit 10;
I use AWS Glue catalog in all cases.
The column above is NOT partitioned but the table is partitioned on other columns. I tried to use repair table, it did not help.
i.e. MSCK REPAIR TABLE datalake.leads_notification
i tried Schema Merge = True like so:
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("hive.exec.dynamic.partition", "true") \
.config("spark.sql.parquet.mergeSchema", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.debug.maxToStringFields", 200) \
.enableHiveSupport() \
.getOrCreate()
No difference, still every value of one column is nulls even though some are not null.
This column was added as the last column to the table so most data is indeed null but some rows are not null. The column is listed at last on the column list in catalog, sitting just above the partitioned columns.
Nevertheless Athena/Presto retrieves all non-null values OK and so does Redshift Spectrum too but alas EMR Spark 2.3.1 PySpark shows all values for this column as "null". All other columns in Spark are retrieved correctly.
Can anyone help me to debug this problem please?
Hive Schema is hard to cut and paste here due to output format.
***CREATE TABLE datalake.leads_notification(
message.environment.siteorigin string,
dcpheader.dcploaddateutc string,
message.id int,
message.country string,
message.financepackage.id string,
message.financepackage.version string)
PARTITIONED BY (
partition_year_utc string,
partition_month_utc string,
partition_day_utc string,
job_run_guid string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://blahblah/leads_notification/leads_notification/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='weekly_datalake_crawler',
'averageRecordSize'='3136',
'classification'='parquet',
'compressionType'='none',
'objectCount'='2',
'recordCount'='897025',
'sizeKey'='1573529662',
'spark.sql.create.version'='2.2 or prior',
'spark.sql.sources.schema.numPartCols'='4',
'spark.sql.sources.schema.numParts'='3',
'spark.sql.sources.schema.partCol.0'='partition_year_utc',
'spark.sql.sources.schema.partCol.1'='partition_month_utc',
'spark.sql.sources.schema.partCol.2'='partition_day_utc',
'spark.sql.sources.schema.partCol.3'='job_run_guid',
'typeOfData'='file')***
Last 3 columns all have the same problems in Spark:
message.country string,
message.financepackage.id string,
message.financepackage.version string
All return OK in Athena/Presto and Redshift Spectrum using same catalog.
I apologize for my editing.
thank you
do step 5 schema inspection:
http://www.openkb.info/2015/02/how-to-build-and-use-parquet-tools-to.html
my bet is these new column names in parquet definition are either upper case (while other column names are lower case) or new column names in parquet definition are either lower case (while other column names are upper case)
see Spark issues reading parquet files
https://medium.com/#an_chee/why-using-mixed-case-field-names-in-hive-spark-sql-is-a-bad-idea-95da8b6ec1e0
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("hive.exec.dynamic.partition", "true") \
.config("spark.sql.parquet.mergeSchema", "true") \
.config("spark.sql.hive.convertMetastoreParquet", "false") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.debug.maxToStringFields", 200) \
.enableHiveSupport() \
.getOrCreate()
This is the solution: note the
.config("spark.sql.hive.convertMetastoreParquet", "false")
The schema columns are all in lower case and the schema was created by AWS Glue, not by my custom code so I dont really know what caused the problem so using the above is probably the safe default setting when schema creation is not directly under your control. This is a major trap, IMHO, so I hope this will help someone else in future.
Thanks to tooptoop4 who pointed out the article:
https://medium.com/#an_chee/why-using-mixed-case-field-names-in-hive-spark-sql-is-a-bad-idea-95da8b6ec1e0

Resources