why is my glue table creating with the wrong path? - apache-spark

I'm creating a table in AWS Glue using a spark job orchestrated by Airflow, it reads from a json and writes a table, the command I use within the job is the following:
spark.sql(s"CREATE TABLE IF NOT EXISTS $database.$table using PARQUET LOCATION '$path'")
The odd thing here is that I have other tables created using the same job (with different names) but they are created without problems, e.g. they have the location
s3://bucket_name/databases/my_db/my_perfectly_created_table
there is exactly one table that creates itself with this location:
s3://bucket_name/databases/my_db/my_problematic_table-__PLACEHOLDER__
I don't know where that -__PLACEHOLDER__ is coming from. I already tried deleting the table and recreating it but it always does the same thing on this exact table. The data is in parquet format in the path:
s3://bucket_name/databases/my_db/my_problematic_table
so I know the problem is just creating the table correctly because all I get is a col (array<string>) when trying to query it in Athena (as there is no data in /my_problematic_table-__PLACEHOLDER__).
Have any of you guys dealt with this before?

Upon closer inspection in AWS glue, this specific problematic_table had the following config, specific for CSV files and custom-delimiters:
Input Format org.apache.hadoop.mapred.SequenceFileInputFormat
Output Format org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Serde serialization library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
while my other tables had the config specific for parquet:
Input Format org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Output Format org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Serde serialization library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
I tried to create the table forcing the config for parquet with the following command:
val path = "s3://bucket_name/databases/my_db/my_problematic_table/"
val my_table = spark.read.format("parquet").load(path)
val ddlSchema = my_table.toDF.schema.toDDL
spark.sql(s"""
|CREATE TABLE IF NOT EXISTS my_db.manual_myproblematic_table($ddlSchema)
|ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
|STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
|OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
|LOCATION '$path'
|""".stripMargin
)
but it threw the following error:
org.apache.spark.SparkException: Cannot recognize hive type string: struct<1:string,2:string,3:string>, column: problematic_column
so the problem was the naming of those columns, "1", "2" & "3" within that struct.
Given that this struct did not contain valuable info I ended up dropping it and creating the table again. now it works like a charm and it has the correct (parquet) config in glue.
Hope this helps anyone

Related

Databricks Delta Live Tables - Apply Changes from delta table

I am working with Databricks Delta Live Tables, but have some problems with upserting some tables upstream. I know it is quite a long text below, but I tried to describe my problem as clear as possible. Let me know if some parts are not clear.
I have the following tables and flow:
Landing_zone -> This is a folder in which JSON files are added that contain data of inserted or updated records.
Raw_table -> This is the data in the JSON files but in table format. This table is in delta format. No transformations are done, except from transforming the JSON structure into a tabular structure (I did an explode and then creating columns from the JSON keys).
Intermediate_table -> This is the raw_table, but with some extra columns (depending on other column values).
To go from my landing zone to the raw table I have the following Pyspark code:
cloudfile = {"cloudFiles.format":"JSON",
"cloudFiles.schemaLocation": sourceschemalocation,
"cloudFiles.inferColumnTypes": True}
#dlt.view('landing_view')
def inc_view():
df = (spark
.readStream
.format('cloudFiles')
.options(**cloudFilesOptions)
.load(filpath_to_landing)
<Some transformations to go from JSON to tabular (explode, ...)>
return df
dlt.create_target_table('raw_table',
table_properties = {'delta.enableChangeDataFeed': 'true'})
dlt.apply_changes(target='raw_table',
source='landing_view',
keys=['id'],
sequence_by='updated_at')
This code works as expected. I run it, add a changes.JSON file to the landing zone, rerun the pipeline and the upserts are correctly applied to the 'raw_table'
(However, each time a new parquet file with all the data is created in the delta folder, I would expect that only a parquet file with the inserted and updated rows was added? And that some information about the current version was kept in the delta logs? Not sure if this is relevant for my problem. I already changed the table_properties of the 'raw_table' to enableChangeDataFeed = true. The readStream for 'intermediate_table' then has option(readChangeFeed, 'true')).
Then I have the following code to go from my 'raw_table' to my 'intermediate_table':
#dlt.table(name='V_raw_table', table_properties={delta.enableChangeDataFeed': 'True'})
def raw_table():
df = (spark.readStream
.format('delta')
.option('readChangeFeed', 'true')
.table('LIVE.raw_table'))
df = df.withColumn('ExtraCol', <Transformation>)
return df
ezeg
dlt.create_target_table('intermediate_table')
dlt.apply_changes(target='intermediate_table',
source='V_raw_table',
keys=['id'],
sequence_by='updated_at')
Unfortunately, when I run this, I get the error:
'Detected a data update (for example part-00000-7127bd29-6820-406c-a5a1-e76fc7126150-c000.snappy.parquet) in the source table at version 2. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory.'
I checked in the 'ignoreChanges', but don't think this is what I want. I would expect that the autoloader would be able to detect the changes in the delta table and pass them through the flow.
I am aware that readStream only works with append, but that is why I would expect that after the 'raw_table' is updated, a new parquet file would be added to the delta folder with only the inserts and updates. This added parquet file is then detected by autoloader and could be used to apply the changes to the 'intermediate_table'.
Am I doing this the wrong way? Or am I overlooking something? Thanks in advance!
As readStream only works with appends, any change in the the source file will create issues downstream. The assumption that an update on "raw_table" will only insert a new parquet file is incorrect. Based on the settings like "optimized writes" or even without it, apply_changes can add or remove files. You can find this information in your "raw_table/_delta_log/xxx.json" under "numTargetFilesAdded" and "numTargetFilesRemoved".
Basically, "Databricks recommends you use Auto Loader to ingest only immutable files".
When you changed the settings to include the option '.option('readChangeFeed', 'true')', you should start with a full refresh(there is dropdown near start). Doing this will resolve the error 'Detected data update xxx', and your code should work for the incremental update.

Load spark bucketed table from disk previously written via saveAsTable

Version: DBR 8.4 | Spark 3.1.2
Spark allows me to create a bucketed hive table and save it to a location of my choosing.
df_data_bucketed = (df_data.write.mode('overwrite').bucketBy(9600, 'id').sortBy('id')
.saveAsTable('data_bucketed', format='parquet', path=bucketed_path)
)
I have verified that this saves the table data to my specified path (in my case, blob storage).
In the future, the table 'data_bucketed' might wiped from my spark catalog, or mapped to something else, and I'll want to "recreate it" using the data that's been previously written to blob, but I can find no way to load a pre-existing, already bucketed spark table.
The only thing that appears to work is
df_data_bucketed = (spark.read.format("parquet").load(bucketed_path)
.write.mode('overwrite').bucketBy(9600, 'id').sortBy('id')
.saveAsTable('data_bucketed', format='parquet', path=bucketed_path)
)
Which seems non-sensical, because it's essentially loading the data from disk and unnecessarily overwriting it with the exact same data just to take advantage of the buckets. (It's also very slow due to the size of this data)
You can use spark SQL to create that table in your catalog
spark.sql("""CREATE TABLE IF NOT EXISTS tbl...""") following this you can tell spark to rediscover data by running spark.sql("MSCK REPAIR TABLE tbl")
I found the answer at https://www.programmerall.com/article/3196638561/
Read from the saved Parquet file If you want to use historically saved data, you can't use the above method, nor can you use
spark.read.parquet() like reading regular files. The data read in this
way does not carry bucket information. The correct way is to use the
CREATE TABLE statement. For details, refer
to https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1 col_type1 [COMMENT col_comment1], ...)]
USING data_source
[OPTIONS (key1=val1, key2=val2, ...)]
[PARTITIONED BY (col_name1, col_name2, ...)]
[CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
[LOCATION path]
[COMMENT table_comment]
[TBLPROPERTIES (key1=val1, key2=val2, ...)]
[AS select_statement]
Examples are as follows:
spark.sql(
"""
|CREATE TABLE bucketed
| (name string)
| USING PARQUET
| CLUSTERED BY (name) INTO 10 BUCKETS
| LOCATION '/path/to'
|""".stripMargin)

Write spark Dataframe to an exisitng Delta Table by providing TABLE NAME instead of TABLE PATH

I am trying to write spark dataframe into an existing delta table.
I do have multiple scenarios where I could save data into different tables as shown below.
SCENARIO-01:
I have an existing delta table and I have to write dataframe into that table with option mergeSchema since the schema may change for each load.
I am doing the same with below command by providing delta table path
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").save(finalDF01DestFolderPath)
Just want to know whether this can be done by providing exisiting delta TABLE NAME instead of delta PATH.
This has been resolved by updating data write command as below.
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").saveAsTable(finalDF01DestTableName)
Is this the correct way ?
SCENARIO 02:
I have to update the existing table if the record already exists and if not insert a new record.
For this I am currently doing as shown below.
spark.sql("SET spark.databricks.delta.schema.autoMerge.enabled = true")
DeltaTable.forPath(DestFolderPath)
.as("t")
.merge(
finalDataFrame.as("s"),
"t.id = s.id AND t.name= s.name")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
I tried with below script.
destMasterTable.as("t")
.merge(
vehMasterDf.as("s"),
"t.id = s.id")
.whenNotMatched().insertAll()
.execute()
but getting below error(even with alias instead of as).
error: value as is not a member of String
destMasterTable.as("t")
Here also I am using delta table path as destination, Is there any way so that we could provide delta TABLE NAME instead of TABLE PATH?
It will be good to provide TABLE NAME instead of TABLE PATH, In case if we chage the table path later will not affect the code.
I have not seen anywhere in databricks documentation providing table name along with mergeSchema and autoMerge.
Is it possible to do so?
To use existing data as a table instead of path you either were need to use saveAsTable from the beginning, or just register existing data in the Hive metastore using the SQL command CREATE TABLE USING, like this (syntax could be slightly different depending on if you're running on Databricks, or OSS Spark, and depending on the version of Spark):
CREATE TABLE IF NOT EXISTS my_table
USING delta
LOCATION 'path_to_existing_data'
after that, you can use saveAsTable.
For the second question - it looks like destMasterTable is just a String. To refer to existing table, you need to use function forName from the DeltaTable object (doc):
DeltaTable.forName(destMasterTable)
.as("t")
...

PySpark cannot insertInto Hive table because "Can only write data to relations with a single path"

I have a Hive Orc table with a definition similar to the following definition
CREATE EXTERNAL TABLE `example.example_table`(
...
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='s3a://path/to/table')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3a://path/to/table'
TBLPROPERTIES (
...
)
I am attempting to use PySpark to append a dataframe to this table using "df.write.insertInto("example.example_table")". When running this, I get the following error:
org.apache.spark.sql.AnalysisException: Can only write data to relations with a single path.;
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:188)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:134)
...
When looking at the underlying Scala code, the condition that throws this error is checking to see if the table location has multiple "rootPaths". Obviously, my table is defined with a single location. What else could cause this?
It is that path that you are defining that causes the error. I just ran into this same problem myself. Hive generates a location path based on the hive.metastore.warehouse.dir property, so you have that default location plus the path you specified, which is causing that linked code to fail.
If you want to pick a specific path other than the default, then try using LOCATION.
Try running a describe extended example.example_table query to see more detailed information on the table. One of the output rows will be a Detailed Table Information which contains a bunch of useful information:
Table(
tableName:
dbName:
owner:
createTime:1548335003
lastAccessTime:0
retention:0
sd:StorageDescriptor(cols:
location:[*path_to_table*]
inputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
outputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
compressed:false
numBuckets:-1
serdeInfo:SerDeInfo(
name:null
serializationLib:org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
parameters:{
serialization.format=1
path=[*path_to_table*]
}
)
bucketCols:[]
sortCols:[]
parameters:{}
skewedInfo:SkewedInfo(skewedColNames:[]
skewedColValues:[]
skewedColValueLocationMaps:{})
storedAsSubDirectories:false
)
partitionKeys:[]
parameters:{transient_lastDdlTime=1548335003}
viewOriginalText:null
viewExpandedText:null
tableType:MANAGED_TABLE
rewriteEnabled:false
)
We had the same problem in a project when migrating from Spark 1.x and HDFS to Spark 3.x and S3. We solve this issue setting the next Spark property to false:
spark.sql.hive.convertMetastoreParquet
You can just run
spark.sql("SET spark.sql.hive.convertMetastoreParquet=false")
Or maybe
spark.conf("spark.sql.hive.convertMetastoreParquet", False)
Being spark the SparkSession object. The explanaition of this is currently in Spark documentation.

How to create an EXTERNAL Spark table from data in HDFS

I have loaded a parquet table from HDFS into a DataFrame:
val df = spark.read.parquet("hdfs://user/zeppelin/my_table")
I now want to expose this table to Spark SQL but this must be a persitent table because I want to access it from a JDBC connection or other Spark Sessions.
Quick way could be to call df.write.saveAsTable method, but in this case it will materialize the contents of the DataFrame and create a pointer to the data in the Hive metastore, creating another copy of the data in HDFS.
I don't want to have two copies of the same data, so I would want create like an external table to point to existing data.
To create a Spark External table you must specify the "path" option of the DataFrameWriter. Something like this:
df.write.
option("path","hdfs://user/zeppelin/my_mytable").
saveAsTable("my_table")
The problem though, is that it will empty your hdfs path hdfs://user/zeppelin/my_mytable eliminating your existing files and will cause an org.apache.spark.SparkException: Job aborted.. This looks like a bug in Spark API...
Anyway, the workaround to this (tested in Spark 2.3) is to create an external table but from a Spark DDL. If your table have many columns creating the DDL could be a hassle. Fortunately, starting from Spark 2.0, you could call the DDL SHOW CREATE TABLE to let spark do the hard work. The problem is that you can actually run the SHOW CREATE TABLE in a persistent table.
If the table is pretty big, I recommend to get a sample of the table, persist it to another location, and then get the DDL. Something like this:
// Create a sample of the table
val df = spark.read.parquet("hdfs://user/zeppelin/my_table")
df.limit(1).write.
option("path", "/user/zeppelin/my_table_tmp").
saveAsTable("my_table_tmp")
// Now get the DDL, do not truncate output
spark.sql("SHOW CREATE TABLE my_table_tmp").show(1, false)
You are going to get a DDL like:
CREATE TABLE `my_table_tmp` (`ID` INT, `Descr` STRING)
USING parquet
OPTIONS (
`serialization.format` '1',
path 'hdfs:///user/zeppelin/my_table_tmp')
Which you would want to change to have the original name of the table and the path to the original data. You can now run the following to create the Spark External table pointing to your existing HDFS data:
spark.sql("""
CREATE TABLE `my_table` (`ID` INT, `Descr` STRING)
USING parquet
OPTIONS (
`serialization.format` '1',
path 'hdfs:///user/zeppelin/my_table')""")

Resources