Databricks schema enforcement issues - apache-spark

As suggested in the article about schema enforcement, a declared schema helps detecting issues early.
The two issues described below however are preventing me from creating a descriptive schema.
Comments on a table column are seen as a difference in the schema
# Get data
test_df = spark.createDataFrame([('100000146710')], ['code'])
# ... save
test_df.write.format("delta").mode("append").save('/my_table_location')
# Create table: ... BOOM
spark.sql("""
CREATE TABLE IF NOT EXISTS my_table (
code STRING COMMENT 'Unique identifier'
) USING DELTA LOCATION '/my_table_location'
""")
This will fail with AnalysisException: The specified schema does not match the existing schema at /my_table_location . The only solution I found is to drop the columnt comments.
Not null struct field shows as nullable
json_schema = StructType([
StructField("code", StringType(), False)
])
json_df = (spark.read
.schema(json_schema)
.json('/my_input.json')
)
json_df.printSchema()
will show
root
|-- code: string (nullable = true)
So despite the schema declaration stating that a field is not null, the field shows as nullable in the dataframe. Because of this, adding a NOT NULL constraint on the table column will trigger the AnalysisException error.
Any comments or suggestions are welcome.

With the execution of
test_df.write.format("delta").mode("append").save('/my_table_location')
You have already created a new Delta table with its specific schema as defined by test_df. This new table delta.`/my_table_location` already has the schema of code STRING.
If you would like to create a comment within the schema, perhaps first create the table as you would like it defined, e.g.
spark.sql("""
CREATE TABLE my_table
code STRING COMMENT 'unique identifier'
USING DELTA LOCATION '/my_table_location'
""")
And then insert your data from your test_df into it, e.g.
test_df.createOrReplaceView("test_df_view")
spark.sql("""
INSERT INTO my_table (code) SELECT code FROM test_df_view
""")

Related

Change the datatype of a column in delta table

Is there a SQL command that I can easily use to change the datatype of a existing column in Delta table. I need to change the column datatype from BIGINT to STRING. Below is the SQL command I'm trying to use but no luck.
%sql ALTER TABLE [TABLE_NAME] ALTER COLUMN [COLUMN_NAME] STRING
Error I'm getting:
org.apache.spark.sql.AnalysisException
ALTER TABLE CHANGE COLUMN is not supported for changing column 'bam_user' with type
'IntegerType' to 'bam_user' with type 'StringType'
SQL doesn't support this, but it can be done in python:
from pyspark.sql.functions import col
# set dataset location and columns with new types
table_path = '/mnt/dataset_location...'
types_to_change = {
'column_1' : 'int',
'column_2' : 'string',
'column_3' : 'double'
}
# load to dataframe, change types
df = spark.read.format('delta').load(table_path)
for column in types_to_change:
df = df.withColumn(column,col(column).cast(types_to_change[column]))
# save df with new types overwriting the schema
df.write.format("delta").mode("overwrite").option("overwriteSchema",True).save("dbfs:" + table_path)
No Option to change the data type of column or dropping the column. You can read the data in datafame, modify the data type and with help of withColumn() and drop() and overwrite the table.
There is no real way to do this using SQL, unless you copy to a different table altogether. This option includes INSERT data to a new table, DROP TABLE and re-CREATE with the new structure and therefore risky.
The way to do this in python is as follows:
Let's say this is your table :
CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
INSERT INTO person VALUES
(100, 'John', 30, 1, 'Street 1'),
(200, 'Mary', NULL, 1, 'Street 2'),
(300, 'Mike', 80, 3, 'Street 3'),
(400, 'Dan', 50, 4, 'Street 4');
You can check the table structure using the following:
DESCRIBE TABLE person
IF you need to change the id to String:
This is the code:
%py
from pyspark.sql.functions import col
df = spark.read.table("person")
df1 = df.withColumn("id",col("id").cast("string"))
df1.write
.format ("parquet")
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable("person")
Couple of pointers: the format is parquet in this table. That's the default for Databricks. So you can omit the "format" line (note that Python is very sensitive regarding spaces).
Re databricks:
If the format is "delta" you must specify this.
Also, if the table is partitioned, it's important to mention that in the code:
For example:
df1.write
.format ("delta")
.mode("overwrite")
.partitionBy("col_to_partition1", "col_to_partition2")
.option("overwriteSchema", "true")
.save(table_location)
When table_location is where the delta table is saved.
(some of this answer is based on this)
Suppose you want to change data type of column "column_name" to "int" of table "delta_table_name"
spark.read.table("delta_table_name") .withColumn("Column_name",col("Column_name").cast("new_data_type")) .write.format("delta").mode("overwrite").option("overwriteSchema",true).saveAsTable("delta_table_name")
Read the table using spark.
Use withColumn method to transform the column you want.
Write the table back, mode overwrite and overwriteSchema True
Reference: https://docs.databricks.com/delta/update-schema.html#explicitly-update-schema-to-change-column-type-or-name
from pyspark.sql import functions as F
spark.read.table("<TABLE NAME>") \
.withColumn("<COLUMN NAME> ",F.col("<COLUMN NAME>").cast("<DATA TYPE>")) \
.write.format("delta").mode("overwrite").option("overwriteSchema",True).saveAsTable("<TABLE NAME>")

PySpark Pushing down timestamp filter

I'm using PySpark version 2.4 to read some tables using jdbc with a Postgres driver.
df = spark.read.jdbc(url=data_base_url, table="tablename", properties=properties)
One column is a timestamp column and I want to filter it like this:
df_new_data = df.where(df.ts > last_datetime )
This way the filter is pushed down as a SQL query but the datetime format
is not right. So I tried this approach
df_new_data = df.where(df.ts > F.date_format( F.lit(last_datetime), "y-MM-dd'T'hh:mm:ss.SSS") )
but then the filter is no pushed down anymore.
Can someone clarify why this is the case ?
While loading the data from a Database table, if you want to push down queries to database and get few result rows, instead of providing the 'table', you can provide the 'Query' and return just the result as a DataFrame. This way, we can leverage database engine to process the query and return only the results to Spark.
The table parameter identifies the JDBC table to read. You can use anything that is valid in a SQL query FROM clause. Note that alias is mandatory to be provided in query.
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
df.show()

Cassandra: List all tables in keyspace based on restriction such as LIKE or CONTAINS?

I have many tables per keyspace, therefore I would like to filter the tables based on restriction criteria. I tried this query but it is not really giving the intended result that I want:
SELECT table_name FROM system_schema.tables
WHERE keyspace_name = 'test'
and table_name >= 'test_001_%';
The output shown is:
'table_name'
---------------------
'test_001_metadata'
'test_001_time1'
'test_001_time2'
'test_001_time3'
'test_001_time4'
'test_002_metadata'
'test_002_time1'
'test_002_time2'
'test_002_time3'
What I really want is:
The output shown is:
'table_name'
---------------------
'test_001_metadata'
'test_001_time1'
'test_001_time2'
'test_001_time3'
'test_001_time4'
The other way out is to use LIKE keyword by creating secondary index on table_name. But I am a bit skeptical if it might cause problem as it is a system table. Another concern is, does clustering column ACTUALLY support secondary index?
Create a SASI index with mode contains on the table_name column after removing the previous index and try the query as
SELECT table_name FROM system_schema.tables
WHERE keyspace_name = 'test'
and table_name LIKE '%test_001_%';
The command to create a SASI index with mode contains is as follows:
CREATE CUSTOM INDEX ON system_schema.tables(table_name)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
'case_sensitive': 'false', 'tokenization_normalize_uppercase': 'true', 'mode': 'CONTAINS'}
And for your second question, you cannot create secondary index on anything which is part of PRIMARY KEY.

HIVE Parquet error

I'm trying to insert the content of a dataframe to a partitioned parquet-formatted hive table using
df.write.mode(SaveMode.Append).insertInto(myTable)
with hive.exec.dynamic.partition = 'true' and hive.exec.dynamic.partition.mode = 'nonstrict'.
I keep getting an parquet.io.ParquetEncodingException saying that
empty fields are illegal, the field should be ommited completely
instead.
The schema includes arrays (array<<struct<<int, string>>>>), and the df do contain some empty entries for these fields.
However, when I insert the df content into a non-partitioned table, I do not get an error.
How to this fix this issue.
I have attached error

Getting ValidationFailureSemanticException on 'INSERT OVEWRITE'

I am creating a DataFrame and registering that DataFrame as temp view using df.createOrReplaceTempView('mytable'). After that I try to write the content from 'mytable' into Hive table(It has partition) using the following query
insert overwrite table
myhivedb.myhivetable
partition(testdate) // ( 1) : Note here : I have a partition named 'testdate'
select
Field1,
Field2,
...
TestDate //(2) : Note here : I have a field named 'TestDate' ; Both (1) & (2) have the same name
from
mytable
when I execute this query, I am getting the following error
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.Table$ValidationFailureSemanticException: Partition spec
{testdate=, TestDate=2013-01-01}
Looks like I am getting this error because of the same field names ; ie testdate(the partition in Hive) & TestDate (The field in temp table 'mytable')
Whereas if my partition name testdate is different from the fieldname(ie TestDate), the query executes successuflly. Example...
insert overwrite table
myhivedb.myhivetable
partition(my_partition) //Note here the partition name is not 'testdate'
select
Field1,
Field2,
...
TestDate
from
mytable
My guess is it looks like a Bug in Spark...but would like to have second opinion...Am I missing something here?
#DuduMarkovitz #dhee ; apologies for being too late for the response. I am finally able to resolve the issue. Earlier I was creating the table using cameCase(in the CREATE statement) which seems to be the reason for the Exception. Now i have created the table using the DDL where field names are in lower case. This has resolved my issue

Resources