I want to add some columns in a Delta table using spark sql, but it showing me error like :
ALTER ADD COLUMNS does not support datasource table with type org.apache.spark.sql.delta.sources.DeltaDataSource.
You must drop and re-create the table for adding the new columns.
Is there any way to alter my table in delta lake?
Thanks a lot for this question! I learnt quite a lot while hunting down a solution 👍
This is Apache Spark 3.2.1 and Delta Lake 1.1.0 (all open source).
The reason for the error is that Spark SQL (3.2.1) supports ALTER ADD COLUMNS statement for csv, json, parquet, orc data sources only. Otherwise, it throws the exception.
I assume you ran ALTER ADD COLUMNS using SQL (as the root cause would've been caught earlier if you'd used Scala API or PySpark).
That leads us to org.apache.spark.sql.delta.catalog.DeltaCatalog that has to be "installed" to Spark SQL for it to recognize Delta Lake as a supported datasource. This is described in the official Quickstart.
For PySpark (on command line) it'd be as follows:
./bin/pyspark \
--packages io.delta:delta-core_2.12:1.1.0 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
In order to extend Spark SQL with Delta Lake's features (incl. ALTER ADD COLUMNS support) you have to add the following configuration properties for DeltaSparkSessionExtension and DeltaCatalog:
spark.sql.extensions
spark.sql.catalog.spark_catalog
They are mandatory (and optional in managed environments like Azure Databricks that were mentioned as working fine for obvious reasons).
ALTER ADD COLUMNS does not support datasource table with type
org.apache.spark.sql.delta.sources.DeltaDataSource. You must drop and
re-create the table for adding the new columns.
You are facing this issue because of different versions in configuration. Make sure you are using compatible version of spark and scala.
As mentioned in comment you can alter table in delta lake, follow link mentioned by #David דודו Markovitz
Refer this similar issue
Related
I am saving my spark dataframe on azure databricks and create delta lake table.
It works fine, however I am getting this warning message while execution.
Question- Why I am still getting this message, even with my table is delta table. What is wrong with my approach, any inputs is greatly appreciated.
Warning Message
This query contains a highly selective filter. To improve the performance of queries, convert the table to Delta and run the OPTIMIZE ZORDER BY command on the table
Code
dfMerged.write\
.partitionBy("Date")\
.mode("append")\
.format("delta")\
.option("overwriteSchema", "true")\
.save("/mnt/path..")
spark.sql("CREATE TABLE DeltaUDTable USING DELTA LOCATION '/mnt/path..'")
Some more details
I've mounted azure storage gen 2 to above mount location.
databricks runtime - 6.4 (includes Apache Spark 2.4.5, Scala 2.11)
The warning message is clearly misleading as you already have a Delta option. Ignore it.
df.write.mode("overwirte").saveAsTable("table_loc")
Using the spark-elasticsearch connector it is possible to directly load only the required columns from ES to Spark. However, there doesn't seem to exist such a straight forward option to do the same, using the spark-cassandra connector
Reading data from ES into Spark
-- here only required columns are being brought from ES to Spark :
spark.conf.set('es.nodes', ",".join(ES_CLUSTER))
es_epf_df = spark.read.format("org.elasticsearch.spark.sql") \
.option("es.read.field.include", "id_,employee_name") \
.load("employee_0001") \
Reading data from Cassandra into Spark
-- here all the columns' data is brought to spark and then select is applied to pull columns of interest :
spark.conf.set('spark.cassandra.connection.host', ','.join(CASSANDRA_CLUSTER))
cass_epf_df = spark.read.format('org.apache.spark.sql.cassandra') \
.options(keyspace="db_0001", table="employee") \
.load() \
.select("id_", "employee_name")
Is it possible to do the same for Cassandra? If yes, then how. If not, then why not.
Actually, connector should do that itself, without need to explicitly set anything, it's called "predicate pushdown", and cassandra-connector does it, according to documentation:
The connector will automatically pushdown all valid predicates to
Cassandra. The Datasource will also automatically only select columns
from Cassandra which are required to complete the query. This can be
monitored with the explain command.
source: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md
The code which you have written is already doing that. You have written select after load and you may think first all the columns are pulled and then selected columns are filtered, but that is not the case.
Assumption : select * from db_0001.employee;
Actual : select id_, employee_name from db_0001.employee;
Spark will understand the columns which you need and query only those in Cassandra database. This feature is called predicate pushdown. This is not limited just to cassandra, many sources support this feature(this is a feature of spark, not casssandra).
For more info: https://docs.datastax.com/en/dse/6.7/dse-dev/datastax_enterprise/spark/sparkPredicatePushdown.html
The default
spark-shell --conf spark.hadoop.metastore.catalog.default=hive
val df:Dataframe = ...
df.write.saveAsTable("db.table")
fails as it tries to write a internal / managed / transactional table (see How to write a table to hive from spark without using the warehouse connector in HDP 3.1).
How can I tell spark to not create a managed, but rather an external table?
For now disabling transactional tables by default looks like the best option to me.
Inside Ambari simply disabling the option of creating transactional tables by default solves my problem.
set to false twice (tez, llap)
hive.strict.managed.tables = false
and enable manually in each table property if desired (to use a transactional table).
As a workaround using the manual CTAS could be an option as well.
As part of a data integration process I am working on, I have a need to persist a Spark SQL DataFrame as an external Hive table.
My constraints at the moment:
Currently limited to Spark 1.6 (v1.6.0)
Need to persist the data in a specific location, retaining the data even if the table definition is dropped (hence external table)
I have found what appears to be a satisfactory solution to write the dataframe, df, as follows:
df.write.saveAsTable('schema.table_name',
format='parquet',
mode='overwrite',
path='/path/to/external/table/files/')
Doing a describe extended schema.table_name against the resulting table confirms that it is indeed external. I can also confirm that the data is retained (as desired) even if the table itself is dropped.
My main concern is that I can't really find a documented example of this anywhere, nor can I find much mention of it in the official docs -
particularly the use of a path to enforce the creation of an external table.
(https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter).
Is there a better/safer/more standard way to persist the dataframe?
I rather creating the Hive tables myself (e.g. CREATE EXTERNAL TABLE IF NOT EXISTS) exactly as I need and then in Spark just do: df.write.saveAsTable('schema.table_name', mode='overwrite').
This way you have control about the table creation and don't depend on the HiveContext doing what you need. In the past there where issues with the Hive tables created this way and the behavior can change in the future since that API is generic and cannot guarantee the underlying implementation by HiveContext.
I wonder can I use the update query in sparksql just like:
sqlContext.sql("update users set name = '*' where name is null")
I got the error:
org.apache.spark.sql.AnalysisException:
Unsupported language features in query:update users set name = '*' where name is null
If the sparksql does not support the update query or am i writing the code incorrectly?
Spark SQL doesn't support UPDATE statements yet.
Hive has started supporting UPDATE since hive version 0.14. But even with Hive, it supports updates/deletes only on those tables that support transactions, it is mentioned in the hive documentation.
See the answers in databricks forums confirming that UPDATES/DELETES are not supported in Spark SQL as it doesn't support transactions. If we think, supporting random updates is very complex with most of the storage formats in big data. It requires scanning huge files, updating specific records and rewriting potentially TBs of data. It is not normal SQL.
Now it's possible, with Databricks Delta Lake
Spark SQL now supports update, delete and such data modification operations if the underlying table is in delta format.
Check this out:
https://docs.delta.io/0.4.0/delta-update.html#update-a-table