what the difference between sparksessioncatalog and sparkcatalog in iceberg - apache-spark

As the title says.
question comes from:
I connect to spark-sql with iceberg catalog like this:
bin/spark-sql \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.ice_test2=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.ice_test2.type=hive \
--conf spark.sql.catalog.ice_test2.uri=thrift://xxxxxxx:9083
but when I execute use ice_test2.default;,I got an error:
java.lang.NullPointerException: Delegated SessionCatalog is missing. Please make sure your are replacing Spark's default catalog, named 'spark_catalog'.
while I run spark-sql with SparkCatalog instead is OK.

Edit following original question edit:
The way org.apache.iceberg.spark.SparkSessionCatalog works is by first trying to load an iceberg table with the given identifier and then falling back the default catalog behaviour for this session catalog.
Since you are using ice_test2 as your catalog it doesn't know which SessionCatalog to fallback to.
As the error indicates if you will use spark_catalog instead of ice_test2 it should work.
Quoting Iceberg documentation for more information about the difference between Iceberg's SparkCatalog and SparkSessionCatalog
org.apache.iceberg.spark.SparkCatalog - supports a Hive Metastore or a Hadoop warehouse as a catalog
org.apache.iceberg.spark.SparkSessionCatalog - adds support for Iceberg tables to Spark’s built-in catalog, and delegates to the built-in catalog for non-Iceberg tables

change spark.sql.catalog.ice_test2=org.apache.iceberg.spark.SparkSessionCatalog to spark.sql.catalog.ice_test2=org.apache.iceberg.spark.SparkCatalog
delete spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
refferred by https://iceberg.apache.org/docs/latest/spark-configuration/#catalogs

Related

Cannot modify the value of a Spark config: spark.executor.instances

I am using spark 3.0 and I am setting parameters
My parameters:
spark.conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.conf.set("fs.s3a.fast.upload.buffer", "bytebuffer")
spark.conf.set("spark.sql.files.maxPartitionBytes",134217728)
spark.conf.set("spark.executor.instances", 4)
spark.conf.set("spark.executor.memory", 3)
Error:
pyspark.sql.utils.AnalysisException: Cannot modify the value of a Spark config: spark.executor.instances
I DONT want to pass it through spark-submit as this is pytest case that I am writing.
How do I get through this?
According to spark official documentation, the spark.executor.instances property may not be affected when setting programmatically through SparkConf in runtime, so it would be suggested to set through configuration file or spark-submit command line options.
Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.
You can try to add those option to PYSPARK_SUBMIT_ARGS before initialize SparkContext. Its syntax is similar to spark-submit.

Azure Databricks - Can not create the managed table The associated location already exists

I have the following problem in Azure Databricks. Sometimes when I try to save a DataFrame as a managed table:
SomeData_df.write.mode('overwrite').saveAsTable("SomeData")
I get the following error:
"Can not create the managed table('SomeData'). The associated
location('dbfs:/user/hive/warehouse/somedata') already exists.;"
I used to fix this problem by running a %fs rm command to remove that location but now I'm using a cluster that is managed by a different user and I can no longer run rm on that location.
For now the only fix I can think of is using a different table name.
What makes things even more peculiar is the fact that the table does not exist. When I run:
%sql
SELECT * FROM SomeData
I get the error:
Error in SQL statement: AnalysisException: Table or view not found:
SomeData;
How can I fix it?
Seems there are a few others with the same issue.
A temporary workaround is to use
dbutils.fs.rm("dbfs:/user/hive/warehouse/SomeData/", true)
to remove the table before re-creating it.
This generally happens when a cluster is shutdown while writing a table. The recomended solution from Databricks documentation:
This flag deletes the _STARTED directory and returns the process to the original state. For example, you can set it in the notebook
%py
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
All of the other recommended solutions here are either workarounds or do not work. The mode is specified as overwrite, meaning you should not need to delete or remove the db or use legacy options.
Instead, try specifying the fully qualified path in the options when writing the table:
df.write \
.option("path", "hdfs://cluster_name/path/to/my_db") \
.mode("overwrite") \
.saveAsTable("my_db.my_table")
For a more context-free answer, run this in your notebook:
dbutils.fs.rm("dbfs:/user/hive/warehouse/SomeData", recurse=True)
Per Databricks's documentation, this will work in a Python or Scala notebook, but you'll have to use the magic command %python at the beginning of the cell if you're using an R or SQL notebook.
I have the same issue, I am using
create table if not exists USING delta
If I first delete the files lie suggested, it creates it once, but second time the problem repeats, It seems the create table not exists does not recognize the table and tries to create it anyway
I don't want to delete the table every time, I'm actually trying to use MERGE on keep the table.
Well, this happens because you're trying to write data to the default location (without specifying the 'path' option) with the mode 'overwrite'.
Like said Mike you can set "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" to "true", but this option was removed in Spark 3.0.0.
If you try to set this option in Spark 3.0.0 you will get the following exception:
Caused by: org.apache.spark.sql.AnalysisException: The SQL config 'spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation' was removed in the version 3.0.0. It was removed to prevent loosing of users data for non-default value.;
To avoid this problem you can explicitly specify the path where you're going to save with the 'overwrite' mode.

Cannot create a Dataproc cluster when setting the fs.defaultFS property?

This was already the object of discussion in previous post, however, I'm not convinced with the answers as the Google docs specify that it is possible to create a cluster setting the fs.defaultFS property. Moreover, even if possible to set this property programmatically, sometimes, it's more convenient to set it from command line.
So I wanted to know why the following option when passed to my cluster creation command does not work: --properties core:fs.defaultFS=gs://my-bucket? Please note I haven't included all parameters as I ran the command without the previous flag and it succeeded to create the cluster. However, when passing this, I get: "failed: Cannot start master: Insufficientnumber of DataNodes reporting."
If anyone managed to create a dataproc cluster by setting the fs.defaultFS that'd be great? Thanks.
It's true there are still known issues due to certain dependencies on actual HDFS; the docs were not intended to imply that setting fs.defaultFS to a GCS path at cluster-creation time would work, but to simply provide a convenient example of a property that appears in core-site.xml; in theory it would work to set fs.defaultFS to a different preexisting HDFS cluster, for example. I've filed a ticket to change the example in the documentation to avoid confusion.
Two options:
Just override fs.defaultFS at job-submission time using per-job properties
Workaround some of the known issues by setting fs.defaultFS explicitly using an initialization action instead of cluster properties.
Option 1 is better understood to work because cluster-level HDFS dependencies won't change. Option 2 works because most of the incompatibilities occur during initial startup only, and initialization actions run after the relevant daemons start up already. To override the setting in an init action, you'd use bdconfig:
bdconfig set_property \
--name 'fs.defaultFS' \
--value 'gs://my-bucket' \
--configuration_file /etc/hadoop/conf/core-site.xml \
--clobber

Spark doesn't show all Hive databases

If I list all the databases in Hive, I get the following result (I have 2 tables default ans sbm):
But if I try to do the same thing in spark I get this
It doesn't show the database SBM.
Are you connected to that hive metastore? Did you specify somewhere the metastore details (i.e. hive-site.xml in spark-conf directory) ? Seems like you are connected to the local merastore.
I think that you need copy your hive-site.xml to spark/conf directory.
If you use ubuntu and defined the environmental variables use the next command:
cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf

tables available in Spark-SQL CLI are not available over thriftserver

I'm trying to expose my spark-sql tables over JDBC via thriftserver but even though it looks like i've successfully connected, its not working. Here's what I've tried so far.
database setup:
in pyspark I loaded a parquet file, created a temp view as tableX
performed a .saveAsTable as hive_tableX
then I queried that table: spark.sql("SELECT * FROM hive_tableX LIMIT 1").show() which returned some data
at this point, my code is saving table information to the hivestore, right?
querying from spark-sql:
I then ran spark-sql and the spark sql shell started up
USE default
show tables; --> i see my table in there, hive_tableX
SELECT * FROM hive_tableX LIMIT 1 and I see some successful results.
thus, I believe it is now verified that my table has saved in the hive metastore, right?
then I turn on thriftserver
./sbin/start-thriftserver.sh
next, I turn on beeline so I can test the thriftserver connection
!connect jdbc:hive2://localhost:10000 (and enter username and password)
then I select the default db: use default;
and show tables; --> there's nothing there.
So, where are my tables? is beeline or thrift pointing to a different warehouse or something?
Edit: I think my thriftserver isn't using the right warehouse directory, so I'm trying to start it with a config option:
[still nothing] sbin/start-thriftserver.sh --hiveconf spark.sql.warehouse.dir=/code/spark/thrift/spark-warehouse
[still nothing] sbin/start-thriftserver.sh --conf spark.sql.warehouse.dir=/code/spark/thrift/spark-warehouse
Edit: starting it in the same physical directory as where the wherehouse was created seems to do the trick. Although, I don't know how to programmatically set the path to something else and start it elsewhere.
the solution to this particular problem was that I was starting thrift from a different directory than the spark-wherehouse and metastore_db were located.
Once I started it from the correct directory, it worked as expected and my tables were now available.

Resources