Failed to execute 'table' on org.apache.spark.sql.SparkSession - apache-spark

I have a Spark + Hive application.
It works fine. But at some point I had to create another Hive environment.
So I ran show create table ... and recreated the same view (with underlying tables). And added some data.
I can query the data from hive cli, etc.
but whenever I run my application it fails with
ERROR Failed to execute 'table' on 'org.apache.spark.sql.SparkSession' with args=([Type=java.lang.String, Value: <view name>])
I believe it refers to the line code when I can sparkSession.table(<view-name>)
What steps can be executed to troubleshoot a such issue?
UPD
Session declaration (definitely tried to create a session without this configuration)
.Config("spark.hadoop.google.cloud.auth.service.account.enable", "true")
.Config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "some.file")
.Config("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
.Config("spark.sql.debug.maxToStringFields", int64 2048)
.Config("spark.debug.maxToStringFields", int64 2048)

Maybe a bit trivial, but when it comes to troubleshooting this kind of an issue, really try and get to the root of the problem with a minimal set up:
I generally start off by starting the spark-shell.
Check whether it is possible to run spark.sql("SHOW DATABASES").show(20, false). If this fails, it's probably something with your Hive configuration, indeed.
Try and see whether you can run spark.table("your_table"). If not, it'll probably give you a clearer error (such as Table or view not found: ...).
If all of the above works, try to strip your application such that it only does that spark.table, which did work in your spark-shell at that point in time. If that suddenly doesn't work, it might have to do with how the SparkSession is created in your application.
If that works, try and uncomment the code piece by piece, until you're back to your original code to better pinpoint where it fails.

Related

How can I install flashtext on every executor?

I am using the flashtext library in a couple of UDFs. It works when I run it locally in Client mode, but once I try to run it in the Cloudera Workbench with several executors, I get an ModuleNotFoundError.
After some research I found that it is possible to add archives (and packages?) to a SparkSession when creating it, so I tried:
SparkSession.builder.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz')
but it didn't help, the same error remains.
According to Spark Configuration doc, there are other configs I could try, e.g. spark.submit.pyFiles, but I don't understand how these py-files to be added would have to look like.
Would it be enough to just create a pyton script with this content?
from flashtext import KeywordProcessor
Could you tell me the easiest way how I can install flashtext on every node?
Edit:
In the meantime, I figured that not only Flashtext was causing issues, but also every relative import from other scripts that I intended to use in a UDF. In order to fix it, I followed this article. I also took the source code from Flashtext and imported it to the main file without installing the actual library.
I think in order to point Spark executors to python modules extracted from your archive, you will need to add another config setting, that adds their location to PYTHONPATH. Something like this:
SparkSession.builder \
.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz#myUDFs') \
.config('spark.executorEnv.PYTHONPATH', './myUDFs')
Citing from the same link you have in the question:
spark.executorEnv.[EnvironmentVariableName]...Add the environment
variable specified by EnvironmentVariableName to the Executor process.
The user can specify multiple of these to set multiple environment
variables.
There are no environment details in your question (or I'm simply not familiar with Cloudera Workbench) but if you're trying to run Spark on YARN, you may need to use slightly different setting spark.yarn.dist.archives.
Also, please make sure that your driver log contains message confirming that an archive was actually uploaded, as in:
:
22/11/08 INFO yarn.Client: Uploading resource file:/absolute/path/to/your/archive.zip -> hdfs://nameservice/user/<your-user-id>/.sparkStaging/<application-id>/archive.zip
:

Azure Databricks - Can not create the managed table The associated location already exists

I have the following problem in Azure Databricks. Sometimes when I try to save a DataFrame as a managed table:
SomeData_df.write.mode('overwrite').saveAsTable("SomeData")
I get the following error:
"Can not create the managed table('SomeData'). The associated
location('dbfs:/user/hive/warehouse/somedata') already exists.;"
I used to fix this problem by running a %fs rm command to remove that location but now I'm using a cluster that is managed by a different user and I can no longer run rm on that location.
For now the only fix I can think of is using a different table name.
What makes things even more peculiar is the fact that the table does not exist. When I run:
%sql
SELECT * FROM SomeData
I get the error:
Error in SQL statement: AnalysisException: Table or view not found:
SomeData;
How can I fix it?
Seems there are a few others with the same issue.
A temporary workaround is to use
dbutils.fs.rm("dbfs:/user/hive/warehouse/SomeData/", true)
to remove the table before re-creating it.
This generally happens when a cluster is shutdown while writing a table. The recomended solution from Databricks documentation:
This flag deletes the _STARTED directory and returns the process to the original state. For example, you can set it in the notebook
%py
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
All of the other recommended solutions here are either workarounds or do not work. The mode is specified as overwrite, meaning you should not need to delete or remove the db or use legacy options.
Instead, try specifying the fully qualified path in the options when writing the table:
df.write \
.option("path", "hdfs://cluster_name/path/to/my_db") \
.mode("overwrite") \
.saveAsTable("my_db.my_table")
For a more context-free answer, run this in your notebook:
dbutils.fs.rm("dbfs:/user/hive/warehouse/SomeData", recurse=True)
Per Databricks's documentation, this will work in a Python or Scala notebook, but you'll have to use the magic command %python at the beginning of the cell if you're using an R or SQL notebook.
I have the same issue, I am using
create table if not exists USING delta
If I first delete the files lie suggested, it creates it once, but second time the problem repeats, It seems the create table not exists does not recognize the table and tries to create it anyway
I don't want to delete the table every time, I'm actually trying to use MERGE on keep the table.
Well, this happens because you're trying to write data to the default location (without specifying the 'path' option) with the mode 'overwrite'.
Like said Mike you can set "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" to "true", but this option was removed in Spark 3.0.0.
If you try to set this option in Spark 3.0.0 you will get the following exception:
Caused by: org.apache.spark.sql.AnalysisException: The SQL config 'spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation' was removed in the version 3.0.0. It was removed to prevent loosing of users data for non-default value.;
To avoid this problem you can explicitly specify the path where you're going to save with the 'overwrite' mode.

Error While inserting rows into Kudu using Spark Shell

I am new to Apache Kudu, I installed it on my Ubuntu system and later created a table in it using Apache Spark shell. Now I am trying to insert data into that table using insertRows() for that I am using the but below given command,
kuduContext.insertRows(customersDF, "spark_kudu_tbl")
Where customersDF is a Data Frame and spark_kudu_tbl is a table in the Kudu data base. I am getting below error,
java.lang.NoSuchMethodError: org.apache.kudu.spark.kudu.KuduContext.insertRows(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;)V
... 70 elided
I have tried different options but no one is giving results to me. Can any one give any solution for my question.
From the error message it appears as though you are using wrong kudu-spark artifact, you should use kudu-spark2_2. please start your spark-shell as below (replace the last bit with your kudu version)
spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.3.0

Path for persistent, managed hive tables in Spark 1.4

I am new to Spark and working on JavaSqlNetworkWordCount example to append the word count in a persistent table. I understand that I can only do it via HiveContext. HiveContext, however, keeps trying to save the table in /user/hive/warehouse/. I have tried changing the path by adding
hiveContext.setConf("hive.metastore.warehouse.dir", "/home/user_name");
and by adding the property
<property><name>hive.metastore.warehouse.dir</name>
<value>/home/user_name</value></property>
$SPARK_HOME/conf/hive-site.xml, but nothing seems to work. If anyone else has faced this problem, please let me know if/how you resolved it. I am using Spark1.4 on my local RHEL5 machine.
I think I solved the problem. It looks like spark-submit was creating a metastore_db directory in root directory of the jar file. If metastore_db exists, then hive-stie.xml values are ignored. As soon as I removed that directory, code picked up values from hive-site.xml. I still cannot set the value of the hive.metastore.warehouse.dir property from the code, though.

InvalidRequestException Keyspace keyspace1 does not exist

I'm trying to connect to a Datastax Community Edition server 2.1.2 via JDBC but I keep getting the following error no matter what I try to do, even when issuing a very basic command like select * from system_traces.events;
InvalidRequestException(why:Keyspace 'keyspace1' does not exist)
Issuing that same command via cqlsh works properly, so it seems to be a JDBC issue.
InvalidRequestException(why:Keyspace 'keyspace1' does not exist)
at org.apache.cassandra.cql.jdbc.CassandraConnection.<init>(CassandraConnection.java:229):229
at org.apache.cassandra.cql.jdbc.CassandraDriver.connect(CassandraDriver.java:92):92
at java.sql.DriverManager.getConnection(DriverManager.java:664):664
at java.sql.DriverManager.getConnection(DriverManager.java:270):270
at railo.commons.db.DBUtil.getConnection(DBUtil.java:109):109
at railo.runtime.db.DatasourceConnectionPool.loadDatasourceConnection(DatasourceConnectionPool.java:89):89
at railo.runtime.db.DatasourceConnectionPool.getDatasourceConnection(DatasourceConnectionPool.java:81):81
at railo.runtime.db.DatasourceManagerImpl.getConnection(DatasourceManagerImpl.java:65):65
at railo.runtime.tag.Query.executeDatasoure(Query.java:696):696 ...
Any ideas? TIA!
InvalidRequestException(why:Keyspace 'keyspace1' does not exist)
This exception means you are trying to query for a keyspace (in this case "Keyspace1") that hasn't yet been added to Cassandra. Try creating the keyspace before querying it.
You're probably doing a select (SELECT * FROM "Keyspace1"."Standard1") that you're not seeing or passing initialisation parameters to JDBC telling it to connect to Keyspace1. Verify that your code isn't looking for the non-existent keyspace by searching through the queries you have, specifically looking for Keyspace1 (or "Keyspace1" since in this case the keyspace name is case-sensitive).
On a side-note, "Keyspace1"."Standard1" tend to be the standard ks.cf pair used for cassandra examples so it would be good to scan your code for them to make sure that they are created before they are queried.

Resources