tables available in Spark-SQL CLI are not available over thriftserver - apache-spark

I'm trying to expose my spark-sql tables over JDBC via thriftserver but even though it looks like i've successfully connected, its not working. Here's what I've tried so far.
database setup:
in pyspark I loaded a parquet file, created a temp view as tableX
performed a .saveAsTable as hive_tableX
then I queried that table: spark.sql("SELECT * FROM hive_tableX LIMIT 1").show() which returned some data
at this point, my code is saving table information to the hivestore, right?
querying from spark-sql:
I then ran spark-sql and the spark sql shell started up
USE default
show tables; --> i see my table in there, hive_tableX
SELECT * FROM hive_tableX LIMIT 1 and I see some successful results.
thus, I believe it is now verified that my table has saved in the hive metastore, right?
then I turn on thriftserver
./sbin/start-thriftserver.sh
next, I turn on beeline so I can test the thriftserver connection
!connect jdbc:hive2://localhost:10000 (and enter username and password)
then I select the default db: use default;
and show tables; --> there's nothing there.
So, where are my tables? is beeline or thrift pointing to a different warehouse or something?
Edit: I think my thriftserver isn't using the right warehouse directory, so I'm trying to start it with a config option:
[still nothing] sbin/start-thriftserver.sh --hiveconf spark.sql.warehouse.dir=/code/spark/thrift/spark-warehouse
[still nothing] sbin/start-thriftserver.sh --conf spark.sql.warehouse.dir=/code/spark/thrift/spark-warehouse
Edit: starting it in the same physical directory as where the wherehouse was created seems to do the trick. Although, I don't know how to programmatically set the path to something else and start it elsewhere.

the solution to this particular problem was that I was starting thrift from a different directory than the spark-wherehouse and metastore_db were located.
Once I started it from the correct directory, it worked as expected and my tables were now available.

Related

what the difference between sparksessioncatalog and sparkcatalog in iceberg

As the title says.
question comes from:
I connect to spark-sql with iceberg catalog like this:
bin/spark-sql \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.ice_test2=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.ice_test2.type=hive \
--conf spark.sql.catalog.ice_test2.uri=thrift://xxxxxxx:9083
but when I execute use ice_test2.default;,I got an error:
java.lang.NullPointerException: Delegated SessionCatalog is missing. Please make sure your are replacing Spark's default catalog, named 'spark_catalog'.
while I run spark-sql with SparkCatalog instead is OK.
Edit following original question edit:
The way org.apache.iceberg.spark.SparkSessionCatalog works is by first trying to load an iceberg table with the given identifier and then falling back the default catalog behaviour for this session catalog.
Since you are using ice_test2 as your catalog it doesn't know which SessionCatalog to fallback to.
As the error indicates if you will use spark_catalog instead of ice_test2 it should work.
Quoting Iceberg documentation for more information about the difference between Iceberg's SparkCatalog and SparkSessionCatalog
org.apache.iceberg.spark.SparkCatalog - supports a Hive Metastore or a Hadoop warehouse as a catalog
org.apache.iceberg.spark.SparkSessionCatalog - adds support for Iceberg tables to Spark’s built-in catalog, and delegates to the built-in catalog for non-Iceberg tables
change spark.sql.catalog.ice_test2=org.apache.iceberg.spark.SparkSessionCatalog to spark.sql.catalog.ice_test2=org.apache.iceberg.spark.SparkCatalog
delete spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
refferred by https://iceberg.apache.org/docs/latest/spark-configuration/#catalogs

Spark doesn't show all Hive databases

If I list all the databases in Hive, I get the following result (I have 2 tables default ans sbm):
But if I try to do the same thing in spark I get this
It doesn't show the database SBM.
Are you connected to that hive metastore? Did you specify somewhere the metastore details (i.e. hive-site.xml in spark-conf directory) ? Seems like you are connected to the local merastore.
I think that you need copy your hive-site.xml to spark/conf directory.
If you use ubuntu and defined the environmental variables use the next command:
cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf

Error While inserting rows into Kudu using Spark Shell

I am new to Apache Kudu, I installed it on my Ubuntu system and later created a table in it using Apache Spark shell. Now I am trying to insert data into that table using insertRows() for that I am using the but below given command,
kuduContext.insertRows(customersDF, "spark_kudu_tbl")
Where customersDF is a Data Frame and spark_kudu_tbl is a table in the Kudu data base. I am getting below error,
java.lang.NoSuchMethodError: org.apache.kudu.spark.kudu.KuduContext.insertRows(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;)V
... 70 elided
I have tried different options but no one is giving results to me. Can any one give any solution for my question.
From the error message it appears as though you are using wrong kudu-spark artifact, you should use kudu-spark2_2. please start your spark-shell as below (replace the last bit with your kudu version)
spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.3.0

How to get Cassandra database dump with data

I need to get a dump(with data) from remote Cassandra database. I was able to get database schema via following command.How can i get all data in the keyspace?
I'm using Cassandra 1.1.9
echo -e "connect localhost/9260;\r\n use PWC_Keyspace;\r\n show schema;\n" | bin/cassandra-cli -h localhost -port 9260 > dilshan.cdl
With Cassandra 1.1.9, I don't believe you have access to cqlsh with the copy-to command, so you'll be stuck with 2 options.
1) Export the data from the data files (sstables) on disk using sstable2json, or
2) Write a program to iterate over every row and copy/serialize it to a format you find easier to work with.
You MAY be able to use a more recent cqlsh (say, from 2.0, which still used thrift instead of the native interface), and point it at your 1.1.9 server and use 'COPY TO' to export each table to a csv. However, the COPY command in cqlsh for 2.0 doesn't use paging, and cassandra 1.1.19 doesn't support paging, so there's a very good chance it's simply going to time out and fail.

InvalidRequestException Keyspace keyspace1 does not exist

I'm trying to connect to a Datastax Community Edition server 2.1.2 via JDBC but I keep getting the following error no matter what I try to do, even when issuing a very basic command like select * from system_traces.events;
InvalidRequestException(why:Keyspace 'keyspace1' does not exist)
Issuing that same command via cqlsh works properly, so it seems to be a JDBC issue.
InvalidRequestException(why:Keyspace 'keyspace1' does not exist)
at org.apache.cassandra.cql.jdbc.CassandraConnection.<init>(CassandraConnection.java:229):229
at org.apache.cassandra.cql.jdbc.CassandraDriver.connect(CassandraDriver.java:92):92
at java.sql.DriverManager.getConnection(DriverManager.java:664):664
at java.sql.DriverManager.getConnection(DriverManager.java:270):270
at railo.commons.db.DBUtil.getConnection(DBUtil.java:109):109
at railo.runtime.db.DatasourceConnectionPool.loadDatasourceConnection(DatasourceConnectionPool.java:89):89
at railo.runtime.db.DatasourceConnectionPool.getDatasourceConnection(DatasourceConnectionPool.java:81):81
at railo.runtime.db.DatasourceManagerImpl.getConnection(DatasourceManagerImpl.java:65):65
at railo.runtime.tag.Query.executeDatasoure(Query.java:696):696 ...
Any ideas? TIA!
InvalidRequestException(why:Keyspace 'keyspace1' does not exist)
This exception means you are trying to query for a keyspace (in this case "Keyspace1") that hasn't yet been added to Cassandra. Try creating the keyspace before querying it.
You're probably doing a select (SELECT * FROM "Keyspace1"."Standard1") that you're not seeing or passing initialisation parameters to JDBC telling it to connect to Keyspace1. Verify that your code isn't looking for the non-existent keyspace by searching through the queries you have, specifically looking for Keyspace1 (or "Keyspace1" since in this case the keyspace name is case-sensitive).
On a side-note, "Keyspace1"."Standard1" tend to be the standard ks.cf pair used for cassandra examples so it would be good to scan your code for them to make sure that they are created before they are queried.

Resources