Cassandra and Spark Thrift Server Integration - apache-spark

I'm trying to integrate Cassandra and Spark Thrift server. I followed the steps from here
I get the following error while registering the cassandara tables in beeline console.
Error: Error while compiling statement: FAILED: ParseException line 1:23 cannot recognize input near 'USING' 'org' '.' in create table statement (state=42000,code=40000)
Below is the query I run
CREATE TABLE test_data USING org.apache.spark.sql.cassandra OPTIONS (keyspace 'abc', table 'def');
Am I missing something?

add the cassandra connector as a aux jar in hive-site.xml

Related

Azure data bricks external hive metastore creation

I am creating a metastore in azure databricks for azure sql.I have given below commands to cluster config using 7.3 runtime. As mentioned in the documentation
https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore#spark-options
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://xxx.database.windows.net:1433;database=hivemetastore
spark.hadoop.javax.jdo.option.ConnectionUserName xxxx
datanucleus.fixedDatastore false
spark.hadoop.javax.jdo.option.ConnectionPassword xxxx
datanucleus.autoCreateSchema true
spark.sql.hive.metastore.jars builtin
spark.sql.hive.metastore.version 1.2.1
hive.metastore.schema.verification.record.version false
hive.metastore.schema.verification false
--
After this when I tried to create database metastore I will get cancelled automatically.
Error I am getting in Data section in databricks which I am not able to copy also.
Cluster setting
Command
--Update
According to the error message updated in the comments
The maximum length allowed is 8000, when the the length specified in declaring a VARCHAR column.
WorkAround: Use either VARCHAR(8000) or VARCHAR(MAX) for column 'PARAM_VALUE'. I would prefer using nvarchar(max), cause an nvarchar (MAX) can store up to 2GB of characters.
Apparently found an official record of the know issue!
See Error in CREATE TABLE with external Hive metastore
This is a known issue with MySQL 8.0 when the default charset is
utfmb4.
Try running this to confirm
SELECT default_character_set_name FROM information_schema.SCHEMATA S WHERE schema_name = "<database-name>"
If yes, Refer Solution
You need to update or recreate the database and set the charset to
latin1.
You have 2 options:
Manually run create statements in the Hive database with DEFAULT CHARSET=latin1 at the end of each CREATE TABLE statement.
Setup the database and user accounts. And create the database and run alter database hive character set latin1; before you launch the metastore. (This command sets the default CHARSET for the database. It is applied when the metastore creates tables.)

Setting up Azure SQL External Metastore for Azure Databricks — Invalid column name ‘IS_REWRITE_ENABLED’

I’m attempting to set up an external Hive metastore for Azure Databricks. The Metastore is in Azure SQL and the Hive version is 1.2.1 (included with azure HdInsight 3.6).
I have followed the setup instructions on the “External Apache Hive metastore” page in Azure documentation.
I can see all of the databases and tables in the metastore but if I look at a specific table I get the following.
Caused by: javax.jdo.JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.OWNER,A0.RETENTION,A0.IS_REWRITE_ENABLED,A0.TBL_NAME,A0.TBL_TYPE,A0.TBL_ID FROM TBLS A0 LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID WHERE A0.TBL_NAME = ? AND B0."NAME" = ?
NestedThrowables:
com.microsoft.sqlserver.jdbc.SQLServerException: Invalid column name 'IS_REWRITE_ENABLED'.
I was expecting to see errors relating to the underlying storage but this appears to be a problem with the metastore.
Anybody have any idea what’s wrong?
The error message seems to suggest that the column IS_REWRITE_ENABLED doesn't exist on table TBLS with alias A0.
Looking at the script for the hive-schema for the derby db, it can help guide you in seeing that the column in question does exist here.
metastore script definition
If you've got admin access to the Azure SQL db, you can alter the table and add the column:
ALTER TABLE TBLS
ADD IS_REWRITE_ENABLED char(1) NOT NULL DEFAULT 'N';
I don't believe this is the actual fix but it does work around the error.

SparkSQL Hive Error : "HikariCP" not found in the CLASSPATH

I configured Hive with mySQL as my metastore. I can enter hive shell and create tables successfully.
Spark version: 2.4.0
Hive version: 3.1.1
When I try to run a SparkSQL program using spark submit, I'm getting the below error.
2019-03-02 15:43:41 WARN HiveMetaStore:622 - Retrying creating default database after error: Error creating transactional connection factory
javax.jdo.JDOFatalInternalException: Error creating transactional connection factory
......
......
Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
......
......
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "HikariCP" plugin to create a ConnectionPool gave an error : The connection pool plugin of type "HikariCP" was not found in the CLASSPATH!
Please let me know if anyone can help me in this regard.
I don't know if you have already solved this problem. There is my advice.
the default database connection is HikariCP in the hive-site.xml. You can search for this in the hive-site.xml: datanucleus.connectionPoolingType. The value is HikariCP. So you need to change it to dbcp since you use Mysql as your metastore.
And at last, don't forget about adding the mysql-connector-java-5.x.x.jar to the path like
/home/hadoop/spark-2.3.0-bin-hadoop2.7/jars

Connecting to Cassandra with Spark

First, I have bought the new O'Reilly Spark book and tried those Cassandra setup instructions. I've also found other stackoverflow posts and various posts and guides over the web. None of them work as-is. Below is as far as I could get.
This is a test with only a handful of records of dummy test data. I am running the most recent Cassandra 2.0.7 Virtual Box VM provided by plasetcassandra.org linked from the main Cassandra project page.
I downloaded Spark 1.2.1 source and got the latest Cassandra Connector code from github and built both against Scala 2.11. I have JDK 1.8.0_40 and Scala 2.11.6 setup on Mac OS 10.10.2.
I run the spark shell with the cassandra connector loaded:
bin/spark-shell --driver-class-path ../spark-cassandra-connector/spark-cassandra-connector/target/scala-2.11/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar
Then I do what should be a simple row count type test on a test table of four records:
import com.datastax.spark.connector._
sc.stop
val conf = new org.apache.spark.SparkConf(true).set("spark.cassandra.connection.host", "192.168.56.101")
val sc = new org.apache.spark.SparkContext(conf)
val table = sc.cassandraTable("mykeyspace", "playlists")
table.count
I get the following error. What is confusing is that it is getting errors trying to find Cassandra at 127.0.0.1, but it also recognizes the host name that I configured which is 192.168.56.101.
15/03/16 15:56:54 INFO Cluster: New Cassandra host /192.168.56.101:9042 added
15/03/16 15:56:54 INFO CassandraConnector: Connected to Cassandra cluster: Cluster on a Stick
15/03/16 15:56:54 ERROR ServerSideTokenRangeSplitter: Failure while fetching splits from Cassandra
java.io.IOException: Failed to open thrift connection to Cassandra at 127.0.0.1:9160
<snip>
java.io.IOException: Failed to fetch splits of TokenRange(0,0,Set(CassandraNode(/127.0.0.1,/127.0.0.1)),None) from all endpoints: CassandraNode(/127.0.0.1,/127.0.0.1)
BTW, I can also use a configuration file at conf/spark-defaults.conf to do the above without having to close/recreate a spark context or pass in the --driver-clas-path argument. I ultimately hit the same error though, and the above steps seem easier to communicate in this post.
Any ideas?
Check the rpc_address config in your cassandra.yaml file on your cassandra node. It's likely that the spark connector is using that value from the system.local/system.peers tables and it may be set to 127.0.0.1 in your cassandra.yaml.
The spark connector uses thrift to get token range splits from cassandra. Eventually I'm betting this will be replaced as C* 2.1.4 has a new table called system.size_estimates (CASSANDRA-7688). It looks like it's getting the host metadata to find the nearest host and then making the query using thrift on port 9160.

how to use presto to query hive data

I just installed presto and when I use the presto-cli to query hive data, I get the following error:
$ ./presto --server node6:8080 --catalog hive --schema default
presto:default> show tables;
Query 20131113_150006_00002_u8uyp failed: Table hive.information_schema.tables does not exist
The config.properties is:
coordinator=true
datasources=jmx,hive
http-server.http.port=8080
presto-metastore.db.type=h2
presto-metastore.db.filename=/root/h2
task.max-memory=1GB
discovery-server.enabled=true
discovery.uri=`http://node6:8080`
And the hive.properties is:
connector.name=hive-cdh4
hive.metastore.uri=thrift://node6:9083
The hadoop distribution I used is CDH 4.4. I believe it's properly installed and hive can process queries successfully on its own.
Can anyone help me work it out? Any ideas will be appreciated.
As recommended by the Getting Started, I created a controller (jmx only) and a separate worker (jmx,hive), each on separate machines.
What finally solved this for me was to specify the worker's hostname and http-server.http.port as the --server argument to presto. When specifying the controller, it didn't work.
This all makes sense, but I am still wondering what will happen when I have two Presto-Hive workers...
Add more line to etc/catalog/hive.properties
"hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml"
ofcourse check values of path before do it.
presto-metastore.db.filename= <- is this the value for Hive Warehouse
Directory ?
=> this presto's metastore,not hive.
I just figured out what was wrong in my case:
you also have to add following line to $HIVE_HOME/conf/hive-env.sh for informing hive to open thrift port(same as mentioned under hive.metastore.uris property in hive-site.xml file). This port is used by hive client to connect to Metastore through RPC.
export METASTORE_PORT=9084
in the hive-env.sh file in the conf folder.
This should sync your hive with presto.

Resources