When I try to list all hive databases through Spark (1.6)
scala> val tdf = sqlContext.sql("SHOW DATABASES");
tdf: org.apache.spark.sql.DataFrame = [result: string]
scala> tdf.show
+-------+
| result|
+-------+
|default|
+-------+
When I try to list all hive databases through hive shell
hive> show databases;
OK
default
Time taken: 0.621 seconds, Fetched: 1 row(s)
While in my hive, actually I already have lot of databases. Am I miss some configuration on my Cloudera cluster? Or maybe there are some problem with my hive metastore?
use HiveContext to fetch data from the hive. set hive.metastore.uris by
spark code -
System.setProperty("hive.metastore.uris","thrift://hostserver:9083")
val hivecontext = new HiveContext(sparkContext)
val tdf = hivecontext.sql("SHOW DATABASES");
spark-shell
spark-shell --driver-java-options "-Dhive.metastore.uris=thrift://hostserver:9083"
Since Hive shell also shows only default database, the Hive metastore configuration can be checked.
To start with, you can log into the database having the metastore, and run this query that should list Hive databases. Example query for MySQL database is:
mysql> SELECT NAME, DB_LOCATION_URI FROM hive.DBS;
Then, you can verify and update hive-site.xml as per below. The location of this file on CDH is generally at /usr/lib/hive/conf/hive-site.xml, and on HDP is generally at /usr/hdp/current/hive-client/conf/hive-site.xml.
The documentation reference for configuration of the metastore:
a) https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-RemoteMetastoreDatabase
b) (CDH) https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_hive_metastore_configure.html (Refer to the section: 4. Configure the metastore service to communicate with the MySQL database)
Example configuration:
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
Related
I want to set up my local Spark to enable multiple connections. (i.e. notebook, BI tool, application, and etc) So I have to get away from Derby.
My hive-site.xml is as follows
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive_metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>spark#localhost</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>spark</value>
</property>
<property>
<name>datanucleus.schema.autoCreateTables</name>
<value>true</value>
</property>
I set "datanucleus.schema.autoCreateTables" to true as suggested by Spark. "createDatabaseIfNotExist=true" does not seem to do anything.
But that still fails with
21/12/26 04:34:20 WARN Datastore: SQL Warning : 'BINARY as attribute of a type' is deprecated and will be removed in a future release. Please use a CHARACTER SET clause with _bin collation instead
21/12/26 04:34:20 ERROR Datastore: Error thrown executing CREATE TABLE `TBLS`
(
`TBL_ID` BIGINT NOT NULL,
`CREATE_TIME` INTEGER NOT NULL,
`DB_ID` BIGINT NULL,
`LAST_ACCESS_TIME` INTEGER NOT NULL,
`OWNER` VARCHAR(767) BINARY NULL,
`RETENTION` INTEGER NOT NULL,
`IS_REWRITE_ENABLED` BIT NOT NULL,
`SD_ID` BIGINT NULL,
`TBL_NAME` VARCHAR(256) BINARY NULL,
`TBL_TYPE` VARCHAR(128) BINARY NULL,
`VIEW_EXPANDED_TEXT` TEXT [CHARACTER SET charset_name] [COLLATE collation_name] NULL,
`VIEW_ORIGINAL_TEXT` TEXT [CHARACTER SET charset_name] [COLLATE collation_name] NULL,
CONSTRAINT `TBLS_PK` PRIMARY KEY (`TBL_ID`)
) ENGINE=INNODB : You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '[CHARACTER SET charset_name] [COLLATE collation_name] NULL,
and such.
Please advice.
Ok I did it.
So basically I can't rely on Spark to do this automatically, even though it was able to initialize the Derby version.
So I had to download both Hadoop and Hive, and use the schemaTool bundled within Hive to set up the metastore.
Then Spark is able to use that directly.
Alternatively, you could run Hive provided scripts directly on the DB. Different backends are at GitHub/Apache Hive: Metastore Scripts.
I use Spark 2.3.0 and Ignite 2.7.0, I have created custom schemas in Ignite using
<property name="sqlSchemas">
<list>
<value>dbsch</value>
<value>dbschma</value>
<value>db_ui</value>
</list>
</property>
I could create tables on each of the schema. But my problem is I cannot load data into the particular schema.
df.write
.format(FORMAT_IGNITE)
.option(OPTION_CONFIG_FILE, CONFIG)
.option(OPTION_TABLE, "db_ui."+tableName)
.option(OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS, primaryKey)
.option(OPTION_CREATE_TABLE_PARAMETERS, "template=replicated")
.mode(SaveMode.Append)
.save()
It tries to create table as the df and eventually fails by saying null in Spark Config. It works if I try to load the table in the public(default) schema. I get the same error if I try to read from Ignite. Any input please?
Ash
I am using HDP-2.6.0.3 but I need Zeppelin 0.8, so I have installed it as an independent service. When I run:
%sql
show tables
I get nothing back and I get 'table not found' when I run Spark2 SQL commands. Tables can be seen in the 0.7 Zeppelin that is part of HDP.
Can anyone tell me what I am missing, for Zeppelin/Spark to see Hive?
The steps I performed to create the zep0.8 are as follows:
maven clean package -DskipTests -Pspark-2.1 -Phadoop-2.7-Dhadoop.version=2.7.3 -Pyarn -Ppyspark -Psparkr -Pr -Pscala-2.11
Copied zeppelin-site.xml and shiro.ini from /usr/hdp/2.6.0.3-8/zeppelin/conf to /home/ed/zeppelin/conf.
created /home/ed/zeppelin/conf/zeppeli-env.sh in which I put the following:
export JAVA_HOME=/usr/jdk64/jdk1.8.0_112
export HADOOP_CONF_DIR=/etc/hadoop/conf
export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.6.0.3-8"
Copied /etc/hive/conf/hive-site.xml to /home/ed/zeppelin/conf
EDIT:
I have also tried:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("interfacing spark sql to hive metastore without configuration file")
.config("hive.metastore.uris", "thrift://s2.royble.co.uk:9083") // replace with your hivemetastore service's thrift url
.config("url", "jdbc:hive2://s2.royble.co.uk:10000/default")
.config("UID", "admin")
.config("PWD", "admin")
.config("driver", "org.apache.hive.jdbc.HiveDriver")
.enableHiveSupport() // don't forget to enable hive support
.getOrCreate()
same result, and:
import java.sql.{DriverManager, Connection, Statement, ResultSet}
val url = "jdbc:hive2://"
val driver = "org.apache.hive.jdbc.HiveDriver"
val user = "admin"
val password = "admin"
Class.forName(driver).newInstance
val conn: Connection = DriverManager.getConnection(url, user, password)
which gives:
java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
ERROR XSDB6: Another instance of Derby may have already booted the database /home/ed/metastore_db
Fixed error with:
val url = "jdbc:hive2://s2.royble.co.uk:10000"
but still no tables :(
This works:
import java.sql.{DriverManager, Connection, Statement, ResultSet}
val url = "jdbc:hive2://s2.royble.co.uk:10000"
val driver = "org.apache.hive.jdbc.HiveDriver"
val user = "admin"
val password = "admin"
Class.forName(driver).newInstance
val conn: Connection = DriverManager.getConnection(url, user, password)
val r: ResultSet = conn.createStatement.executeQuery("SELECT * FROM tweetsorc0")
but then I have the pain of converting the resultset to a dataframe. I'd rather SparkSession worked and I get a dataframe so I will add a bounty later today.
I had a similar problem in Cloudera Hadoop. In my case the problem was that spark sql did not see my hive metastore. So when I used my Spark Session object for spark SQL I could not see my previously created tables. I managed to solve it with adding in zeppelin-env.sh
export SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2
export HADOOP_HOME=/opt/cloudera/parcels/CDH
export SPARK_CONF_DIR=/etc/spark/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
(I assume for Horton Works these paths are something else). I also change spark.master from local[*] to yarn-client at Interpreter UI. Most importantly I manually copied hive-site.xml in /etc/spark/conf/ because I though it was strange that it was not in that directory and that solved my problem.
So my advice is to see if hive-site.xml exists in your SPARK_CONF_DIR and if not add it manually. I also find a guide for Horton Works and zeppelin in case this will not work.
I'm trying to run a basic java program using spark-sql & JDBC. I'm running into the following error. Not sure what's wrong here. Most of the material I have read does not talk on what needs to be done to fix this problem.
It will also be great if someone can point me to some good material to read on Spark-sql (Spark-2.1.1). I'm planning to use spark to implement ETL's, connecting to MySQL and other datasources.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: myschema.mytable; line 1 pos 21;
String MYSQL_CONNECTION_URL = "jdbc:mysql://localhost:3306/myschema";
String MYSQL_USERNAME = "root";
String MYSQL_PWD = "root";
Properties connectionProperties = new Properties();
connectionProperties.put("user", MYSQL_USERNAME);
connectionProperties.put("password", MYSQL_PWD);
Dataset<Row> jdbcDF2 = spark.read()
.jdbc(MYSQL_CONNECTION_URL, "myschema.mytable", connectionProperties);
spark.sql("SELECT COUNT(*) FROM myschema.mytable").show();
It's because Spark is not registering any tables from any schemas from connection by default in Spark SQL Context. You must register it by yourself:
jdbcDF2.createOrReplaceTempView("mytable");
spark.sql("select count(*) from mytable");
Your jdbcDF2 has a source in myschema.mytable from MySQL and will load data from this table on some action.
Remember that MySQL table is not the same as Spark table or view. You are telling Spark to read data from MySQL, but you must register this DataFrame or Dataset as table or view in current Spark SQL Context or Spark Session
Just got a single node cluster up and running with the new datastax 4.0.
Works great. We use hive to build and query our data.
On the server it self. I can start hive
$>dse hive
and query tables just fine.
When I try and use the newest Hive ODBC driver to run the same query I seeing this error.
It connects just fine, i can query the keyspace and see the tables. but when i try to run the query. Looks like the map/red gets in the queue, but then errors out with the following.
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
java.lang.IllegalArgumentException: Does not contain a valid host:port authority: ${dse.job.tracker}
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:147)
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:128)
at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:2584)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:474)
at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:457)
at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:402)
at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:144)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at org.apache.hadoop.hive.service.HiveServer$HiveServerHandler.execute(HiveServer.java:198)
at org.apache.hadoop.hive.service.ThriftHive$Processor$execute.getResult(ThriftHive.java:646)
at org.apache.hadoop.hive.service.ThriftHive$Processor$execute.getResult(ThriftHive.java:630)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:225)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Job Submission failed with exception 'java.lang.IllegalArgumentException(Does not contain a valid host:port authority: ${dse.job.tracker})'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
Any thoughts on what i should try?
Thanks ahead of time for any thoughts and or suggestions/assistance you all can provide.
Cheers,
Eric
I have solved the issue by manually configuring host:port into mapred-site.xml configuration file.
Just add the lines
<property>
<name>mapred.job.tracker</name>
<value>host:port</value>
</property>
depending on the ip address of your hive server and the used port (usually 8012).
This will override the default placeholder $(dse.job.tracker) present in dse-mapred-default.xml configuration file.
The dse.job.tracker property needs to be set in System properties of the JVM that starts Hadoop Jobs. Hadoop will substitute the placeholder with an appropriate system property value if it is defined. Otherwise, it will be just left as is, thus the error you see.
For hive, pig and mahout the mapred.job.tracker property is set in the bin/dse script as follows:
if [ -z "$HADOOP_JT" ]; then
HADOOP_JT=`$BIN/dsetool jobtracker --use-hadoop-config`
fi
if [ -z "$HADOOP_JT" ]; then
echo "Unable to run $HADOOP_CMD: jobtracker not found"
exit 2
fi
#set the JT param as a JVM arg
export HADOOP_OPTS="$HADOOP_OPTS -Ddse.job.tracker=$HADOOP_JT"
So you should do the same for your program using the Hive ODBC driver and I guess it should be fine.
By hardcoding Hadoop JT location you make it harder to move the JT to another node, because then you'd have to update the config file manually. Moreover, the automatic JT failover of dse won't work properly if your primary JT goes down, because your program would still try to connect the old one.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>local</value>
</property>
<property>
<name>mapreduce.jobtracker.http.address</name>
<value>localhost:50030</value>
</property>
<property>
<name>mapreduce.jobhhistory.webapp.address</name>
<value>localhost:19888</value>
</property>
</configuration>