Hive CLI and Hiveserver2 Inconsistent Metastore

Hive CLI and Hiveserver2 Inconsistent Metastore - azure

I'm trying to modify an existing Azure HDInsight cluster to point at an existing Hive Metastore (hosted on an MSSQL instance). I've changed the following parameters in hive-site.xml to point to the existing Metastore:
"javax.jdo.option.ConnectionDriverName" : "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"javax.jdo.option.ConnectionUserName" : "<<user>>",
"javax.jdo.option.ConnectionPassword" : "<<password>>",
"javax.jdo.option.ConnectionURL" : "jdbc:sqlserver://<<server>>.database.windows.net:1433;database=HiveMetaStoreEast;user=<<user>>;password=<<password>>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
This seems to have somewhat worked, as I am able to access both Hive CLI and Hiveserver2 via Beeline. The strange thing is show databases; output different results depending on the client being used. I read that starting Hive 0.14 (which I am running), more granular configuration is available for Hive/Hiveserver2 using hiveserver2-site.xml, etc. I've tried setting the hive.metastore.uris parameter in hiveserver2-site.xml to match what it shows in hive-site.xml but still get the same strange results.
In summary, how can I know for sure the Hiveserver2 and Hive CLI processes are pointed at the same (and correct) Metastore URIs?

Just after posting this I found a similar thread on the Hortonworks website: http://hortonworks.com/community/forums/topic/configuration-of-hiveserver2-to-use-a-remote-metastore-server/#post-81960
It appears the startHiveserver2.sh.j2 start script, residing here (on my Hive nodes) /var/lib/ambari-agent/cache/common-services/HIVE/0.12.0.2.0/package/templates/ contains an empty string CLI override of the hive.metastore.uris parameter which I believe forces Hiveserver2 to start in local metastore mode and hence creating inconsistent views between Hive CLI (using remote URIs) and Beeline (using local).
See below for the patch that resolved the inconsistency:
--- startHiveserver2.sh.j2 2015-11-25 04:06:15.357996439 +0000
+++ /var/lib/ambari-server/resources/common-services/HIVE/0.12.0.2.0/package/templates/startHiveserver2.sh.j2 2015-11-25 03:43:29.837452851 +0000
## -20,5 +20,6 ##
#
HIVE_SERVER2_OPTS=" -hiveconf hive.log.file=hiveserver2.log -hiveconf hive.log.dir=$5"
-HIVE_CONF_DIR=$4 {{hive_bin}}/hiveserver2 -hiveconf hive.metastore.uris=" " ${HIVE_SERVER2_OPTS} > $1 2> $2 &
+#HIVE_CONF_DIR=$4 {{hive_bin}}/hiveserver2 -hiveconf hive.metastore.uris=" " ${HIVE_SERVER2_OPTS} > $1 2> $2 &
+HIVE_CONF_DIR=$4 {{hive_bin}}/hiveserver2 ${HIVE_SERVER2_OPTS} > $1 2> $2 &
echo $!|cat>$3

Related

Need help in running presto execute query

I need to query bunch of ticket numbers which get it from ServiceNow (I'm using Cassandra DB).
I query cassandra using presto --execute command but I can use only one ticket at a time, I tried using --file but it didn't work.
./prestocli --server https:////10.x.x.x:8081 --catalog cassandra --keystore-path=etc/catalog/presto_keystore.jks --keystore-password=xxxxxxx --execute --file /tmp/input.txt --output-format CSV > /tmp/output_1.csv
it failed to process... (input.txt is were I have select statement as mentioned below and I tried saving this file and running as input.sql but no luck)
I use single find command
./prestocli --server https:////10.x.x.x:8081 --catalog cassandra --keystore-path=etc/catalog/presto_keystore.jks --keystore-password=xxxxxxx --execute "select inc_number, inc_state from incident_table where inc_number = 'INCxxxxxx';" --output-format CSV > /tmp/output.csv
Can any one suggest the best of doing this using presto?
Cassandra v3.11.3
Presto v0.215

Could not find valid SPARK_HOME on dataproc

Spark job executed by Dataproc cluster on Google Cloud gets stuck on a task PythonRDD.scala:446
The error log says Could not find valid SPARK_HOME while searching ... paths under /hadoop/yarn/nm-local-dir/usercache/root/
The thing is, SPARK_HOME should be set by default on a dataproc cluster.
Other spark jobs that don't use RDDs work just fine.
During the initialization of the cluster I do not reinstall spark (but I have tried to, which I previously thought caused the issue).
I also found out that all my executors were removed after a minute of running the task.
And yes, I have tried to run the following initialization action and it didn't help:
#!/bin/bash
cat << EOF | tee -a /etc/profile.d/custom_env.sh /etc/*bashrc >/dev/null
export SPARK_HOME=/usr/lib/spark/
EOF
Any help?

I was using a custom mapping function. When I put the function to a separate file the problem disappeared.

Hive Tables are created from spark but are not visible in hive

From spark using:
DataFrame.write().mode(SaveMode.Ignore).format("orc").saveAsTable("myTableName")
Table is getting saved I can see using below command's hadoop fs -ls /apps/hive/warehouse\test.db' where test is my database name
drwxr-xr-x - psudhir hdfs 0 2016-01-04 05:02
/apps/hive/warehouse/test.db/myTableName
but when I trying to check tables in Hive I cannot view them either with command SHOW TABLES from hiveContext.

sudo cp /etc/hive/conf.dist/hive-site.xml /etc/spark/conf/
This worked for me in a Cloudera quick start Virtual Box.

You have to copy the hive-site.xml file (mine is located at /etc/hive/conf.dist/hive-site.xml) to Spark conf folder (mine is located at /etc/spark/conf/)
sudo cp /etc/hive/conf.dist/hive-site.xml /etc/spark/conf/
Restart Spark and it should work.

I think you need to run INVALIDATE METADATA; in the hive console to refresh the databases and view your new table.

Spark is not started automatically on the AWS cluster - how to launch it?

A spark cluster has been launched using the ec2/spark-ec2 script from within the branch-1.4 codebase. I have logged onto it.
I can login to it - and it reflects 1 master, 2 slaves:
11:35:10/sparkup2 $ec2/spark-ec2 -i ~/.ssh/hwspark14.pem login hwspark14
Searching for existing cluster hwspark14 in region us-east-1...
Found 1 master, 2 slaves.
Logging into master ec2-54-83-81-165.compute-1.amazonaws.com...
Warning: Permanently added 'ec2-54-83-81-165.compute-1.amazonaws.com,54.83.81.165' (RSA) to the list of known hosts.
Last login: Tue Jun 23 20:44:05 2015 from c-73-222-32-165.hsd1.ca.comcast.net
__| __|_ )
_| ( / Amazon Linux AMI
___|\___|___|
https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/
Amazon Linux version 2015.03 is available.
But .. where are they?? The only java processes running are:
Hadoop: NameNode and SecondaryNode
Tachyon: Master and Worker
It is a surprise to me that the Spark Master and Workers are not started. When looking for the processes to start them manually it is not at all obvious where they are located.
Hints on
why spark did not start automatically
and
where the launch scripts live
would be appreciated. (In the meantime i will do an exhaustive
find / -name start-all.sh
And .. survey says:
root#ip-10-151-25-94 etc]$ find / -name start-all.sh
/root/persistent-hdfs/bin/start-all.sh
/root/ephemeral-hdfs/bin/start-all.sh
Which means to me that spark were not even installed??
Update I wonder: is this a bug in 1.4.0? I ran same set of commands in 1.3.1 and the spark cluster came up.

There was a bug in spark 1.4.0 provisioning script which is cloned from github repository by spark-ec2 (https://github.com/mesos/spark-ec2/) with similar symptoms - apache spark haven't started. The reason was - provisioning script failed to download spark archive.
Check was spark downloaded and uncompressed on the master host ls -altr /root/spark there should be several directories there. From your description looks like /root/spark/sbin/start-all.sh script is missing - which is missing there.
Also check the contents of the file cat /tmp/spark-ec2_spark.log it should has information about uncompressing step.
Another thing to try is to run spark-ec2 with other provisioning script branch by adding --spark-ec2-git-branch branch-1.4 into the spark-ec2 command line argument.
Also when you run spark-ec2 save all output and check is there something suspicious:
spark-ec2 <...args...> 2>&1 | tee start.log

Issue while Running Hive in Cygwin

I have followed the below steps to configure Hive 0.8.1 in Cygwin. Hive is started properly as I am getting Hive CLI when type hive. But while running any command in hive its not returning any response and the command is running into a infinite loop.
Please help if I miss anything.
Steps to configure Hive
Chown of the hive folder
Change permission of hive folder to 755
Set this to hive-site.xml
<property>
<name>hive.exec.scratchdir</name>
<value>/home/yourusername/mydir</value>
<description>Scratch space for Hive jobs</description>
</property>
Put the following in the hive lib folder:
hadoop-0.20-core.jar
hive/lib/hive-exec-0.7.1.jar
hive/lib/hive-jdbc-0.7.1.jar
hive/lib/hive-metastore-0.7.1.jar
hive/lib/hive-service-0.7.1.jar
hive/lib/libfb303.jar
lib/commons-logging-1.0.4.jar
slf4j-api-1.6.1.jar
slf4j-log4j12-1.6.1.jar
In hive-env.sh change following:
# Set HADOOP_HOME to point to a specific hadoop install directory
#here instead of path what i have given you give your own path where hadoop #isthere
export HADOOP_HOME=/home/user/Hadoop/hadoop-0.20.205
# Hive Configuration Directory can be controlled by:
#here you specify the conf directory path of hive
export HIVE_CONF_DIR=/home/user/Hadoop/hive-0.8.1/conf
#Folder containing extra ibraries required for hive compilation/execution
#can be controlled by:
#here you specify the lib file directory, here you can specify the lib

I had this issue, i could successfully run HIVE after starting all hadoop deamons like namenode,datanode,jobtracker & Task Tracker. And Run queries from files using "hive -f " insted of writing queries directly at hive command prompt. You may also use bin/hive -e 'SHOW TABLES'

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string