How to execute Hive queries on Hive 2.1.1 on Spark 2.2.0? - apache-spark

Simple queries, e.g. select, work fine, but when I use aggregate functions, e.g. count, I face errors.
I use beeline to connect to Hive 2.1.1 with Spark 2.2.0 and Hadoop 2.8.
hive-site.xml is as follows:
<property>
<name>hive.execution.engine</name>
<value>spark</value>
<description>
Expects one of [mr, tez, spark].
Chooses execution engine. Options are: mr (Map reduce, default), tez, spark. While MR
remains the default engine for historical reasons, it is itself a historical engine
and is deprecated in Hive 2 line. It may be removed without further warning.
</description>
</property>
<property>
<name>spark.master</name>
<value>spark://master:7077</value>
<description>Spark Master URL</description>
</property>
<property>
<name>spark.eventLog.enabled</name>
<value>true</value>
<description>Spark Event Log</description>
</property>
<property>
<name>spark.eventLog.dir</name>
<value>hdfs://master:8020/user/spark/eventLogging</value>
<description>Spark event log folder</description>
</property>
<property>
<name>spark.executor.memory</name>
<value>512m</value>
<description>Spark executor memory</description>
</property>
<property>
<name>spark.serializer</name>
<value>org.apache.spark.serializer.KryoSerializer</value>
<description>Spark serializer</description>
</property>
<property>
<name>spark.yarn.jars</name>
<value>hdfs://master:9000:/user/spark/spark-jars/*</value>
</property>
<property>
<name>spark.master</name>
<value>spark://master:7077</value>
<description>Spark Master URL</description>
</property>
<property>
<name>spark.eventLog.enabled</name>
<value>true</value>
<description>Spark Event Log</description>
</property>
<property>
<name>spark.eventLog.dir</name>
<value>hdfs://master:8020/user/spark/eventLogging</value>
<description>Spark event log folder</description>
</property>
<property>
<name>spark.executor.memory</name>
<value>512m</value>
<description>Spark executor memory</description>
</property>
<property>
<name>spark.serializer</name>
<value>org.apache.spark.serializer.KryoSerializer</value>
<description>Spark serializer</description>
</property>
<property>
<name>spark.yarn.jars</name>
<value>hdfs://master:9000:/user/spark/spark-jars/*</value>
</property>
When executing select count(*) from table in hive I get the below error:
WARN thrift.ThriftCLIService: Error executing statement:
org.apache.hive.service.cli.HiveSQLException: Error running query: java.lang.NoClassDefFoundError: scala/collection/Iterable
at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:225) ~[hive-service-2.1.1.jar:2.1.1]
at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:276) ~[hive-service-2.1.1.jar:2.1.1]
at org.apache.hive.service.cli.operation.Operation.run(Operation.java:324) ~[hive-service-2.1.1.jar:2.1.1]
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:499) ~[hive-service-2.1.1.jar:2.1.1]
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:486) ~[hive-service-2.1.1.jar:2.1.1]
at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:295) ~[hive-service-2.1.1.jar:2.1.1]
at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:506) [hive-service-2.1.1.jar:2.1.1]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_121]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_121]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_121]
at org.apache.hive.jdbc.HiveConnection$SynchronizedHandler.invoke(HiveConnection.java:1412) [hive-jdbc-2.1.1.jar:2.1.1]
at com.sun.proxy.$Proxy35.ExecuteStatement(Unknown Source) [?:?]
at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:308) [hive-jdbc-2.1.1.jar:2.1.1]
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:250) [hive-jdbc-2.1.1.jar:2.1.1]

Related

failed to connect hadoop spark

I am a beginner in spark job and spark configuration
I try to submit a spark job, after a few minutes (job is accepted and running during few minutes) the job fails with a connection refused.
User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 2
Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Failed to connect to my.domain.com/myIp:portNumber
I also have this error with jobs success
ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
On my computer, with intellij Idea my job turn, this is not a code mistake
I try several time to change configuration in yarn-site.xml and mapred-site.xml
This is a hadoop hdfs cluster, 3 nodes, 2 cores on each node, 8GB RAM on each node, I try to submit with this command line :
spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.3 --class MyClass --master yarn --deploy-mode cluster myJar.jar
mapred-site.xml :
<property>
<value>yarn</value>
<name>mapreduce.framework.name</name>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1000</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1000</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>2000</value>
</property>
yarn-site.xml
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>ipadress</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4000</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>500</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2000</value>
</property>
spark-default.conf
spark.master yarn
spark.driver.memory 1g
spark.history.fs.update.interval 30s
spark.history.ui.port port
spark.core.connection.ack.wait.timeout 600s
spark.default.parallelism 2
spark.executor.memory 2g
spark.cores.max 2
spark.executor.cores 2

How to configure spark thrift user and password

i am trying to connect spark thrift server by beeline, and i started spark thrift as below:
start-thriftserver.sh --master yarn-client --num-executors 2 --conf spark.driver.memory=2g --executor-memory 3g
and the spark conf/hive-site.xml as below:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:node001:3306/hive?useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
</property>
<property>
<name>hive.server2.thrift.client.user</name>
<value>root</value>
</property>
<property>
<name>hive.server2.thrift.client.password</name>
<value>123456</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10001</value>
</property>
<property>
<name>hive.security.authorization.enabled</name>
<value>true</value>
<description>enableor disable the hive clientauthorization</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
</configuration>
when i use beeline cli to access spark thrift server, it hint me input username and passwd, but i didn't input anything, just click enter(no input the username and password configured in hive-site.xml), i also can access spark. How can make the configuration work well.
Thanks :)
beeline> !connect jdbc:hive2://node001:10001
Connecting to jdbc:hive2://node001:10001
Enter username for jdbc:hive2://node001:10001:
Enter password for jdbc:hive2://node001:10001:
18/12/06 16:13:44 INFO Utils: Supplied authorities: node001:10001
18/12/06 16:13:44 INFO Utils: Resolved authority: node001:10001
18/12/06 16:13:45 INFO HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://node001:10001
Connected to: Spark SQL (version 2.4.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://node001:10001> show databases;
+-------------------------+--+
| databaseName |
+-------------------------+-- |
| default |
| test |
+-------------------------+--+
7 rows selected (0.142 seconds)
0: jdbc:hive2://node001:10001>

How to configure the Hive cli when using the Spark execution engine?

I have set the hive.execution.engine to spark and also am using a spark-enabled queue. Spark sql is able to access the hive tables - and so is beeline from a directly connected cluster machine.
But the hive cli seems to need additional steps. So far the following have been done:
** Copy the scala libraries to the $HIVE_HOME/libs dir (or we get ClassNotFoundException)
** Run the following at the start of the hive script (or in .hiverc)
set hive.execution.engine=spark;
set mapred.job.queue.name=root.spark.sbg.hos;
However the following error now happens Failed to create spark client.:
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/usr/local/Cellar/hive/2.1.1/libexec/lib/hive-common-2.1.1.jar!/hive-log4j2.properties Async: true
hive (default)> insert into sb.test2 values (1,'ab');
Query ID = sboesch_20171030175629_dc310c9a-519e-4f84-a632-f3a44f1df8c3
Total jobs = 3
Launching Job 1 out of 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
Has anyone managed to connect to spark backend for hive ? I am connecting via vanilla hive (not Cloudera or Hortonworks or MapR).
you have to start Hive metastore Server separately for accessing hive tables through spark.
Try hive --service metastore in a new Terminal you will get a response like Starting Hive Metastore Server
hive-site.xml
`<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>**mysql metastore username**</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>**mysql metastore DB password**</value>
</property>
<property>
<name>hive.querylog.location</name>
<value>/tmp/hivequerylogs/${user.name}</value>
</property>
<property>
<name>hive.aux.jars.path</name>
<value>file:///usr/local/hive/apache-hive-2.1.1-bin/lib/hive-hbase-handler-2.1.1.jar,file:///usr/local/hive/apache-hive-2.1.1-bin/lib/zookeeper-3.4.6.jar</value>
<description>A comma separated list (with no spaces) of the jar files required for Hive-HBase integration</description>
</property>
<property>
<name>hive.support.concurrency</name>
<value>false</value>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>PAM</value>
</property>
<property>
<name>hive.server2.custom.authentication.class</name>
<value>org.apache.hive.service.auth.PamAuthenticationProvider</value>
</property>
<property>
<name>hive.server2.authentication.pam.services</name>
<value>sshd,sudo</value>
</property>
<property>
<name>hive.stats.dbclass</name>
<value>jdbc:mysql</value>
</property>
<property>
<name>hive.stats.jdbcdriver</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>hive.session.history.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>hive.optimize.sort.dynamic.partition</name>
<value>false</value>
</property>
<property>
<name>hive.optimize.insert.dest.volume</name>
<value>false</value>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive/${user.name}</value>
<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
<description/>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
<description>creates necessary schema on a startup if one doesn't exist. set this to false, after creating it once</description>
</property>
<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
</property>
<property>
<name>datanucleus.schema.validateConstraints</name>
<value>true</value>
</property>
<property>
<name>datanucleus.schema.validateColumns</name>
<value>true</value>
</property>
<property>
<name>datanucleus.schema.validateTables</name>
<value>true</value>
</property>
</configuration>`

AuthorizationException: User not allowed to impersonate User

I wrote a spark job which registers a temp table
and when I expose it via beeline (JDBC client)
$ ./bin/beeline
beeline> !connect jdbc:hive2://IP:10003 -n ram -p xxxx
0: jdbc:hive2://IP> show tables;
+---------------------------------------------+--------------+---------------------+
| tableName | isTemporary |
+---------------------------------------------+--------------+---------------------+
| f238 | true |
+---------------------------------------------+--------------+---------------------+
2 rows selected (0.309 seconds)
0: jdbc:hive2://IP>
I can view the table. When querying I get this error message
0: jdbc:hive2://IP> select * from f238;
Error: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: ram is not allowed to impersonate ram (state=,code=0)
0: jdbc:hive2://IP>
I have this in hive-site.xml,
<property>
<name>hive.metastore.sasl.enabled</name>
<value>false</value>
<description>If true, the metastore Thrift interface will be secured with SASL. Clients must authenticate with Kerberos.</description>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
</property>
I have this in core-site.xml,
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
full log
ERROR [pool-19-thread-2] thriftserver.SparkExecuteStatementOperation: Error running hive query:
org.apache.hive.service.cli.HiveSQLException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: ram is not allowed to impersonate ram
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Any idea what configuration I am missing?
<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>
Also if you want user ABC to impersonate all(*), add below properties to your core-site.xml
<property>
<name>hadoop.proxyuser.ABC.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.ABC.hosts</name>
<value>*</value>
</property>
check that the owner of hiveserver2 process is 'hive'

Hadoop: each namenode and datanode only last for a momentary time

Using CentOs 5.4
Three virtual machines(using vmware workstation):master, slave1, slave2. master is used for the namenode, and slave1 slave2 are used for the datanodes.
Hadoop version is hadoop-0.20.1.tar.gz, I have configured all the relative files, and closed the firewall with root user using the command:/sbin/service iptables stop. Then I tried to format namenode and start hadoop in the master(namenode) virtual machine with the following commands, no error was reported.
bin/hadoop namenode -format
bin/start-all.sh
Then I typed the command "jps" in the master machine right now, and found the right result:
5144 JobTracker
4953 NameNode
5079 SecondaryNameNode
5216 Jps
But after about several seconds, when I tried to type the "jps" command, all the virtual machines only have one process: JPS. The following is the result displayed in the namenode(master)
5236 Jps
What's the matter? Or how can I find what caused the matter? Dose it mean that it cannot find any namenode or datanode? Thank you.
Attachment: all the places I have modified:
hadoop-env.sh:
# set java environment
export JAVA_HOME=/usr/jdk1.6.0_13/
core-site.xml:
<configuration>
<property>
<name>master.node</name>
<value>namenode_master</value>
<description>master</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
<description>local dir</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://${master.node}:9000</value>
<description> </description>
</property>
</configuration>
hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>${hadoop.tmp.dir}/hdfs/name</value>
<description>local dir</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>${hadoop.tmp.dir}/hdfs/data</value>
<description> </description>
</property>
</configuration>
mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>${master.node}:9001</value>
<description> </description>
</property>
<property>
<name>mapred.local.dir</name>
<value>${hadoop.tmp.dir}/mapred/local</value>
<description> </description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/tmp/mapred/system</value>
<description>hdfs dir</description>
</property>
</configuration>
master:
master
slaves:
slave1
slave2
/etc/hosts:
192.168.190.133 master
192.168.190.134 slave1
192.168.190.135 slave2
From the log files, I found that I should change the namenode_master to master in the file core-site.xml. Now it works.

Resources