AuthorizationException: User not allowed to impersonate User - apache-spark

I wrote a spark job which registers a temp table
and when I expose it via beeline (JDBC client)
$ ./bin/beeline
beeline> !connect jdbc:hive2://IP:10003 -n ram -p xxxx
0: jdbc:hive2://IP> show tables;
+---------------------------------------------+--------------+---------------------+
| tableName | isTemporary |
+---------------------------------------------+--------------+---------------------+
| f238 | true |
+---------------------------------------------+--------------+---------------------+
2 rows selected (0.309 seconds)
0: jdbc:hive2://IP>
I can view the table. When querying I get this error message
0: jdbc:hive2://IP> select * from f238;
Error: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: ram is not allowed to impersonate ram (state=,code=0)
0: jdbc:hive2://IP>
I have this in hive-site.xml,
<property>
<name>hive.metastore.sasl.enabled</name>
<value>false</value>
<description>If true, the metastore Thrift interface will be secured with SASL. Clients must authenticate with Kerberos.</description>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
</property>
I have this in core-site.xml,
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
full log
ERROR [pool-19-thread-2] thriftserver.SparkExecuteStatementOperation: Error running hive query:
org.apache.hive.service.cli.HiveSQLException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: ram is not allowed to impersonate ram
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Any idea what configuration I am missing?

<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>
Also if you want user ABC to impersonate all(*), add below properties to your core-site.xml
<property>
<name>hadoop.proxyuser.ABC.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.ABC.hosts</name>
<value>*</value>
</property>

check that the owner of hiveserver2 process is 'hive'

Related

failed to connect hadoop spark

I am a beginner in spark job and spark configuration
I try to submit a spark job, after a few minutes (job is accepted and running during few minutes) the job fails with a connection refused.
User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 2
Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Failed to connect to my.domain.com/myIp:portNumber
I also have this error with jobs success
ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
On my computer, with intellij Idea my job turn, this is not a code mistake
I try several time to change configuration in yarn-site.xml and mapred-site.xml
This is a hadoop hdfs cluster, 3 nodes, 2 cores on each node, 8GB RAM on each node, I try to submit with this command line :
spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.3 --class MyClass --master yarn --deploy-mode cluster myJar.jar
mapred-site.xml :
<property>
<value>yarn</value>
<name>mapreduce.framework.name</name>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1000</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1000</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>2000</value>
</property>
yarn-site.xml
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>ipadress</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4000</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>500</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2000</value>
</property>
spark-default.conf
spark.master yarn
spark.driver.memory 1g
spark.history.fs.update.interval 30s
spark.history.ui.port port
spark.core.connection.ack.wait.timeout 600s
spark.default.parallelism 2
spark.executor.memory 2g
spark.cores.max 2
spark.executor.cores 2

How to configure spark thrift user and password

i am trying to connect spark thrift server by beeline, and i started spark thrift as below:
start-thriftserver.sh --master yarn-client --num-executors 2 --conf spark.driver.memory=2g --executor-memory 3g
and the spark conf/hive-site.xml as below:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:node001:3306/hive?useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
</property>
<property>
<name>hive.server2.thrift.client.user</name>
<value>root</value>
</property>
<property>
<name>hive.server2.thrift.client.password</name>
<value>123456</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10001</value>
</property>
<property>
<name>hive.security.authorization.enabled</name>
<value>true</value>
<description>enableor disable the hive clientauthorization</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
</configuration>
when i use beeline cli to access spark thrift server, it hint me input username and passwd, but i didn't input anything, just click enter(no input the username and password configured in hive-site.xml), i also can access spark. How can make the configuration work well.
Thanks :)
beeline> !connect jdbc:hive2://node001:10001
Connecting to jdbc:hive2://node001:10001
Enter username for jdbc:hive2://node001:10001:
Enter password for jdbc:hive2://node001:10001:
18/12/06 16:13:44 INFO Utils: Supplied authorities: node001:10001
18/12/06 16:13:44 INFO Utils: Resolved authority: node001:10001
18/12/06 16:13:45 INFO HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://node001:10001
Connected to: Spark SQL (version 2.4.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://node001:10001> show databases;
+-------------------------+--+
| databaseName |
+-------------------------+-- |
| default |
| test |
+-------------------------+--+
7 rows selected (0.142 seconds)
0: jdbc:hive2://node001:10001>

Connecting Azure Data lake store from Presto DB: No FileSystem for scheme: adl

I am trying the run a Presto DB on my local machine to connect the data from Azure Datalake store. Even after adding the JAR files for Azure Data lake store not able to fetch the data from Azure DataLake Store.
I get the below error:
Query 20181005_191247_00000_wcgur failed: No FileSystem for scheme: adl
com.facebook.presto.spi.PrestoException: No FileSystem for scheme: adl
at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:189)
at com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
at com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
at com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: No FileSystem for scheme: adl
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.PrestoFileSystemCache.createFileSystem(PrestoFileSystemCache.java:114)
at org.apache.hadoop.fs.PrestoFileSystemCache.getInternal(PrestoFileSystemCache.java:89)
at org.apache.hadoop.fs.PrestoFileSystemCache.get(PrestoFileSystemCache.java:62)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at com.facebook.presto.hive.HdfsEnvironment.lambda$getFileSystem$0(HdfsEnvironment.java:71)
at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
at com.facebook.presto.hive.HdfsEnvironment.getFileSystem(HdfsEnvironment.java:70)
at com.facebook.presto.hive.HdfsEnvironment.getFileSystem(HdfsEnvironment.java:64)
at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:282)
at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:256)
at com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:91)
at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:185)
... 7 more
Reference:
Persto server: presto-server-0.212.tar.gz
Presto CLI: presto-cli-0.212-executable.jar
JARs added to plugin/hive-hadoop2:
hadoop-azure-datalake-3.1.1.jar
azure-data-lake-store-sdk-2.3.2.jar
hadoop-azure-3.1.1.jar
hive.properties
connector.name=hive-hadoop2
hive.metastore.uri=thrift://hive:9083
hive.config.resources=presto/server/etc/catalog/adls-site.xml
adls-site.xml
<configuration>
<property>
<name>fs.adl.impl</name>
<value>org.apache.hadoop.fs.adl.AdlFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.adl.impl</name>
<value>org.apache.hadoop.fs.adl.Adl</value>
</property>
<property>
<name>fs.adl.oauth2.access.token.provider.type</name>
<value>ClientCredential</value>
</property>
<property>
<name>fs.adl.oauth2.refresh.url</name>
<value>my_url</value>
</property>
<property>
<name>fs.adl.oauth2.client.id</name>
<value>my_id</value>
</property>
<property>
<name>fs.adl.oauth2.credential</name>
<value>my_cred</value>
</property>
</configuration>
Any comment on this will be much helpful. Thanks in advance!

How to execute Hive queries on Hive 2.1.1 on Spark 2.2.0?

Simple queries, e.g. select, work fine, but when I use aggregate functions, e.g. count, I face errors.
I use beeline to connect to Hive 2.1.1 with Spark 2.2.0 and Hadoop 2.8.
hive-site.xml is as follows:
<property>
<name>hive.execution.engine</name>
<value>spark</value>
<description>
Expects one of [mr, tez, spark].
Chooses execution engine. Options are: mr (Map reduce, default), tez, spark. While MR
remains the default engine for historical reasons, it is itself a historical engine
and is deprecated in Hive 2 line. It may be removed without further warning.
</description>
</property>
<property>
<name>spark.master</name>
<value>spark://master:7077</value>
<description>Spark Master URL</description>
</property>
<property>
<name>spark.eventLog.enabled</name>
<value>true</value>
<description>Spark Event Log</description>
</property>
<property>
<name>spark.eventLog.dir</name>
<value>hdfs://master:8020/user/spark/eventLogging</value>
<description>Spark event log folder</description>
</property>
<property>
<name>spark.executor.memory</name>
<value>512m</value>
<description>Spark executor memory</description>
</property>
<property>
<name>spark.serializer</name>
<value>org.apache.spark.serializer.KryoSerializer</value>
<description>Spark serializer</description>
</property>
<property>
<name>spark.yarn.jars</name>
<value>hdfs://master:9000:/user/spark/spark-jars/*</value>
</property>
<property>
<name>spark.master</name>
<value>spark://master:7077</value>
<description>Spark Master URL</description>
</property>
<property>
<name>spark.eventLog.enabled</name>
<value>true</value>
<description>Spark Event Log</description>
</property>
<property>
<name>spark.eventLog.dir</name>
<value>hdfs://master:8020/user/spark/eventLogging</value>
<description>Spark event log folder</description>
</property>
<property>
<name>spark.executor.memory</name>
<value>512m</value>
<description>Spark executor memory</description>
</property>
<property>
<name>spark.serializer</name>
<value>org.apache.spark.serializer.KryoSerializer</value>
<description>Spark serializer</description>
</property>
<property>
<name>spark.yarn.jars</name>
<value>hdfs://master:9000:/user/spark/spark-jars/*</value>
</property>
When executing select count(*) from table in hive I get the below error:
WARN thrift.ThriftCLIService: Error executing statement:
org.apache.hive.service.cli.HiveSQLException: Error running query: java.lang.NoClassDefFoundError: scala/collection/Iterable
at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:225) ~[hive-service-2.1.1.jar:2.1.1]
at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:276) ~[hive-service-2.1.1.jar:2.1.1]
at org.apache.hive.service.cli.operation.Operation.run(Operation.java:324) ~[hive-service-2.1.1.jar:2.1.1]
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:499) ~[hive-service-2.1.1.jar:2.1.1]
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:486) ~[hive-service-2.1.1.jar:2.1.1]
at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:295) ~[hive-service-2.1.1.jar:2.1.1]
at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:506) [hive-service-2.1.1.jar:2.1.1]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_121]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_121]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_121]
at org.apache.hive.jdbc.HiveConnection$SynchronizedHandler.invoke(HiveConnection.java:1412) [hive-jdbc-2.1.1.jar:2.1.1]
at com.sun.proxy.$Proxy35.ExecuteStatement(Unknown Source) [?:?]
at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:308) [hive-jdbc-2.1.1.jar:2.1.1]
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:250) [hive-jdbc-2.1.1.jar:2.1.1]

How to configure the Hive cli when using the Spark execution engine?

I have set the hive.execution.engine to spark and also am using a spark-enabled queue. Spark sql is able to access the hive tables - and so is beeline from a directly connected cluster machine.
But the hive cli seems to need additional steps. So far the following have been done:
** Copy the scala libraries to the $HIVE_HOME/libs dir (or we get ClassNotFoundException)
** Run the following at the start of the hive script (or in .hiverc)
set hive.execution.engine=spark;
set mapred.job.queue.name=root.spark.sbg.hos;
However the following error now happens Failed to create spark client.:
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/usr/local/Cellar/hive/2.1.1/libexec/lib/hive-common-2.1.1.jar!/hive-log4j2.properties Async: true
hive (default)> insert into sb.test2 values (1,'ab');
Query ID = sboesch_20171030175629_dc310c9a-519e-4f84-a632-f3a44f1df8c3
Total jobs = 3
Launching Job 1 out of 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
Has anyone managed to connect to spark backend for hive ? I am connecting via vanilla hive (not Cloudera or Hortonworks or MapR).
you have to start Hive metastore Server separately for accessing hive tables through spark.
Try hive --service metastore in a new Terminal you will get a response like Starting Hive Metastore Server
hive-site.xml
`<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>**mysql metastore username**</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>**mysql metastore DB password**</value>
</property>
<property>
<name>hive.querylog.location</name>
<value>/tmp/hivequerylogs/${user.name}</value>
</property>
<property>
<name>hive.aux.jars.path</name>
<value>file:///usr/local/hive/apache-hive-2.1.1-bin/lib/hive-hbase-handler-2.1.1.jar,file:///usr/local/hive/apache-hive-2.1.1-bin/lib/zookeeper-3.4.6.jar</value>
<description>A comma separated list (with no spaces) of the jar files required for Hive-HBase integration</description>
</property>
<property>
<name>hive.support.concurrency</name>
<value>false</value>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>PAM</value>
</property>
<property>
<name>hive.server2.custom.authentication.class</name>
<value>org.apache.hive.service.auth.PamAuthenticationProvider</value>
</property>
<property>
<name>hive.server2.authentication.pam.services</name>
<value>sshd,sudo</value>
</property>
<property>
<name>hive.stats.dbclass</name>
<value>jdbc:mysql</value>
</property>
<property>
<name>hive.stats.jdbcdriver</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>hive.session.history.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>hive.optimize.sort.dynamic.partition</name>
<value>false</value>
</property>
<property>
<name>hive.optimize.insert.dest.volume</name>
<value>false</value>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive/${user.name}</value>
<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
<description/>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
<description>creates necessary schema on a startup if one doesn't exist. set this to false, after creating it once</description>
</property>
<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
</property>
<property>
<name>datanucleus.schema.validateConstraints</name>
<value>true</value>
</property>
<property>
<name>datanucleus.schema.validateColumns</name>
<value>true</value>
</property>
<property>
<name>datanucleus.schema.validateTables</name>
<value>true</value>
</property>
</configuration>`

Resources