spark ETL and spark thrift server - apache-spark

Some details:
Spark SQL (version 3.2.1)
Driver: Hive JDBC (version 2.3.9)
ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker threads
BI tool is connect via odbc driver
After activating Spark Thrift Server I'm unable to run pyspark script using spark-submit as they both use the same metastore_db
error:
Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1#3acaa384, see the next exception for details.
at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source)
... 140 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /tmp/metastore_db.
I need to be able to run PySpark (Spark ETL) while having spark thrift server up for BI tool queries.
Any workaround for it?
Thanks!

In my case the solution was to move the metastore_db to a database server like MySql (in my case) or Postgresql.
You will have to configure $SPARK_HOME/conf/hive-site.xml and include your jdbc driver in $SPARK_HOME/jars path
hive-site.xml example for MySQL connection
<configuration>
<!-- Hive Execution Parameters -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://XXX.XXX.XXX.XXX:3306/metastore?createDatabaseIfNotExist=true&useSSL=FALSE&autoReconnect=true&nullCatalogMeansCurrent=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>YOUR_USER</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>YOUR_PASSWORD</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>hive.server2.transport.mode</name>
<value>http</value>
</property>
<property>
<name>hive.server2.thrift.http.port</name>
<value>10000</value>
</property>
<property>
<name>hive.server2.http.endpoint</name>
<value>cliservice</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description/>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>true</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoCreateTables</name>
<value>true</value>
</property>
<property>
<name>datanucleus.schema.autoCreateTables</name>
<value>true</value>
</property>
</configuration>

Related

RangerAdminRESTClient: Error getting Roles

I'm trying to connect my apache spark to apache Ranger using Spark AuthZ Plugin.
Kerberos is enabled for both spark and apache ranger .
i'm getting this error "WARN RangerAdminRESTClient: Error getting Roles. secureMode=true, user=hive/host_fqdn#MYCOMPANY.COM (auth:KERBEROS), response={"httpStatusCode":401,"statusCode":0}, serviceName=hive_policy"
apache ranger conf files :
ll /etc/ranger/admin/conf/ :
conf -> /usr/local/ranger-admin/ews/webapp/WEB-INF/classes/conf
core-site.xml -> /etc/hadoop/conf/core-site.xml
java_home.sh
logback.xml
ranger-admin-default-site.xml
ranger-admin-env-hadoopconfdir.sh
ranger-admin-env-logback-conf-file.sh
ranger-admin-env-logdir.sh
ranger-admin-env-piddir.sh
ranger-admin-site.xml
security-applicationContext.xml
ranger-admin-site.xml content :
<configuration>
<property>
<name>ranger.jpa.jdbc.driver</name>
<value>org.postgresql.Driver</value>
<description />
</property>
<property>
<name>ranger.jpa.jdbc.url</name>
<value>jdbc:postgresql://ip#:5432/ranger</value>
<description />
</property>
<property>
<name>ranger.jpa.jdbc.user</name>
<value>rangeradmin</value>
<description />
</property>
<property>
<name>ranger.jpa.jdbc.password</name>
<value>_</value>
<description />
</property>
<property>
<name>ranger.externalurl</name>
<value>http://machine_fqdn:6080</value>
<description />
</property>
<property>
<name>ranger.scheduler.enabled</name>
<value>true</value>
<description />
</property>
<property>
<name>ranger.audit.elasticsearch.urls</name>
<value>127.0.0.1</value>
<description />
</property>
<property>
<name>ranger.audit.elasticsearch.port</name>
<value>9200</value>
<description />
</property>
<property>
<name>ranger.audit.elasticsearch.user</name>
<value />
<description />
</property>
<property>
<name>ranger.audit.elasticsearch.password</name>
<value />
<description />
</property>
<property>
<name>ranger.audit.elasticsearch.index</name>
<value />
<description />
</property>
<property>
<name>ranger.audit.elasticsearch.bootstrap.enabled</name>
<value>true</value>
</property>
<property>
<name>ranger.audit.amazon_cloudwatch.region</name>
<value>us-east-2</value>
</property>
<property>
<name>ranger.audit.amazon_cloudwatch.log_group</name>
<value>ranger_audits</value>
</property>
<property>
<name>ranger.audit.amazon_cloudwatch.log_stream_prefix</name>
<value />
</property>
<property>
<name>ranger.audit.solr.urls</name>
<value>http://##solr_host##:6083/solr/ranger_audits</value>
<description />
</property>
<property>
<name>ranger.audit.source.type</name>
<value>db</value>
<description />
</property>
<property>
<name>ranger.service.http.enabled</name>
<value>true</value>
<description />
</property>
<property>
<name>ranger.authentication.method</name>
<value>NONE</value>
<description />
</property>
<property>
<name>ranger.ldap.url</name>
<value>ldap://</value>
<description />
</property>
<property>
<name>ranger.ldap.user.dnpattern</name>
<value>uid={0},ou=users,dc=xasecure,dc=net</value>
<description />
</property>
<property>
<name>ranger.ldap.group.searchbase</name>
<value>ou=groups,dc=xasecure,dc=net</value>
<description />
</property>
<property>
<name>ranger.ldap.group.searchfilter</name>
<value>(member=uid={0},ou=users,dc=xasecure,dc=net)</value>
<description />
</property>
<property>
<name>ranger.ldap.group.roleattribute</name>
<value>cn</value>
<description />
</property>
<property>
<name>ranger.ldap.base.dn</name>
<value />
<description>LDAP base dn or search base</description>
</property>
<property>
<name>ranger.ldap.bind.dn</name>
<value />
<description>LDAP bind dn or manager dn</description>
</property>
<property>
<name>ranger.ldap.bind.password</name>
<value />
<description>LDAP bind password</description>
</property>
<property>
<name>ranger.ldap.default.role</name>
<value>ROLE_USER</value>
</property>
<property>
<name>ranger.ldap.referral</name>
<value />
<description>follow or ignore</description>
</property>
<property>
<name>ranger.ldap.ad.domain</name>
<value>example.com</value>
<description />
</property>
<property>
<name>ranger.ldap.ad.url</name>
<value />
<description>ldap://</description>
</property>
<property>
<name>ranger.ldap.ad.base.dn</name>
<value>dc=example,dc=com</value>
<description>AD base dn or search base</description>
</property>
<property>
<name>ranger.ldap.ad.bind.dn</name>
<value>cn=administrator,ou=users,dc=example,dc=com</value>
<description>AD bind dn or manager dn</description>
</property>
<property>
<name>ranger.ldap.ad.bind.password</name>
<value />
<description>AD bind password</description>
</property>
<property>
<name>ranger.ldap.ad.referral</name>
<value />
<description>follow or ignore</description>
</property>
<property>
<name>ranger.service.https.attrib.ssl.enabled</name>
<value>false</value>
</property>
<property>
<name>ranger.service.https.attrib.keystore.keyalias</name>
<value>myKey</value>
</property>
<property>
<name>ranger.service.https.attrib.keystore.pass</name>
<value>_</value>
</property>
<property>
<name>ranger.service.host</name>
<value>machine_fqdn</value>
</property>
<property>
<name>ranger.service.http.port</name>
<value>6080</value>
</property>
<property>
<name>ranger.service.https.port</name>
<value>6182</value>
</property>
<property>
<name>ranger.service.https.attrib.keystore.file</name>
<value>/etc/ranger/admin/keys/server.jks</value>
</property>
<property>
<name>ranger.solr.audit.user</name>
<value />
<description />
</property>
<property>
<name>ranger.solr.audit.user.password</name>
<value />
<description />
</property>
<property>
<name>ranger.audit.solr.zookeepers</name>
<value />
<description />
</property>
<property>
<name>ranger.ldap.user.searchfilter</name>
<value>(uid={0})</value>
<description />
</property>
<property>
<name>ranger.ldap.ad.user.searchfilter</name>
<value>(sAMAccountName={0})</value>
<description />
</property>
<property>
<name>ranger.sso.providerurl</name>
<value>https://127.0.0.1:8443/gateway/knoxsso/api/v1/websso</value>
</property>
<property>
<name>ranger.sso.publicKey</name>
<value />
</property>
<property>
<name>ranger.sso.enabled</name>
<value>false</value>
</property>
<property>
<name>ranger.sso.browser.useragent</name>
<value>Mozilla,chrome</value>
</property>
<property>
<name>ranger.admin.kerberos.token.valid.seconds</name>
<value>30</value>
</property>
<property>
<name>ranger.admin.kerberos.cookie.domain</name>
<value>machine_fqdn</value>
</property>
<property>
<name>ranger.admin.kerberos.cookie.path</name>
<value>/</value>
</property>
<property>
<name>ranger.admin.kerberos.principal</name>
<value>hive/_HOST#MYCOMPANY.COM</value>
</property>
<property>
<name>ranger.admin.kerberos.keytab</name>
<value>/etc/security/hdfs.keytab</value>
</property>
<property>
<name>ranger.spnego.kerberos.principal</name>
<value>HTTP/_HOST#MYCOMPANY.COM</value>
</property>
<property>
<name>ranger.spnego.kerberos.keytab</name>
<value>/etc/security/hdfs.keytab</value>
</property>
<property>
<name>ranger.lookup.kerberos.principal</name>
<value>hive/_HOST#MYCOMPANY.COM</value>
</property>
<property>
<name>ranger.lookup.kerberos.keytab</name>
<value>/etc/security/hdfs.keytab</value>
</property>
<property>
<name>ranger.kerberos.principal</name>
<value>hive/_HOST#MYCOMPANY.COM</value>
</property>
<property>
<name>ranger.kerberos.keytab</name>
<value>/etc/security/hdfs.keytab</value>
</property>
<property>
<name>ranger.supportedcomponents</name>
<value />
</property>
<property>
<name>ranger.downloadpolicy.session.log.enabled</name>
<value>false</value>
</property>
<property>
<name>ranger.kms.service.user.hdfs</name>
<value>hdfs</value>
</property>
<property>
<name>ranger.kms.service.user.hive</name>
<value>hive</value>
</property>
<property>
<name>ranger.kms.service.user.hbase</name>
<value>hbase</value>
</property>
<property>
<name>ranger.kms.service.user.om</name>
<value>om</value>
</property>
<property>
<name>ranger.audit.hive.query.visibility</name>
<value>true</value>
<description />
</property>
<property>
<name>ranger.service.https.attrib.keystore.credential.alias</name>
<value>keyStoreCredentialAlias</value>
</property>
<property>
<name>ranger.tomcat.ciphers</name>
<value />
</property>
<property>
<name>ranger.audit.solr.collection.name</name>
<value>ranger_audits</value>
</property>
<property>
<name>ranger.audit.solr.config.name</name>
<value>ranger_audits</value>
</property>
<property>
<name>ranger.audit.solr.configset.location</name>
<value />
</property>
<property>
<name>ranger.audit.solr.no.shards</name>
<value>1</value>
</property>
<property>
<name>ranger.audit.solr.max.shards.per.node</name>
<value>1</value>
</property>
<property>
<name>ranger.audit.solr.no.replica</name>
<value>1</value>
</property>
<property>
<name>ranger.audit.solr.acl.user.list.sasl</name>
<value>solr,infra-solr</value>
</property>
<property>
<name>ranger.audit.solr.bootstrap.enabled</name>
<value />
</property>
<property>
<name>ranger.audit.solr.max.retry</name>
<value />
<description>Maximum no. of retry to setup solr</description>
</property>
<property>
<name>ranger.admin.cookie.name</name>
<value>RANGERADMINSESSIONID</value>
</property>
</configuration>
core-site.xml content :
<configuration>
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
</configuration>
spark conf files : ll /opt/spark/conf
core-site.xml : to enable kerberos (same as above file)
hive-site.xml
ranger-admin-default-site.xml (same as above file)
ranger-admin-site.xml (same as above file)
ranger-spark-audit.xml
ranger-spark-security.xml
spark-defaults.conf
hive-site.xml content:
<configuration>
<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://ip#:8020/hive/warehouse</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://ip#:5432/hivemetastoredb</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>kube</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>kubeadmin</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
<property>
<name>hive.execution.engine</name>
<value>mr</value>
</property>
<property>
<name>hive.metastore.port</name>
<value>9083</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://hive-metastore-service.my-hdfs.svc.cluster.local:9083</value>
</property>
<property>
<name>mapreduce.input.fileinputformat.input.dir.recursive</name>
<value>true</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>KERBEROS</value>
<description>authenticationtype</description>
</property>
<property>
<name>hive.server2.authentication.kerberos.principal</name>
<value>hive/_HOST#MYCOMPANY.COM</value>
<description>HiveServer2 principal. If _HOST is used as the FQDN portion, it will be replaced with the actual hostname of the running instance.</description>
</property>
<property>
<name>hive.server2.authentication.kerberos.keytab</name>
<value>/etc/security/hdfs.keytab</value>
<description>Keytab file for HiveServer2 principal</description>
</property>
</configuration>
ranger-spark-security.xml content :
<configuration>
<property>
<name>ranger.plugin.spark.policy.rest.url</name>
<value>http://ranger_machine_ip#:6080</value>
</property>
<property>
<name>ranger.plugin.spark.service.name</name>
<value>hive_policy</value>
</property>
<property>
<name>ranger.plugin.spark.policy.cache.dir</name>
<value>/</value>
</property>
<property>
<name>ranger.plugin.spark.policy.pollIntervalMs</name>
<value>5000</value>
</property>
<property>
<name>ranger.plugin.spark.policy.source.impl</name>
<value>org.apache.ranger.admin.client.RangerAdminRESTClient</value>
</property>
<property>
<name>ranger.plugin.spark.enable.implicit.userstore.enricher</name>
<value>true</value>
<description>Enable UserStoreEnricher for fetching user and group attributes if using macros or scripts in row-filters since Ranger 2.3</description>
</property>
<property>
<name>ranger.plugin.hive.policy.cache.dir</name>
<value>/</value>
<description>As Authz plugin reuses hive service def, a policy cache path is required for caching UserStore and Tags for "hive" service def, while "ranger.plugin.spark.policy.cache.dir config" is the path for caching policies in service. </description>
</property>
</configuration>
spark-defaults.conf content :
spark.kubernetes.driver.master k8s://master_ip#:6443
spark.kubernetes.authenticate.serviceAccountName spark
spark.kubernetes.namespace my-hdfs
spark.executor.memory 1g
spark.driver.memory 2g
spark.kubernetes.container.image spark-3.2.2
spark.storage.memoryFraction 0
spark.executor.cores 1
spark.executor.instances 3
spark.sql.extensions org.apache.kyuubi.plugin.spark.authz.ranger.RangerSparkExtension
spark.kerberos.keytab=/etc/security/hdfs.keytab
spark.kerberos.principal=hive/host_fqdn#MYCOMPANY.COM
spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf
Any help will be appreciated; THX

Make Spark application use all available YARN resources

I am currently using a cluster of 5 Raspberry Pi 4 (4GB) and I installed Hadoop to manage the resources. Unfortunately I am not able to config the settings right to use the full resources (4 worker nodes, 1 master node) for the Apache Spark Application, which I submit on top of the Hadoop Framework.
Does somebody knows, how I have to config the settings right to use the full resources (16 cores, 14 GB RAM) for only 1 application?
My current settings are:
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.memory-mb</name>
<value>3584</value> <!--512-->
</property>
<property>
<name>mapreduce.map.resource.memory-mb</name>
<value>3584</value> <!--256-->
</property>
<property>
<name>mapreduce.reduce.resource.memory-mb</name>
<value>3584</value> <!--256-->
</property>
</configuration>
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>pi1</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3584</value> <!--1536-->
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>8</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>3584</value> <!--1536-->
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>64</value> <!--128-->
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value> <!--128-->
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>8</value> <!--128-->
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>true</value>
</property>
</configuration>
spark-defaults.config
# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.master yarn
spark.driver.memory 2048m
spark.yarn.am.memory 512m
spark.executor.memory 1024m
spark.executor.cores 4
#spark.driver.memory 512m
#spark.yarn.am.memory 512m
#spark.executor.memory 512m
spark.eventLog.enabled true
spark.eventLog.dir hdfs://pi1:9000/spark-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs://pi1:9000/spark-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
If somebody has a suggestion, I would be really thankful. :)
P.s: If more information are required, just tell me.

YARN nodemanager error while running basic sparkpi example

I am running a basic spark program to test my YARN setup with Spark. I am running a job similar to the example one on the website.
spark-submit --master yarn --deploy-mode cluster --num-executors 75 --
executor-cores 2 --executor-memory 6g --class
org.apache.spark.examples.JavaSparkPi
/home/spark/examples/jars/spark_examples.jar 1000
However, the job never terminates and the nodemanagers on the different nodes show this error:
2020-03-16 14:27:42,917 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: couldn't find container container_1584386586744_0001_01_000319 while processing FINISH_CONTAINERS event
I am not sure whats causing this. Any advice is appreciated.
Here is the yarn-site.xml file for the standalone cluster(causes error):
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>172.16.1.1</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>10.66.4.100:8088</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>172.16.1.1</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>262144</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>262144</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>56</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>56</value>
</property>
Here is the yarn-site.xml file for the EMR cluster(which works)
<configuration>
<property>
<name>yarn.timeline-service.hostname</name>
<value>ip-172-31-63-120.ec2.internal</value>
</property>
<property>
<name>yarn.web-proxy.address</name>
<value>ip-172-31-63-120.ec2.internal:20888</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>ip-172-31-63-120.ec2.internal:8025</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>ip-172-31-63-120.ec2.internal:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>ip-172-31-63-120.ec2.internal:8030</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://ip-172-31-63-120.ec2.internal:19888/jobhistory/logs</value>
</property>
<property>
<name>yarn.dispatcher.exit-on-error</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/mnt/yarn,/mnt1/yarn</value>
<final>true</final>
</property>
<property>
<description>Where to store container logs.</description>
<name>yarn.nodemanager.log-dirs</name>
<value>/var/log/hadoop-yarn/containers</value>
</property>
<property>
<description>Where to aggregate logs to.</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/var/log/hadoop-yarn/apps</value>
</property>
<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*,
/usr/lib/hadoop-lzo/lib/*,
/usr/share/aws/emr/emrfs/conf,
/usr/share/aws/emr/emrfs/lib/*,
/usr/share/aws/emr/emrfs/auxlib/*,
/usr/share/aws/emr/lib/*,
/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar,
/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar,
/usr/lib/spark/yarn/lib/datanucleus-api-jdo.jar,
/usr/lib/spark/yarn/lib/datanucleus-core.jar,
/usr/lib/spark/yarn/lib/datanucleus-rdbms.jar,
/usr/share/aws/emr/cloudwatch-sink/lib/*,
/usr/share/aws/aws-java-sdk/*
</value>
</property>
<!-- The defaut setting (2.1) is silly. The virtual memory is not
a limiting factor on 64Bit systems, at least not a limiting
resource, so make it large, very large. -->
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>5</value>
</property>
<property>
<name>yarn.node-labels.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.node-labels.am.default-node-label-expression</name>
<value>CORE</value>
</property>
<property>
<name>yarn.node-labels.fs-store.root-dir</name>
<value>file:///mnt/var/lib/hadoop-yarn/nodelabels</value>
</property>
<property>
<name>yarn.node-labels.configuration-type</name>
<value>distributed</value>
</property>
<property>
<name>yarn.log-aggregation.enable-local-cleanup</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.address</name>
<value>${yarn.nodemanager.hostname}:8041</value>
</property>
<property>
<name>yarn.nodemanager.container-metrics.enable</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.recovery.supervised</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.nodes.exclude-path</name>
<value>/emr/instance-controller/lib/yarn.nodes.exclude.xml</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.cross-origin.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.scheduler.increment-allocation-mb</name>
<value>32</value>
</property>
<property>
<name>yarn.resourcemanager.nodemanagers.heartbeat-interval-ms</name>
<value>250</value>
</property>
<property>
<name>yarn.nodemanager.node-labels.provider</name>
<value>config</value>
</property>
<property>
<name>yarn.nodemanager.node-labels.provider.configured-node-partition</name>
<value>CORE</value>
</property>
<property>
<name>yarn.resourcemanager.system-metrics-publisher.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.http-cross-origin.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.client.thread-count</name>
<value>64</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.client.thread-count</name>
<value>64</value>
</property>
<property>
<name>yarn.nodemanager.container-manager.thread-count</name>
<value>64</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.client.thread-count</name>
<value>64</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>12288</value>
</property>
<property>
<name>yarn.nodemanager.localizer.client.thread-count</name>
<value>20</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>172800</value>
</property>
<property>
<name>yarn.nodemanager.localizer.fetch.thread-count</name>
<value>20</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>12288</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>128</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>172.31.63.120</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>32</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>

Query hive database using hive context created on spark 2.3.0

I am able to create a hive context programmatically on spark 1.6.0 using :
val conf = new SparkConf().setAppName("SparkTest").setMaster("local")
val sc=new SparkContext(conf)
val hc = new HiveContext(sc)
val actualRecordCountHC = hc.sql("select count(*) from hiveorc_replica.appointment")
This is working fine for me.
In the same way, I want to create a hive context on spark 2.3.0 but when running the program, it throws the following error:
org.apache.spark.sql.AnalysisException:
Table or view not found: `hiveorc_replica`.`appointment`; line 1 pos 21;
'Aggregate [unresolvedalias(count(1), None)]
'UnresolvedRelation `hiveorc_replica`.`appointment`
I know that HiveContext(sc) has been deprecated in 2.3.0 but when run these as commands on spark-shell, they are also giving results. Also, I want to make the program generic for both versions of spark. Can someone please suggest some way of querying hive tables directly without using the hive database file names ?
Following is the hive-site.xml I am using to connect remotely-
<?xml version="1.0" encoding="UTF-8"?>
<!--Autogenerated by Cloudera Manager-->
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://fqdn:9083</value>
</property>
<property>
<name>hive.metastore.client.socket.timeout</name>
<value>300</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.warehouse.subdir.inherit.perms</name>
<value>true</value>
</property>
<property>
<name>hive.auto.convert.join</name>
<value>true</value>
</property>
<property>
<name>hive.auto.convert.join.noconditionaltask.size</name>
<value>20971520</value>
</property>
<property>
<name>hive.optimize.bucketmapjoin.sortedmerge</name>
<value>false</value>
</property>
<property>
<name>hive.smbjoin.cache.rows</name>
<value>10000</value>
</property>
<property>
<name>hive.server2.logging.operation.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/var/log/hive/operation_logs</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>-1</value>
</property>
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>67108864</value>
</property>
<property>
<name>hive.exec.copyfile.maxsize</name>
<value>33554432</value>
</property>
<property>
<name>hive.exec.reducers.max</name>
<value>1099</value>
</property>
<property>
<name>hive.vectorized.groupby.checkinterval</name>
<value>4096</value>
</property>
<property>
<name>hive.vectorized.groupby.flush.percent</name>
<value>0.1</value>
</property>
<property>
<name>hive.compute.query.using.stats</name>
<value>false</value>
</property>
<property>
<name>hive.vectorized.execution.enabled</name>
<value>false</value>
</property>
<property>
<name>hive.vectorized.execution.reduce.enabled</name>
<value>false</value>
</property>
<property>
<name>hive.merge.mapfiles</name>
<value>true</value>
</property>
<property>
<name>hive.merge.mapredfiles</name>
<value>false</value>
</property>
<property>
<name>hive.cbo.enable</name>
<value>false</value>
</property>
<property>
<name>hive.fetch.task.conversion</name>
<value>minimal</value>
</property>
<property>
<name>hive.fetch.task.conversion.threshold</name>
<value>268435456</value>
</property>
<property>
<name>hive.limit.pushdown.memory.usage</name>
<value>0.1</value>
</property>
<property>
<name>hive.merge.sparkfiles</name>
<value>true</value>
</property>
<property>
<name>hive.merge.smallfiles.avgsize</name>
<value>16777216</value>
</property>
<property>
<name>hive.merge.size.per.task</name>
<value>268435456</value>
</property>
<property>
<name>hive.optimize.reducededuplication</name>
<value>true</value>
</property>
<property>
<name>hive.optimize.reducededuplication.min.reducer</name>
<value>4</value>
</property>
<property>
<name>hive.map.aggr</name>
<value>true</value>
</property>
<property>
<name>hive.map.aggr.hash.percentmemory</name>
<value>0.5</value>
</property>
<property>
<name>hive.optimize.sort.dynamic.partition</name>
<value>false</value>
</property>
<property>
<name>hive.execution.engine</name>
<value>mr</value>
</property>
<property>
<name>spark.executor.memory</name>
<value>268435456</value>
</property>
<property>
<name>spark.driver.memory</name>
<value>268435456</value>
</property>
<property>
<name>spark.executor.cores</name>
<value>1</value>
</property>
<property>
<name>spark.yarn.driver.memoryOverhead</name>
<value>26</value>
</property>
<property>
<name>spark.yarn.executor.memoryOverhead</name>
<value>26</value>
</property>
<property>
<name>spark.dynamicAllocation.enabled</name>
<value>true</value>
</property>
<property>
<name>spark.dynamicAllocation.initialExecutors</name>
<value>1</value>
</property>
<property>
<name>spark.dynamicAllocation.minExecutors</name>
<value>1</value>
</property>
<property>
<name>spark.dynamicAllocation.maxExecutors</name>
<value>2147483647</value>
</property>
<property>
<name>hive.metastore.execute.setugi</name>
<value>true</value>
</property>
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
<property>
<name>hive.zookeeper.quorum</name>
<value>fqdn</value>
</property>
<property>
<name>hive.zookeeper.client.port</name>
<value>2181</value>
</property>
<property>
<name>hive.zookeeper.namespace</name>
<value>hive_zookeeper_namespace_CD-HIVE-WAyDdBlP</value>
</property>
<property>
<name>hive.cluster.delegation.token.store.class</name>
<value>org.apache.hadoop.hive.thrift.MemoryTokenStore</value>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.sasl.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.kerberos.principal</name>
<value>hive/_HOST#EXAMPLE.COM</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.principal</name>
<value>hive/_HOST#EXAMPLE.COM</value>
</property>
<property>
<name>spark.shuffle.service.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>LDAP</value>
</property>
</configuration>
Here, fqdn is being replaced by the host hdfs FQDN during run time and is running perfectly for spark 1.6.0.
In spark 2.x.x you need to use enableHiveSupport() when creating SparkSession
val spark = SparkSession.builder()
.appName("Example")
.master("local")
.config("hive.metastore.uris","thrift://B:PortNumber")
.enableHiveSupport() // <---- This line here
.getOrCreate()
And if you want generic - I think you just need to create SparkContext and HiveContext separately:
if (sparkVersion <= 2.x.x) {
// create the old way
}
else
{
//create spark session and then get SparkContext and HiveContext from it.
}
Here you can find how to know spark version programmatically

How to encrypt and decrypt Azure Storage Access Key

I have created Azure VM and installed my Java application in it and then connected to WASB storage.
I have added following jars and core-site.xml to access WASB storage from Java application.
azure-storage
hadoop-azure
core-site.xml
<configuration>
<property>
<name>fs.AbstractFileSystem.wasb.impl</name>
<value>org.apache.hadoop.fs.azure.Wasb</value>
</property>
<property>
<name>fs.azure.account.key.STORAGE_ACCOUNT_NAME .blob.core.windows.net</name>
<value>STORAGE ACCESS KEY</value>
</property>
<property>
<name>fs.azure.io.copyblob.retry.max.retries</name>
<value>60</value>
</property>
<property>
<name>fs.azure.io.read.tolerate.concurrent.append</name>
<value>true</value>
</property>
<property>
<name>fs.azure.page.blob.dir</name>
<value>/mapreducestaging,/atshistory,/tezstaging,/ams/hbase/WALs,/ams/hbase/oldWALs,/ams/hbase/MasterProcWALs</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>wasb://STORAGE_CONTAINER_NAME#STORAGE_ACCOUNT_NAME.blob.core.windows.net</value>
<final>true</final>
</property>
<property>
<name>fs.trash.interval</name>
<value>360</value>
</property>
</configuration>
I have used Storage Access Key directly in core-site.xml. But I want Access key to be encrypted.
When I search about it, I got to know about below script:-
<property>
<name>fs.azure.account.keyprovider.youraccount</name>
<value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
</property>
<property>
<name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
<value>YOUR ENCRYPTED ACCESS KEY</value>
</property>
<property>
<name>fs.azure.shellkeyprovider.script</name>
<value>PATH TO DECRYPTION PROGRAM</value>
</property>
How do I access WASB storage with encrypted key. Is there a sample available for it for above configuration ?
Note:- I am connecting Azure VM directly to WASB storage without using HDInsight Cluster.
KeyVault maybe the best choice for you to store the secret - keys.
https://kamranicus.com/blog/2016/02/24/azure-key-vault-config-encryption-azure/

Resources