Missing hive-site when using spark-submit YARN cluster mode

Missing hive-site when using spark-submit YARN cluster mode - apache-spark

Using HDP 2.5.3 and I've been trying to debug some YARN container classpath issues.
Since HDP includes both Spark 1.6 and 2.0.0, there have been some conflicting versions
Users I support are successfully able to use Spark2 with Hive queries in YARN client mode, but not from cluster mode they get errors about tables not found, or something like that because the Metastore connection isn't established.
I am guessing that setting either --driver-class-path /etc/spark2/conf:/etc/hive/conf or passing --files /etc/spark2/conf/hive-site.xml after spark-submit would work, but why isn't hive-site.xml loaded already from the conf folder?
Accoringing to Hortonworks docs, says hive-site should be placed in $SPARK_HOME/conf, and it is...
I see hdfs-site.xml and core-site.xml, and other files that are part of HADOOP_CONF_DIR, for example, and this is the from the YARN UI container info.
2232355 4 drwx------ 2 yarn hadoop 4096 Aug 2 21:59 ./__spark_conf__
2232379 4 -r-x------ 1 yarn hadoop 2358 Aug 2 21:59 ./__spark_conf__/topology_script.py
2232381 8 -r-x------ 1 yarn hadoop 4676 Aug 2 21:59 ./__spark_conf__/yarn-env.sh
2232392 4 -r-x------ 1 yarn hadoop 569 Aug 2 21:59 ./__spark_conf__/topology_mappings.data
2232398 4 -r-x------ 1 yarn hadoop 945 Aug 2 21:59 ./__spark_conf__/taskcontroller.cfg
2232356 4 -r-x------ 1 yarn hadoop 620 Aug 2 21:59 ./__spark_conf__/log4j.properties
2232382 12 -r-x------ 1 yarn hadoop 8960 Aug 2 21:59 ./__spark_conf__/hdfs-site.xml
2232371 4 -r-x------ 1 yarn hadoop 2090 Aug 2 21:59 ./__spark_conf__/hadoop-metrics2.properties
2232387 4 -r-x------ 1 yarn hadoop 662 Aug 2 21:59 ./__spark_conf__/mapred-env.sh
2232390 4 -r-x------ 1 yarn hadoop 1308 Aug 2 21:59 ./__spark_conf__/hadoop-policy.xml
2232399 4 -r-x------ 1 yarn hadoop 1480 Aug 2 21:59 ./__spark_conf__/__spark_conf__.properties
2232389 4 -r-x------ 1 yarn hadoop 1602 Aug 2 21:59 ./__spark_conf__/health_check
2232385 4 -r-x------ 1 yarn hadoop 913 Aug 2 21:59 ./__spark_conf__/rack_topology.data
2232377 4 -r-x------ 1 yarn hadoop 1484 Aug 2 21:59 ./__spark_conf__/ranger-hdfs-audit.xml
2232383 4 -r-x------ 1 yarn hadoop 1020 Aug 2 21:59 ./__spark_conf__/commons-logging.properties
2232357 8 -r-x------ 1 yarn hadoop 5721 Aug 2 21:59 ./__spark_conf__/hadoop-env.sh
2232391 4 -r-x------ 1 yarn hadoop 281 Aug 2 21:59 ./__spark_conf__/slaves
2232373 8 -r-x------ 1 yarn hadoop 6407 Aug 2 21:59 ./__spark_conf__/core-site.xml
2232393 4 -r-x------ 1 yarn hadoop 812 Aug 2 21:59 ./__spark_conf__/rack-topology.sh
2232394 4 -r-x------ 1 yarn hadoop 1044 Aug 2 21:59 ./__spark_conf__/ranger-hdfs-security.xml
2232395 8 -r-x------ 1 yarn hadoop 4956 Aug 2 21:59 ./__spark_conf__/metrics.properties
2232386 8 -r-x------ 1 yarn hadoop 4221 Aug 2 21:59 ./__spark_conf__/task-log4j.properties
2232380 4 -r-x------ 1 yarn hadoop 64 Aug 2 21:59 ./__spark_conf__/ranger-security.xml
2232372 20 -r-x------ 1 yarn hadoop 19975 Aug 2 21:59 ./__spark_conf__/yarn-site.xml
2232397 4 -r-x------ 1 yarn hadoop 1006 Aug 2 21:59 ./__spark_conf__/ranger-policymgr-ssl.xml
2232374 4 -r-x------ 1 yarn hadoop 29 Aug 2 21:59 ./__spark_conf__/yarn.exclude
2232384 4 -r-x------ 1 yarn hadoop 1606 Aug 2 21:59 ./__spark_conf__/container-executor.cfg
2232396 4 -r-x------ 1 yarn hadoop 1000 Aug 2 21:59 ./__spark_conf__/ssl-server.xml
2232375 4 -r-x------ 1 yarn hadoop 1 Aug 2 21:59 ./__spark_conf__/dfs.exclude
2232359 8 -r-x------ 1 yarn hadoop 7660 Aug 2 21:59 ./__spark_conf__/mapred-site.xml
2232378 16 -r-x------ 1 yarn hadoop 14474 Aug 2 21:59 ./__spark_conf__/capacity-scheduler.xml
2232376 4 -r-x------ 1 yarn hadoop 884 Aug 2 21:59 ./__spark_conf__/ssl-client.xml
As you might see, hive-site is not there, even though I definitely have conf/hive-site.xml for spark-submit to take
[spark#asthad006 conf]$ pwd && ls -l
/usr/hdp/2.5.3.0-37/spark2/conf
total 32
-rw-r--r-- 1 spark spark 742 Mar 6 15:20 hive-site.xml
-rw-r--r-- 1 spark spark 620 Mar 6 15:20 log4j.properties
-rw-r--r-- 1 spark spark 4956 Mar 6 15:20 metrics.properties
-rw-r--r-- 1 spark spark 824 Aug 2 22:24 spark-defaults.conf
-rw-r--r-- 1 spark spark 1820 Aug 2 22:24 spark-env.sh
-rwxr-xr-x 1 spark spark 244 Mar 6 15:20 spark-thrift-fairscheduler.xml
-rw-r--r-- 1 hive hadoop 918 Aug 2 22:24 spark-thrift-sparkconf.conf
So, I don't think I am supposed to place hive-site in HADOOP_CONF_DIR as HIVE_CONF_DIR is separated, but my question is that how do we get Spark2 to pick up the hive-site.xml without needing to manually pass it as a parameter at runtime?
EDIT Naturally, since I'm on HDP I am using Ambari. The previous cluster admin has installed Spark2 clients on all of the machines, so all of the YARN NodeManagers that could be potential Spark drivers should have the same config files

You can use spark property - spark.yarn.dist.files and specify path to hive-site.xml there.

The way I understand it, in local or yarn-client modes...
the Launcher checks whether it needs Kerberos tokens for HDFS, YARN, Hive, HBase
> hive-site.xml is searched in the CLASSPATH by the Hive/Hadoop client libs (including in driver.extraClassPath because the Driver runs inside the Launcher and the merged CLASSPATH is already built at this point)
the Driver checks which kind of metastore to use for internal purposes: a standalone metastore backed by a volatile Derby instance, or a regular Hive metastore
> that's $SPARK_CONF_DIR/hive-site.xml
when using the Hive interface, a Metastore connection is used to read/write Hive metadata in the Driver
> hive-site.xml is searched in the CLASSPATH by the Hive/Hadoop client libs (and the Kerberos token is used, if any)
So you can have one hive-site.xml stating that Spark should use an embedded, in-memory Derby instance to use as a sandbox (in-memory implying "stop leaving all these temp files behind you") while another hive-site.xml gives the actual Hive Metastore URI. And all is well.
Now, in yarn-cluster mode, all that mechanism pretty much explodes in a nasty, undocumented mess.
The Launcher needs its own CLASSPATH settings to create the Kerberos tokens, otherwise it fails silently. Better go to the source code to find out which undocumented Env variable you shoud use.
It may also need an override in some properties because the hard-coded defaults suddenly are not the defaults any more (silently).
The Driver cannot tap the original $SPARK_CONF_DIR, it has to rely on what the Launcher has made available for upload. Does that include a copy of $SPARK_CONF_DIR/hive-site.xml? Looks like it's not the case.
So you are probably using a Derby thing as a stub.
And the Driver has to to do with whatever YARN has forced on the container CLASSPATH, in whatever order.
Besides, the driver.extraClassPath additions do NOT take precedence by default; for that you have to force spark.yarn.user.classpath.first=true (which is translated to the standard Hadoop property whose exact name I can't remember right now, especially since there are multiple props with similar names that may be deprecated and/or not working in Hadoop 2.x)
Think that's bad? Try out connecting to a Kerberized HBase in yarn-cluster mode. The connection is done in the Executors, that's another layer of nastyness. But I disgress.
Bottom line: start your diagnostic again.
A. Are you really, really sure that the mysterious "Metastore connection errors" are caused by missing properties, and specifically the Metastore URI?
B. By the way, are your users explicitly using a HiveContext???
C. What is exactly the CLASSPATH that YARN presents to the Driver JVM, and what is exactly the CLASSPATH that the Driver presents to the Hadoop libs when opening the Metastore connection?
D. If the CLASSPATH built by YARN is messed up for some reason, what would be the minimal fix -- change in precedence rules? addition? both?

In the cluster mode configuration is read from the conf directory of the machine, which runs the driver container, not the one use for spark-submit.

Found an issue with this
You create a org.apache.spark.sql.SQLContext before creating hive context the hive-site.xml is not picked properly when you create hive context.
Solution : Create the hive context before creating another SQL context.

Related

zoo + snapshot files are created very frequently

under folder - /var/hadoop/zookeeper/version-2/
we can see that Zookeeper transaction logs and snapshot files are created very frequently (multiple files in every minute) and that fills up the Filesystem in a very short time.
ROOT CAUSE
One or more application are creating or modifying the znodes too frequently, causing too many transactions in a short duration. This leads to the creation of too many transactional log files and snapshot files since they get rolled over after 100,000 entries by default (as defined by zookeeper property 'snapCount')
-rw-r--r-- 1 zookeeper hadoop 67108880 Jul 28 17:24 log.570021fa92
-rw-r--r-- 1 zookeeper hadoop 490656299 Jul 28 17:24 snapshot.5700232ffa
-rw-r--r-- 1 zookeeper hadoop 67108880 Jul 28 17:29 log.5700232ffc
-rw-r--r-- 1 zookeeper hadoop 490656389 Jul 28 17:29 snapshot.5700249d7f
-rw-r--r-- 1 zookeeper hadoop 67108880 Jul 28 17:33 log.5700249d78
-rw-r--r-- 1 zookeeper hadoop 490656275 Jul 28 17:33 snapshot.570025fdaf
-rw-r--r-- 1 zookeeper hadoop 67108880 Jul 28 17:36 log.570025fdae
-rw-r--r-- 1 zookeeper hadoop 490656275 Jul 28 17:36 snapshot.570026c447
-rw-r--r-- 1 zookeeper hadoop 67108880 Jul 28 17:40 log.570026c449
-rw-r--r-- 1 zookeeper hadoop 490658969 Jul 28 17:40 snapshot.570027caed
-rw-r--r-- 1 zookeeper hadoop 67108880 Jul 28 17:43 log.570027caef
-rw-r--r-- 1 zookeeper hadoop 490658981 Jul 28 17:43 snapshot.570028a0d0
-rw-r--r-- 1 zookeeper hadoop 67108880 Jul 28 17:48 log.570028a0d2
-rw-r--r-- 1 zookeeper hadoop 165081088 Jul 28 17:48 snapshot.57002a0268
-rw-r--r-- 1 zookeeper hadoop 67108880 Jul 28 17:48 log.57002a026b
.
.
.
.
when we opened one of the log as - log.57002a026b we saw encrypted log
any suggestion how to unencrypted the logs above ?
or how to know which is the application thatcreating or modifying the znodes too frequently ?

PROBLEM
Zookeeper transaction logs and snapshot files are created very frequently (multiple files in every minute) and that fills up the FileSystem in a very short time.
ROOT CAUSE
One or more application are creating or modifying the znodes too frequently, causing too many transactions in a short duration. This leads to the creation of too many transactional log files and snapshot files since they get rolled over after 100,000 entries by default (as defined by zookeeper property 'snapCount')
RESOLUTION
The resolution for such cases involves reviewing the zookeeper transaction logs to find the znodes that are updated/created most frequently using the following command on one of the zookeeper servers:
# cd /usr/hdp/current/zookeeper-server
# java -cp zookeeper.jar:lib/* org.apache.zookeeper.server.LogFormatter /hadoop/zookeeper/version-2/logxxx
(where 'dataDir' is set to '/hadoop/zookeeper' within zookeeper configuration)
Once the frequently updating znodes are identified using the above command, one should continue with fixing the related application that is creating such a large number of updates on zookeeper.
An example of such an application that can cause this problem is Hbase, when there are very large number of regions stuck in transition and they repeatedly fail to become online.

FileNotFoundException while deploying pyspark job on YARN cluster

 
Trying to submit the below test.py Spark app on a YARN cluster with the below command
PYSPARK_PYTHON=./venv/venv/bin/python spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./venv/venv/bin/python --master yarn --deploy-mode cluster --archives venv#venv test.py
Note: I am not using local mode, but trying to use the python3.7 site-packages under the virtualenv used for building the code in PyCharm. The virtualenv provides the custom app packages that are not provided as cluster services 
This is how the Python project structure looks along with the contents of venv directory
-rw-r--r-- 1 schakrabarti nobody 225908565 Feb 26 13:07 venv.tar.gz
-rw-r--r-- 1 schakrabarti nobody      1313 Feb 26 13:07 test.py
drwxr-xr-x 6 schakrabarti nobody      4096 Feb 26 13:07 venv
drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/bin
drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/share
-rw-r--r-- 1 schakrabarti nobody   75 Feb 26 13:07 venv/pyvenv.cfg
drwxr-xr-x 2 schakrabarti nobody 4096 Feb 26 13:07 venv/include
drwxr-xr-x 3 schakrabarti nobody 4096 Feb 26 13:07 venv/lib
 Getting the same error of File does not exist - pyspark.zip (as shown below)
java.io.FileNotFoundException: File does not exist: hdfs://hostname-nn1.cluster.domain.com:8020/user/schakrabarti/.sparkStaging/application_1571868585150_999337/pyspark.zip
 Please refer to my comments added on Spark-10795: https://issues.apache.org/jira/browse/SPARK-10795

I apologize if I misunderstood the problem, but according to
PYSPARK_PYTHON=./venv/venv/bin/python spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./venv/venv/bin/python --master yarn --deploy-mode cluster --archives venv#venv test.py
you use Yarn cluster, but in your test.py
#test.py
import json
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder \
.appName("Test_App") \
.master("spark://gwrd352n36.red.ygrid.yahoo.com:41767") \
.config("spark.ui.port", "4057") \
.config("spark.executor.memory", "4g") \
.getOrCreate()
print(json.dumps(spark.sparkContext.getConf().getAll(), indent=4))
spark.stop()
you try to connect to Spark standalone cluster
spark://gwrd352n36.red.ygrid.yahoo.com:41767
So, it can be a problem

Cache accumulation of long running spark application

We have long-running spark streaming application in our hadoop cluster. The problem is cache directory size is growing continuously until stop spark application.
Directory : yarn/local/usercache
Now, we restart application periodically. Not smart way...
Can limit the size of directory?
File list example
-r-x------ 1 yarn hadoop 169M Sep 20 14:53 /data/hadoop/yarn/local/usercache/username/filecache/81/appname-SNAPSHOT.jar
-r-x------ 1 yarn hadoop 169M Sep 20 15:55 /data/hadoop/yarn/local/usercache/username/filecache/84/appname-SNAPSHOT.jar
-r-x------ 1 yarn hadoop 169M Sep 20 16:02 /data/hadoop/yarn/local/usercache/username/filecache/87/appname-SNAPSHOT.jar
-r-x------ 1 yarn hadoop 169M Sep 20 17:30 /data/hadoop/yarn/local/usercache/username/filecache/90/appname-SNAPSHOT.jar
-r-x------ 1 yarn hadoop 169M Sep 21 10:55 /data/hadoop/yarn/local/usercache/username/filecache/93/appname-SNAPSHOT.jar
-r-x------ 1 yarn hadoop 169M Sep 21 11:01 /data/hadoop/yarn/local/usercache/username/filecache/96/appname-SNAPSHOT.jar
-r-x------ 1 yarn hadoop 169M Sep 21 12:14 /data/hadoop/yarn/local/usercache/username/filecache/99/appname-SNAPSHOT.jar

SPARK 2.2 is not picking up /etc/spark2/conf configuration while using in YARN CLUSTER mode [duplicate]

Using HDP 2.5.3 and I've been trying to debug some YARN container classpath issues.
Since HDP includes both Spark 1.6 and 2.0.0, there have been some conflicting versions
Users I support are successfully able to use Spark2 with Hive queries in YARN client mode, but not from cluster mode they get errors about tables not found, or something like that because the Metastore connection isn't established.
I am guessing that setting either --driver-class-path /etc/spark2/conf:/etc/hive/conf or passing --files /etc/spark2/conf/hive-site.xml after spark-submit would work, but why isn't hive-site.xml loaded already from the conf folder?
Accoringing to Hortonworks docs, says hive-site should be placed in $SPARK_HOME/conf, and it is...
I see hdfs-site.xml and core-site.xml, and other files that are part of HADOOP_CONF_DIR, for example, and this is the from the YARN UI container info.
2232355 4 drwx------ 2 yarn hadoop 4096 Aug 2 21:59 ./__spark_conf__
2232379 4 -r-x------ 1 yarn hadoop 2358 Aug 2 21:59 ./__spark_conf__/topology_script.py
2232381 8 -r-x------ 1 yarn hadoop 4676 Aug 2 21:59 ./__spark_conf__/yarn-env.sh
2232392 4 -r-x------ 1 yarn hadoop 569 Aug 2 21:59 ./__spark_conf__/topology_mappings.data
2232398 4 -r-x------ 1 yarn hadoop 945 Aug 2 21:59 ./__spark_conf__/taskcontroller.cfg
2232356 4 -r-x------ 1 yarn hadoop 620 Aug 2 21:59 ./__spark_conf__/log4j.properties
2232382 12 -r-x------ 1 yarn hadoop 8960 Aug 2 21:59 ./__spark_conf__/hdfs-site.xml
2232371 4 -r-x------ 1 yarn hadoop 2090 Aug 2 21:59 ./__spark_conf__/hadoop-metrics2.properties
2232387 4 -r-x------ 1 yarn hadoop 662 Aug 2 21:59 ./__spark_conf__/mapred-env.sh
2232390 4 -r-x------ 1 yarn hadoop 1308 Aug 2 21:59 ./__spark_conf__/hadoop-policy.xml
2232399 4 -r-x------ 1 yarn hadoop 1480 Aug 2 21:59 ./__spark_conf__/__spark_conf__.properties
2232389 4 -r-x------ 1 yarn hadoop 1602 Aug 2 21:59 ./__spark_conf__/health_check
2232385 4 -r-x------ 1 yarn hadoop 913 Aug 2 21:59 ./__spark_conf__/rack_topology.data
2232377 4 -r-x------ 1 yarn hadoop 1484 Aug 2 21:59 ./__spark_conf__/ranger-hdfs-audit.xml
2232383 4 -r-x------ 1 yarn hadoop 1020 Aug 2 21:59 ./__spark_conf__/commons-logging.properties
2232357 8 -r-x------ 1 yarn hadoop 5721 Aug 2 21:59 ./__spark_conf__/hadoop-env.sh
2232391 4 -r-x------ 1 yarn hadoop 281 Aug 2 21:59 ./__spark_conf__/slaves
2232373 8 -r-x------ 1 yarn hadoop 6407 Aug 2 21:59 ./__spark_conf__/core-site.xml
2232393 4 -r-x------ 1 yarn hadoop 812 Aug 2 21:59 ./__spark_conf__/rack-topology.sh
2232394 4 -r-x------ 1 yarn hadoop 1044 Aug 2 21:59 ./__spark_conf__/ranger-hdfs-security.xml
2232395 8 -r-x------ 1 yarn hadoop 4956 Aug 2 21:59 ./__spark_conf__/metrics.properties
2232386 8 -r-x------ 1 yarn hadoop 4221 Aug 2 21:59 ./__spark_conf__/task-log4j.properties
2232380 4 -r-x------ 1 yarn hadoop 64 Aug 2 21:59 ./__spark_conf__/ranger-security.xml
2232372 20 -r-x------ 1 yarn hadoop 19975 Aug 2 21:59 ./__spark_conf__/yarn-site.xml
2232397 4 -r-x------ 1 yarn hadoop 1006 Aug 2 21:59 ./__spark_conf__/ranger-policymgr-ssl.xml
2232374 4 -r-x------ 1 yarn hadoop 29 Aug 2 21:59 ./__spark_conf__/yarn.exclude
2232384 4 -r-x------ 1 yarn hadoop 1606 Aug 2 21:59 ./__spark_conf__/container-executor.cfg
2232396 4 -r-x------ 1 yarn hadoop 1000 Aug 2 21:59 ./__spark_conf__/ssl-server.xml
2232375 4 -r-x------ 1 yarn hadoop 1 Aug 2 21:59 ./__spark_conf__/dfs.exclude
2232359 8 -r-x------ 1 yarn hadoop 7660 Aug 2 21:59 ./__spark_conf__/mapred-site.xml
2232378 16 -r-x------ 1 yarn hadoop 14474 Aug 2 21:59 ./__spark_conf__/capacity-scheduler.xml
2232376 4 -r-x------ 1 yarn hadoop 884 Aug 2 21:59 ./__spark_conf__/ssl-client.xml
As you might see, hive-site is not there, even though I definitely have conf/hive-site.xml for spark-submit to take
[spark#asthad006 conf]$ pwd && ls -l
/usr/hdp/2.5.3.0-37/spark2/conf
total 32
-rw-r--r-- 1 spark spark 742 Mar 6 15:20 hive-site.xml
-rw-r--r-- 1 spark spark 620 Mar 6 15:20 log4j.properties
-rw-r--r-- 1 spark spark 4956 Mar 6 15:20 metrics.properties
-rw-r--r-- 1 spark spark 824 Aug 2 22:24 spark-defaults.conf
-rw-r--r-- 1 spark spark 1820 Aug 2 22:24 spark-env.sh
-rwxr-xr-x 1 spark spark 244 Mar 6 15:20 spark-thrift-fairscheduler.xml
-rw-r--r-- 1 hive hadoop 918 Aug 2 22:24 spark-thrift-sparkconf.conf
So, I don't think I am supposed to place hive-site in HADOOP_CONF_DIR as HIVE_CONF_DIR is separated, but my question is that how do we get Spark2 to pick up the hive-site.xml without needing to manually pass it as a parameter at runtime?
EDIT Naturally, since I'm on HDP I am using Ambari. The previous cluster admin has installed Spark2 clients on all of the machines, so all of the YARN NodeManagers that could be potential Spark drivers should have the same config files

You can use spark property - spark.yarn.dist.files and specify path to hive-site.xml there.

The way I understand it, in local or yarn-client modes...
the Launcher checks whether it needs Kerberos tokens for HDFS, YARN, Hive, HBase
> hive-site.xml is searched in the CLASSPATH by the Hive/Hadoop client libs (including in driver.extraClassPath because the Driver runs inside the Launcher and the merged CLASSPATH is already built at this point)
the Driver checks which kind of metastore to use for internal purposes: a standalone metastore backed by a volatile Derby instance, or a regular Hive metastore
> that's $SPARK_CONF_DIR/hive-site.xml
when using the Hive interface, a Metastore connection is used to read/write Hive metadata in the Driver
> hive-site.xml is searched in the CLASSPATH by the Hive/Hadoop client libs (and the Kerberos token is used, if any)
So you can have one hive-site.xml stating that Spark should use an embedded, in-memory Derby instance to use as a sandbox (in-memory implying "stop leaving all these temp files behind you") while another hive-site.xml gives the actual Hive Metastore URI. And all is well.
Now, in yarn-cluster mode, all that mechanism pretty much explodes in a nasty, undocumented mess.
The Launcher needs its own CLASSPATH settings to create the Kerberos tokens, otherwise it fails silently. Better go to the source code to find out which undocumented Env variable you shoud use.
It may also need an override in some properties because the hard-coded defaults suddenly are not the defaults any more (silently).
The Driver cannot tap the original $SPARK_CONF_DIR, it has to rely on what the Launcher has made available for upload. Does that include a copy of $SPARK_CONF_DIR/hive-site.xml? Looks like it's not the case.
So you are probably using a Derby thing as a stub.
And the Driver has to to do with whatever YARN has forced on the container CLASSPATH, in whatever order.
Besides, the driver.extraClassPath additions do NOT take precedence by default; for that you have to force spark.yarn.user.classpath.first=true (which is translated to the standard Hadoop property whose exact name I can't remember right now, especially since there are multiple props with similar names that may be deprecated and/or not working in Hadoop 2.x)
Think that's bad? Try out connecting to a Kerberized HBase in yarn-cluster mode. The connection is done in the Executors, that's another layer of nastyness. But I disgress.
Bottom line: start your diagnostic again.
A. Are you really, really sure that the mysterious "Metastore connection errors" are caused by missing properties, and specifically the Metastore URI?
B. By the way, are your users explicitly using a HiveContext???
C. What is exactly the CLASSPATH that YARN presents to the Driver JVM, and what is exactly the CLASSPATH that the Driver presents to the Hadoop libs when opening the Metastore connection?
D. If the CLASSPATH built by YARN is messed up for some reason, what would be the minimal fix -- change in precedence rules? addition? both?

In the cluster mode configuration is read from the conf directory of the machine, which runs the driver container, not the one use for spark-submit.

Found an issue with this
You create a org.apache.spark.sql.SQLContext before creating hive context the hive-site.xml is not picked properly when you create hive context.
Solution : Create the hive context before creating another SQL context.

Restoring data after upgrading Cassandra

I'm trying to upgrade from Cassandra to the latest Datastax Enterprise and everything went fine except the fact I can't get my data back.
Basically, I had a clean cassandra after the upgrade, then I recreated the schema and trying to somehow link the files that are left from old db to the new db.
That's what I have right now in /var/lib/cassandra/data/wowch directory for example:
drwxr-x--- 4 cassandra cassandra 4.0K Feb 27 13:05 users-247834809d2011e58d82b7a748b1d9c2/
drwxr-xr-x 2 cassandra cassandra 4.0K Feb 27 18:53 users-f41a5300dd5611e58bc7b7a748b1d9c2/
As I get, the older directory is what was in the db before the upgrade. It contains some db files:
total 144K
drwxr-x--- 4 cassandra cassandra 4.0K Feb 27 13:05 ./
drwxr-x--- 60 cassandra cassandra 20K Feb 27 14:35 ../
drwxr-x--- 2 cassandra cassandra 4.0K Dec 7 21:21 backups/
-rwxr-x--- 2 cassandra cassandra 51 Jan 20 00:05 ma-46-big-CompressionInfo.db*
-rwxr-x--- 2 cassandra cassandra 828 Jan 20 00:05 ma-46-big-Data.db*
-rwxr-x--- 2 cassandra cassandra 10 Jan 20 00:05 ma-46-big-Digest.crc32*
-rwxr-x--- 2 cassandra cassandra 16 Jan 20 00:05 ma-46-big-Filter.db*
-rwxr-x--- 2 cassandra cassandra 83 Jan 20 00:05 ma-46-big-Index.db*
-rwxr-x--- 2 cassandra cassandra 4.9K Jan 20 00:05 ma-46-big-Statistics.db*
-rwxr-x--- 2 cassandra cassandra 92 Jan 20 00:05 ma-46-big-Summary.db*
-rwxr-x--- 2 cassandra cassandra 92 Jan 20 00:05 ma-46-big-TOC.txt*
-rwxr-x--- 2 cassandra cassandra 43 Feb 12 15:05 ma-47-big-CompressionInfo.db*
-rwxr-x--- 2 cassandra cassandra 41 Feb 12 15:05 ma-47-big-Data.db*
-rwxr-x--- 2 cassandra cassandra 10 Feb 12 15:05 ma-47-big-Digest.crc32*
-rwxr-x--- 2 cassandra cassandra 16 Feb 12 15:05 ma-47-big-Filter.db*
-rwxr-x--- 2 cassandra cassandra 20 Feb 12 15:05 ma-47-big-Index.db*
-rwxr-x--- 2 cassandra cassandra 4.5K Feb 12 15:05 ma-47-big-Statistics.db*
-rwxr-x--- 2 cassandra cassandra 92 Feb 12 15:05 ma-47-big-Summary.db*
-rwxr-x--- 2 cassandra cassandra 92 Feb 12 15:05 ma-47-big-TOC.txt*
-rwxr-x--- 2 cassandra cassandra 43 Feb 12 16:05 ma-48-big-CompressionInfo.db*
-rwxr-x--- 2 cassandra cassandra 169 Feb 12 16:05 ma-48-big-Data.db*
-rwxr-x--- 2 cassandra cassandra 10 Feb 12 16:05 ma-48-big-Digest.crc32*
-rwxr-x--- 2 cassandra cassandra 16 Feb 12 16:05 ma-48-big-Filter.db*
-rwxr-x--- 2 cassandra cassandra 20 Feb 12 16:05 ma-48-big-Index.db*
-rwxr-x--- 2 cassandra cassandra 4.9K Feb 12 16:05 ma-48-big-Statistics.db*
-rwxr-x--- 2 cassandra cassandra 92 Feb 12 16:05 ma-48-big-Summary.db*
-rwxr-x--- 2 cassandra cassandra 92 Feb 12 16:05 ma-48-big-TOC.txt*
-rwxr-x--- 1 cassandra cassandra 31 Dec 7 21:26 manifest.json*
drwxr-x--- 3 cassandra cassandra 4.0K Feb 27 13:05 snapshots/
I tried to copy all the stuff from here to the users-f41a5300dd5611e58bc7b7a748b1d9c2/ directory and run nodetool repair or nodetool refresh -- wowch users but had no success — the data is still not loaded.
Did I forget something? What is the right way of doing it and how to get the data back?

Depending on the version of Cassandra/DSE you were previously running you may need to run a nodetool upgradesstables. You can see the documentation here.
https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsUpgradeSstables.html

It's possible that you've run into this issue but without more info I can't say for sure.
You also haven't provided info on which version you started with and ended. A little more info would be very helpful. Can you also clarify - are you upgrading from community Cassandra to DSE? I couldn't tell from the way your question was worded.
Stuff to check: Do you have the token assignments from the old version? I didn't use vnodes and I found that I had to manually set initial_token in cassandra.yaml after a backup/restore of my cluster. Make sure that cassandra owns all of the dirs and files. After you import the schema, stop DSE and then empty the contents of the commitlog directory. move your data if necessary into the new folders and then restart DSE. Hope this helps.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Missing hive-site when using spark-submit YARN cluster mode - apache-spark

You can use spark property - spark.yarn.dist.files and specify path to hive-site.xml there.

In the cluster mode configuration is read from the conf directory of the machine, which runs the driver container, not the one use for spark-submit.

Found an issue with this You create a org.apache.spark.sql.SQLContext before creating hive context the hive-site.xml is not picked properly when you create hive context. Solution : Create the hive context before creating another SQL context.

Related

zoo + snapshot files are created very frequently

FileNotFoundException while deploying pyspark job on YARN cluster

Cache accumulation of long running spark application

SPARK 2.2 is not picking up /etc/spark2/conf configuration while using in YARN CLUSTER mode [duplicate]

Restoring data after upgrading Cassandra

Categories

Resources