Unable to start bin/dse spark-sql. File not exception /tmp/hive - cassandra

I'am trying to run following command on DSE cassandra :-
dse$ bin/dse spark-sql
It gives following error :-
2018-05-24 16:59:41 [main] ERROR o.a.s.d.DseSparkSubmitBootstrapper - Failed to start or submit Spark application - see details in the log file(s): /home/aditya/.spark-sql-shell.log
java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwxrwxr-x
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) ~[hive-exec-1.2.1.spark2.jar:1.2.1.spark2]
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:114) ~[spark-hive-thriftserver_2.11-2.0.2.16.jar:2.0.2.16]
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) ~[spark-hive-thriftserver_2.11-2.0.2.16.jar:2.0.2.16]
I'dont understand is this permission issue or something else but directory has all permissions.
Thanks,

I solved my issue. It was because I was not starting Cassandra in Analytic mode so if you face such problem make sure that you have started your Cassandra in Analytic mode by -
bin/dse cassandra -k
Thanks,

Related

Pyspark not creating SparkContext (Yarn). bad gateway or network traffic blocked?

Here is some context of my installation of pyspark binary.
In my company, we use a Cloudera Data Science Workbench (CDSW). When we create a session for a new projet, I'm guessing it's a image from a specific Dockerfile. And inside this dockerfile is pushed the installation of CDH binaries and configuration.
Now I wish to use thoses configurations outside CDSW. I have a kubernetes cluster where I deploy webapps. And I would like to use spark in Yarn mode to deploy very small ressources for the webapps.
What I have done, is to tar.gz all binaries and config from /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072 and /var/lib/cdsw/client-config/.
Then untar.gz them in a container or in a WSL2 instance.
Instead of unpacking everything in /var/ or /opt/ like I should do, I've put them in $HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/* and $USER/etc/client-config/*. Why I did this? Because I might want to use a mounted Volume in my kubernetes and share binaries between containers.
I've sed and modifiy all configuration files to adapt paths:
spark-env.sh
topology.py
Any *.txt, *.sh, *.py
So I managed to run beeline hadoop hdfs hbase pointing them with the hadoop-conf folder. I can use pyspark but in local mode only. But What I really want is to use pyspark with yarn.
So I set a bunch of env variables to make this work:
export HADOOP_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export SPARK_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export JAVA_HOME=/usr/local
export BIN_DIR=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/bin
export PATH=$BIN_DIR:$JAVA_HOME/bin:$PATH
export PYSPARK_PYTHON=python3.6
export PYSPARK_DRIVER_PYTHON=python3.6
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export SPARK_HOME=/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark
export PYSPARK_ARCHIVES_PATH=$(ZIPS=("$CDH_DIR"/lib/spark/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYSPARK_ARCHIVES_PATH
export SPARK_DIST_CLASSPATH=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/hadoop/client/accessors-smart-1.2.jar:<ALL OTHER JARS FOR EVERY BINARIES>
Anyway, all of the paths are existing and working. And since I've sed all config files, they also generate the same path as the exported one.
I launch my pyspark binary like this:
pyspark --conf "spark.master=yarn" --properties-file $HOME/etc/client-config/spark-conf/spark-defaults.conf --verbose
FYI, it is using pyspark 2.4.0. And I've install Java(TM) SE Runtime Environment (build 1.8.0_131-b11). The same that I found on the CDSW instance. I added the keystore with the public certificate of the company. And I also generate a keytab for the kerberos auth. Both of them are working since I can used hdfs with HADOOP_CONF_DIR=$HOME/etc/client-config/hadoop-conf
In verbose mode I can see all the details and configuration from spark. When I compare it from the CDSW session, they are quite identical (with modified path, for example :
Using properties file: /home/docker4sg/etc/client-config/spark-conf/spark-defaults.conf
Adding default property: spark.lineage.log.dir=/var/log/spark/lineage
Adding default property: spark.port.maxRetries=250
Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property: spark.driver.log.persistToDfs.enabled=true
Adding default property: spark.yarn.jars=local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/jars/*,local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/hive/*
...
After few seconds it fails to create a sparkSession:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-02-22 14:44:14 WARN Client:760 - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2022-02-22 14:44:14 ERROR SparkContext:94 - Error initializing SparkContext.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
2022-02-22 14:44:15 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:69 - Attempted to request executors before the AM has registered!
2022-02-22 14:44:15 WARN MetricsSystem:69 - Stopping a MetricsSystem that is not running
2022-02-22 14:44:15 WARN SparkContext:69 - Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58
From what I understand, it fails for a reason I'm not sure about and then tries to fall back into an other mode. That fails too.
In the configuration file spark-conf/yarn-conf/yarn-site.xml, it is specified that it is using a zookeeper:
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>corporate.machine.node1.name.net:9999,corporate.machine.node2.name.net:9999,corporate.machine.node3.name.net:9999</value>
</property>
Could it be that the Yarn cluster does not accept traffic from a random IP (kuber IP or personnal IP from computer)? For me, the IP i'm working on is not on the whitelist, but at the moment I cannot ask to add my ip to the whitelist. How can I know for sure I'm looking in the good direction?
Edit 1:
As said in the comment, the URI of the pyspark.zip was wrong. I've modified my PYSPARK_ARCHIVES_PATH to the real location of pyspark.zip.
PYSPARK_ARCHIVES_PATH=local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/py4j-0.10.7-src.zip,local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/pyspark.zip
Now I get an error UnknownHostException:
org.apache.spark.SparkException: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult
...
Caused by: java.io.IOException: Failed to connect to <HOSTNAME>:13250
...
Caused by: java.net.UnknownHostException: <HOSTNAME>
...

The root scratch dir: /tmp/hive on HDFS should be writable Spark app error

I have created a Spark application which uses Hive metastore but in the line of the external Hive table creation, I get such an error when I execute the application (Spark driver logs):
Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwxrwxr-x;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwxrwxr-x
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
I run the application using the Spark operator for K8s.
So I checked the permissions of the directories ob driver pod of the Spark application:
ls -l /tmp
...
drwxrwxr-x 1 1001 1001 4096 Feb 22 16:47 hive
If I try to change permissions it does not make any effect.
I run Hive metastore and HDFS in K8s as well.
How this problem can be fixed?
This is a common error which can be fixed by creating a directory at another place and pointing the spark to use the new dir.
Step 1: Create a new dir called tmpops at /tmp/tmpops
Step 2: Give permission for the dir chmod -777 /tmp/tmpops
Note: -777 is for local testing. If you are working with sensitive data make sure to add this path to security groups to avoid accidental data leakage and security loophole.
Step 3: Add the below property in your hive-site.xml that the spark app is referring to:
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/tmpops</value>
</property>
Once you do this, the error will no longer appear unless someone deletes that dir.
I face the same issue in window 10, below solution helped me to get this fixed.
Following steps solved my problem
Open Command Prompt in Admin Mode
winutils.exe chmod 777 /tmp/hive
Open Spark-Shell --master local[2]

Unable to write data on hive using spark

I am using spark1.6. I am creating hivecontext using spark context. When I save the data into hive it gives error. I am using cloudera vm. My hive is inside cloudera vm and spark in on my system. I can access the vm using IP. I have started the thrift server and hiveserver2 on vm. I have user thrift server uri for hive.metastore.uris
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.metastore.uris", "thrift://IP:9083")
............
............
df.write.mode(SaveMode.Append).insertInto("test")
I get the following error:
FAILED: SemanticException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClien‌​t
Probably inside spark conf folder, hive-site.xml is not available , I have added the details below.
Adding hive-site.xml inside spark configuration folder.
creating a symlink which points to hive-site.xml in hive configuration folder.
sudo ln -s /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/hive-site.xml
after the above steps, restarting spark-shell should help.

new Spark StreamingContext failes with hdfs errors

I'm using dcos installed via Azure ACS and installed hdfs and spark via dcos tool with default options.
Creating a SparkStreamingContext gives:
16/07/22 01:51:04 WARN DFSUtil: Namenode for hdfs remains unresolved for ID nn1. Check your hdfs-site.xml file to ensure namenodes are configured properly.
16/07/22 01:51:04 WARN DFSUtil: Namenode for hdfs remains unresolved for ID nn2. Check your hdfs-site.xml file to ensure namenodes are configured properly.
Exception in thread "main" java.lang.IllegalArgumentException:
java.net.UnknownHostException: namenode1.hdfs.mesos
I expect I have to redeploy the spark package with dcos package install with –options= but can't figure out what the hdfs.config-url should be. The https://docs.mesosphere.com/1.7/usage/service-guides/spark/install/#hdfs docs seem out of date.
Yes, it is out of date. We'll fix that.
DC/OS HDFS now serves its config on http://hdfs.marathon.mesos:[port]/v1/connect

error when starting the spark shell

I just downloaded the latest version of spark and when I started the spark shell I got the following error:
java.net.BindException: Failed to bind to: /192.168.1.254:0: Service 'sparkDriver' failed after 16 retries!
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389)
...
...
java.lang.NullPointerException
at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:193)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:71)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
at $iwC$$iwC.<init>(<console>:9)
...
...
<console>:10: error: not found: value sqlContext
import sqlContext.implicits._
^
<console>:10: error: not found: value sqlContext
import sqlContext.sql
^
Is there something that I missed in setting up spark?
Try setting the Spark env variable SPARK_LOCAL_IP to a local IP address.
In my case, I was running Spark on an Amazon EC2 Linux instance. spark-shell stopped working, with an error message similar to yours. I was able to fix it by adding a setting like the following to the Spark config file spark-env.conf.
export SPARK_LOCAL_IP=172.30.43.105
Could also set it in ~/.profile or ~/.bashrc.
Also check host settings in /etc/hosts
See SPARK-8162.
It looks like it only affects 1.4.1 and 1.5.0 - you're probably best off running the latest release (1.4.0 at time of writing).
I was experiencing the same issue. First got to .bashrc and put
export SPARK_LOCAL_IP=172.30.43.105
then goto
cd $HADOOP_HOME/bin
then run the following command
hdfs dfsadmin -safemode leave
This just switches your safemode of namenode off.
Then delete the metastore_db folder from the spark home folder or /bin. It will be generally be in a folder from which you generally start a spark session.
then I ran my spark-shell using this
spark-shell --master "spark://localhost:7077"
and voila I didnot get the sqlContext.implicits._ error.

Resources