We have an OpenShift v4.0 deployed and running. We are using Open Data Hub pods framework within Openshift wherein we have got our jupyterhub along with spark.
Goal is to read a bunch of csv files with spark and load it into mysql. Error I was getting is mentioned in this tread How to set up JDBC driver for MySQL in Jupyter notebook for pyspark?.
One of the solution is to copy the jar file in spark master node. But I am not having access to pod as root user.
How can I get access to root within a pod in Openshift?
#roar S, your answer is correct, however, it is preferable to create your own SCC identical to the "anyuid" SCC (call it "my-anyuid") and link the new SCC it to the system account.
(and your link points to OCP v3.2 where the question is about OCP v4.x)
We had previous bad experience with this as the upgrade from OCP v4.2 to v4.3 failed because we did what you proposed. In fact "add-scc-to-user" "modify" the target SCC and the upgrade process ddidn't like it
To create a SCC similar toanyuid, just extract the anyuid manifest (oc get scc anyuid -o yaml)save it, remove all linked SA in the manifest, change the name and create the new one
https://docs.okd.io/latest/authentication/managing-security-context-constraints.html
Related
What i'm doing
I'm working on microsoft Azure, and here is the thing. I'm trying to create an R cluster on azure with hadoop 3.6 but i need some default tools like nifi, kafka and storm which are available on an HDF.
Problem
When i create the cluster, i cant't chose the instance of ambari so i tried to create the cluster with a template wich i activate every morging to create the cluster and another one to delete the cluster every night. I was wondering if it's possible to chose the Ambari instrance while using the template.
anyone has an idea ?
AFAIK, you cannot change the version of Ambari. Since it comes by default with HDI versions.
You can find more details on this documentation regarding Hadoop components available with different HDInsight versions.
I have deployed a HDInsight 3.5 Spark (2.0) cluster on Microsoft Azure with the standard configurations (Location = US East, Head Nodes = D12 v2 (x2), Worker Nodes = D4 v2 (x4)). Locally I have installed sparkmagic following the steps in https://github.com/jupyter-incubator/sparkmagic/blob/master/README.md#installation and https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-notebook-install-locally and changed the config.json file. When starting jupyter notebook I can chose the PySpark kernel. Even tough I get the message that the kernel is ready, when I try to execute a simple statement (e.g. t = 4), the kernel starts to run infinitely. Could you provide possible solution(s)?
Most probably, this is an issue where the config.json is configured with the wrong endpoint, username, or password. If you are using the base64 password field, make sure the password is base64 encoded.
Without more information regarding errors (log file should be in ~/.sparkmagic/logs), it's hard to say why you couldn't connect.
I've used the Windows version of HDInsight before, and that has a tab where you can set the number of cores and ram per worker node for Zeppelin.
I followed this tutorial to get Zeppelin working:
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-use-zeppelin-notebook/
The Linux version of HDInsight uses Ambari to manage the resources, but I can't seem to find a way to change the settings for Zeppelin.
Zeppelin is not selectable as a separate service in the list of services on the left. It also seems like it isn't available to be added when I choose 'add service' in actions.
I tried editing the general spark configs in Ambari by using override, then adding the worker nodes to my new config group and increasing the number of cores and RAM in custom spark-defaults. (Then clicked save and restarted all affected services.)
I tried editing the spark settings using
vi /etc/spark/conf/spark-defaults.conf
on the headnode, but that wasn't picked up by Ambari.
The performance in Zeppelin seems to stay the same for a query that takes about 1000-1100 seconds every time.
Zeppelin is not a service so it shouldn't show up in Ambari. If you are committed to managing it that way, you may be able to get this to work
https://github.com/tzolov/zeppelin-ambari-plugin
To edit via ssh you'll need edit the zeppelin-env.sh file. First give yourself edit perms.
sudo chmod u+w /usr/hdp/current/incubator-zeppelin/conf/zeppelin-env.sh
and then edit zeppelin configs using
vi /usr/hdp/current/incubator-zeppelin/conf/zeppelin-env.sh
Here you can configure the ZEPPELIN_JAVA_OPTS variable, adding:
-Dspark.executor.memory=1024m -Dspark.executor.cores=16
All that being said... any reason you can't just use a Jupyter notebook instead?
I'm using Apache Spark to build an application. To make the RDDs available from other applications I'm trying two approaches:
Using tachyon
Using a spark-jobserver
I'm new to Tachyon. I completed the following tasks given in the a Running Tachyon on a Cluster
I'm able to access the UI from master:19999 URL.
From the tachyon directory I successfully created a directory./bin/tachyon tfs mkdir /Test
But while trying to do the copyFromLocal command I'm getting the following errors:
FailedToCheckpointException(message:Failed to rename hdfs://master:54310/tmp/tachyon/workers/1421840000001/8/93 to hdfs://master:54310/tmp/tachyon/data/93)
You are most likely running tachyon and spark-jobserver under different users, and have HDFS as your underFS.
Check out https://tachyon.atlassian.net/browse/TACHYON-1339 and the related patch.
The easy way out is running tachyon and your spark job server as the same user.
The (slightly) harder way is to port the patch and recompile spark, and then sjs with the patched client.
Every few days the Azure HDInsight cluster is being (randomly?) restarted by Microsoft, and in the process any custom changes to hive-site.xml (such as adding a JsonSerde) are lost without any prior warning, and as a result the hive queries from Excel/PowerPivot starts breaking.
How are you supposed to deal with this scenario - are we forced to store our data as CSV files ?
In order to preserve customization during os update or node re-image, you should think of using script action. Here is the link: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster/
If you specify the Hive config parameter with a custom configuration object at the time of cluster creation, it should persist. The link here http://hadoopsdk.codeplex.com/wikipage?title=PowerShell%20Cmdlets%20for%20Cluster%20Management has some more details on creating a cluster with custom configuration.
This blog post on MSDN has a table showing what customizations are supported via the different methods, as well as examples for using PowerShell or the SDK to create a cluster with custom Hive configuration parameters (Line 62-64 in the Powershell example): http://blogs.msdn.com/b/bigdatasupport/archive/2014/04/15/customizing-hdinsight-cluster-provisioning-via-powershell-and-net-sdk.aspx
This is the only way to persist these settings because the cluster nodes can be reset for Azure servicing events such as security updates, and the configurations are set back to the initial values when this occurs.