Spark cluster on Kubernetes without spark-submit - apache-spark

I have a spark application and want to deploy this on a Kubernetes cluster.
Following the below documentation I have managed to create an empty Kubernetes cluster, generated docker image using the Dockerfile provided under kubernetes/dockerfiles/spark/Dockerfile and deployed this on the cluster using spark-submit in a Dev environment.
https://spark.apache.org/docs/latest/running-on-kubernetes.html
However, in a 'proper' environment we have a managed Kubernetes cluster (bespoke unlike EKS etc.) and will have to provide pod configuration files to get deployed.
I believe you can supply Pod template file as an argument to the spark-submit command.
https://spark.apache.org/docs/latest/running-on-kubernetes.html#pod-template
How can I do this without spark-submit? And are there any example yaml files?
PS: we have limited access to this cluster, e.g. we can install Helm charts but not operator or controller.

You could try to use k8s Spark CRD https://github.com/GoogleCloudPlatform/spark-on-k8s-operator and provide a pod configuration through it.

Related

How to Pass Variables into Azure Databricks Cluster Init Script

I'm trying to use workspace environment variables to pass access tokens into my custom cluster init scripts.
It appears that there are only a few supported environment variables that we can access in our custom cluster init scripts as described at https://docs.databricks.com/clusters/init-scripts.html#environment-variables
I've attempted to write to the base cluster configuration using
Microsoft.Azure.Databricks.Client.SparkEnvironmentVariables.Add("WORKSPACE_ID", workspaceId)
My init scripts are still failing to uptake this variable in the following line:
[[ -z "${WORKSPACE_ID}" ]] && LOG_ANALYTICS_WORKSPACE_ID='default' || LOG_ANALYTICS_WORKSPACE_ID="${WORKSPACE_ID}"
With the above lines of code, my init script causes the cluster to fail with the following error:
Spark Error: Spark encountered an error on startup. This issue can be caused by
invalid Spark configurations or malfunctioning init scripts. Please refer to the Spark
driver logs to troubleshoot this issue, and contact Databricks if the problem persists.
Internal error message: Spark error: Driver down
The logs don't say that any part of my bash script is failing, so I'm assuming that it's just failing to pick up the variable from the environment variables.
Has anyone else dealt with a problem with this? I realize that I could write this information to dbfs, and then read it into the init script, but I'd like to avoid doing that since I'll be passing in access tokens. What other approaches can I try?
Thanks for any help!
This article shows how to send application logs and metrics from Azure Databricks to a Log Analytics workspace. It uses the Azure Databricks Monitoring Library, which is available on GitHub.
Prerequisites: Configure your Azure Databricks cluster to use the monitoring library, as described in the GitHub readme.
Steps to build the Azure monitoring library and configure an Azure Databricks cluster:
Step1: Build the Azure Databricks monitoring library
Step2: Create and configure the Azure Databricks cluster
For more details, refer "Monitoring Azure Databricks".
Hope this helps.

How to Integrate GitLab-Ci w/ Azure Kubernetes + Kubectl + ACR for Deployments?

Our previous GitLab based CI/CD utilized an Authenticated curl request to a specific REST API endpoint to trigger the redeployment of an updated container to our service, if you use something similar for your Kubernetes based deployment this Question is for you.
More Background
We run a production site / app (Ghost blog based) on an Azure AKS Cluster. Right now we manually push our updated containers to a private ACR (Azure Container Registry) and then update from the command line with Kubectl.
That being said we previously used Docker Cloud for our orchestration and fully integrated re-deploying our production / staging services using GitLab-Ci.
That GitLab-Ci integration is the goal, and the 'Why' behind this question.
My Question
Since we previously used Docker Cloud (doh, should have gone K8s from the start) how should we handle the fact that GitLab-Ci was able to make use of Secrets created the Docker Cloud CLI and then authenticate with the Docker Cloud API to trigger actions on our Nodes (ie. re-deploy with new containers etc).
While I believe we can build a container (to be used by our GitLab-Ci runner) that contains Kubectl, and the Azure CLI, I know that Kubernetes also has a similar (to docker cloud) Rest API that can be found here (https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster) — specifically the section that talks about connecting WITHOUT Kubectl appears to be relevant (as does the piece about the HTTP REST API).
My Question to anyone who is connecting to an Azure (or potentially other managed Kubernetes service):
How does your Ci/CD server authenticate with your Kubernetes service provider's Management Server, and then how do you currently trigger an update / redeployment of an updated container / service?
If you have used the Kubernetes HTTP Rest API to re-deploy a service your thoughts are particularly value-able!
Kubernetes Resources I am Reviewing
How should I manage deployments with kubernetes
Kubernetes Deployments
Will update as I work through the process.
Creating the integration
I had the same problem of how to integrate the GitLab CI/CD with my Azure AKS Kubernetes cluster. I created this question because I was having some error when I tried to add my Kubernetes cluester info into GitLab.
How to integrate them:
Inside GitLab, go to "Operations" > "Kubernetes" menu.
Click on the "Add Kubernetes cluster" button on the top of the page
You will have to fill some form fields, to get the content that you have to put into these fields, connect to your Azure account from the CLI (you need Azure CLI installed on your PC) using az login command, and then execute this other command to get the Kubernetes cluster credentials: az aks get-credentials --resource-group <resource-group-name> --name <kubernetes-cluster-name>
The previous command will create a ~/.kube/config file, open this file, the content of the fields that you have to fill in the GitLab "Add Kubernetes cluster" form are all inside this .kube/config file
These are the fields:
Kubernetes cluster name: It's the name of your cluster on Azure, it's in the .kube/config file too.
API URL: It's the URL in the field server of the .kube/config file.
CA Certificate: It's the field certificate-authority-data of the .kube/config file, but you will have to base64 decode it.
After you decode it, it must be something like this:
-----BEGIN CERTIFICATE-----
...
some base64 strings here
...
-----END CERTIFICATE-----
Token: It's the string of hexadecimal chars in the field token of the .kube/config file (it might also need to be base 64 decoded?). You need to use a token belonging to an account with cluster-admin privileges, so GitLab can use it for authenticating and installing stuff on the cluster. The easiest way to achieve this is by creating a new account for GitLab: create a YAML file with the service account definition (an example can be seen here under Create a gitlab service account in the default namespace) and apply it to your cluster by means of kubectl apply -f serviceaccount.yml.
Project namespace (optional, unique): I leave it empty, don't know yet for what or where this namespace can be used.
Click in "Save" and it's done. Your GitLab project must be connected to your Kubernetes cluster now.
Deploy
In your deploy job (in the pipeline), you'll need some environment variables to access your cluster using the kubectl command, here is a list of all the variables available:
https://docs.gitlab.com/ee/user/project/clusters/index.html#deployment-variables
To have these variables injected in your deploy job, there are some conditions:
You must have added correctly the Kubernetes cluster into your GitLab project, menu "Operations" > "Kubernetes" and these steps that I described above
Your job must be a "deployment job", in GitLab CI, to be considered a deployment job, your job definition (in your .gitlab-ci.yml) must have an environment key (take a look at the line 31 in this example), and the environment name must match the name you used in menu "Operations" > "Environments".
Here are an example of a .gitlab-ci.yml with three stages:
Build: it builds a docker image and push it to gitlab private registry
Test: it doesn't do anything yet, just put an exit 0 to change it later
Deploy: download a stable version of kubectl, copy the .kube/config file to be able to run kubectl commands in the cluster and executes a kubectl cluster-info to make sure it is working. In my project I didn't finish to write my deploy script to really execute a deploy. But this kubectl cluster-info command is executing fine.
Tip: to take a look at all the environment variables and their values (Jenkins has a page with this view, GitLab CI doesn't) you can execute the command env in the script of your deploy stage. It helps a lot to debug a job.
I logged into our GitLab-Ci backend today and saw a 'Kubernetes' button — along with an offer to save $500 at GCP.
GitLab Kubernetes
URL to hit your repo's Kubernetes GitLab page is:
https://gitlab.com/^your-repo^/clusters
As I work through the integration process I will update this answer (but also welcome!).
Official GitLab Kubernetes Integration Docs
https://docs.gitlab.com/ee/user/project/clusters/index.html

How to setup spark in aws ecs

i have a spark multimaster cluster managed by zookeeper and i want to move this multimaster arch to aws ecs, aws EMR is not an option,
the question is how does ecs load balance will work, taking in to the account that every component(spark-master,spark-worker,zookeeper,livy) is deployed in a different service ecs .. thanks
it can not be done spark components are connected thru a config file in compile time, so the solution is to use task definition loopback interface which provides an static ip number(127.0.0.1) and hostname(localhost), so e.g worker connecting to master will be spark://127.0.0.1:7077

Change Kubernetes pods logging

I Have an AKS (Azure Container Service) configured, up and running, with kubernetes installed.
Deploying containers on using [kubectl proxy] and the GUI of Kubernetes provided.
I am trying to increase the log level of the pods in order to get more information for better debugging.
I read a lot about kubectl config set
and the log level --v=0 [0-10]
but not being able to change the log level. it seems the documentation
can someone point me out in the right direction?
The --v flag is an argument to kubectl and specifies the verbosity of the kubectl output. It has nothing to do with the log levels of the application running inside your Pods.
To get the logs from your Pods, you can run kubectl logs <pod>, or read /var/log/pods/<namespace>_<pod_name>_<pod_id>/<container_name>/ on the Kubernetes node.
To increase the log level of your application, your application has to support it. And like #Jose Armesto said above, this is usually configured using an environment variable.

Request insufficient authentication scopes when running Spark-Job on dataproc

I am trying to run the spark job on the google dataproc cluster as
gcloud dataproc jobs submit hadoop --cluster <cluster-name> \
--jar file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
--class org.apache.hadoop.examples.WordCount \
--arg1 \
--arg2 \
But the Job throws error
(gcloud.dataproc.jobs.submit.spark) PERMISSION_DENIED: Request had insufficient authentication scopes.
How do I add the auth scopes to run the JOB?
Usually if you're running into this error it's because of running gcloud from inside a GCE VM that's using VM-metadata controlled scopes, since otherwise gcloud installed on a local machine will typically already be using broad scopes to include all GCP operations.
For Dataproc access, when creating the VM from which you're running gcloud, you need to specify --scopes cloud-platform from the CLI, or if creating the VM from the Cloud Console UI, you should select "Allow full access to all Cloud APIs":
As another commenter mentioned above, nowadays you can also update scopes on existing GCE instances to add the CLOUD_PLATFORM scope.
You Need to check the option for allowing the API access while creating the DataProc cluster. Then only you can submit the jobs to cluster using gcloud dataproc jobs submit
command

Resources