Where to store SparkApplication YAML files on Kubernetes cluster? - apache-spark

I'm using the Helm Chart to deploy Spark Operator to GKE. Then I define a SparkApplication specification in a YAML file. But after reading the User Guide I still don't understand:
Where to store SparkApplication YAML files on Kubernetes cluster or Google storage?
Is it ok/possible to deploy them along with the Spark Operator Helm chart to the Spark Master container?
Is it a good approach to load the SparkApplication configurations to Google Storage and then run kubectl apply -f <YAML GS file path>
What are the best practices for storing SparkApplication configurations on Kubernetes cluster or GS that I may be missing?

To address your questions:
There are a lot of possibilities to store your YAML files. You can store it locally on your PC, laptop or you can store it in the cloud. Going further in that topic, syncing your YAML files to version controlled system (for example Git) would be one of the better options because you will have full history of the changes with ability to check what changes you made and rollback if something failed. The main thing is that the kubectl will need access to this files.
There is no such thing as master container in Kubernetes. There is master node. A master node is a machine which controls and manages a set of worker nodes (workloads runtime)
Please check the official documentation about Kubernetes components.
You can put your YAML files in your Google Storage (bucket). But you would not be able to run command in a way kubectl apply -f FILE. kubectl will not be able to properly interpret file location like gs://NAME_OF_THE_BUCKET/magical-deployment.yaml.
One way to run kubectl apply -f FILE_NAME.yaml would be to have it stored locally and synced outside.
You can access the data inside a bucket through gsutil. You could try to tinker with gsutil cat gs://NAME_OF_THE_BUCKET/magical-deployment.yaml and try to pipe it into kubectl but I would not recommend that approach.
Please refer to gsutil tool documentation in this case and be aware of:
The gsutil cat command does not compute a checksum of the downloaded data. Therefore, we recommend that users either perform their own validation of the output of gsutil cat or use gsutil cp or rsync (both of which perform integrity checking automatically).
-- https://cloud.google.com/storage/docs/gsutil/commands/cat
Let me know if you have any questions to this.

Related

Save the temporary file created by a task in a DAG and email it as an attachment in another task

Am using Kubernetes executor https://airflow.apache.org/docs/apache-airflow/stable/executor/kubernetes.html
My requirement is as below, There is a DAG that has two tasks.
Bash Task A (BashOperator) , created a file at temp location, using python code
Email Task B (EmailOperator), must access the above created file and send an email as an attachment
Apparently, In a Kubernetes Executor, each task instance is run in its own pod on a Kubernetes cluster. The worker pod then runs the task, reports the result, and terminates. Therefore after the worker pods shuts everything inside the pod is lost. Thus any file downloaded is lost.
Note : No Storage mounted yet. Exploring easy options if any?.
Would not like python code to send email too, instead want a separate task to email.
If you are looking for the easiest option you can use the
Hostpath to mount the files to Node and if you are running your container on a specific node pool POD will be able to get the
files. Note : If the node goes down you files will be gone.
If you want to share the file system between PODs you have to
implement the ReadWriteMany PVC.
If you are on any cloud provider you can use a File system like
AWS to provide the EFS.
You can also implement the GlusterFS or Minio to create the File
system on K8s and use that as the mount option to PODs so those can
access it share it.
Could also leverage the s3 bucket option to upload the artifacts or
files and new POD will download it first in temp location, email and
terminate it self this way files will be saved at s3 and no clean up
required at FS level or POD level.

How to modify Cassandra config values when using helm chart in Terraform

I'm using a Bitnami Helm Chart for Cassandra in order to deploy it with Terraform. I'm freshly new to it all, and I struggle with changing one config value, mainly commitlog_segment_size_in_mb. I want to do it before I run terraform commands, but in the Helm Chart itself, I failed to find any mentions of it.
I know I can change it after the terraform deployment in the cassandra.yaml file, but I would like to have this value controllable, so that another terraform update will not overwrite this file.
What would be the best approach to change values of Cassandra config?
Can I modify it in Terraform if it's not in the Helm Chart?
Can I export parts of the configuration to a different file, so that I know my next Terraform installations will not overwrite them?
This isn't a direct answer to your question but in case you weren't aware of it already, K8ssandra.io is a ready-made platform for running Apache Cassandra in Kubernetes using Helm charts to deploy Cassandra with the DataStax Cassandra Operator (cass-operator) under the hood with all the tools built-in:
Reaper for automated repairs
Medusa for backups and restores
Metrics Collector for monitoring with Prometheus + Grafana
Traefik templates for k8s cluster ingress
Stargate.io - a data gateway for connecting to Cassandra using REST API, GraphQL API and JSON/Doc API
K8ssandra and all components are fully open-source and free to use, improve and enjoy. Cheers!

Modify file from kubernetes pod

I want to modify particular config file from kubernetes running pod at runtime.
How can I get pod name at runtime and I can modify the file from running pod and restart it to reflect the changes? I am trying this in python 3.6.
Suppose,
I have two running pods.
In one pod I have config.json file. In that I have
{
"server_url" : "http://127.0.0.1:8080"
}
So I want to replace 127.0.0.1 to other kubernetes service's loadbalancer IP in it.
Generally you would do this with an initContainer and a templating tool like envsubst or confd or Consul Templates.
Use downwardAPI to capture the pod name. Develop start up script to get the config file that you want to update. Populate the required values using ' sed' command and then run container process

Cassandra Snapshot running on kubernetes

I'm using Kubernetes (via minikube) to deploy my Lagom services and my Cassandra DB.
After a lot of work, I succeed to deploy my service and my DB on Kubernetes.
Now, I'm about to manage my data and I need to generate a backup for each day.
Is there any solution to generate and restore a snapshot (Backup) for Cassandra running on Kubernetes:
cassandra statefulset image:
gcr.io/google-samples/cassandra:v12
Cassandra node:
svc/cassandra ClusterIP 10.97.86.33 <none> 9042/TCP 1d
Any help? please.
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsBackupRestore.html
That link contains all the information you need. Basically you use nodetool snapshot command to create hard links of your SSTables. Then it's up to you to decide what to do with the snapshots.
I would define a new disk in the statefulset and mount it to a folder, e.g. /var/backup/cassandra. The backup disk is a network storage. Then I would create a simple script that:
Run 'nodetool snapshot'
Get the snapshot id from the output of the command.
Copy all files in the snapshot folder to /var/backup/cassandra
Delete snapshot folder
Now all I have to do is make sure I store the backups on my network drive somewhere else for long term.
Disclaimer. I haven't actually done this so there might be a step missing but this would be the first thing I would try based on the Datastax documentation.

Where does Google Dataproc store Spark logs on disk?

I'd like to get command line access to the live logs produced by my Spark app when I'm SSH'd into the master node (the machine hosting the Spark driver program). I'm able to see them using gcloud dataproc jobs wait, the Dataproc web UI, and in GCS, but I'd like to be able to access the live log via command line so I can grep, etc. through it.
Where can I find the logs produced by Spark on the driver (and on the executors too!)?
At the moment, Dataproc doesn't actually tee out any duplicate copy of the driver output to local disk vs just placing it in GCS, in part because it doesn't quite fit into standard log-rotation policies or YARN task log cleanup, so it requires extra definitions of how to perform garbage-collection of these output files on the local disk or otherwise risking slowly running out of disk space on a longer-lived cluster.
That said, such deletion policies are certainly surmountable, so I'll go ahead and add this as a feature request to tee the driver output out to both GCS and a local disk file for better ease-of-use.
In the meantime though, you have a couple options:
Enable the cloud-platform scope when creating your cluster (gcloud dataproc clusters create --scopes cloud-platform) and then even on the cluster you can gcloud dataproc jobs wait <jobid> | grep foo
Alternatively, use gsutil cat; if you can gcloud dataproc jobs describe from another location first to find the driverOutputResourceUri field, this points at the GCS prefix (which you probably already found since you mentioned finding them in GCS). Since the output parts are named with a padded numerical prefix, gsutil cat gs://bucket/google-cloud-dataproc-metainfo/cluster-uuid/jobs/jobid/driveroutput* will print out the job output in the correct order, and then you can pipe that into whatever you need.

Resources