Extracting Spark logs (Spark UI contents) from Databricks - apache-spark

I am trying to save Apache Spark logs (the contents of Spark UI), not necessarily stderr, stdout and log4j files (although they might be useful too) to a file so that I can send it over to someone else to analyze.
I am following the manual described in the Apache Spark documentation here:
https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact
The problem is that I am running the code on Azure Databricks. Databricks saves the logs elsewhere and you can display them from the web UI but cannot export it.
When I ran the Spark job with spark.eventLog.dir set to a location in DBFS, the file was created but it was empty.
Is there a way to export the full Databricks job log so that anyone can open it without giving them the access to the workspace?

The simplest way of doing it as following:
You create a separate storage account + container in it or a separate container in existing storage account & give access to it to developers
You mount that container to the Databricks workspace
You configure clusters/jobs to write logs into mount location (you can enforce it for new objects using the cluster policies). This will create sub-directories with the cluster name, containing logs of driver & executors + result of execution of init scripts
(optional) you can setup retention policy on that container to automatically remove old logs.

Related

Spark history-server stderr and stdout logs location when working on S3

I deployed spark-history server that supposes to serve multi-environments,
all spark clusters will write to one bucket and the history-server will read from that bucket.
I got everything working and set up but when I try to access the stdout/stderr of a certain task it's addressing the private IP of the worker that the task was running on (e.g- http://10.192.21.80:8081/logPage/?appId=app-20220510103043-0001&executorId=1&logType=stderr).
I want to access those logs from the UI, but of course, there is no access to those internal IP's (private subnets and private IP's), Isn't there a way to also upload those stderr/stdout logs to the bucket and then access it from the history server UI?
I couldn't find anything in the documentation
I am assuming that you are referring to running spark jobs AWS EMR here.
If you have enabled logging to a s3 bucket on your cluster [1] all the applications logs are logged in the s3 bucket path specified, while launching the cluster.
You should find the logs in the following path:
s3://<bucketpath>/<clusterid>/containers/<applicationId>/<containerId>/stdout.gz
s3://<bucketpath>/<clusterid>/containers/<applicationId>/<containerId>/stderr.gz
Hope the above information was helpful!
References:
[1] Configure cluster logging and debugging - Enable the debugging tool - https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html#emr-plan-debugging-logs-archive-debug

Azure Flink checkpointing to Azure Storage: No credentials found for account

I have a test Flink app I am trying to running on Azure Kubernetes connected to Azure Storage. In my Flink app I have configured the following configuration:
Configuration cfg = new Configuration();
cfg.setString("fs.azure.account.key.<storage-account.blob.core.windows.net", "<access-key>");
FileSystem.initialize(cfg, null);
I have also enabled checkpointing as follows:
env.enableCheckpointing(10000);
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION);
env.setStateBackend(new EmbeddedRocksDBStateBackend());
env.getCheckpointConfig().setCheckpointStorage("wasbs://<container>#<storage-account>.blob.core.windows.net/checkpoint/");
The storage account has been created on the Azure Portal. I have used the Access Key in the code above.
When I deploy the app to Kubernetes the JobManager runs and creates the checkpoint folder in the Azure Storage container, however, the size of the Block blob data is always 0B. The app also continuously throws this exception.
The fun error I am getting is:
Caused by: org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.azure.AzureException: No credentials found for account <storage-account>.blob.core.windows.net in the configuration, and its container <container> is not accessible using anonymous credentials. Please check if the container exists first. If it is not publicly available, you have to provide account credentials.
org.apache.flink.fs.azure.shaded.com.microsoft.azure.storage.StorageException: Public access is not permitted on this storage account
The part that has been scratching my head (apart from the fleas) is the fact that it does create the checkpoint folders and files and continues to create further checkpoints.
This account is not publicly accessible and company policy has restricted enabling public access.
I also tried using the flink-conf.yaml and this was my example:
state:backend: rocksdb
state.checkpoints.dir: wasbs://<container>#<storage-account>.blob.core.windows.net/checkpoint/
fs.azure.account.key.**flinkstorage**.blob.core.windows.net: <access-key>
fs.azure.account.key.<storage-account>.blob.core.windows.net: <access-key>
I tried both account.key options above. I tried with wasb protocol as well. I also tried rotating the access keys on Azure Storage all resulting the same errors.
I eventually got this working by moving all of my checkpointing configurations to the flink-conf.yaml. All reference to checkpointing was removed from my code i.e. the StreamExecutionEnvironment.
My flink-config.yaml looks like this
execution.checkpointing.interval: 10s
execution.checkpoint.mode: EXACTLY_ONCE
state.backend: rocksdb
state.checkpoints.dir: wasbs://<container>#<storage-account.blob.core.windows.net/checkpoint/
# azure storage access key
fs.azure.account.key.psbombb.blob.core.windows.net: <access-key>
Checkpoints are now being written to Azure Storage with the size of the metadata files no longer 0B.
I deployed my Flink cluster to Kubernetes as follows with Azure Storage plugins enabled:
./bin/kubernetes-session.sh -Dkubernetes.cluster-id=<cluster-name> -Dkubernetes.namespace=<your-namespace> -Dcontainerized.master.env.ENABLE_BUILT_IN_PLUGINS=flink-azure-fs-hadoop-1.14.0.jar -Dcontainerized.taskmanager.env.ENABLE_BUILT_IN_PLUGINS=flink-azure-fs-hadoop-1.14.0.jar
I then deployed the job to the Flink cluster as follows:
./bin/flink run --target kubernetes-session -Dkubernetes.namespace=<your-namespace> -Dkubernetes.cluster-id=<cluster-name> ~/path/to/project/<your-jar>.jar
The TaskManager on the WebUI will not show StdOut logs. You'll need to kubectl logs -f <taskmanager-pod-name> -n <your-namespace> to see the job logs.
Remember to port-forward 8081 if you want to see the Flink WebUI:
kubectl port-forward svc/<cluster-name> -n <namespace>
e.g. http://localhost:8081
If you're using Minikube and you wish to access the cluster through the Flink LoadBalancer external IP you need to run minikube tunnel
e.g. http://<external-ip>:8081

How to ship Airflow logs to Azure Blob Store

I'm having trouble following this guide section 3.6.5.3 "Writing Logs to Azure Blob Storage"
The documentation states you need an active hook to Azure Blob storage. I'm not sure how to create this. Some sources say you need to create the hook in the UI, and some say you can use an environment variable. Either way, none of my logs are getting written to blob store and I'm at my wits end.
Azure Blob Store hook(or any hook for that matter) tells overflow how to write to into Azure Blob Store. This is already included in recent versions of airflow, wasb_hook.
You will need to make sure that the hook is able to write to Azure Blob Store. Just mention the REMOTE_BASE_LOG_FOLDER bucket should be named like wasb-xxx. Once you take care of these two things instructions works without a hitch,
I achieved writing logs to blob using below steps
Create folder named config inside airflow folder
Create empty __init__.py and log_config.py files inside config folder
Search airflow_local_settings.py in your machine
/home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.py
/home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.pyc
run
cp /home/user/env/lib/python2.7/site-packages/airflow/config_templates/airflow_local_settings.py config/log_config.py
Edit airflow.cfg [core] section
remote_logging = True
remote_log_conn_id = log_sync
remote_base_log_folder=wasb://airflow-logs#storage-account.blob.core.windows.net/logs/
logging_config_class =log_config.DEFAULT_LOGGING_CONFIG
Add log_sync connection object as below
install airflow azure dependency
pip install apache-airflow[azure]
Restart webserver and scheduler

azure HDInsight script action

I am trying to copy a file from a accessible data lake to blob storage while spinning up the cluster.
I am using this command from Azure documentation
hadoop distcp adl://data_lake_store_account.azuredatalakestore.net:443/myfolder wasb://container_name#storage_account_name.blob.core.windows.net/example/data/gutenberg
Now, If I am trying to automate this instead of hardcoding, how do I use this in script action. To be specific how can I dynamically get the the container name and storage_account_name associated while spinning up the cluster.
First as below,
A Script Action is simply a Bash script that you provide a URI to, and parameters for. The script runs on nodes in the HDInsight cluster.
So you just need to refer to the offical tutorial Script action development with HDInsight to write your script action and know how to run it. Or you can call the REST API Run Script Actions on a running cluster (Linux cluster only) to run it automatically.
For how to dynamically get the container name & storage account, a way for any language is to call the REST API Get configurations and extract the property of you want from the core-site in the JSON response, or just to call Get configuration REST API with parameter core-site as {configuration Type} in the url and extract the property of you want from the JSON response.
Hope it helps.

How to run oozie jobs in HDInsight cluster?

I have an oozie workflow that I'd like to run on an HDInsight cluster. My job has a jar file as well as a workflow.xml file that I store on the Azure blob storage. However the only way I found to store the job.config file is on the local storage of the HDInsight headnode. However my concern is what happens when the VM gets re-imaged? does it remove my job.config file?
In general, you can use Script Actions on HDInsight. Script actions perform customization on the HDInsight clusters during provisioning. So every time the cluster is created, the scripts will be run. (You were smart to be concerned about what happens when the cluster is re-created!)
In these advanced configuration options, it shows HDInsight cluster customization during the provision process using PowerShell. There is an oozie section:
# oozie-site.xml configuration
$OozieConfigValues = new-object 'Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightOozieConfiguration'
$OozieConfigValues.Configuration = #{ "oozie.service.coord.normal.default.timeout"="150" } # default 120
Does that help?
Other resources:
Customizing HDInsight Cluster provisioning
Oozie tutorial on HDInsight

Resources