Airflow doesn't pick up FAILED status of Spark job

Airflow doesn't pick up FAILED status of Spark job - apache-spark

I'm running Airflow on Kubernetes using this Helm chart: https://github.com/apache/airflow/tree/1.5.0
I've written a very simple DAG just to test some things. It looks like this:
default_args={
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG(
'my-dag',
default_args=default_args,
description='simple dag',
schedule_interval=timedelta(days=1),
start_date=datetime(2022, 4, 21),
catchup=False,
tags=['example']
) as dag:
t1 = SparkKubernetesOperator(
task_id='spark-pi',
trigger_rule="all_success",
depends_on_past=False,
retries=3,
application_file="spark-pi.yaml",
namespace="my-ns",
kubernetes_conn_id="myk8s",
api_group="sparkoperator.k8s.io",
api_version="v1beta2",
do_xcom_push=True,
dag=dag
)
t2 = SparkKubernetesOperator(
task_id='other-spark-job',
trigger_rule="all_success",
depends_on_past=False,
retries=3,
application_file=other-spark-job-definition,
namespace="my-ns",
kubernetes_conn_id="myk8s",
api_group="sparkoperator.k8s.io",
api_version="v1beta2",
dag=dag
)
t1 >> t2
When I run the DAG from the Airflow UI, the first task Spark job (t1, spark-pi) gets created and is immediately marked as successful, and then Airflow launches the second (t2) task right after that. This can be seen in the web UI:
What you're seeing is the status of the two tasks in 5 separate DAG runs, as well as their total status (the circles). The middle row of the image shows the status of t1, which is "success".
However, the actual spark-pi pod of t1 launched by the Spark operator fails on every run, and its status can be seen by querying the Sparkapplication resource on Kubernetes:
$ kubectl get sparkapplications/spark-pi-2022-04-28-2 -n my-ns -o json
{
"apiVersion": "sparkoperator.k8s.io/v1beta2",
"kind": "SparkApplication",
"metadata": {
"creationTimestamp": "2022-04-29T13:28:02Z",
"generation": 1,
"name": "spark-pi-2022-04-28-2",
"namespace": "my-ns",
"resourceVersion": "111463226",
"uid": "23f1c8fb-7843-4628-b22f-7808b562f9d8"
},
"spec": {
"driver": {
"coreLimit": "1500m",
"cores": 1,
"labels": {
"version": "2.4.4"
},
"memory": "512m",
"volumeMounts": [
{
"mountPath": "/tmp",
"name": "test-volume"
}
]
},
"executor": {
"coreLimit": "1500m",
"cores": 1,
"instances": 1,
"labels": {
"version": "2.4.4"
},
"memory": "512m",
"volumeMounts": [
{
"mountPath": "/tmp",
"name": "test-volume"
}
]
},
"image": "my.google.artifactory.com/spark-operator/spark:v2.4.4",
"imagePullPolicy": "Always",
"mainApplicationFile": "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar",
"mainClass": "org.apache.spark.examples.SparkPi",
"mode": "cluster",
"restartPolicy": {
"type": "Never"
},
"sparkVersion": "2.4.4",
"type": "Scala",
"volumes": [
{
"hostPath": {
"path": "/tmp",
"type": "Directory"
},
"name": "test-volume"
}
]
},
"status": {
"applicationState": {
"errorMessage": "driver container failed with ExitCode: 1, Reason: Error",
"state": "FAILED"
},
"driverInfo": {
"podName": "spark-pi-2022-04-28-2-driver",
"webUIAddress": "172.20.23.178:4040",
"webUIPort": 4040,
"webUIServiceName": "spark-pi-2022-04-28-2-ui-svc"
},
"executionAttempts": 1,
"lastSubmissionAttemptTime": "2022-04-29T13:28:15Z",
"sparkApplicationId": "spark-3335e141a51148d7af485457212eb389",
"submissionAttempts": 1,
"submissionID": "021e78fc-4754-4ac8-a87d-52c682ddc483",
"terminationTime": "2022-04-29T13:28:25Z"
}
}
As you can see in the status section, we have "state": "FAILED". Still, Airflow marks it as successful and thus runs t2 right after it, which is not what we want when defining t2 as dependent on (downstream of) t1.
Why does Airflow see t1 as successful even though the Spark job itself fails?

That's the implementation. If you see the code for the operator it is basically a submit and forget job. To monitor the status you use SparkkubernetesSensor
t2 = SparkKubernetesSensor(
task_id="spark_monitor",
application_name="{{ task_instance.xcom_pull(task_ids='spark-job-full-refresh.spark_full_refresh') ['metadata']['name'] }}",
attach_log=True,
)
I have tried to create a custom operator that combines both but it does not work very well via inheritance because they are slightly different execution patterns, so it needs to be created from scratch. But for all purposes and intents, the Sensor works perfectly, just adds unneeded lines to code.

Related

Databricks API - Instance Pool - How to update an existing job to use instance pool instead?

I am trying to update a batch of jobs to use some instance pools with the databricks api and when I try to use the update endpoint, the job just does not update. It says it executed without errors, but when I check the job, it was not updated.
What am I doing wrong?
What i used to update the job:
I used the get endpoint using the job_id to get my job settings and all
I updated the resulting data with the values that i needed and executed the call to update the job.
'custom_tags': {'ResourceClass': 'Serverless'},
'driver_instance_pool_id': 'my-pool-id',
'driver_node_type_id': None,
'instance_pool_id': 'my-other-pool-id',
'node_type_id': None
I used this documentation, https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsUpdate
here is my payload
{
"created_time": 1672165913242,
"creator_user_name": "email#email.com",
"job_id": 123123123123,
"run_as_owner": true,
"run_as_user_name": "email#email.com",
"settings": {
"email_notifications": {
"no_alert_for_skipped_runs": false,
"on_failure": [
"email1#email.com",
"email2#email.com"
]
},
"format": "MULTI_TASK",
"job_clusters": [
{
"job_cluster_key": "the_cluster_key",
"new_cluster": {
"autoscale": {
"max_workers": 4,
"min_workers": 2
},
"aws_attributes": {
"availability": "SPOT_WITH_FALLBACK",
"ebs_volume_count": 0,
"first_on_demand": 1,
"instance_profile_arn": "arn:aws:iam::XXXXXXXXXX:instance-profile/instance-profile",
"spot_bid_price_percent": 100,
"zone_id": "us-east-1a"
},
"cluster_log_conf": {
"s3": {
"canned_acl": "bucket-owner-full-control",
"destination": "s3://some-bucket/log/log_123123123/",
"enable_encryption": true,
"region": "us-east-1"
}
},
"cluster_name": "",
"custom_tags": {
"ResourceClass": "Serverless"
},
"data_security_mode": "SINGLE_USER",
"driver_instance_pool_id": "my-driver-pool-id",
"enable_elastic_disk": true,
"instance_pool_id": "my-worker-pool-id",
"runtime_engine": "PHOTON",
"spark_conf": {...},
"spark_env_vars": {...},
"spark_version": "..."
}
}
],
"max_concurrent_runs": 1,
"name": "my_job",
"schedule": {...},
"tags": {...},
"tasks": [{...},{...},{...}],
"timeout_seconds": 79200,
"webhook_notifications": {}
}
}
I tried to use the update endpoint and reading the docs for information but I found nothing related to the issue.

I finally got it
I was using partial update and found that this does not work for the whole job payload
So I changed the endpoint to use full update (reset) and it worked

ADF get property "status": "Succeeded" and IF for validation

I have a pipeline that pull out data from external and sink into SQL Server table as staging. Process for getting raw data has already succeeded by using 4 'Copy data'. Because of so many columns (250 columns), so I split them.
What the next requirement validate 4 those 'Copy data' by getting succeeded status. The output of 'Copy data' look like this
{
"dataRead": 4772214,
"dataWritten": 106918,
"sourcePeakConnections": 1,
"sinkPeakConnections": 1,
"rowsRead": 1366,
"rowsCopied": 1366,
"copyDuration": 8,
"throughput": 582.546,
"errors": [],
"effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (Southeast Asia)",
"usedDataIntegrationUnits": 4,
"billingReference": {
"activityType": "DataMovement",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.016666666666666666,
"unit": "DIUHours"
}
]
},
"usedParallelCopies": 1,
"executionDetails": [
{
"source": {
"type": "RestService"
},
"sink": {
"type": "AzureSqlDatabase",
"region": "Southeast Asia"
},
"status": "Succeeded",
"start": "2022-04-13T07:16:48.5905628Z",
"duration": 8,
"usedDataIntegrationUnits": 4,
"usedParallelCopies": 1,
"profile": {
"queue": {
"status": "Completed",
"duration": 4
},
"transfer": {
"status": "Completed",
"duration": 4,
"details": {
"readingFromSource": {
"type": "RestService",
"workingDuration": 1,
"timeToFirstByte": 1
},
"writingToSink": {
"type": "AzureSqlDatabase",
"workingDuration": 0
}
}
}
},
"detailedDurations": {
"queuingDuration": 4,
"timeToFirstByte": 1,
"transferDuration": 3
}
}
],
"dataConsistencyVerification": {
"VerificationResult": "NotVerified"
},
"durationInQueue": {
"integrationRuntimeQueue": 0
}
}
Now, I want to get "status": "Succeeded" (JSON output) for validating in the 'IF Condition'. So, I set Value from variable in the dynamic content #activity('copy_data_Kobo_MBS').output
but when it run, I got error
The variable 'copy_Kobo_MBS' of type 'Boolean' cannot be initialized
or updated with value of type 'Object'. The variable 'copy_Kobo_MBS'
only supports values of types 'Boolean'.
And the question is how to get "status": "Succeeded" (JSON output) as 'Variable' value ? So 'IF condition' can examine the 'Variable' value.

You can use the below expression to pull the run status from the copy data activity. As your variable is of Boolean type, you need to evaluate it using the #equals() function which returns true or false.
#equals(activity('Copy data1').output.executionDetails[0].status,'Succeeded')
As per knowledge, you don’t have to extract the status from copy data activity as you are connecting your copy activity to set variable activity upon success.
That means your set variable activity runs only when your copy data activity ran successfully.
Also, note that
If the copy data activity (or any other activity) fails, then the activities which are added upon the success output of the previous activity will not be running.
And if you are connecting more than 1 activity output to a single activity, it only runs when all the connected activities run.
You can add activities upon failure or upon completion to process further.
Example:
In the below snip, the Set Variable activity is not run as copy data is not successful. And Wait2 activity is not run as all the input activities are not run successfully.

Avoid duplicates during parallel job run in Azure Synapse

Need your suggestions in developing code in Azure Synapse.
We have a requirement where our jobs will run in parallel at same time and insert data to the same table.
During this insert there are changes that duplicate entries will be inserted to the same table.
For Example: If Job A and Job B run at same time both with same values then "not exists" or "not in" will fail to work. In this case I will get duplicates from both the job. Primary key or Unique constraint allows duplicates in Azure synapse. Is there any best way to lock tables during data insert. Like if Job A is running then JOB B should not insert the data to same table. Please pour your suggestions as I am new to this. Note: We use stored Procedure to load the data through ADF V2
Thanks,
Nandini

Duplicates must be handled within jobs before inserting data into Azure Synapse. If the duplicates exists between two jobs, then do it after completion of both jobs. It depends really how you are loading data. You can easily manage by creating a temp table instead of directly loading data to final table. Please make sure the structure of temp table should be same as final table (Distribution, Partition, constraints, nullability of the columns) You can use SQL BCP/INSERT TO/CTAS/CTAS with partition switching with stage table to final table.
If you can share specific scenario, it will be helpful to give suggestions relevant to your use case.

I just got the same case and I solved it with Pipeline Runs - Query By Factory
Use a Until activity before the DataFlow activity that writes the values in the table with this expression #equals(activity('pingPL').output.value[0].runId, pipeline().RunId) as follow:
Into the Until activities put a web activity and a wait time:
a. Web activity body - follow docs:
{
"lastUpdatedAfter": "#{addminutes(utcnow(), -30)}",
"lastUpdatedBefore": "#{utcnow()}",
"filters": [
{
"operand": "PipelineName",
"operator": "Equals",
"values": [
"pipeline_name_where_writeInSynapse_is_located"
]
},
{
"operand": "Status",
"operator": "Equals",
"values": [
"InProgress"
]
}
]
}
b. Wait activity 30 sec or whatever make sense
What is happening is, if you trigger several times the same pipeline in parallel the web activity is going to filter each PL status InProgress. It will look like this:
{
"value": [
{
"id": "...",
"runId": "52004775-5ef5-493b-8a44-ee3fff6bff7b",
"debugRunId": null,
"runGroupId": "52004775-5ef5-493b-8a44-ee3fff6bff7b",
"pipelineName": "synapse_writting",
"parameters": {
"region": "NW",
"unique_item": "a"
},
"invokedBy": {
"id": "80efce4dbda74636878bc99472978ccf",
"name": "Manual",
"invokedByType": "Manual"
},
"runStart": "2021-10-13T17:24:01.0210945Z",
"runEnd": "2021-10-13T17:25:06.9692394Z",
"durationInMs": 65948,
"status": "InProgress",
"message": "",
"output": null,
"lastUpdated": "2021-10-13T17:25:06.9704432Z",
"annotations": [],
"runDimension": {},
"isLatest": true
},
{
"id": "...",
"runId": "cf3f5038-ba10-44c3-b8f5-df8ad4c85819",
"debugRunId": null,
"runGroupId": "cf3f5038-ba10-44c3-b8f5-df8ad4c85819",
"pipelineName": "synapse_writting",
"parameters": {
"region": "NW",
"unique_item": "a"
},
"invokedBy": {
"id": "08205e0eda0b41f6b5a90a8dda06a7f6",
"name": "Manual",
"invokedByType": "Manual"
},
"runStart": "2021-10-13T17:28:58.219611Z",
"runEnd": null,
"durationInMs": null,
"status": "InProgress",
"message": "",
"output": null,
"lastUpdated": "2021-10-13T17:29:00.9860175Z",
"annotations": [],
"runDimension": {},
"isLatest": true
}
],
"ADFWebActivityResponseHeaders": {
"Pragma": "no-cache",
"Strict-Transport-Security": "max-age=31536000; includeSubDomains",
"X-Content-Type-Options": "nosniff",
"x-ms-ratelimit-remaining-subscription-reads": "11999",
"x-ms-request-id": "188508ef-8897-4c21-8c37-ccdd4adc6d81",
"x-ms-correlation-request-id": "188508ef-8897-4c21-8c37-ccdd4adc6d81",
"x-ms-routing-request-id": "WESTUS2:20211013T172902Z:188508ef-8897-4c21-8c37-ccdd4adc6d81",
"Cache-Control": "no-cache",
"Date": "Wed, 13 Oct 2021 17:29:02 GMT",
"Server": "Microsoft-IIS/10.0",
"X-Powered-By": "ASP.NET",
"Content-Length": "1492",
"Content-Type": "application/json; charset=utf-8",
"Expires": "-1"
},
"effectiveIntegrationRuntime": "NCAP-Simple-DataMovement (West US 2)",
"executionDuration": 0,
"durationInQueue": {
"integrationRuntimeQueue": 0
},
"billingReference": {
"activityType": "ExternalActivity",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.016666666666666666,
"unit": "Hours"
}
]
}
}
Then the Until expression will evaluate if the first value[0] has runId == pipeline_runid to stop the until activity and run the dataflow that writes in Synapse. Once the PL ends the status will be Succeeded and the Web activity in another job will get the next value[0] with the status InProgress and continue with the next write. This creates a dependency to the parallel jobs to wait until the dataflow validates and writes in table if need it.

Can I use Spark's REST API to get the version of Spark on the Workers

I know I can get the version of Spark v2.2.1 that's running on Spark Master with this command:
http://<spark-master>:4040/api/v1/version
which will return something like
{
"spark" : "2.2.1"
}
However, I also want to check the version of Spark running on each Worker. I know I can get a list of all Workers thusly:
http://<spark-master>:8080/json/
which will return a response similar to
{
"url": "spark://<spark-master>:7077",
"workers": [{
"id": "worker-20180228071440-<ip-address>-7078",
"host": "<ip-address>",
"port": 7078,
"webuiaddress": "http://<ip-address>:8081",
"cores": 8,
"coresused": 8,
"coresfree": 0,
"memory": 40960,
"memoryused": 35875,
"memoryfree": 5085,
"state": "ALIVE",
"lastheartbeat": 1519932580686
}, ...
],
"cores": 32,
"coresused": 32,
"memory": 163840,
"memoryused": 143500,
"activeapps": [{
"starttime": 1519830260440,
"id": "app-20180228070420-0000",
"name": "<spark-app-name>",
"user": "<spark-app-user>",
"memoryperslave": 35875,
"submitdate": "Wed Feb 28 07:04:20 PST 2018",
"state": "RUNNING",
"duration": 102328434
}
],
"completedapps": [],
"activedrivers": [],
"status": "ALIVE"
}
I'd like to use that information to query each Spark Worker's version. Is this possible?

Call stored procedure using ADF

I am loading SQL server table using ADF and after insertion is over, I have to do little manipulation using below approach
Trigger (After insert) - Failed, SQL server not able to detect inserted record that I push using ADF.. **Seems to be a bug**.
Stored procedure using user defined table type - Getting error
Error Number '156'. Error message from database execution : Incorrect
syntax near the keyword 'select'. Must declare the table variable
"#a".
I have created below pipeline
{
"name": "CopyPipeline-xxx",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource",
"recursive": false
},
"sink": {
"type": "SqlSink",
"sqlWriterStoredProcedureName": "sp_xxx",
"storedProcedureParameters": {
"stringProductData": {
"value": "str1"
}
},
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "col1:col1,col2:col2"
}
},
"inputs": [
{
"name": "InputDataset-3jg"
}
],
"outputs": [
{
"name": "OutputDataset-3jg"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Hour",
"interval": 8
},
"name": "Activity-0-xxx_csv->[dbo]_[xxx_staging]"
}
],
"start": "2017-01-09T21:48:53.348Z",
"end": "2099-12-30T18:30:00Z",
"isPaused": false,
"hubName": "hub",
"pipelineMode": "Scheduled"
}
}
and using below stored procedure
create procedure [dbo].[sp_xxx] #xxx1 [dbo].[ut_xxx] READONLY, #str1 varchar(100) AS
MERGE xxx_dummy AS a
USING #xxx1 AS b
ON (a.col1 = b.col1)
WHEN NOT MATCHED
THEN INSERT(col1, col2)
VALUES(b.col1, b.col2)
WHEN MATCHED
THEN UPDATE SET a.col2 = b.col2;
Please help me to resolve the issue.

I can reproduce your first error. Inserting to a SQL Server table with Azure Data Factory (ADF) appears to use a bulk insert method (similar to BULK INSERT, bcp, SSIS etc) and by default these methods do not fire triggers:
insert bulk [dbo].[testADF] ([col1] Int, [col2] Int, [col3] Int, [col4] Int)
with (TABLOCK, CHECK_CONSTRAINTS)
With bcp, BULK INSERT there is a flag to change to say 'fire triggers' but it appears there is no way to change this setting for ADF. As a workaround, move the logic from your trigger into the stored proc.
If you believe this flag is important, consider creating a feedback item.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Airflow doesn't pick up FAILED status of Spark job - apache-spark

Related

Databricks API - Instance Pool - How to update an existing job to use instance pool instead?

ADF get property "status": "Succeeded" and IF for validation

Avoid duplicates during parallel job run in Azure Synapse

Can I use Spark's REST API to get the version of Spark on the Workers

Call stored procedure using ADF

Categories

Resources