> I am facing this issue intermittently in dataproc spark job in GCP for both long running job for 3-4 hrs and short running job 25-20 min. below is the spark configuration and dags conf of GCP cluster `
spark.cores.max="80"
master="yarn"
spark.executor.memory="8G"
spark.task.maxFailures="50"
spark.driver.maxResultSize="10g"
spark.hadoop.dfs.replication= "1"
MASTER_MACHINE_TYPE = "n1-standard-4"
WORKER_MACHINE_TYPE = "n1-standard-16"
# Dataproc cluster definition
CLUSTER_CONFIG = {
"gce_cluster_config": {
"subnetwork_uri": SUBNETWORK_URI,
"internal_ip_only": True,
"service_account" : GCP_SERVICE_ACCOUNT
},
"master_config": {
"num_instances": 3,
"machine_type_uri": MASTER_MACHINE_TYPE,
"disk_config": {"boot_disk_type": "pd-standard", "boot_disk_size_gb": 1024},
"image_uri" : IMAGE_URI
},
"worker_config": {
"num_instances": 10,
"machine_type_uri": WORKER_MACHINE_TYPE,
"disk_config": {"boot_disk_type": "pd-standard", "boot_disk_size_gb": 1024},
"image_uri" : IMAGE_URI
},
"software_config": {
"properties": {
"dataproc:dataproc.allow.zero.workers": "true",
"yarn:yarn.resourcemanager.scheduler.class":
"org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler",
"yarn:yarn.nodemanager.resource.cpu-vcores": "48"
}
},
"endpoint_config" : {
"enable_http_port_access" : True
},
"lifecycle_config": {
"idle_delete_ttl": {"seconds": 15*60},
"auto_delete_ttl": {"seconds": 480*60},
}
}
`
Related
I am trying to update a batch of jobs to use some instance pools with the databricks api and when I try to use the update endpoint, the job just does not update. It says it executed without errors, but when I check the job, it was not updated.
What am I doing wrong?
What i used to update the job:
I used the get endpoint using the job_id to get my job settings and all
I updated the resulting data with the values that i needed and executed the call to update the job.
'custom_tags': {'ResourceClass': 'Serverless'},
'driver_instance_pool_id': 'my-pool-id',
'driver_node_type_id': None,
'instance_pool_id': 'my-other-pool-id',
'node_type_id': None
I used this documentation, https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsUpdate
here is my payload
{
"created_time": 1672165913242,
"creator_user_name": "email#email.com",
"job_id": 123123123123,
"run_as_owner": true,
"run_as_user_name": "email#email.com",
"settings": {
"email_notifications": {
"no_alert_for_skipped_runs": false,
"on_failure": [
"email1#email.com",
"email2#email.com"
]
},
"format": "MULTI_TASK",
"job_clusters": [
{
"job_cluster_key": "the_cluster_key",
"new_cluster": {
"autoscale": {
"max_workers": 4,
"min_workers": 2
},
"aws_attributes": {
"availability": "SPOT_WITH_FALLBACK",
"ebs_volume_count": 0,
"first_on_demand": 1,
"instance_profile_arn": "arn:aws:iam::XXXXXXXXXX:instance-profile/instance-profile",
"spot_bid_price_percent": 100,
"zone_id": "us-east-1a"
},
"cluster_log_conf": {
"s3": {
"canned_acl": "bucket-owner-full-control",
"destination": "s3://some-bucket/log/log_123123123/",
"enable_encryption": true,
"region": "us-east-1"
}
},
"cluster_name": "",
"custom_tags": {
"ResourceClass": "Serverless"
},
"data_security_mode": "SINGLE_USER",
"driver_instance_pool_id": "my-driver-pool-id",
"enable_elastic_disk": true,
"instance_pool_id": "my-worker-pool-id",
"runtime_engine": "PHOTON",
"spark_conf": {...},
"spark_env_vars": {...},
"spark_version": "..."
}
}
],
"max_concurrent_runs": 1,
"name": "my_job",
"schedule": {...},
"tags": {...},
"tasks": [{...},{...},{...}],
"timeout_seconds": 79200,
"webhook_notifications": {}
}
}
I tried to use the update endpoint and reading the docs for information but I found nothing related to the issue.
I finally got it
I was using partial update and found that this does not work for the whole job payload
So I changed the endpoint to use full update (reset) and it worked
I'm running Airflow on Kubernetes using this Helm chart: https://github.com/apache/airflow/tree/1.5.0
I've written a very simple DAG just to test some things. It looks like this:
default_args={
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG(
'my-dag',
default_args=default_args,
description='simple dag',
schedule_interval=timedelta(days=1),
start_date=datetime(2022, 4, 21),
catchup=False,
tags=['example']
) as dag:
t1 = SparkKubernetesOperator(
task_id='spark-pi',
trigger_rule="all_success",
depends_on_past=False,
retries=3,
application_file="spark-pi.yaml",
namespace="my-ns",
kubernetes_conn_id="myk8s",
api_group="sparkoperator.k8s.io",
api_version="v1beta2",
do_xcom_push=True,
dag=dag
)
t2 = SparkKubernetesOperator(
task_id='other-spark-job',
trigger_rule="all_success",
depends_on_past=False,
retries=3,
application_file=other-spark-job-definition,
namespace="my-ns",
kubernetes_conn_id="myk8s",
api_group="sparkoperator.k8s.io",
api_version="v1beta2",
dag=dag
)
t1 >> t2
When I run the DAG from the Airflow UI, the first task Spark job (t1, spark-pi) gets created and is immediately marked as successful, and then Airflow launches the second (t2) task right after that. This can be seen in the web UI:
What you're seeing is the status of the two tasks in 5 separate DAG runs, as well as their total status (the circles). The middle row of the image shows the status of t1, which is "success".
However, the actual spark-pi pod of t1 launched by the Spark operator fails on every run, and its status can be seen by querying the Sparkapplication resource on Kubernetes:
$ kubectl get sparkapplications/spark-pi-2022-04-28-2 -n my-ns -o json
{
"apiVersion": "sparkoperator.k8s.io/v1beta2",
"kind": "SparkApplication",
"metadata": {
"creationTimestamp": "2022-04-29T13:28:02Z",
"generation": 1,
"name": "spark-pi-2022-04-28-2",
"namespace": "my-ns",
"resourceVersion": "111463226",
"uid": "23f1c8fb-7843-4628-b22f-7808b562f9d8"
},
"spec": {
"driver": {
"coreLimit": "1500m",
"cores": 1,
"labels": {
"version": "2.4.4"
},
"memory": "512m",
"volumeMounts": [
{
"mountPath": "/tmp",
"name": "test-volume"
}
]
},
"executor": {
"coreLimit": "1500m",
"cores": 1,
"instances": 1,
"labels": {
"version": "2.4.4"
},
"memory": "512m",
"volumeMounts": [
{
"mountPath": "/tmp",
"name": "test-volume"
}
]
},
"image": "my.google.artifactory.com/spark-operator/spark:v2.4.4",
"imagePullPolicy": "Always",
"mainApplicationFile": "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar",
"mainClass": "org.apache.spark.examples.SparkPi",
"mode": "cluster",
"restartPolicy": {
"type": "Never"
},
"sparkVersion": "2.4.4",
"type": "Scala",
"volumes": [
{
"hostPath": {
"path": "/tmp",
"type": "Directory"
},
"name": "test-volume"
}
]
},
"status": {
"applicationState": {
"errorMessage": "driver container failed with ExitCode: 1, Reason: Error",
"state": "FAILED"
},
"driverInfo": {
"podName": "spark-pi-2022-04-28-2-driver",
"webUIAddress": "172.20.23.178:4040",
"webUIPort": 4040,
"webUIServiceName": "spark-pi-2022-04-28-2-ui-svc"
},
"executionAttempts": 1,
"lastSubmissionAttemptTime": "2022-04-29T13:28:15Z",
"sparkApplicationId": "spark-3335e141a51148d7af485457212eb389",
"submissionAttempts": 1,
"submissionID": "021e78fc-4754-4ac8-a87d-52c682ddc483",
"terminationTime": "2022-04-29T13:28:25Z"
}
}
As you can see in the status section, we have "state": "FAILED". Still, Airflow marks it as successful and thus runs t2 right after it, which is not what we want when defining t2 as dependent on (downstream of) t1.
Why does Airflow see t1 as successful even though the Spark job itself fails?
That's the implementation. If you see the code for the operator it is basically a submit and forget job. To monitor the status you use SparkkubernetesSensor
t2 = SparkKubernetesSensor(
task_id="spark_monitor",
application_name="{{ task_instance.xcom_pull(task_ids='spark-job-full-refresh.spark_full_refresh') ['metadata']['name'] }}",
attach_log=True,
)
I have tried to create a custom operator that combines both but it does not work very well via inheritance because they are slightly different execution patterns, so it needs to be created from scratch. But for all purposes and intents, the Sensor works perfectly, just adds unneeded lines to code.
I have 2 json files, in one I have policies and in the another one I have clusters with custom configurations, the thing is if a cluster has a policy_id key, it should be merged/join with the respective policy to get its default configurations, if not just returns the base cluster.
cluster.json
[
{
"name": "a",
"memory": 16
},
{
"name":"b",
"memory": 16,
"policy_id": 2
}
]
policies.json
[
{
"policy_id": 1,
"policy_name": "test",
"policy_cores" : 4
},
{
"policy_id": 2,
"policy_name": "test2",
"policy_cores" : 8
}
]
So the expected result should be something like this, the "a" cluster remains the same because doesn't have a policy_id key, the "b" cluster has its values and policy's values:
[
{
"name": "a",
"memory": 16
},
{
"name":"b",
"memory": 16,
"policy_id": 2,
"policy_name": "test2",
"policy_cores" : 8
}
]
I was trying to do it in the locals block code but I don't know how I can do nested for loops with the conditional. Sorry for the pseudo code, I code in Python so Terraform is so rare for me.
locals {
# get jsons
policies = jsondecode(file("${path.module}/policies.json"))
clusters = jsondecode(file("${path.module}/clusters.json"))
#pseudo-code to express the logic, sorry im still learning Terraform
aux_clusters = [
for cluster in local.clusters : {
if try(cluster.policy_id, null) != null : {
#if policy_id key exists, then merge with the respective policy
for k, v in local.policies : {
k => merge(v, cluster) if v.policy_id == cluster.policy_id
}
} else {
#if policy_id key doesnt exist just return the base cluster
cluster
}
}
]
}
Thank you...
I am writing a program to analyze sql query. So I am using Spark logical plan.
Below is the code which I am using
object QueryAnalyzer {
val LOG = LoggerFactory.getLogger(this.getClass)
//Spark Conf
val conf = new
SparkConf().setMaster("local[2]").setAppName("LocalEdlExecutor")
//Spark Context
val sc = new SparkContext(conf)
//sql Context
val sqlContext = new SQLContext(sc)
//Spark Session
val sparkSession = SparkSession
.builder()
.appName("Spark User Data")
.config("spark.app.name", "LocalEdl")
.getOrCreate()
def main(args: Array[String]) {
var inputDfColumns = Map[String,List[String]]()
val dfSession = sparkSession.
read.
format("csv").
option("header", EdlConstants.TRUE).
option("inferschema", EdlConstants.TRUE).
option("delimiter", ",").
option("decoding", EdlConstants.UTF8).
option("multiline", true)
var oDF = dfSession.
load("C:\\Users\\tarun.khaneja\\data\\order.csv")
println("smaple data in oDF====>")
oDF.show()
var cusDF = dfSession.
load("C:\\Users\\tarun.khaneja\\data\\customer.csv")
println("smaple data in cusDF====>")
cusDF.show()
oDF.createOrReplaceTempView("orderTempView")
cusDF.createOrReplaceTempView("customerTempView")
//get input columns from all dataframe
inputDfColumns += ("orderTempView"->oDF.columns.toList)
inputDfColumns += ("customerTempView"->cusDF.columns.toList)
val res = sqlContext.sql("""select OID, max(MID+CID) as MID_new,ROW_NUMBER() OVER (
ORDER BY CID) as rn from
(select OID_1 as OID, CID_1 as CID, OID_1+CID_1 as MID from
(select min(ot.OrderID) as OID_1, ct.CustomerID as CID_1
from orderTempView as ot inner join customerTempView as ct
on ot.CustomerID = ct.CustomerID group by CID_1)) group by OID,CID""")
println(res.show(false))
val analyzedPlan = res.queryExecution.analyzed
println(analyzedPlan.prettyJson)
}
Now problem is, with Spark 2.2.1, I am getting below json. where I have SubqueryAlias which provide important information of alias name for table which we used in query, as shown below.
...
...
...
[ {
"class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
"num-children" : 0,
"name" : "OrderDate",
"dataType" : "string",
"nullable" : true,
"metadata" : { },
"exprId" : {
"product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
"id" : 2,
"jvmId" : "acefe6e6-e469-4c9a-8a36-5694f054dc0a"
},
"isGenerated" : false
} ] ]
}, {
"class" : "org.apache.spark.sql.catalyst.plans.logical._**SubqueryAlias**_",
"num-children" : 1,
"alias" : "ct",
"child" : 0
}, {
"class" : "org.apache.spark.sql.catalyst.plans.logical._**SubqueryAlias**_",
"num-children" : 1,
"alias" : "customertempview",
"child" : 0
}, {
"class" : "org.apache.spark.sql.execution.datasources.LogicalRelation",
"num-children" : 0,
"relation" : null,
"output" :
...
...
...
But with Spark 2.4, I am getting SubqueryAlias name as null. As shown below in json.
...
...
{
"class":
"org.apache.spark.sql.catalyst.expressions.AttributeReference",
"num-children": 0,
"name": "CustomerID",
"dataType": "integer",
"nullable": true,
"metadata": {},
"exprId": {
"product-class":
"org.apache.spark.sql.catalyst.expressions.ExprId",
"id": 19,
"jvmId": "3b0dde0c-0b8f-4c63-a3ed-4dba526f8331"
},
"qualifier": "[ct]"
}]
}, {
"class":
"org.apache.spark.sql.catalyst.plans.logical._**SubqueryAlias**_",
"num-children": 1,
"name": null,
"child": 0
}, {
"class":
"org.apache.spark.sql.catalyst.plans.logical._**SubqueryAlias**_",
"num-children": 1,
"name": null,
"child": 0
}, {
"class":
"org.apache.spark.sql.execution.datasources.LogicalRelation",
"num-children": 0,
"relation": null,
"output":
...
...
So, I am not sure if it is bug in Spark 2.4 because of which I am getting name as null in SubquerAlias.
Or if it is not bug then how can I get relation between alias name and real table name.
Any idea on this?
I'm using a custom sink in structured stream (spark 2.2.0) and noticed that spark produces incorrect metrics for number of input rows - it's always zero.
My stream construction:
StreamingQuery writeStream = session
.readStream()
.schema(RecordSchema.fromClass(TestRecord.class))
.option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
.option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
.csv(s3Path.toString())
.as(Encoders.bean(TestRecord.class))
.flatMap(
((FlatMapFunction<TestRecord, TestOutputRecord>) (u) -> {
List<TestOutputRecord> list = new ArrayList<>();
try {
TestOutputRecord result = transformer.convert(u);
list.add(result);
} catch (Throwable t) {
System.err.println("Failed to convert a record");
t.printStackTrace();
}
return list.iterator();
}),
Encoders.bean(TestOutputRecord.class))
.map(new DataReinforcementMapFunction<>(), Encoders.bean(TestOutputRecord.clazz))
.writeStream()
.trigger(Trigger.ProcessingTime(WRITE_FREQUENCY, TimeUnit.SECONDS))
.format(MY_WRITER_FORMAT)
.outputMode(OutputMode.Append())
.queryName("custom-sink-stream")
.start();
writeStream.processAllAvailable();
writeStream.stop();
Logs:
Streaming query made progress: {
"id" : "a8a7fbc2-0f06-4197-a99a-114abae24964",
"runId" : "bebc8a0c-d3b2-4fd6-8710-78223a88edc7",
"name" : "custom-sink-stream",
"timestamp" : "2018-01-25T18:39:52.949Z",
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 781,
"triggerExecution" : 781
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "FileStreamSource[s3n://test-bucket/test]",
"startOffset" : {
"logOffset" : 0
},
"endOffset" : {
"logOffset" : 0
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "com.mycompany.spark.MySink#f82a99"
}
}
Do I have to populate any metrics in my custom sink to be able to track progress? Or could it be a problem in FileStreamSource when it reads from s3 bucket?
The problem was related to using dataset.rdd in my custom sink that creates a new plan so that StreamExecution doesn't know about it and therefore is not able to get metrics.
Replacing data.rdd.mapPartitions with data.queryExecution.toRdd.mapPartitions fixes the issue.