AWS ECS create service waits for user input while showing output on console - linux

I am using circleci job to create an ECS service. Below is the aws cli command that I'm using to create ECS service.
aws ecs create-service --cluster "test-cluster" --service-name testServiceName \
--task-definition testdef:1 \
--desired-count 1 --launch-type EC2
When this command is executing following error is occurred and the CircleCI job is failed.
{ress RETURN)
"service": {
"serviceArn": "arn:aws:ecs:*********:<account-id>:service/testServiceName",
"serviceName": "testServiceName",
"clusterArn": "arn:aws:ecs:*********:<account-id>:cluster/test-cluster",
"loadBalancers": [],
"serviceRegistries": [],
"status": "ACTIVE",
"desiredCount": 1,
"runningCount": 0,
"pendingCount": 0,
"launchType": "EC2",
"taskDefinition": "arn:aws:ecs:*********:<account-id>:task-definition/testdef*********:1",
"deploymentConfiguration": {
"maximumPercent": 200,
"minimumHealthyPercent": 100
},
"deployments": [
{
"id": "ecs-svc/1585305191116328179",
"status": "PRIMARY",
:
Too long with no output (exceeded 10m0s): context deadline exceeded
Running the command locally on a minimized terminal window gives the following output
{
"service": {
"serviceArn": "arn:aws:ecs:<region>:<account-id>:service/testServiceName",
"serviceName": "testServiceName",
"clusterArn": "arn:aws:ecs:<region>:<account-id>:cluster/test-cluster",
"loadBalancers": [],
"serviceRegistries": [],
"status": "ACTIVE",
"desiredCount": 1,
"runningCount": 0,
"pendingCount": 0,
"launchType": "EC2",
"taskDefinition": "arn:aws:ecs:<region>:<account-id>:task-definition/testdef:1",
"deploymentConfiguration": {
"maximumPercent": 200,
"minimumHealthyPercent": 100
},
"deployments": [
{
"id": "ecs-svc/8313453507891259676",
"status": "PRIMARY",
"taskDefinition": "arn:aws:ecs:<region>:<account-id>:task-definition/testdef:1",
"desiredCount": 1,
:
The further execution is stopped until I hit some key. This is the reason that CircleCI job is failing after 10m threshold limit. When I run the command in a full screen terminal locally then it does not wait and shows the output.
Is there any way that the command is run in such a way that it does not wait for any key to be hit and execution is completed so that the pipeline does not fail. Please note that the ECS service is created successfully.

Related

Azure Databricks CLI: update workflow/job definition

I have created a pipeline in Azure DevOps to perform the following three steps:
Retrieve the job definition from one Databricks workspace and save it as a json (Databricks CLI config is omitted)
databricks jobs get --job-id $(job_id) > workflow.json
Use this json to update the workflow in a second (separate) Databricks workspace (Databricks CLI is first reconfigured to point to the new workspace)
databricks jobs reset --job-id $(job_id) --json-file workflow.json
Run the updated job in the second Databricks workspace
databricks jobs run-now --job-id $(job_id)
However, my pipeline fails at step 2 with the following error, even though the existing_cluster_id is already defined inside the workflow.json. Any idea?
Error: b'{"error_code":"INVALID_PARAMETER_VALUE","message":"One of job_cluster_key, new_cluster, or existing_cluster_id must be specified."}' 
Here is what my workflow.json looks like (hiding some of the details):
{
"job_id": 123,
"creator_user_name": "user1",
"run_as_user_name": "user1",
"run_as_owner": true,
"settings":
{
"name": "my-workflow",
"existing_cluster_id": "abc-def-123-xyz",
"email_notifications": {
"no_alert_for_skipped_runs": false
},
"webhook_notifications": {},
"timeout_seconds": 0,
"notebook_task": {
"notebook_path": "notebooks/my-notebook",
"base_parameters": {
"environment": "production"
},
"source": "GIT"
},
"max_concurrent_runs": 1,
"git_source": {
"git_url": "https://my-org#dev.azure.com/my-project/_git/my-repo",
"git_provider": "azureDevOpsServices",
"git_branch": "master"
},
"format": "SINGLE_TASK"
},
"created_time": 1676477563075
}
I figured out that you don't need to retrieve the entire workflow definition json file, as shown in step 1, but only the "settings" part, i.e. modifying step 1 to this solved my issue:
databricks jobs get --job-id $(job_id) | jq .settings > workflow.json

"Testlog - Get Test Result Logs" in Azure devops services- REST API for Manual Testing

Getting Response as
{"value":[],"count":0}
when tried to get the logs of test results and test run for manual testing using the REST API.
Sample Requests:
Test Result:
GET https://vstmr.dev.azure.com/{organization}/{project}/_apis/testresults/runs/{runId}/results/{resultId}/testlog?type=generalAttachment&api-version=6.0-preview.1
Test Run:
GET https://vstmr.dev.azure.com/{organization}/{project}/_apis/testresults/runs/{runId}/testlog?type=generalAttachment&api-version=6.0-preview.1
Looking for guidance to get the required response as mentioned below:
{
"logReference": {
"scope": 0,
"buildId": 0,
"releaseId": 0,
"releaseEnvId": 0,
"runId": 1,
"resultId": 0,
"subResultId": 0,
"type": 1,
"filePath": "textAsFileAttachment.txt"
},
"modifiedOn": "/Date(123456789)/",
"size": 65826,
"metaData": {}
}

Airflow doesn't pick up FAILED status of Spark job

I'm running Airflow on Kubernetes using this Helm chart: https://github.com/apache/airflow/tree/1.5.0
I've written a very simple DAG just to test some things. It looks like this:
default_args={
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG(
'my-dag',
default_args=default_args,
description='simple dag',
schedule_interval=timedelta(days=1),
start_date=datetime(2022, 4, 21),
catchup=False,
tags=['example']
) as dag:
t1 = SparkKubernetesOperator(
task_id='spark-pi',
trigger_rule="all_success",
depends_on_past=False,
retries=3,
application_file="spark-pi.yaml",
namespace="my-ns",
kubernetes_conn_id="myk8s",
api_group="sparkoperator.k8s.io",
api_version="v1beta2",
do_xcom_push=True,
dag=dag
)
t2 = SparkKubernetesOperator(
task_id='other-spark-job',
trigger_rule="all_success",
depends_on_past=False,
retries=3,
application_file=other-spark-job-definition,
namespace="my-ns",
kubernetes_conn_id="myk8s",
api_group="sparkoperator.k8s.io",
api_version="v1beta2",
dag=dag
)
t1 >> t2
When I run the DAG from the Airflow UI, the first task Spark job (t1, spark-pi) gets created and is immediately marked as successful, and then Airflow launches the second (t2) task right after that. This can be seen in the web UI:
What you're seeing is the status of the two tasks in 5 separate DAG runs, as well as their total status (the circles). The middle row of the image shows the status of t1, which is "success".
However, the actual spark-pi pod of t1 launched by the Spark operator fails on every run, and its status can be seen by querying the Sparkapplication resource on Kubernetes:
$ kubectl get sparkapplications/spark-pi-2022-04-28-2 -n my-ns -o json
{
"apiVersion": "sparkoperator.k8s.io/v1beta2",
"kind": "SparkApplication",
"metadata": {
"creationTimestamp": "2022-04-29T13:28:02Z",
"generation": 1,
"name": "spark-pi-2022-04-28-2",
"namespace": "my-ns",
"resourceVersion": "111463226",
"uid": "23f1c8fb-7843-4628-b22f-7808b562f9d8"
},
"spec": {
"driver": {
"coreLimit": "1500m",
"cores": 1,
"labels": {
"version": "2.4.4"
},
"memory": "512m",
"volumeMounts": [
{
"mountPath": "/tmp",
"name": "test-volume"
}
]
},
"executor": {
"coreLimit": "1500m",
"cores": 1,
"instances": 1,
"labels": {
"version": "2.4.4"
},
"memory": "512m",
"volumeMounts": [
{
"mountPath": "/tmp",
"name": "test-volume"
}
]
},
"image": "my.google.artifactory.com/spark-operator/spark:v2.4.4",
"imagePullPolicy": "Always",
"mainApplicationFile": "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar",
"mainClass": "org.apache.spark.examples.SparkPi",
"mode": "cluster",
"restartPolicy": {
"type": "Never"
},
"sparkVersion": "2.4.4",
"type": "Scala",
"volumes": [
{
"hostPath": {
"path": "/tmp",
"type": "Directory"
},
"name": "test-volume"
}
]
},
"status": {
"applicationState": {
"errorMessage": "driver container failed with ExitCode: 1, Reason: Error",
"state": "FAILED"
},
"driverInfo": {
"podName": "spark-pi-2022-04-28-2-driver",
"webUIAddress": "172.20.23.178:4040",
"webUIPort": 4040,
"webUIServiceName": "spark-pi-2022-04-28-2-ui-svc"
},
"executionAttempts": 1,
"lastSubmissionAttemptTime": "2022-04-29T13:28:15Z",
"sparkApplicationId": "spark-3335e141a51148d7af485457212eb389",
"submissionAttempts": 1,
"submissionID": "021e78fc-4754-4ac8-a87d-52c682ddc483",
"terminationTime": "2022-04-29T13:28:25Z"
}
}
As you can see in the status section, we have "state": "FAILED". Still, Airflow marks it as successful and thus runs t2 right after it, which is not what we want when defining t2 as dependent on (downstream of) t1.
Why does Airflow see t1 as successful even though the Spark job itself fails?
That's the implementation. If you see the code for the operator it is basically a submit and forget job. To monitor the status you use SparkkubernetesSensor
t2 = SparkKubernetesSensor(
task_id="spark_monitor",
application_name="{{ task_instance.xcom_pull(task_ids='spark-job-full-refresh.spark_full_refresh') ['metadata']['name'] }}",
attach_log=True,
)
I have tried to create a custom operator that combines both but it does not work very well via inheritance because they are slightly different execution patterns, so it needs to be created from scratch. But for all purposes and intents, the Sensor works perfectly, just adds unneeded lines to code.

Attach aws emr cluster to remote jupyter notebook using sparkmagic

I am trying to connect and attach an AWS EMR cluster (emr-5.29.0) to a Jupyter notebook that I am working on my local windows machine. I have started a cluster with Hive 2.3.6, Pig 0.17.0, Hue 4.4.0, Livy 0.6.0, Spark 2.4.4 and the subnets are public. I found that this can be done with Azure HDInsight, so was hoping something similar can be done using EMR. The issue I am having is with passing the correct values in the config.json file. How should I attach a EMR cluster?
I could work on the EMR notebooks native to AWS, but thought I can go the develop locally route and have hit a road block.
{
"kernel_python_credentials" : {
"username": "{IAM ACCESS KEY ID}", # not sure about the username for the cluster
"password": "{IAM SECRET ACCESS KEY}", # I use putty to ssh into the cluster with the pem key, so again not sure about the password for the cluster
"url": "ec2-xx-xxx-x-xxx.us-west-2.compute.amazonaws.com", # as per the AWS blog When Amazon EMR is launched with Livy installed, the EMR master node becomes the endpoint for Livy
"auth": "None"
},
"kernel_scala_credentials" : {
"username": "{IAM ACCESS KEY ID}",
"password": "{IAM SECRET ACCESS KEY}",
"url": "{Master public DNS}",
"auth": "None"
},
"kernel_r_credentials": {
"username": "{}",
"password": "{}",
"url": "{}"
},
Update 1/4/2021
On 4/1, I got sparkmagic to work on my local jupyter notebook. Used these documents as a references (ref-1, ref-2 & ref-3) to setup local port forwarding (if possible avoid using sudo).
sudo ssh -i ~/aws-key/my-pem-file.pem -N -L 8998:ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com:8998 hadoop#ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com
Configuration details
Release label:emr-5.32.0
Hadoop distribution:Amazon 2.10.1
Applications:Hive 2.3.7, Livy 0.7.0, JupyterHub 1.1.0, Spark 2.4.7, Zeppelin 0.8.2
Updated config file
{
"kernel_python_credentials" : {
"username": "",
"password": "",
"url": "http://localhost:8998"
},
"kernel_scala_credentials" : {
"username": "",
"password": "",
"url": "http://localhost:8998",
"auth": "None"
},
"kernel_r_credentials": {
"username": "",
"password": "",
"url": "http://localhost:8998"
},
"logging_config": {
"version": 1,
"formatters": {
"magicsFormatter": {
"format": "%(asctime)s\t%(levelname)s\t%(message)s",
"datefmt": ""
}
},
"handlers": {
"magicsHandler": {
"class": "hdijupyterutils.filehandler.MagicsFileHandler",
"formatter": "magicsFormatter",
"home_path": "~/.sparkmagic"
}
},
"loggers": {
"magicsLogger": {
"handlers": ["magicsHandler"],
"level": "DEBUG",
"propagate": 0
}
}
},
"authenticators": {
"Kerberos": "sparkmagic.auth.kerberos.Kerberos",
"None": "sparkmagic.auth.customauth.Authenticator",
"Basic_Access": "sparkmagic.auth.basic.Basic"
},
"wait_for_idle_timeout_seconds": 15,
"livy_session_startup_timeout_seconds": 60,
"fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.",
"ignore_ssl_errors": false,
"session_configs": {
"driverMemory": "1000M",
"executorCores": 2
},
"use_auto_viz": true,
"coerce_dataframe": true,
"max_results_sql": 2500,
"pyspark_dataframe_encoding": "utf-8",
"heartbeat_refresh_seconds": 5,
"livy_server_heartbeat_timeout_seconds": 60,
"heartbeat_retry_seconds": 1,
"server_extension_default_kernel_name": "pysparkkernel",
"custom_headers": {},
"retry_policy": "configurable",
"retry_seconds_to_sleep_list": [0.2, 0.5, 1, 3, 5],
"configurable_retry_policy_max_retries": 8
}
Second update 1/9
Back to square one. Keep getting this error and spent days trying to debug. Not sure what I did previously to get things going. Also checked my security group config and it looks fine, ssh on port 22.
An error was encountered:
Error sending http request and maximum retry encountered.
Created a local port forwarding (ssh tunneling) to livy server on port 8998 and it works like magic.
sudo ssh -i ~/aws-key/my-pem-file.pem -N -L 8998:ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com:8998 hadoop#ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com
Did not change my config.json file from 1/4 update

Meaning of Azure CLI status codes with extension for bot

I'm running the following Azure CLI command to display changes in a bot.
az bot publish --resource-group botxxx --name latest-xxxx --debug -o json
The fact is that it returns the next output. I have searched but I can't find where they explain the meaning of the status codes. In my case, "status": 4.
Event: CommandInvoker.OnFilterResult []
{
"active": true,
"author": "N/A",
"author_email": "N/A",
"complete": true,
"deployer": "Push-Deployer",
"end_time": "2018-09-17T08:11:59.6100459Z",
"id": "3fc9899xxxxxxxxxxa06b0ab4d7e5",
"is_readonly": true,
"is_temp": false,
"last_success_end_time": "2018-09-17T08:11:59.6100459Z",
"log_url": "https://latest-xxxx.scm.azurewebsites.net/api/deployments/latest/log",
"message": "Created via a push deployment",
"progress": "",
"received_time": "2018-09-17T08:11:56.9780509Z",
"site_name": "latest-xxxx",
"start_time": "2018-09-17T08:11:57.1968236Z",
"status": 4,
"status_text": "",
"url": "https://latest-xxxx.scm.azurewebsites.net/api/deployments/latest"
}
Event: Cli.PostExecute []
My Azure CLI versions if needed:
C:\Users\hlorenzo\Desktop\latest-src>az --version
azure-cli (2.0.44)
I have searched the official documentation and found no solution. Thank you very much for the help.

Resources