VMWARE import packer job stuck at "Starting rename of AMI" - amazon-ami

trying to build AMI with the OVA import to S3. it was working all the while but now I could see the jobs are stuck at "Starting rename of AMI" In reality the import job succeeds. Please help.
aws#Cloud-Automation-Host:~$ aws ec2 describe-import-image-tasks --
import-task-ids import-ami-067ff7c968962fffb
{
"ImportImageTasks": [
{
"Status": "completed",
"LicenseType": "BYOL",
"ImageId": "ami-0778e29c4926231de",
"Platform": "Linux",
"Architecture": "x86_64",
"SnapshotDetails": [
{
"Status": "completed",
"DeviceName": "/dev/sda1",
"Format": "VMDK",
"DiskImageSize": 2152929792.0,
"SnapshotId": "snap-0a9e836f4359eeca2",
"UserBucket": {
"S3Bucket": "***-jenkins-us-west-1",
"S3Key": "**-6cce79e-***.ova"
}
}
],
"ImportTaskId": "import-ami-067ff7c968962fffb"
}
]
}
[packer-host] out: 2019/02/20 06:54:50 packer: 2019/02/20 06:54:50 Allowing
300s to complete (change with AWS_TIMEOUT_SECONDS)
[packer-host] out: 2019/02/20 07:10:33 ui: [0;32m vmware-iso (amazon-
import): Import task import-ami-085c7a13345948192 complete[0m
[packer-host] out: [0;32m vmware-iso (amazon-import): Import task import-
ami-085c7a13345948192 complete[0m
[packer-host] out: 2019/02/20 07:10:33 ui: [0;32m vmware-iso (amazon-
import): Starting rename of AMI (ami-0c843c10c8c2f641c)[0m
[packer-host] out: [0;32m vmware-iso (amazon-import): Starting rename of
AMI (ami-0c843c10c8c2f641c)[0m

Related

Airflow doesn't pick up FAILED status of Spark job

I'm running Airflow on Kubernetes using this Helm chart: https://github.com/apache/airflow/tree/1.5.0
I've written a very simple DAG just to test some things. It looks like this:
default_args={
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG(
'my-dag',
default_args=default_args,
description='simple dag',
schedule_interval=timedelta(days=1),
start_date=datetime(2022, 4, 21),
catchup=False,
tags=['example']
) as dag:
t1 = SparkKubernetesOperator(
task_id='spark-pi',
trigger_rule="all_success",
depends_on_past=False,
retries=3,
application_file="spark-pi.yaml",
namespace="my-ns",
kubernetes_conn_id="myk8s",
api_group="sparkoperator.k8s.io",
api_version="v1beta2",
do_xcom_push=True,
dag=dag
)
t2 = SparkKubernetesOperator(
task_id='other-spark-job',
trigger_rule="all_success",
depends_on_past=False,
retries=3,
application_file=other-spark-job-definition,
namespace="my-ns",
kubernetes_conn_id="myk8s",
api_group="sparkoperator.k8s.io",
api_version="v1beta2",
dag=dag
)
t1 >> t2
When I run the DAG from the Airflow UI, the first task Spark job (t1, spark-pi) gets created and is immediately marked as successful, and then Airflow launches the second (t2) task right after that. This can be seen in the web UI:
What you're seeing is the status of the two tasks in 5 separate DAG runs, as well as their total status (the circles). The middle row of the image shows the status of t1, which is "success".
However, the actual spark-pi pod of t1 launched by the Spark operator fails on every run, and its status can be seen by querying the Sparkapplication resource on Kubernetes:
$ kubectl get sparkapplications/spark-pi-2022-04-28-2 -n my-ns -o json
{
"apiVersion": "sparkoperator.k8s.io/v1beta2",
"kind": "SparkApplication",
"metadata": {
"creationTimestamp": "2022-04-29T13:28:02Z",
"generation": 1,
"name": "spark-pi-2022-04-28-2",
"namespace": "my-ns",
"resourceVersion": "111463226",
"uid": "23f1c8fb-7843-4628-b22f-7808b562f9d8"
},
"spec": {
"driver": {
"coreLimit": "1500m",
"cores": 1,
"labels": {
"version": "2.4.4"
},
"memory": "512m",
"volumeMounts": [
{
"mountPath": "/tmp",
"name": "test-volume"
}
]
},
"executor": {
"coreLimit": "1500m",
"cores": 1,
"instances": 1,
"labels": {
"version": "2.4.4"
},
"memory": "512m",
"volumeMounts": [
{
"mountPath": "/tmp",
"name": "test-volume"
}
]
},
"image": "my.google.artifactory.com/spark-operator/spark:v2.4.4",
"imagePullPolicy": "Always",
"mainApplicationFile": "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar",
"mainClass": "org.apache.spark.examples.SparkPi",
"mode": "cluster",
"restartPolicy": {
"type": "Never"
},
"sparkVersion": "2.4.4",
"type": "Scala",
"volumes": [
{
"hostPath": {
"path": "/tmp",
"type": "Directory"
},
"name": "test-volume"
}
]
},
"status": {
"applicationState": {
"errorMessage": "driver container failed with ExitCode: 1, Reason: Error",
"state": "FAILED"
},
"driverInfo": {
"podName": "spark-pi-2022-04-28-2-driver",
"webUIAddress": "172.20.23.178:4040",
"webUIPort": 4040,
"webUIServiceName": "spark-pi-2022-04-28-2-ui-svc"
},
"executionAttempts": 1,
"lastSubmissionAttemptTime": "2022-04-29T13:28:15Z",
"sparkApplicationId": "spark-3335e141a51148d7af485457212eb389",
"submissionAttempts": 1,
"submissionID": "021e78fc-4754-4ac8-a87d-52c682ddc483",
"terminationTime": "2022-04-29T13:28:25Z"
}
}
As you can see in the status section, we have "state": "FAILED". Still, Airflow marks it as successful and thus runs t2 right after it, which is not what we want when defining t2 as dependent on (downstream of) t1.
Why does Airflow see t1 as successful even though the Spark job itself fails?
That's the implementation. If you see the code for the operator it is basically a submit and forget job. To monitor the status you use SparkkubernetesSensor
t2 = SparkKubernetesSensor(
task_id="spark_monitor",
application_name="{{ task_instance.xcom_pull(task_ids='spark-job-full-refresh.spark_full_refresh') ['metadata']['name'] }}",
attach_log=True,
)
I have tried to create a custom operator that combines both but it does not work very well via inheritance because they are slightly different execution patterns, so it needs to be created from scratch. But for all purposes and intents, the Sensor works perfectly, just adds unneeded lines to code.

How to provide cluster name in Azure Databricks Notebook Run Now JSON

I am able to use the below JSON through POSTMAN to run my Databricks notebook.
I want to be able to give a name to the cluster that is created through the "new_cluster" options.
Is there any such option available?
{
"tasks": [
{
"task_key": "Job_Run_Api",
"description": "To see how the run and trigger api works",
"new_cluster": {
"spark_version": "9.0.x-scala2.12",
"node_type_id": "Standard_E8as_v4",
"num_workers": "1",
"custom_tags": {
"Workload": "Job Run Api"
}
},
"libraries": [
{
"maven": {
"coordinates": "net.sourceforge.jtds:jtds:1.3.1"
}
}
],
"notebook_task": {
"notebook_path": "/Shared/POC/Job_Run_Api_POC",
"base_parameters": {
"name": "Junaid Khan"
}
},
"timeout_seconds": 2100,
"max_retries": 0
}
],
"job_clusters": null,
"run_name": "RUN_API_TEST",
"timeout_seconds": 2100
}
When the above API call is done, the cluster created has a name like "job-5975-run-2" and that is not super explanatory.
I have tried to use the tag "cluster_name" inside the "new_cluster" tag but I got an error that I can't do that, like this:
{
"error_code": "INVALID_PARAMETER_VALUE",
"message": "Cluster name should not be provided for jobs."
}
Appreciate any help here
Cluster name for jobs are automatically generated and can't be changed. If you want somehow track specific jobs, use tags.
P.S. If you want to have more "advanced" tracking capability, look onto Overwatch project.

Usage of Command Override in Azure Container Instances within Azure Portal

I'm trying to deploy a Windows-based Container from a private repository within an Azure Container Instance using the Azure Portal and I'm not sure whether I use the "Command override" at the "Advanced" section properly (probably I do not). The thing is I've to pass an argument during runtime, which sets the value of a License Server, so that a specific application, which needs to establish a connection to the License Server, can start up.
In general, the run command for the container would look like:
docker run IMAGE:TAG -LicenseServer Port#Host
My entrypoint within the Dockerfile is a Powershell Script "Start.ps1", which requests the corresponding value of the mentioned License Server.
I've read the manual and therefore I've inserted following string to override and to pass the argument:
[ "cmd", "Start.ps1", "-LicenseServer", "<Port>#<Hostname>"]
After deploying the ACI, the Container gets the state "running" for a few seconds, after that, it's terminated again. According to logs, it didn't work anyway.
So I wonder, what would be the proper way to deploy the container to get it running?
Thank you a lot in advance!
In addition to my question, to get more context:
ACI was created within Azure Portal:
I've used following settings see JSON view:
{
"properties": {
"sku": "Standard",
"provisioningState": "Succeeded",
"containers": [
{
"name": "<name>",
"properties": {
"image": "<image name>",
"command": [
"powershell",
"Start.ps1",
"-LicenseServer",
"<port>#<host>"
],
"ports": [
{
"protocol": "TCP",
"port": 80
}
],
"environmentVariables": [],
"instanceView": {
"restartCount": 1,
"currentState": {
"state": "Terminated",
"finishTime": "2021-04-28T06:06:22.2263538Z",
"detailStatus": "Container stopped per client request"
},
"previousState": {
"state": "Waiting",
"detailStatus": "CrashLoopBackOff: Back-off restarting failed"
}
},
"resources": {
"requests": {
"memoryInGB": 8,
"cpu": 1
}
}
}
}
],
"initContainers": [],
"imageRegistryCredentials": [
{
"server": "<login server>",
"username": "<user>"
}
],
"restartPolicy": "OnFailure",
"ipAddress": {
"ports": [
{
"protocol": "TCP",
"port": 80
}
],
"type": "Public",
"dnsNameLabel": "mycontainerdns",
"fqdn": "mycontainerdns.westeurope.azurecontainer.io"
},
"osType": "Windows",
"instanceView": {
"events": [],
"state": "Stopped"
}
},
"id": "/subscriptions/<subscription id>",
"name": "<aci name>",
"type": "Microsoft.ContainerInstance/containerGroups",
"location": "westeurope",
"tags": {}
}
Actually, the cmd just tell you when you need to connect to the windows container instance, you need to use the command:
az container exec -g resource_group_name -n container_group_name --container-name container_name --exec-command "cmd"
But when you want to overwrite the CMD command, you need to pass the arguments like this:
["powershell", "Start.ps1", "-LicenseServer", "<Port>#<Hostname>"]
It means you need to execute the Powershell script in the cmd terminal.
I finally found the solution. The command string, provided within "Command override" was wrong.
I've tried several versions, but it now worked with following:
[ "powershell", "C:/Windows/Scripts/Start.ps1", "-LicenseServer", "<port>#<host>" ]
Now I get logs and the running state of the container within the ACI deployment.
Before, I've tried as suggested in the first answer: (among others)
["powershell", "Start.ps1", "-LicenseServer", "<Port>#<Hostname>"]
But that seems not to work within ACI, as "Start.ps1" script couldn't be found ALTHOUGH I've set the working directory within the Dockerfile and of course it works within my Rancher deployment (by just providing "-LicenseServer PortatHost").
So, as conclusion you've to provide the full path to your file when it serves as Entrypoint within the Container.
Thank you a lot anyway for your help!

creating my first image for rocket (serviio with java dependency)

I have CoreOS stable (1068.10.0) installed and I want to create a serviio streaming media server image for rocket.
this is my manifest file:
{
"acVersion": "1.0.0",
"acKind": "ImageManifest",
"name": "tux-in.com/serviio",
"app": {
"exec": [
"/opt/serviio/bin/serviio.sh"
],
"user":"serviio",
"group":"serviio"
},
"labels": [
{
"name": "version",
"value": "1.0.0"
},
{
"name": "arch",
"value": "amd64"
},
{
"name": "os",
"value": "linux"
}
],
"ports": [
{
"name": "serviio",
"protocol": "tcp",
"port": 8895
}
],
"mountPoints": [
{
"name": "serviio-config",
"path": "/config/serviio",
"kind": "host",
"readOnly": false
}
],
"environment": {
"JAVA_HOME": "/opt/jre1.8.0_102"
}
}
I couldn't find on google how to add java package depenency, so I just downloaded jre, opened it to /rootfs/opt and set a JAVA_HOME environment variable. is that the right way to go?
welp.. because I configured serviio to run on user and group called serviio, I created /etc/group with serviio:x:500:serviio and /etc/passwd with serviio:x:500:500:Serviio:/opt/serviio:/bin/bash. is this ok? should I added and configured users differently ?
then I crated a rocket image with actool build serviio serviio-1.0-linux-amd64.aci, signed it and ran it with rkt run serviio-1.0-linux-amd64.aci. then with rkt list i see that the container started and exited immediately.
UUID APP IMAGE NAME STATE CREATED STARTED NETWORKS
bea402d9 serviio tux-in.com/serviio:1.0.0 exited 11 minutes ago 11 minutes ago
rkt status bea402d9 returns:
state=exited
created=2016-09-03 12:38:03.792 +0000 UTC
started=2016-09-03 12:38:03.909 +0000 UTC
pid=15904
exited=true
app-serviio=203
no idea how to debug this issue further. how can I see the output of the sh command that was executed? any other error related information?
have I configured things properly? I'm pretty lost so any information regarding the issue would be greatly appreciated.
thanks!

CloudFormation without snapshot

Cloudformation created a template for us which specifies both the AMI instance to start from, and also the snapshot ID of that AMI instance.
We create our base AMI instance with Packer, which reports the AMI instance it creates, but does not report the snapshot associated - we find that in the Amazon UI.
Can the Cloudformation template be modified so it does not specify the snapshot ID? Can you give an example of the stanza?
Sure you can! For example, something like this would work:
"Resources": {
"someEC2": {
"Type": "AWS::EC2::Instance",
"Properties": {
"ImageId": "...valid_ami_id...",
"InstanceType": "m3.medium",
"KeyName": "...",
"Monitoring": "false",
"NetworkInterfaces": [
{
...
}
],
"BlockDeviceMappings": [
{
"DeviceName": "/dev/sda",
"Ebs": {
"VolumeSize": 10
}
}
]
}
}
}

Resources