Is it possible to update only part of a Glue Job using AWS CLI?

Is it possible to update only part of a Glue Job using AWS CLI? - aws-cli

I am trying to include in my CI/CD development the update of the script_location and only this parameter. AWS is asking me to include the required parameters such as RoleArn. How can I only update the part of the job configuration I want to change ?
This is what I am trying to use
aws glue update-job --job-name <job_name> --job-update Command="{ScriptLocation=s3://<s3_path_to_script>}
This is what happens :
An error occurred (InvalidInputException) when calling the UpdateJob operation: Command name should not be null or empty.
If I add the default Command Name glueetl, this is what happens :
An error occurred (InvalidInputException) when calling the UpdateJob operation: Role should not be null or empty.

An easy way to update via CLI a glue-job or a glue-trigger is using --cli-input-json option. In order to use correct json you could use aws glue update-job --generate-cli-skeleton what returns a complete structure to insert your changes.
EX:
{"JobName":"","JobUpdate":{"Description":"","LogUri":"","Role":"","ExecutionProperty":{"MaxConcurrentRuns":0},"Command":{"Name":"","ScriptLocation":"","PythonVersion":""},"DefaultArguments":{"KeyName":""},"NonOverridableArguments":{"KeyName":""},"Connections":{"Connections":[""]},"MaxRetries":0,"AllocatedCapacity":0,"Timeout":0,"MaxCapacity":null,"WorkerType":"G.1X","NumberOfWorkers":0,"SecurityConfiguration":"","NotificationProperty":{"NotifyDelayAfter":0},"GlueVersion":""}}
Well here just fill the name of the job and change the options.
After this you have to transform your json into a one-line json and send into the command using ' '
aws glue update-job --cli-input-json '<one-line-json>'
I hope help someone with this problem too.
Ref:
https://docs.aws.amazon.com/cli/latest/reference/glue/update-job.html
https://w3percentagecalculator.com/json-to-one-line-converter/

I don't know whether you've solved this problem, but I managed using this command:
aws glue update-job --job-name <gluejobname> --job-update Role=myRoleNameBB,Command="{Name=<someupdatename>,ScriptLocation=<local_filename.py>}"
You don't need the the ARN of the role, rather the role name. The example above assumes that you have a role with the name myRoleNameBB and it has access to AWS Glue.
Note: I used a local file on my laptop. Also, the "Name" in "Command" part is also compulsory.
When I run it I go this output:
{
"JobName": "<gluejobname>"
}

Based on what I have found, there is no way to update just part of the job using the update-job API.
I ran into the same issue and I provided the role to get past this error. The command worked but the update-job API actually resets other parameters to defaults such as Type of application, Job Language,Class, Timeout, Max Capacity, etc.
So if your pre-existing job is a Spark Application in scala, it will fail as AWS defaults to Python Shell and python as job language as part of the update-job API. And this API provides no way to set job Language type to scala and set a main class (required in case of scala). It provides a way to set the application type to Spark application.
If you do not want to specify the Role to the update-job API. One approach is to copy the new script with the same name and same location that your pre-existing ETL job uses and then trigger your ETL using start-job API as part of the CI process.
Second approach is to run your ETL directly and force it to use the latest script in the start-job API call:
aws glue start-job-run --job-name <job-name> --arguments=scriptLocation="<path to your latest script>"
The only caveat with the second approach is when you look in the console the ETL job will still be referencing the old script Location. The above command just forces this run of the job to use the latest script which you can confirm by looking in the History tab on the Glue ETL console.

Related

How can I set the command override on an ECS scheduled task via Terraform?

I'm using this module - https://registry.terraform.io/modules/cn-terraform/ecs-fargate-scheduled-task/aws/latest
I've managed to build the scheduled task with everything except the command override on the container
I cannot set the command override at the task definition level because multiple scheduled tasks implement the same task definition so the command override needs to happen at the scheduled task level as it's unique per scheduled task
I don't see anything that helps in the modules documentation so i'm wondering if there is another way I could do this by either querying for the scheduled task once it's created and using a different module to set the command override?

If you look at the Terraform documentation for aws_cloudwatch_event_target, there is an example in there for an ECS scheduled task with command override. Notice how they are passing the override via the input parameter to the event target.
Now if you look at the source code for the module you are using, you will see they are passing anything you set in the event_target_input variable to the input parameter of the aws_cloudwatch_event_target resource.
So you need to pass the override as a JSON string (I would copy the example JSON string in the Terraform docs and then modify it to your needs) as event_target_input in your module declaration.

AWS Wrangler Error HIVE_METASTORE_ERROR: Table is missing storage descriptor

hope you can help me with a concern about an error with awswrangler.
this is the case: i have 2 aws accounts, AccountA and AccountB, both with lakeformation enabled, i have a set of databases in AccA and another set in AccB, so we share AccountB databases to AccountA through lakeformation so we can query their Db/tables with Athena in AccountA.
i am trying to automate a sql query with python, so i'm using awswrangler to achieve this, but i'm getting a not very specific error when in run the query in python.
when i run "select * from DatabaseAccB.Table" get this error "HIVE_METASTORE_ERROR: Table is missing storage descriptor" what could be the cause? i tried with boto3.Athena session and same result.
this may should help, when i query select * from DatabaseAccB.Table with my user, this runs fine. but when i try to do it with lambda or glue job, fails with error mentioned before.
PD: AccountA has only select/describe permission on tables in AccountB. Can show some code if you need.
PD2: if run "select * from DatabaseAccA.Table" query runs fine
tried with Boto 3, same result.
Tried using lambda, same result.
Tried giving admin access to glue role in AccountA, same result.
I think that there something happening with Lakeformation.
Thanks!

Make sure your Lambda/Glue Job execution Roles have the following Lake Formation permissions, all granted from AccountA's Console/CLI:
DESCRIBE on Resource Links (AccountA's Glue Catalog);
SELECT, DROP, etc, on the Shared DB/Table (AccountB's Glue Catalog);
Resource Link permissions must be granted in pairs: even though your queries point to a Resource Link the Principal executing the query in Athena/Redshift Spectrum still needs to have "normal" (SELECT, INSERT, etc) permissions on the underlying shared Database/Table granted by AccountA's Lake Formation Administrator.
For the AWS Wrangler part, if the problem still persists, maybe you'll need to be explicit on which Glue Catalog ID it'll execute the query upon (at the moment I'm not sure if this parameter exists in AWS Wrangler though).

Log link of failed Hive job submitted to Dataproc through Airflow

I have submitted a Hive job using Airflow's DataprocWorkflowTemplateInstantiateInlineOperator to Dataproc cluster. When some of the jobs fail in googlecloud->dataproc->jobs I can see a link to the log with failure:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found in 'gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput'
Can I fetch this log link (e.g. gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput) through Airflow?
I checked gcp_dataproc_hook.py operator for anything that points to a log link so that I can retrieve it, but didn't find anything useful.

Looks like there's no auto-created handy link to fetch the output in Airflow's logs yet, but it could certainly be added (if you're feeling bold, could be worth sending a pull request to Airflow yourself! Or otherwise filing a feature request https://issues.apache.org/jira/browse/AIRFLOW).
In general you can construct a handy URL or a copy/pasteable CLI command given the jobid; if you want to use Dataproc's UI directly, simply construct a URL of the form:
https://cloud.google.com/console/dataproc/jobs/%s/?project=%s&region=%s
with params
jobId, projectId, region
Alternatively, you could type:
gcloud dataproc jobs wait ${JOBID} --project ${PROJECTID} --region ${REGION}
A more direct approach with the URI would be:
gsutil cat ${LOG_LINK}*
with a glob expression at the end of that URL (it's not just a single file, it's a set of files).

How to solve "DriverClass not found for database:mariadb" with AWS data pipeline?

I'm trying to play with AWS Data Pipelines (and then Glue later) and am following Copy MySQL Data Using the AWS Data Pipeline Console. However, when I execute the pipeline, I get
DriverClass not found for database:mariadb
I would expect this to "just work," but why is it not providing it's own driver? Or is driver for MySQL not equal to driver for MariaDB?

Right, after fighting with this all day, I found the following link which solves it: https://forums.aws.amazon.com/thread.jspa?messageID=834603&tstart=0
Basically:
You are getting the error because you are using the RdsDatabase, it needs to be the JdbcDatabase when using mariadb.
"type": "JdbcDatabase",
"connectionString": "jdbc:mysql://thing-master.cpbygfysczsq.eu-west-1.rds.amazonaws.com:3306/db_name",
"jdbcDriverClass" : "com.mysql.jdbc.Driver"
FULL credit goes to Webstar34 (https://forums.aws.amazon.com/profile.jspa?userID=452398)

submit job to remote hazelcast cluster

I'm new to Hazelcast Jet and have a very basic question. I have a 3-node JET cluster set up. I have a sample code to read from Kafka and drain to an IMap. When I run it from command-line (using jet-submit.sh and use JetBootstrap.getInstance() to acquire JET client instance) it works perfectly fine. When I run the same code (using Jet.newJetClient() to acquire the instance and Run As -> Java application on Eclipse), I get:
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field com.hazelcast.jet.core.ProcessorMetaSupplier.
Could you please let me know where am I going wrong?

One of your lambda functions captures an outside variable, probably defined at class level, and that class is not Serializable or not added to the Job config when submitting from client. This is done automatically when submitting via the script.
Please see http://docs.hazelcast.org/docs/jet/0.6.1/manual/#remember-that-a-jet-job-is-distributed

When you use a client instance to submit the job, you have to add all classes that contain the code called by the job to the JobConfig:
JobConfig config = new JobConfig();
config.addClass(...);
config.addJar(...);
...
client.newJob(pipeline, config);
For example, if you use a lambda for stage.map(), the class containing the lambda has to be added.
The jet-submit.sh script makes this easier by automatically adding the entire submitted .jar file.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Is it possible to update only part of a Glue Job using AWS CLI? - aws-cli

Related

How can I set the command override on an ECS scheduled task via Terraform?

AWS Wrangler Error HIVE_METASTORE_ERROR: Table is missing storage descriptor

Log link of failed Hive job submitted to Dataproc through Airflow

How to solve "DriverClass not found for database:mariadb" with AWS data pipeline?

submit job to remote hazelcast cluster

Categories

Resources