Terraform for providing dependent jar path in AWS Glue - terraform

I am trying to deploy an AWS Glue job through terraform. However, having gone through the documentation over the below, I am unable to find a way to configure "Dependent jars path" in Terraform as I am referencing a jar file in my AWS Glue code
https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/glue_job
Is there a way to get around this please?
Click here for a screen grab of the Dependent jars path

Put the --extra-jars path (as per https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html) into the default_arguments, as per https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/glue_job

Related

Terraform State migration

I started working with Terraform and realized that the state files were created and saved locally. After some searching I found that it is not recommended that terraform state files be committed to git.
So I added a backend configuration using S3 as the backend. Then I ran the following command
terraform init -reconfigure
I realize now that this set the backend as S3 but didn't copy any files.
Now when I run terraform plan, it plans to recreate the entire infrastructure that already exists.
I don't want to destroy and recreate the existing infrastructure. I just want terraform to recognize the local state files and copy them to S3.
Any suggestions on what I might do now?
State files are basically JSON files containing information about the current setup. You can manually copy files from the local to remote(S3) backend and use them without issues. You can read more about state files here: https://learn.hashicorp.com/tutorials/terraform/state-cli
I also manage a package to handle remote states in S3/Blob/GCS, if you want to try: https://github.com/tomarv2/tfremote

How to use Airflow-API-Plugin?

I want to List and Trigger DAGs using this https://github.com/airflow-plugins/airflow_api_plugin github repo. How and where should I place this plugin in my airflow folder so that I can call the endpoints?
Is there anything that I need to change in the airflow.cfg file?
The repository you listed has not been updated in a while. Why not just use the experimental REST APIs included in Airflow? You can find them here: https://airflow.apache.org/docs/stable/api.html .
Use:
GET /api/experimental/dags//dag_runs
to get a list of DAG runs and
POST /api/experimental/dags//dag_runs
to trigger a new dag run

Unable to pull jar file from JFrog Artifactory repo when running Spark job on K8's

I am trying to run spark job on Kubernetes cluster but it fails with class not found exception. The reason which I feel is that it is not able to pull the jar file from the JFrog Artifactory repository. Any suggestions on what can be done?
Can we include something in the parameters of spark submit or create a password file?
You didn't mention how you are making sure how you are pulling the jar when you tested your job locally, or perhaps you haven't tested it yet. As per Advanced Dependency Management:
Spark uses the following URL scheme to allow different strategies for disseminating jars:
hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected
And:
Users may also include any other dependencies by supplying a comma-delimited list of Maven coordinates with --packages. All transitive dependencies will be handled when using this command. Additional repositories (or resolvers in SBT) can be added in a comma-delimited fashion with the flag --repositories. (Note that credentials for password-protected repositories can be supplied in some cases in the repository URI, such as in https://user:password#host/.... Be careful when supplying credentials this way.)
If you are Jfrog repo or Jar file requires credentials looks like you will have to pass the credentials in the URL: https://user:password#host/...

AWS Code deploy without appspec.yml

We are successfully using code deploy for deployment , however we have a request from client to separate deployment script repository and code repository , right now code repository contains the appspec.yml and other script which need to be run and available to coders too.
I tried searching google and stackoverflow but found nothing :( .
Do we need to make use of other tool like chef,puppet etc ? however client want to be solution using aws only.
Kindly help.
I've accomplished this by adding extra step to my build process.
During the build, my CI tool checks out second repository which contains deployment related scripts and appspec.yml file. After that we zip up the code + scripts and ship it to CodeDeploy.
Don't forget that appspec.yml has to be in root directory.
I hope it helps.

How to Automate Pyspark script in Microsoft Azure

Hope you are doing well.
I am new to Spark as well as Microsoft Azure. As per our project requirement we have developed a pyspark script though the jupyter notebook installed in our HDInsight cluster. Till date we ran the code from the jupyter itself but now we need to automate the script. I tried to use Azure Datafactory but could not find a way to run the pyspark script from there. Also tried to use oozie but could not figure out how to use it.
May you people please help me how I can automate/ schedule a pyspark script in azure.
Thanks,
Shamik.
Azure Data Factory today doesn't have first class support for Spark. We are working to add that integration in future. Till that time, we have published a sample on Github that uses ADF Map Reduce Activity to submit a jar that invokes spark submit.
Please take a look here:
https://github.com/Azure/Azure-DataFactory/tree/master/Samples/Spark

Resources