AWS DataPipeline EMR cluster with spark - apache-spark

I have created an AWS DataPipeline using EMR template, but its not installing Spark on EMR cluster. Do I need to set any special action for that ?
I see some bootstrapaction is need for spark installation but that is also not working.

That install-spark bootstrap action is only for 3.x AMI versions. If you are using a releaseLabel (emr-4.x or beyond), the applications to install are specified in a different way.
When you are creating a pipeline, you click "Edit in Architect" at the bottom or edit your pipeline on pipelines home page then you can then click on the EmrCluster node and select Applications from the "Add an optional field..." dropdown. That is where you may add Spark.

Related

spark-monitoring library not writing in Azure Log Analytivs Workspace

I have installed the new version of the spark-monitoring library which is supposed to support Databricks Runtime 11.0. See here: spark-monitoring-library. I have successfully attached the init script to my cluster. However, when I run jobs on this cluster, I do not see any logs of the Databricks jobs in Log Analytics. Does anyone have the same problem and has it resolved?

How to modify config file of spark job in Airflow UI?

I'm using Airflow to schedule for spark job and using a conf.properties file.
I want to change this file in Airflow UI not in server CLI.
How cant I do??
Airflow webserver doesn't support files edit in its UI. But it allows you to add your plugins and customize the UI by adding flask_appbuilder views (here is the doc).
You can also use an unofficial open source plugins to do that (ex: airflow_code_editor).

Terraform for providing dependent jar path in AWS Glue

I am trying to deploy an AWS Glue job through terraform. However, having gone through the documentation over the below, I am unable to find a way to configure "Dependent jars path" in Terraform as I am referencing a jar file in my AWS Glue code
https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/glue_job
Is there a way to get around this please?
Click here for a screen grab of the Dependent jars path
Put the --extra-jars path (as per https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html) into the default_arguments, as per https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/glue_job

How to Automate Pyspark script in Microsoft Azure

Hope you are doing well.
I am new to Spark as well as Microsoft Azure. As per our project requirement we have developed a pyspark script though the jupyter notebook installed in our HDInsight cluster. Till date we ran the code from the jupyter itself but now we need to automate the script. I tried to use Azure Datafactory but could not find a way to run the pyspark script from there. Also tried to use oozie but could not figure out how to use it.
May you people please help me how I can automate/ schedule a pyspark script in azure.
Thanks,
Shamik.
Azure Data Factory today doesn't have first class support for Spark. We are working to add that integration in future. Till that time, we have published a sample on Github that uses ADF Map Reduce Activity to submit a jar that invokes spark submit.
Please take a look here:
https://github.com/Azure/Azure-DataFactory/tree/master/Samples/Spark

Google compute Engine "Click to Deploy" allows only one cassandra cluster

I am using "Click to Deploy" to create 3-node cassandra cluster in my project.
No I need to create one more cluster for another purpose in the same project.
I am not able to create new one, as its showing the cluster is already installed and only option is to delete the existing cluster.
This is a known limitation with the current version of Click to Deploy. We are working on an update that will allow multiple deployments in a single project. To #chrispomeroy's point, a current workaround is to create another project and deploy your next cluster.

Resources