unable to create aws emr cluster with enable-debugging - aws-cli

I am not able to create aws emr cluster when I add command "--enable-debugging"
I am able to create the cluster without enable-debugging command.
Getting error like: aws: error: invalid json argument for option --configurations
my script to create cluster is :
aws emr create-cluster \
--name test-cluster \
--release-label emr-5.5.0 \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m3.xlarge \
--no-auto-terminate \
--termination-protected \
--visible-to-all-users \
--use-default-roles \
--log-uri s3://testlogs/ \
--enable-debugging \
--tags Owner=${OWNER} Environment=Dev Name=${OWNER}-test-cluster \
--ec2-attributes KeyName=$KEY,SubnetId=$SUBNET \
--applications Name=Hadoop Name=Pig Name=Hive \
--security-configuration test-sec-config \
--configurations s3://configurations/mapreduceconfig.json
mapreduceconfig.json file is :
[
{
"Classification": "mapred-site",
"Properties": {
"mapred.tasktracker.map.tasks.maximum": 2
}
},
{
"Classification": "hadoop-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"HADOOP_DATANODE_HEAPSIZE": 2048,
"HADOOP_NAMENODE_OPTS": "-XX:GCTimeRatio=19"
}
}
]
}
]

Well, the error is self apparent. --configurations option does not support S3:// file system. As per the examples and documentation
http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html
It only supports file:// and a direct public link to a file in S3. like https://s3.amazonaws.com/myBucket/mapreduceconfig.json
So, your configurations gotta be Public.
Not sure how you got it working without --enable-debugging command.

Related

Streaming not working in Delta Live table pipeline (Databricks)?

I am working on a pipeline in Databricks > Workflows > Delta Live Tables and having an issue with the streaming part.
Expectations:
One bronze table reads the json files with AutoLoader (cloudFiles), in a streaming mode (spark.readStream)
One silver table reads and flattens the bronze table in streaming (dlt.read_stream)
Result:
When taking the root location as the source (load /*, several hundreds of files): the pipelines starts, but the number of rows/files appended is not updated in the graph until the bronze part be completed. Then, the silver part starts, the number of files/rows never updates either and the pipeline terminates with a memory error.
When taking a very small number of files (/specific_folder among hundreds) : the pipeline runs well and terminates with no error, but again, the number of rows/files appended is not updated in the graph until each part is completed.
This led me to the conclusion that the pipeline seems not to run in a streaming mode.
Maybe I am missing something about the config or how to run properly a DLT pipeline, and would need your help on this please.
Here is the configuration of the pipeline:
{
"id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"clusters": [
{
"label": "default",
"aws_attributes": {
"instance_profile_arn": "arn:aws:iam::xxxxxxxxxxxx:instance-profile/iam_role_example"
},
"autoscale": {
"min_workers": 1,
"max_workers": 10,
"mode": "LEGACY"
}
}
],
"development": true,
"continuous": false,
"channel": "CURRENT",
"edition": "PRO",
"photon": false,
"libraries": [
{
"notebook": {
"path": "/Repos/user_example#xxxxxx.xx/dms/bronze_job"
}
}
],
"name": "01-landing-task-1",
"storage": "dbfs:/pipelines/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"configuration": {
"SCHEMA": "example_schema",
"RAW_MOUNT_NAME": "xxxx",
"DELTA_MOUNT_NAME": "xxxx",
"spark.sql.parquet.enableVectorizedReader": "false"
},
"target": "landing"}
Here is the code of the pipeline (the query in the silver table contains many more columns with a get_json_object, ~30 actually):
import dlt
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.window import Window
RAW_MOUNT_NAME = spark.conf.get("RAW_MOUNT_NAME")
SCHEMA = spark.conf.get("SCHEMA")
SOURCE = spark.conf.get("SOURCE")
TABLE_NAME = spark.conf.get("TABLE_NAME")
PRIMARY_KEY_PATH = spark.conf.get("PRIMARY_KEY_PATH")
#dlt.table(
name=f"{SCHEMA}_{TABLE_NAME}_bronze",
table_properties={
"quality": "bronze"
}
)
def bronze_job():
load_path = f"/mnt/{RAW_MOUNT_NAME}/{SOURCE}/5e*"
return spark \
.readStream \
.format("text") \
.option("encoding", "UTF-8") \
.load(load_path) \
.select("value", "_metadata") \
.withColumnRenamed("value", "json") \
.withColumn("id", F.expr(f"get_json_object(json, '$.{PRIMARY_KEY_PATH}')")) \
.withColumn("_etl_timestamp", F.col("_metadata.file_modification_time")) \
.withColumn("_metadata", F.col("_metadata").cast(T.StringType())) \
.withColumn("_etl_operation", F.lit("U")) \
.withColumn("_etl_to_delete", F.lit(False)) \
.withColumn("_etl_file_name", F.input_file_name()) \
.withColumn("_etl_job_processing_timestamp", F.current_timestamp()) \
.withColumn("_etl_table", F.lit(f"{TABLE_NAME}")) \
.withColumn("_etl_partition_date", F.to_date(F.col("_etl_timestamp"), "yyyy-MM-dd")) \
.select("_etl_operation", "_etl_timestamp", "id", "json", "_etl_file_name", "_etl_job_processing_timestamp", "_etl_table", "_etl_partition_date", "_etl_to_delete", "_metadata")
#dlt.table(
name=f"{SCHEMA}_{TABLE_NAME}_silver",
table_properties = {
"quality": "silver",
"delta.autoOptimize.optimizeWrite": "true",
"delta.autoOptimize.autoCompact": "true"
}
)
def silver_job():
df = dlt.read_stream(f"{SCHEMA}_{TABLE_NAME}_bronze").where("_etl_table == 'extraction'")
return df.select(
df.id.alias('medium_id'),
F.get_json_object(df.json, '$.request').alias('request_id'))
Thank you very much for your help!

Add Databricks API to configure init script in existing bash script

I would like to add the Databricks init script API in my exsiting bash script. How can I do this? Here is the API provided by Databricks:
curl -n -X POST -H 'Content-Type: application/json' -d '{
"cluster_id": "",
"num_workers": 1,
"spark_version": "8.4.x-scala2.12",
"node_type_id": "$node_type",
"cluster_log_conf": {
"dbfs" : {
"destination": "dbfs:/cluster-logs"
}
},
"init_scripts": [ {
"dbfs": {
"destination": "dbfs:/FileStore/shared_uploads/kafka_keytabs/CopyKrbFiles.sh"
}
} ]
}' https://<databricks-instance>/api/2.0/clusters/edit

curl equivalent to python POST requests to call jenkins api

I am trying to call jenkins api to build a job using curl command which perfectly fine. Here is my curl command:
curl -X POST https://myjenkinshost/job/myorg/job/10042/job/myproject/job/10042_myansiblejob/build --user admuser:my-token --data-urlencode json='{"parameter": [{"name": "ENVIRONNEMENT", "value": "developpement"},{"name": "REFERENCE_GIT_INVENTAIRE", "value": "develop"},{"name": "REFERENCE_GIT_PLAYBOOK", "value": "develop"},{"name": "VAULT", "value": "mysecretpassword"},{"name": "EXTRA_VARS", "value": "myhosts: TARGET"}]}'
But when I try this with python, it always shows HTTP 400 :
Error:
Reason: HTTP ERROR 400. Problem accessing /job/10042/job/myproject/job/10042_myansiblejob/build. Nothing is submitted
Here is my python code which is very simple, could be a very small issue.
import requests
import json
basicAuthCredentials = ('admuser', 'my-token')
jenkins_headers={'Content-type':'application/json', 'Accept':'application/json'}
ansible_vault_password="mysecretpassword"
JENKINS_URL="https://myjenkinshost/job/myorg/job/10042/job/myproject/job/10042_myansiblejob/build"
ENVIRONNEMENT="developpement"
GIT_BRANCH="develop"
ANSIBLE_EXTRA_VARS_VARNAME="myhosts: TARGET"
json_payload='{"parameter": [{"name": "ENVIRONNEMENT", "value": "'+ENVIRONNEMENT+'"'+ \
'},{"name": "REFERENCE_GIT_INVENTAIRE", "value": "'+GIT_BRANCH+'"'+ \
'},{"name": "REFERENCE_GIT_PLAYBOOK", "value": "'+GIT_BRANCH+'"'+ \
'},{"name": "VAULT", "value": "'+ansible_vault_password+'"'+ \
'},{"name": "EXTRA_VARS", "value": "'+ANSIBLE_EXTRA_VARS_VARNAME+'"}]}'
json_data=json.dumps(json_payload)
response_jenkins = requests.post(JENKINS_URL, headers=jenkins_headers,
data=json_data, auth=basicAuthCredentials)
print(response_jenkins.text)
Reference: https://www.jenkins.io/doc/book/using/remote-access-api/
Any suggestion is appreciated.
From curl command you are doing --data-urlencode, which is sending Content-Type: application/x-www-form-urlencoded.
From your python you are passing Content-type':'application/json.

AWS CLI: The role defined for the function cannot be assumed by Lambda

AWS CLI version:
aws --version
aws-cli/1.11.21 Python/2.7.12 Darwin/15.3.0 botocore/1.4.78
Trying to create a Lambda function and getting the error:
An error occurred (InvalidParameterValueException) when calling the CreateFunction operation: The role defined for the function cannot be assumed by Lambda.
Role was created as:
aws iam create-role --role-name microrole --assume-role-policy-document file://./trust.json
trust.json is:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "lambda.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
Policy was attached as:
aws iam put-role-policy --policy-document file://./policy.json --role-name microrole --policy-name micropolicy
policy.json is:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
},
{
"Effect": "Allow",
"Action": [
"apigateway:*"
],
"Resource": "arn:aws:apigateway:*::/*"
},
{
"Effect": "Allow",
"Action": [
"execute-api:Invoke"
],
"Resource": "arn:aws:execute-api:*:*:*"
}
]
}
Waited for multiple minutes as mentioned at [1] and [2] but still the error is not going away. The policy and trust attached to the role is similar to the default role created when Lambda Function is created using Console.
Complete steps are listed at https://github.com/arun-gupta/serverless/tree/master/aws/microservice.
What's missing?
The Lambda function was created as:
aws lambda create-function \
--function-name MicroserviceGetAll \
--role arn:aws:iam::<act-id>:role/service-role/microRole \
--handler org.sample.serverless.aws.couchbase.BucketGetAll \
--zip-file fileb:///Users/arungupta/workspaces/serverless/aws/microservice/microservice-http-endpoint/target/microservice-http-endpoint-1.0-SNAPSHOT.jar \
--description "Microservice HTTP Endpoint - Get All" \
--runtime java8 \
--region us-west-1 \
--timeout 30 \
--memory-size 1024 \
--environment Variables={COUCHBASE_HOST=ec2-35-165-83-82.us-west-2.compute.amazonaws.com} \
--publish
The correct command is:
aws lambda create-function \
--function-name MicroserviceGetAll \
--role arn:aws:iam::<act-id>:role/microRole \
--handler org.sample.serverless.aws.couchbase.BucketGetAll \
--zip-file fileb:///Users/arungupta/workspaces/serverless/aws/microservice/microservice-http-endpoint/target/microservice-http-endpoint-1.0-SNAPSHOT.jar \
--description "Microservice HTTP Endpoint - Get All" \
--runtime java8 \
--region us-west-1 \
--timeout 30 \
--memory-size 1024 \
--environment Variables={COUCHBASE_HOST=ec2-35-165-83-82.us-west-2.compute.amazonaws.com} \
--publish
The difference is that the role was incorrectly specified as role/service-role/microRole instead of role/microRole.

AWS CLI: get RDS free space

How to get free space of RDS instance using aws-cli?
Tried:
aws rds describe-db-instances | grep -i 'size|space|free|available|used'
but no result
I've found the solution:
STARTTIME="$(date -u -d '5 minutes ago' '+%Y-%m-%dT%T')"
ENDTIME="$(date -u '+%Y-%m-%dT%T')"
aws cloudwatch get-metric-statistics --namespace AWS/RDS \
--metric-name FreeStorageSpace\
--start-time $STARTTIME --end-time $ENDTIME --period 300 \
--statistics Average\
--dimensions="Name=DBInstanceIdentifier, Value=<DB_INSTANCE>"
otput sample:
{
"Datapoints": [
{
"Timestamp": "2016-02-11T08:50:00Z",
"Average": 45698627515.73333,
"Unit": "Bytes"
}
],
"Label": "FreeStorageSpace"
}

Resources