How to call Cluster API and start cluster from within Databricks Notebook? - apache-spark

Currently we are using a bunch of notebooks to process our data in azure databricks using mainly python/pyspark.
What we want to achieve is make sure that our clusters are started (warmed up) before initiating the data processing. For that reason we are exploring ways to get access to the Cluster API from within databricks notebooks.
So far we tried running the following:
import subprocess
cluster_id = "XXXX-XXXXXX-XXXXXXX"
subprocess.run(
[f'databricks clusters start --cluster-id "{cluster_id}"'], shell=True
)
which however returns below and nothing really happens afterwards. Cluster is not started.
CompletedProcess(args=['databricks clusters start --cluster-id "0824-153237-ovals313"'], returncode=127)
Is there any convenient and smart way to call the ClusterAPI from within databricks notebook or maybe call a curl command and how is this achieved?

Most probably the error is coming from the incorrectly configured credentials.
Instead of using command-line application it's better to use the Start command of Clusters REST API. This could be done with something like this:
import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
host_name = ctx.tags().get("browserHostName").get()
host_token = ctx.apiToken().get()
cluster_id = "some_id" # put your cluster ID here
requests.post(
f'https://{host_name}/api/2.0/clusters/get',
json = {'cluster_id': cluster_id},
headers={'Authorization': f'Bearer {host_token}'}
)
and then you can monitor the status using the Get endpoint until it gets into the RUNNING state:
response = requests.get(
f'https://{host_name}/api/2.0/clusters/get?cluster_id={cluster_id}',
headers={'Authorization': f'Bearer {host_token}'}
).json()
status = response['state']

Related

Timeout when writing custom metric data to CloudWatch with AWS lambda

I'm running a vanilla AWS lambda function to count the number of messages in my RabbitMQ task queue:
import boto3
from botocore.vendored import requests
cloudwatch_client = boto3.client('cloudwatch')
def get_queue_count(user="user", password="password", domain="<my domain>/api/queues"):
url = f"https://{user}:{password}#{domain}"
res = requests.get(url)
message_count = 0
for queue in res.json():
message_count += queue["messages"]
return message_count
def lambda_handler(event, context):
metric_data = [{'MetricName': 'RabbitMQQueueLength', "Unit": "None", 'Value': get_queue_count()}]
print(metric_data)
response = cloudwatch_client.put_metric_data(MetricData=metric_data, Namespace="RabbitMQ")
print(response)
Which returns the following output on a test run:
Response:
{
"errorMessage": "2020-06-30T19:50:50.175Z d3945a14-82e5-42e5-b03d-3fc07d5c5148 Task timed out after 15.02 seconds"
}
Request ID:
"d3945a14-82e5-42e5-b03d-3fc07d5c5148"
Function logs:
START RequestId: d3945a14-82e5-42e5-b03d-3fc07d5c5148 Version: $LATEST
/var/runtime/botocore/vendored/requests/api.py:72: DeprecationWarning: You are using the get() function from 'botocore.vendored.requests'. This dependency was removed from Botocore and will be removed from Lambda after 2021/01/30. https://aws.amazon.com/blogs/developer/removing-the-vendored-version-of-requests-from-botocore/. Install the requests package, 'import requests' directly, and use the requests.get() function instead.
DeprecationWarning
[{'MetricName': 'RabbitMQQueueLength', 'Value': 295}]
END RequestId: d3945a14-82e5-42e5-b03d-3fc07d5c5148
You can see that I'm able to interact with the RabbitMQ API just fine--the function hangs when trying to post the metric.
The lambda function uses the IAM role put-custom-metric, which uses the policies recommended here, as well as CloudWatchFullAccess for good measure.
Resources on my internal load balancer, where my RabbitMQ server lives, are protected by a VPN, so it's necessary for me to associate this function with the proper VPC/security group. Here's how it's setup right now (I know this is working, because otherwise the communication with RabbitMQ would fail):
I read this post where multiple contributors suggest increasing the function memory and timeout settings. I've done both of these, and the timeout persists.
I can run this locally without any issue to create the metric on CloudWatch in less than 5 seconds.
#noxdafox has written a brilliant plugin that got me most of the way there, but at the end of the day I ended going for a pure lambda-based solution. It was surprisingly tricky getting the cloud watch plugin running with docker, and after I had trouble with the container shutting down its services and stopping processing of the message queue. Additionally, I wanted to be able to normalize queue count by the number of worker services in my ECS cluster, so I was going to need to connect to at least one AWS resource from within my VPC anyhow. I figured best to keep everything simple and in the same place.
import os
import boto3
from botocore.vendored import requests
USER = os.getenv("RMQ_USER")
PASSWORD = os.getenv("RMQ_PASSWORD")
cloudwatch_client = boto3.client(service_name='cloudwatch', endpoint_url="https://MYCLOUDWATCHURL.monitoring.us-east-1.vpce.amazonaws.com")
ecs_client = boto3.client(service_name='ecs', endpoint_url="https://vpce-MYECSURL.ecs.us-east-1.vpce.amazonaws.com")
def get_message_count(user=USER, password=PASSWORD, domain="rabbitmq.stockbets.io/api/queues"):
url = f"https://{user}:{password}#{domain}"
res = requests.get(url)
message_count = 0
for queue in res.json():
message_count += queue["messages"]
print(f"message count: {message_count}")
return message_count
def get_worker_count():
worker_data = ecs_client.describe_services(cluster="prod", services=["worker"])
worker_count = worker_data["services"][0]["runningCount"]
print(f"worker_count count: {worker_count}")
return worker_count
def lambda_handler(event, context):
message_count = get_message_count()
worker_count = get_worker_count()
print(f"msgs per worker: {message_count / worker_count}")
metric_data = [
{'MetricName': 'MessagesPerWorker', "Unit": "Count", 'Value': message_count / worker_count},
{'MetricName': 'NTasks', "Unit": "Count", 'Value': worker_count}
]
cloudwatch_client.put_metric_data(MetricData=metric_data, Namespace="RabbitMQ")
Creating the VPC endpoints was easier that I thought it would be. For Cloudwatch, you want to search for the "monitoring" VPC endpoint during the creation step (not "cloudwatch" or "logs". Searching for "ecs" gets you what you need for the ECS connect.
Once your lambda is us you need to configure the metric and accompanying alerts, and then relate those to an auto-scaling policy, but that's probably beyond the scope of this post. Leave a comment if you have questions on how I worked that out.
Only reason you might want to use a Lambda function to achieve your goal is if you do not own the RabbitMQ cluster. The fact your logic is hanging during communication suggests a network issue mostly due to mis-configured security groups.
If you can change the cluster configuration, I'd suggest you to install and configure the CloudWatch metrics exporter plugin which does most of the heavy-lifting work for you.
If your cluster runs on Docker, I believe the custom Docker file to be the best solution. If you run your Docker instances in AWS via ECS/Fargate, the plugin should be able to automatically infer the credentials from the Task Role through ExAws. Otherwise, just follow the README instructions on how to set the credentials yourself.

get list of tables in database using boto3

I’m trying to get a list of the tables from a database in my aws data catalog. I’m trying to use boto3. I’m running the code below on aws, in a sagemaker notebook. It runs forever (like over 30 minutes) and doesn’t return any results. The test_db only has 4 tables in it. My goal is to run similar code as part of an aws glue etl job, that I would run in an edited aws etl job script. Does anyone see what the issue might be or suggest how to do this?
code:
import boto3
from pprint import pprint
glue = boto3.client('glue', region_name='us-east-2')
response = glue.get_tables(
DatabaseName=‘test_db’
)
print(pprint(response['TableList']))
db = session.resource('dynamodb', region_name="us-east-2")
tables = list(db.tables.all())
print(tables)
resource
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/dynamodb.html

What does "avoid multiple Kudu clients per cluster" mean?

I am looking at kudu's documentation.
Below is a partial description of kudu-spark.
https://kudu.apache.org/docs/developing.html#_avoid_multiple_kudu_clients_per_cluster
Avoid multiple Kudu clients per cluster.
One common Kudu-Spark coding error is instantiating extra KuduClient objects. In kudu-spark, a KuduClient is owned by the KuduContext. Spark application code should not create another KuduClient connecting to the same cluster. Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.
To diagnose multiple KuduClient instances in a Spark job, look for signs in the logs of the master being overloaded by many GetTableLocations or GetTabletLocations requests coming from different clients, usually around the same time. This symptom is especially likely in Spark Streaming code, where creating a KuduClient per task will result in periodic waves of master requests from new clients.
Does this mean that I can only run one kudu-spark task at a time?
If I have a spark-streaming program that is always writing data to the kudu,
How can I connect to kudu with other spark programs?
In a non-Spark program you use a KUDU Client for accessing KUDU. With a Spark App you use a KUDU Context that has such a Client already, for that KUDU cluster.
Simple JAVA program requires a KUDU Client using JAVA API and maven
approach.
KuduClient kuduClient = new KuduClientBuilder("kudu-master-hostname").build();
See http://harshj.com/writing-a-simple-kudu-java-api-program/
Spark / Scala program of which many can be running at the same time
against the same Cluster using Spark KUDU Integration. Snippet
borrowed from official guide as quite some time ago I looked at this.
import org.apache.kudu.client._
import collection.JavaConverters._
// Read a table from Kudu
val df = spark.read
.options(Map("kudu.master" -> "kudu.master:7051", "kudu.table" -> "kudu_table"))
.format("kudu").load
// Query using the Spark API...
df.select("id").filter("id >= 5").show()
// ...or register a temporary table and use SQL
df.registerTempTable("kudu_table")
val filteredDF = spark.sql("select id from kudu_table where id >= 5").show()
// Use KuduContext to create, delete, or write to Kudu tables
val kuduContext = new KuduContext("kudu.master:7051", spark.sparkContext)
// Create a new Kudu table from a dataframe schema
// NB: No rows from the dataframe are inserted into the table
kuduContext.createTable("test_table", df.schema, Seq("key"),
new CreateTableOptions()
.setNumReplicas(1)
.addHashPartitions(List("key").asJava, 3))
// Insert data
kuduContext.insertRows(df, "test_table")
See https://kudu.apache.org/docs/developing.html
The more clear statement of "avoid multiple Kudu clients per cluster" is "avoid multiple Kudu clients per spark application".
Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.

How to pull Spark jobs client logs submitted using Apache Livy batches POST method using AirFlow

I am working on submitting Spark job using Apache Livy batches POST method.
This HTTP request is send using AirFlow. After submitting job, I am tracking status using batch Id.
I want to show driver ( client logs) logs on Air Flow logs to avoid going to multiple places AirFLow and Apache Livy/Resource Manager.
Is this possible to do using Apache Livy REST API?
Livy has an endpoint to get logs /sessions/{sessionId}/log & /batches/{batchId}/log.
Documentation:
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-sessionssessionidlog
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-batchesbatchidlog
You can create python functions like the one shown below to get logs:
http = HttpHook("GET", http_conn_id=http_conn_id)
def _http_rest_call(self, method, endpoint, data=None, headers=None, extra_options=None):
if not extra_options:
extra_options = {}
self.http.method = method
response = http.run(endpoint, json.dumps(data), headers, extra_options=extra_options)
return response
def _get_batch_session_logs(self, batch_id):
method = "GET"
endpoint = "batches/" + str(batch_id) + "/log"
response = self._http_rest_call(method=method, endpoint=endpoint)
# return response.json()
return response
Livy exposes REST API in 2 ways: session and batch. In your case, since we assume you are not using session, you are submitting using batches. You can post your batch using the curl command:
curl http://livy-server-IP:8998/batches
Once you have submitted the job, you would get the batch ID in return. Then you can curl using the command:
curl http://livy-server-IP:8998/batches/{batchId}/log
You can find the documentation at:
https://livy.incubator.apache.org/docs/latest/rest-api.html
If you want to avoid the above steps, you can use a ready-made AMI (namely, LightningFLow) from AWS Marketplace which provides Airflow with a custom Livy operator. Livy operator submits and tracks the status of the job every 30 sec (configurable), and it also provides spark logs at the end of the spark job in Airflow UI logs.
Note: LightningFlow comes pre-integrated with all required libraries, Livy, custom operators, and local Spark cluster.
Link for AWS Marketplace:
https://aws.amazon.com/marketplace/pp/Lightning-Analytics-Inc-LightningFlow-Integrated-o/B084BSD66V
This will enable you to view consolidated logs at one place, instead of shuffling between Airflow and EMR/Spark logs (Ambari/Resource Manager).

AWS - Neptune restore from snapshot using SDK

I'm trying to test restoring Neptune instances from a snapshot using python (boto3). Long story short, we want to spin up and delete the Dev instance daily using automation.
When restoring, my restore seems to only create the cluster without creating the attached instance. I have also tried creating an instance once the cluster is up and add to the cluster, but that doesn't work either. (ref: client.create_db_instance)
My code does as follows, get the most current snapshot. Use that variable to create the cluster so the most recent data is there.
import boto3
client = boto3.client('neptune')
response = client.describe_db_cluster_snapshots(
DBClusterIdentifier='neptune',
MaxRecords=100,
IncludeShared=False,
IncludePublic=False
)
snaps = response['DBClusterSnapshots']
snaps.sort(key=lambda c: c['SnapshotCreateTime'], reverse=True)
latest_snapshot = snaps[0]
snapshot_ID = latest_snapshot['DBClusterSnapshotIdentifier']
print("Latest snapshot: " + snapshot_ID)
db_response = client.restore_db_cluster_from_snapshot(
AvailabilityZones=['us-east-1c'],
DBClusterIdentifier='neptune-test',
SnapshotIdentifier=snapshot_ID,
Engine='neptune',
Port=8182,
VpcSecurityGroupIds=['sg-randomString'],
DBSubnetGroupName='default-vpc-groupID'
)
time.sleep(60)
db_instance_response = client.create_db_instance(
DBName='neptune',
DBInstanceIdentifier='brillium-neptune',
DBInstanceClass='db.r4.large',
Engine='neptune',
DBSecurityGroups=[
'sg-string',
],
AvailabilityZone='us-east-1c',
DBSubnetGroupName='default-vpc-string',
BackupRetentionPeriod=7,
Port=8182,
MultiAZ=False,
AutoMinorVersionUpgrade=True,
PubliclyAccessible=False,
DBClusterIdentifier='neptune-test',
StorageEncrypted=True
)
The documentation doesn't help much at all. It's very good at providing the variables needed for basic creation, but not the actual instance. If I attempt to create an instance using the same Cluster Name, it either errors out or creates a new cluster with the same name appended with '-1'.
If you want to programmatically do a restore from snapshot, then you need to:
Create the cluster snapshot using create-db-cluster-snapshot
Restore cluster from snapshot using restore-db-cluster-from-snapshot
Create an instance in the new cluster using create-db-instance
You mentioned that you did do a create-db-instance call in the end, but your example snippet does not have it. If that call did succeed, then you should see an instance provisioned inside that cluster.
When you do a restore from Snapshot using the Neptune Console, it does steps #2 and #3 for you.
It seems like you did the following:
Create the snapshot via CLI
Create the cluster via CLI
Create an instance in the cluster, via Console
Today, we recommend restoring the snapshot entirely via the Console or entirely using the CLI.

Resources