Databricks API - Instance Pool - How to create with photon enabled? - databricks

I am trying to create an instance pool using databricks api and I need the photon enabled, in the documentation I could not find the parameter to enable it, does anyone know how to do it?
I used this documentation,
https://docs.databricks.com/dev-tools/api/latest/instance-pools.html#create
Here is my payload
payload = {
"instance_pool_name": f"DRIVER - {instance_type}",
"node_type_id": instance_type,
"idle_instance_autotermination_minutes": 4,
"enable_elastic_disk": True,
"max_capacity": 300,
"min_idle_instances":0,
"preloaded_spark_versions": ["12.0.x-scala2.12"],
# RUNTINME PHOTON TESTS
"preloaded_runtime_engine": "PHOTON",
"runtime_engine": "PHOTON",
"aws_attributes":{
"availability":"SPOT",
"zone_id": "us-east-1a",
"spot_bid_price_percent": 100
},
"custom_tags":[
{...}
]
}
I tried adding the cluster creation options, but neither runtime engine nor preloaded parameters work
"preloaded_runtime_engine": "PHOTON",
"runtime_engine": "PHOTON",

It's not a separate parameter, there are dedicated Photon runtimes like 11.3.x-photon-scala2.12. You can obtain the name by creating an instance pool with a Photon runtime using UI and querying its definition via API.
You can also list all available runtimes using runtime versions API endpoint:
$ curl "$DATABRICKS_API/clusters/spark-versions" | jq -r '.versions[] | .key' | grep photon | sort
10.4.x-photon-scala2.12
11.1.x-photon-scala2.12
11.2.x-photon-scala2.12
11.3.x-photon-scala2.12
12.0.x-photon-scala2.12
12.1.x-photon-scala2.12
9.1.x-photon-scala2.12

Related

How can I run a search job periodically in Azure Log Analytics?

I'm trying to visualize the browser statistics of our app hosted in Azure.
For that I'm using the nginx logs and run an Azure Log Analytics query like this:
ContainerLog
| where LogEntrySource == "stdout" and LogEntry has "nginx"
| extend logEntry=parse_json(LogEntry)
| extend userAgent=parse_user_agent(logEntry.nginx.http_user_agent, "browser")
| extend browser=parse_json(userAgent)
| summarize count=count() by tostring(browser.Browser.Family)
| sort by ['count']
| render piechart with (legend=hidden)
Then I'm getting this diagram, which is exactly what I want:
But the query is very very slow. If I set the time range to more than just the last few hours it takes several minutes or doesn't work at all.
My solution is to use a search job like this:
ContainerLog
| where LogEntrySource == "stdout" and LogEntry has "nginx"
| extend d=parse_json(LogEntry)
| extend user_agent=parse_user_agent(d.nginx.http_user_agent, "browser")
| extend browser=parse_json(user_agent)
It creates a new table BrowserStats_SRCH on which I can do this search query:
BrowserStats_SRCH
| summarize count=count() by tostring(browser.Browser.Family)
| sort by ['count']
| render piechart with (legend=hidden)
This is much faster now and only takes some seconds.
But my problem is, how can I keep this up-to-date? Preferably this search job would run once a day automatically and refreshed the BrowserStats_SRCH table so that new queries on that table run always on the most recent logs. Is this possible? Right now I can't even trigger the search job manually again, because then I get the error "A destination table with this name already exists".
In the end I would like to have a deeplink to the pie chart with the browser stats without the need to do any further click. Any help would be appreciated.
But my problem is, how can I keep this up-to-date? Preferably this search job would run once a day automatically and refreshed the BrowserStats_SRCH table so that new queries on that table run always on the most recent logs. Is this possible?
You can leverage the api to create a search job. Then use a timer triggered azure function or logic app to call that api on a schedule.
PUT https://management.azure.com/subscriptions/00000000-0000-0000-0000-00000000000/resourcegroups/testRG/providers/Microsoft.OperationalInsights/workspaces/testWS/tables/Syslog_suspected_SRCH?api-version=2021-12-01-preview
with a request body containing the query
{
"properties": {
"searchResults": {
"query": "Syslog | where * has 'suspected.exe'",
"limit": 1000,
"startSearchTime": "2020-01-01T00:00:00Z",
"endSearchTime": "2020-01-31T00:00:00Z"
}
}
}
Or you can use the Azure CLI:
az monitor log-analytics workspace table search-job create --subscription ContosoSID --resource-group ContosoRG --workspace-name ContosoWorkspace --name HeartbeatByIp_SRCH --search-query 'Heartbeat | where ComputerIP has "00.000.00.000"' --limit 1500 --start-search-time "2022-01-01T00:00:00.000Z" --end-search-time "2022-01-08T00:00:00.000Z" --no-wait
Right now I can't even trigger the search job manually again, because then I get the error "A destination table with this name already exists".
Before you start the job as described above, remove the old result table using an api call:
DELETE https://management.azure.com/subscriptions/{subscriptionId}/resourcegroups/{resourceGroupName}/providers/Microsoft.OperationalInsights/workspaces/{workspaceName}/tables/{tableName}?api-version=2021-12-01-preview
Optionally, you could check the status of the job using this api before you delete it to make sure it is not InProgress or Deleting

On GCP, using the python pubsub client, how to list only a subset of subscriptions based on a filter

On the gcloud cli, when listing the pubsub subscriptions of a project, it is possible to filter results by using the --filter flag. Here is an example:
gcloud --project=my-project pubsub subscriptions list --filter=my-filter-string --format='value(name)'
I did not manage to find out how to do this with the python library and its list_subscription method.
It seems to only basically accept a project string and to return all subscriptions in the project. This means I would need to get all the subscriptions in the project and then loop through them to filter them, as follows:
from google.cloud import pubsub_v1
subscriber_client = pubsub_v1.SubscriberClient()
filter = "my-filter-string"
with subscriber_client:
page_result = subscriber_client.list_subscriptions(
project="projects/my-project",
)
filtered_subscriptions = [
subscription.name
for subscription in page_result
if filter in subscription.name.split("/")[-1]
]
for subscription_name in filtered_subscriptions:
print(subscription_name)
Is there a more efficient way to do that ?
I have been trying to do this with the metadata: Sequence[Tuple[str, str]] argument on the method, but could not find examples of how to do it.
Neither the REST nor RPC API provide a way to filter on the server side, so no there is no more efficient way to do this.
I imagine the gcloud code to do the filter is conceptually similar to what you wrote.

Incorrect component name in App Insights on Azure, using OpenTelemetry

I'm using OpenTelemetry to trace my service running on Azure.
Azure Application Map is showing incorrect name for the components sending the trace. It is also sending incorrect Cloud RoleName which is probably why this is happening. The Cloud RoleName that App Insights is displaying is Function App name and not the Function name.
In my Azure Function (called FirewallCreate), I start a trace using following util method:
def get_otel_new_tracer(name=__name__):
# Setup a TracerProvider(). For more details read:
# https://learn.microsoft.com/en-us/azure/azure-monitor/app/opentelemetry-enable?tabs=python#set-the-cloud-role-name-and-the-cloud-role-instance
trace.set_tracer_provider(
TracerProvider(
resource=Resource.create(
{
SERVICE_NAME: name,
SERVICE_NAMESPACE: name,
# SERVICE_INSTANCE-ID: "my-instance-id"
}
)
)
)
# Send messages to Exporter in batches
span_processor = BatchSpanProcessor(
AzureMonitorTraceExporter.from_connection_string(
os.environ["TRACE_APPINSIGHTS_CONNECTION_STRING"]
)
)
trace.get_tracer_provider().add_span_processor(span_processor)
return trace.get_tracer(name)
def firewall_create():
tracer = get_otel_new_tracer("FirewallCreate")
with tracer.start_new_span("span-firewall-create") as span:
# do some stuff
...
The traces show another function name, in the same function app (picture attached).
What mistake I could be making?
The component has 0 ms and 0 calls in it. What does it mean? How to interpret it?
Hey thanks for trying out OpenTelemetry with Application Insights! First of all, are you using the azure-monitor-opentelemetry-exporter, and if so, what version are you using? I've also answered your questions inline:
What mistake I could be making?
What name are you trying to set your component to? The exporter will attempt to set your cloudRoleName to <service_namespace>.<service_name> by default that you set in your Resource. If <service_namespace> is not populated, it takes the value of <service_name>.
The component has 0 ms and 0 calls in it. What does it mean? How to interpret it?
Is this the only node in your application map? We need to first find whether this node is created by the exporter or by the functions runtime itself.

How to pass session parameters with python to snowflake?

The below code is my attempt at passing a session parameter to snowflake through python. This part of an existing codebase which runs in AWS Glue, & the only part of the following that doesn't work is the session_parameters.
I'm trying to understand how to add session parameters from within this code. Any help in understanding what is going on here is appreciated.
sf_credentials = json.loads(CACHE["SNOWFLAKE_CREDENTIALS"])
CACHE["sf_options"] = {
"sfURL": "{}.snowflakecomputing.com".format(sf_credentials["account"]),
"sfUser": sf_credentials["user"],
"sfPassword": sf_credentials["password"],
"sfRole": sf_credentials["role"],
"sfDatabase": sf_credentials["database"],
"sfSchema": sf_credentials["schema"],
"sfWarehouse": sf_credentials["warehouse"],
"session_parameters": {
"QUERY_TAG": "Something",
}
}
In AWS Cloudwatch, I can find the parameter was sent with the other options. In snowflake, the parameter was never set.
I can add more detail where necessary, I just wasn't sure what details are needed.
It turns out that there is no need to specify that a given parameter is a session parameter when you are using the Spark Connector. So instead:
sf_credentials = json.loads(CACHE["SNOWFLAKE_CREDENTIALS"])
CACHE["sf_options"] = {
"sfURL": "{}.snowflakecomputing.com".format(sf_credentials["account"]),
"sfUser": sf_credentials["user"],
"sfPassword": sf_credentials["password"],
"sfRole": sf_credentials["role"],
"sfDatabase": sf_credentials["database"],
"sfSchema": sf_credentials["schema"],
"sfWarehouse": sf_credentials["warehouse"],
"QUERY_TAG": "Something",
}
Works perfectly.
I found this in the Snowflake Documentation for Using the Spark Connector: Here's the section on setting Session Parameters

Groovy (SoapUI) choose which request are compatible with deployed API and use them

I want to write SoapUI test cases, which can be run on different versions of API. Before running test case, it needs to find which API versions are deployed. And then choose which request can run on those API versions. API versions are changed every day, so we need that SoapUI test cases can handle different versions.
I wrote simple script in Groovy, which checks deployed API version (major version) on servers and save them in the test suite properties. Also this script checks If we have requests for those versions. It fails when we don't have request for deployed version.
Test Suite Properties looks something like this:
I also have prepared requests for different versions. But now I need to make a Groovy script which will choose appropriate requests. Any idea which is easiest way to do that?
My idea was to make map (apiRequests) with API names as 'keys' and requests names as 'values'. Than use .each and every loop it will get version of that API from Test Suite Properties. I have made this:
API: onboarding, communication, customer-bill, etc.
Requests: Login, Logout, List of inbox messages, See bill history, Retrieve unpaid bills, etc.
def apiRequests=[
'onboarding' : ['Login', 'Logout'],
'customer-bill' : ['See bill history', 'Retrieve unpaid bills'],
'communication' : ['List of inbox messages'],
]
apiRequests.each{k,v->
def apiVersion = testRunner.testCase.testSuite.project.getTestSuiteByName("Independent functions").getTestCaseByName("Get API Version").getPropertyValue("$k")
log.info apiVersion //returns version for api in that loop (e.g. '2' for onboarding api)
}
And now I need to build complete request name (e.g. 'Login - v2'). I think that I can use something like this:
def finalRequest = (v + " -v " + apiVersion)
But this doesn't work because I have more 'values' for one 'key'.
And then I need to disable other requests (different than all finalRequests) ->
Loop - disable: 'Login - v1', 'Login - v3', 'Logout - v1', Logout - v3'; enable: 'Login - v2', 'Logout - v2'
Loop - disable: 'See bill history - v2; enable: 'See bill history - v1'
Loop - etc.
I would like to make this groovy script universal for every test case. So it will run loop for every request stored in 'apiRequests' and If they exist in actual test case, then it choose them and disable others.
This would work what are your looking for -
PS: I assume the properties are added on Project label scope. If you add the properties TestSuite/TestCase lable, change the context.expand('${#TestSuite/TestCase#....}') accordingly.
testRunner.testCase.testSteps.each { k, v ->
if(k.inspect().contains(context.expand('${#Project#onboarding}'))) {
log.info k
v.setDisabled(true)
}
if(k.inspect().contains(context.expand('${#Project#communication}'))) {
log.info k
v.setDisabled(true)
}
if(k.inspect().contains(context.expand('${#Project#customer-bill}'))) {
log.info k
v.setDisabled(true)
}
}
return
PS:
I would suggest to change the value of your property with adding v as version before the version number like you named your testStep name - Login - v1 , Login - v2 etc.
So the property would look like -
|Name | Value |
|----------------------|
|onboarding | v2 |
|communication | v3 |
|customer-bill | v1 |
At any case the above code ll work.

Resources