I am trying to collect the Performance of a Virtual Machine like CPU Utilization, Available Memory, Logical Disk MB/s, and Logical Disk IOPS, which can be seen under Insights via console. I want to collect these data and save them into a CSV file. Is there any API to get the data with Avg, Min, Max, 50th, 90th, and 95th included?
I have tried the following solutions:
az monitor metrics command: az monitor metrics list --resource {ResourceName} --metric "Percentage CPU"
API: https://management.azure.com/subscriptions/{subscription_id}/resourceGroups/{resource_group_name}/providers/Microsoft.Compute/virtualMachines/{vm_name}/providers/microsoft.insights/metrics?api-version=2018-01-01&metricnames=Percentage CPU×pan={start_time}/{end_time}&interval=PT1H&aggregation=average
Microsoft Azure Monitor Client Library (Python SDK): azure-mgmt-monitor
In all the above-mentioned approaches, Instead of CPU Utilization, I'm getting results of 'Percentage CPU', i.e., instead of Insights these approaches are giving metrics.
One possible solution is to use the Azure Monitor REST API which allows you to collect various metrics from a virtual machine. You can specify the metric names, time span, interval, and aggregation parameters in the request URL. For example:
https://management.azure.com/subscriptions/{subscription_id}/resourceGroups/{resource_group_name}/providers/Microsoft.Compute/virtualMachines/{vm_name}/providers/microsoft.insights/metrics?api-version=2018-01-01&metricnames=Percentage CPU,Average Memory Bytes,Disk Read Bytes/sec,Disk Write Bytes/sec,Disk Read Operations/Sec,Disk Write Operations/Sec×pan={start_time}/{end_time}&interval=PT1H&aggregation=average,count,maximum,minimun,total
This request will return the average, count, maximum, minimum, and total values for each metric in each hour within the specified time span. You can also use other aggregation types such as percentile.
Another possible solution is to use the Azure Monitor libraries for Python which provides a wrapper for the REST API. You can install the azure-mgmt-monitor package and use the list method in MetricsOperations class to get the metrics data. For example:
import datetime
from azure.mgmt.monitor import MonitorManagementClient
# Get the ARM id of your resource
resource_id = (
"subscriptions/{}/"
"resourceGroups/{}/"
"providers/Microsoft.Compute/virtualMachines/{}"
).format(subscription_id, resource_group_name, vm_name)
# Get your credentials ready
credentials = ServicePrincipalCredentials(
client_id = client_id,
secret = secret,
tenant = tenant_id
)
# Create a monitor client
monitor_client = MonitorManagementClient(
credentials,
subscription_id
)
# Get metrics data
metrics_data = monitor_client.metrics.list(
resource_id,
timespan="{}/{}".format(start_time,end_time),
interval='PT1H',
metricnames="Percentage CPU,Average Memory Bytes,Disk Read Bytes/sec,Disk Write Bytes/sec,Disk Read Operations/Sec,Disk Write Operations/Sec",
aggregation="Average,count,maximum,minimun,total",
)
This code will return a similar result as the REST API request.
To save the metrics data into a CSV file, you can use Python’s built-in csv module or other libraries such as pandas. You can iterate over each metric value object in metrics_data.value and write its properties into a row of CSV file.
Related
I am trying to create dashboard of my services in Azure. I added Azure Metrics Chart of each service and later wanted to add under it specific details to operations included in service.
But when I try to get it from logs, I get much higher number of requests made. KQL:
requests
| where cloud_RoleName startswith "notificationengine"
| summarize Count = count() by operation_Name
| order by Count
And result:
Problem is with some metrics chart I get values with minimal difference or exactly same while with some like one I shown I get completely different values. I tried to modify KQL or search what might be wrong but never got anywhere.
My guess is that those are 2 different values but in that case why both are labeled as "requests" and if so what are actual differences?
I have taken an Azure Function App with 2 Http Trigger Functions with identical names starts with “HttpTrigger” and run both the functions for couple of times.
Test Case 1:
In the Logs Workspace, Requests count got for the two functions that starts with the word “HttpTrigger”:
But I have pinned the chart of only 1 Function Requests Count to the Azure Dashboard:
Probably, I believe you have written the query of requests of all the services/applications that starts with “notificationengine” but pinned only some apps/services logs-chart to the dashboard.
Test Case 2:
We store some data in HDF5 formats on Azure blob. I have noticed higher than expected ingress traffic and used capacity when overwriting and modifying H5.
To test out the usage, I use a Python script to generate a H5 file that is exactly 256MB in size. The attached plot from Azure portal shows usage during the experiments:
The first peak is the initial creation of the H5. Ingress traffic is 256MB and there's no egress, as expected.
The second peak is when I ran the same script again without deleting the file created from the first run. It shows egress traffic of 256MB and also total ingress of 512MB. The resulting file is still 256MB.
Ran it again for a third time without deleting the file, and third peak shows the same usage as the second.
The used capacity seems to be calculated based on ingress traffic, so we are being charged for 512MB even though we are only using 256MB. I would like to note that if I were to delete the original file and re-run the script again, we would have no egress traffic from the deletion and only 256MB ingress from creating the file again. I did similar experiments with csv and Python pickles and found no such odd behaviors in usage calculation. All tests are carried out on a Azure VM in the same region as the blob, with the blob storage mounted using blobfuse.
I would like to understand how Azure counts the traffic when modifying existing files. For those of you who uses H5 on Azure blob, is there a way to avoid the additional charge?
Python script I used to generate H5:
import tables
import numpy as np
db = 'test.h5'
class TestTable(tables.IsDescription):
col0 = tables.Float64Col(pos=0)
col1 = tables.Float64Col(pos=1)
col2 = tables.Float64Col(pos=2)
col3 = tables.Float64Col(pos=3)
data = np.zeros((1024*1024, 4))
tablenames = ['Test'+str(i) for i in range(8)]
mdb = tables.open_file(db, mode="w")
# Create tables
for name in tablenames:
mdb.create_table('/', name, TestTable)
table = eval('mdb.root.'+name)
table.append(list(map(tuple, data)))
table.flush()
mdb.close()
I am trying to find a way to capture the dag stats - i.e run time (start time, end time), status, dag id, task id, etc for various dags and their task in a separate table
found the default logs which goes to elasticsearch/kibana, but not a simple way to pull the required logs from there back to the s3 table.
building a separate process to load those logs in s3 will have replicated data and also there will be too much data to scan and filter as tons of other system-related logs are generated as well.
adding a function to each dag - would have to modify each dag
what other possibilities are to get it don't efficiently, of any other airflow inbuilt feature can be used
You can try using Ad Hoc Query available in Apache airflow.
This option is available at Data Profiling -> Ad Hoc Query and select airflow_db
If you wish to get DAG statistics such as start_time,end_time etc you can simply query in the below format
select start_date,end_date from dag_run where dag_id = 'your_dag_name'
The above query returns start_time and end_time details of the DAG for all the DAG runs. If you wish to get details for a particular run then you can add another filter condition like below
select start_date,end_date from dag_run where dag_id = 'your_dag_name' and execution_date = '2021-01-01 09:12:59.0000' ##this is a sample time
You can get this execution_date from tree or graph views. Also you can get other stats like id,dag_id,execution_date,state,run_id,conf as well.
You can also refer to https://airflow.apache.org/docs/apache-airflow/1.10.1/profiling.html#:~:text=Part%20of%20being%20productive%20with,application%20letting%20you%20visualize%20data. link for more details.
You did not mention do you need this information real time or in batches.
Since you do not want to use ES logs either, you can try airflow metrics, if it suits your need.
However pulling this information from database is not efficient, in any case but it still is an option if you are not looking for real time data collection.
I have a vast database comprised of ~2.4 million JSON files that by themselves contain several records. I've created a simple apache-beam data pipeline (shown below) that follows these steps:
Read data from a GCS bucket using a glob pattern.
Extract records from JSON data.
Transform data: convert dictionaries to JSON strings, parse timestamps, others.
Write to BigQuery.
# Pipeline
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
p = beam.Pipeline(options=pipeline_options)
# Read
files = p | 'get_data' >> ReadFromText(files_pattern)
# Transform
output = (files
| 'extract_records' >> beam.ParDo(ExtractRecordsFn())
| 'transform_data' >> beam.ParDo(TransformDataFn()))
# Write
output | 'write_data' >> WriteToBigQuery(table=known_args.table,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
write_disposition=beam.io.BigQueryDisposition.WRITE_EMPTY,
insert_retry_strategy='RETRY_ON_TRANSIENT_ERROR',
temp_file_format='NEWLINE_DELIMITED_JSON')
# Run
result = p.run()
result.wait_until_finish()
I've tested this pipeline with a minimal sample dataset and is working as expected. But I'm pretty doubtful regarding the optimal use of BigQuery resources and quotas. The batch load quotas are very restrictive, and due to the massive amount of files to parse and load, I want to know if I'm missing some settings that could guarantee the pipeline will respect the quotas and run optimally. I don't want to exceed the quotas as I am running other loads to BigQuery in the same project.
I haven't finished understanding some parameters of the WriteToBigQuery() transform, specifically batch_size, max_file_size, and max_files_per_bundle, or if they could help to optimize the load jobs to BigQuery. Could you help me with this?
Update
I'm not only concerned about BigQuery quotas, but GCP quotas of other resources used by this pipeline are also a matter of concern.
I tried to run my simple pipeline over the target data (~2.4 million files), but I'm receiving the following warning message:
Project [my-project] has insufficient quota(s) to execute this workflow with 1 instances in region us-central1. Quota summary (required/available): 1/16 instances, 1/16 CPUs, 250/2096 disk GB, 0/500 SSD disk GB, 1/99 instance groups, 1/49 managed instance groups, 1/99 instance templates, 1/0 in-use IP addresses. Please see https://cloud.google.com/compute/docs/resource-quotas about requesting more quota.
I don't understand that message completely. The process activated 8 workers successfully and is using 8 from the 8 available in-use IP addresses. Is this a problem? How could I fix it?
If you're worried about load job quotas, you can try streaming data into bigquery that comes with a less restrictive quota policy.
To achieve what you want to do, you can try the Google provided templates or just refer to their code.
Cloud Storage Text to BigQuery (Stream) [code]
Cloud Storage Text to BigQuery (Batch)
And last but not the least, more detailed information can be found on the Google BigQuery I/O connector.
i am looking for a way to get the raw data from a performance counter in windows azure
using the diagnostic api.
so far I've noticed that i can configured a counter from the known counters
and set the sampling rate for that counter.
Is the sampling rate configured in the diagnostics configuration is the sampling rate
that the counter calculation is based on ?
if not how can i get the raw data for that counter, since i want to get the cpu user time (for example)
and do the calculation by myself.
thanks
Each counter has a sampling frequency from 1 second whatever number. Azure will sample each instance at the given rate and capture the values and store them inside each instance. Furthermore, there is a setting that allows Azure to transfer these values from each instance onto storage account's WADPerformanceCountersTable. The transfer setting is measured in minutes and a minimum of once per minute.
To get details you want to read this:
http://convective.wordpress.com/2009/12/10/diagnostics-management-in-windows-azure/
and this:
http://convective.wordpress.com/2010/12/01/configuration-changes-to-windows-azure-diagnostics-in-azure-sdk-v1-3/