HDF5 Usage in Azure Blob - azure

We store some data in HDF5 formats on Azure blob. I have noticed higher than expected ingress traffic and used capacity when overwriting and modifying H5.
To test out the usage, I use a Python script to generate a H5 file that is exactly 256MB in size. The attached plot from Azure portal shows usage during the experiments:
The first peak is the initial creation of the H5. Ingress traffic is 256MB and there's no egress, as expected.
The second peak is when I ran the same script again without deleting the file created from the first run. It shows egress traffic of 256MB and also total ingress of 512MB. The resulting file is still 256MB.
Ran it again for a third time without deleting the file, and third peak shows the same usage as the second.
The used capacity seems to be calculated based on ingress traffic, so we are being charged for 512MB even though we are only using 256MB. I would like to note that if I were to delete the original file and re-run the script again, we would have no egress traffic from the deletion and only 256MB ingress from creating the file again. I did similar experiments with csv and Python pickles and found no such odd behaviors in usage calculation. All tests are carried out on a Azure VM in the same region as the blob, with the blob storage mounted using blobfuse.
I would like to understand how Azure counts the traffic when modifying existing files. For those of you who uses H5 on Azure blob, is there a way to avoid the additional charge?
Python script I used to generate H5:
import tables
import numpy as np
db = 'test.h5'
class TestTable(tables.IsDescription):
col0 = tables.Float64Col(pos=0)
col1 = tables.Float64Col(pos=1)
col2 = tables.Float64Col(pos=2)
col3 = tables.Float64Col(pos=3)
data = np.zeros((1024*1024, 4))
tablenames = ['Test'+str(i) for i in range(8)]
mdb = tables.open_file(db, mode="w")
# Create tables
for name in tablenames:
mdb.create_table('/', name, TestTable)
table = eval('mdb.root.'+name)
table.append(list(map(tuple, data)))
table.flush()
mdb.close()

Related

How to get performance of a virtual machine in Azure?

I am trying to collect the Performance of a Virtual Machine like CPU Utilization, Available Memory, Logical Disk MB/s, and Logical Disk IOPS, which can be seen under Insights via console. I want to collect these data and save them into a CSV file. Is there any API to get the data with Avg, Min, Max, 50th, 90th, and 95th included?
I have tried the following solutions:
az monitor metrics command: az monitor metrics list --resource {ResourceName} --metric "Percentage CPU"
API: https://management.azure.com/subscriptions/{subscription_id}/resourceGroups/{resource_group_name}/providers/Microsoft.Compute/virtualMachines/{vm_name}/providers/microsoft.insights/metrics?api-version=2018-01-01&metricnames=Percentage CPU&timespan={start_time}/{end_time}&interval=PT1H&aggregation=average
Microsoft Azure Monitor Client Library (Python SDK): azure-mgmt-monitor
In all the above-mentioned approaches, Instead of CPU Utilization, I'm getting results of 'Percentage CPU', i.e., instead of Insights these approaches are giving metrics.
One possible solution is to use the Azure Monitor REST API which allows you to collect various metrics from a virtual machine. You can specify the metric names, time span, interval, and aggregation parameters in the request URL. For example:
https://management.azure.com/subscriptions/{subscription_id}/resourceGroups/{resource_group_name}/providers/Microsoft.Compute/virtualMachines/{vm_name}/providers/microsoft.insights/metrics?api-version=2018-01-01&metricnames=Percentage CPU,Average Memory Bytes,Disk Read Bytes/sec,Disk Write Bytes/sec,Disk Read Operations/Sec,Disk Write Operations/Sec&timespan={start_time}/{end_time}&interval=PT1H&aggregation=average,count,maximum,minimun,total
This request will return the average, count, maximum, minimum, and total values for each metric in each hour within the specified time span. You can also use other aggregation types such as percentile.
Another possible solution is to use the Azure Monitor libraries for Python which provides a wrapper for the REST API. You can install the azure-mgmt-monitor package and use the list method in MetricsOperations class to get the metrics data. For example:
import datetime
from azure.mgmt.monitor import MonitorManagementClient
# Get the ARM id of your resource
resource_id = (
"subscriptions/{}/"
"resourceGroups/{}/"
"providers/Microsoft.Compute/virtualMachines/{}"
).format(subscription_id, resource_group_name, vm_name)
# Get your credentials ready
credentials = ServicePrincipalCredentials(
client_id = client_id,
secret = secret,
tenant = tenant_id
)
# Create a monitor client
monitor_client = MonitorManagementClient(
credentials,
subscription_id
)
# Get metrics data
metrics_data = monitor_client.metrics.list(
resource_id,
timespan="{}/{}".format(start_time,end_time),
interval='PT1H',
metricnames="Percentage CPU,Average Memory Bytes,Disk Read Bytes/sec,Disk Write Bytes/sec,Disk Read Operations/Sec,Disk Write Operations/Sec",
aggregation="Average,count,maximum,minimun,total",
)
This code will return a similar result as the REST API request.
To save the metrics data into a CSV file, you can use Python’s built-in csv module or other libraries such as pandas. You can iterate over each metric value object in metrics_data.value and write its properties into a row of CSV file.

Azure Functions "The operation has timed out." for timer trigger blob archival

I have a Python Azure Functions timer trigger that is run once a day and archives files from a general purpose v2 hot storage container to a general purpose v2 cold storage container. I'm using the Linux Consumption plan. The code looks like this:
container = ContainerClient.from_connection_string(conn_str=hot_conn_str,
container_name=hot_container_name)
blob_list = container.list_blobs(name_starts_with = hot_data_dir)
files = []
for blob in blob_list:
files.append(blob.name)
for file in files:
blob_from = BlobClient.from_connection_string(conn_str=hot_conn_str,
container_name=hot_container_name,
blob_name=file)
data = blob_from.download_blob()
blob_to = BlobClient.from_connection_string(conn_str=cold_conn_str,
container_name=cold_container_name,
blob_name=f'archive/{file}')
try:
blob_to.upload_blob(data.readall())
except ResourceExistsError:
logging.debug(f'file already exists: {file}')
except ResourceNotFoundError:
logging.debug(f'file does not exist: {file}')
container.delete_blob(blob=file)
This has been working for me for the past few months with no problems, but for the past two days I am seeing this error halfway through the archive process:
The operation has timed out.
There is no other meaningful error message other than that. If I manually call the function through the UI, it will successfully archive the rest of the files. The size of the blobs ranges from a few KB to about 5 MB and the timeout error seems to be happening on files that are 2-3MB. There is only one invocation running at a time so I don't think I'm exceeding the 1.5GB memory limit on the consumption plan (I've seen python exited with code 137 from memory issues in the past). Why am I getting this error all of a sudden when it has been working flawlessly for months?
Update
I think I'm going to try using the method found here for archival instead so I don't have to store the blob contents in memory in Python: https://www.europeclouds.com/blog/moving-files-between-storage-accounts-with-azure-functions-and-event-grid
Just summarize the solution from comments for other communities reference:
As mentioned in comments, OP uses start_copy_from_url() method instead to implement the same requirements as a workaround.
start_copy_from_url() can process the file from original blob to target blob directly, it works much faster than using data = blob_from.download_blob() to store the file temporarily and then upload data to target blob.

Optimize the use of BigQuery resources to load 2 million JSON files from GCS using Google Dataflow

I have a vast database comprised of ~2.4 million JSON files that by themselves contain several records. I've created a simple apache-beam data pipeline (shown below) that follows these steps:
Read data from a GCS bucket using a glob pattern.
Extract records from JSON data.
Transform data: convert dictionaries to JSON strings, parse timestamps, others.
Write to BigQuery.
# Pipeline
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
p = beam.Pipeline(options=pipeline_options)
# Read
files = p | 'get_data' >> ReadFromText(files_pattern)
# Transform
output = (files
| 'extract_records' >> beam.ParDo(ExtractRecordsFn())
| 'transform_data' >> beam.ParDo(TransformDataFn()))
# Write
output | 'write_data' >> WriteToBigQuery(table=known_args.table,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
write_disposition=beam.io.BigQueryDisposition.WRITE_EMPTY,
insert_retry_strategy='RETRY_ON_TRANSIENT_ERROR',
temp_file_format='NEWLINE_DELIMITED_JSON')
# Run
result = p.run()
result.wait_until_finish()
I've tested this pipeline with a minimal sample dataset and is working as expected. But I'm pretty doubtful regarding the optimal use of BigQuery resources and quotas. The batch load quotas are very restrictive, and due to the massive amount of files to parse and load, I want to know if I'm missing some settings that could guarantee the pipeline will respect the quotas and run optimally. I don't want to exceed the quotas as I am running other loads to BigQuery in the same project.
I haven't finished understanding some parameters of the WriteToBigQuery() transform, specifically batch_size, max_file_size, and max_files_per_bundle, or if they could help to optimize the load jobs to BigQuery. Could you help me with this?
Update
I'm not only concerned about BigQuery quotas, but GCP quotas of other resources used by this pipeline are also a matter of concern.
I tried to run my simple pipeline over the target data (~2.4 million files), but I'm receiving the following warning message:
Project [my-project] has insufficient quota(s) to execute this workflow with 1 instances in region us-central1. Quota summary (required/available): 1/16 instances, 1/16 CPUs, 250/2096 disk GB, 0/500 SSD disk GB, 1/99 instance groups, 1/49 managed instance groups, 1/99 instance templates, 1/0 in-use IP addresses. Please see https://cloud.google.com/compute/docs/resource-quotas about requesting more quota.
I don't understand that message completely. The process activated 8 workers successfully and is using 8 from the 8 available in-use IP addresses. Is this a problem? How could I fix it?
If you're worried about load job quotas, you can try streaming data into bigquery that comes with a less restrictive quota policy.
To achieve what you want to do, you can try the Google provided templates or just refer to their code.
Cloud Storage Text to BigQuery (Stream) [code]
Cloud Storage Text to BigQuery (Batch)
And last but not the least, more detailed information can be found on the Google BigQuery I/O connector.

MS Azure Data Factory ADF Copy Activity from BLOB to Azure Postgres Gen5 8 cores fails with connection closed by host error

I am using ADF copy acivity to copy files on azure blob to azure postgres.. im doing recursive copy i.e. there are multiple files withing the folder.. thats fine.. size of 5 files which i have to copy is total around 6 gb. activity fails after 30-60 min of run. used write batch size from 100- 500 but still fails.
used 4 or 8 orauto DIUS, similarly tried used 1,2,4,8 or auto parallel connections to postgres.normally it seems it uses 1 per source file. azure postgres server has 8 cores and temp buffer size is 8192 kb. max allowed is 16000 something kb. even tried using that but 2 errors which i have been constantly getting. ms support team suggested to use retry option. still awaiting response from there pg team if i get something but below r the errors.
Answer: {
'errorCode': '2200',
'message': ''Type=Npgsql.NpgsqlException,Message=Exception while reading from stream,Source=Npgsql,''Type=System.IO.IOException,Message=Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.,Source=System,''Type=System.Net.Sockets.SocketException,Message=An existing connection was forcibly closed by the remote host,Source=System,'',
'failureType': 'UserError',
'target': 'csv to pg staging data migration',
'details': []
}
or
Operation on target csv to pg staging data migration failed: 'Type=Npgsql.NpgsqlException,Message=Exception while flushing stream,Source=Npgsql,''Type=System.IO.IOException,Message=Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host.,Source=System,''Type=System.Net.Sockets.SocketException,Message=An existing connection was forcibly closed by the remote host,Source=System
I was also facing this issue recently and contacted our microsoft rep who got back to me with the following update on 2020-01-16:
“This is another issue we found in the driver, we just finished our
deployment yesterday to fix this issue by upgrading driver version.
Now customer can have up to 32767 columns data in one batch size(which
is the limitation in PostgreSQL, we can’t exceed that).
Please let customer make sure that (Write batch size* column size)<
32767 as I mentioned, otherwise they will face the limitation. “
"Column size" refers to the count of columns in the table. The "area" (row write batch size * column count) cannot be greater than 32,767.
I was able to change my ADF write batch size on copy activity to a dynamic formula to ensure optimum batch sizes per table with the following:
#div(32766,length(pipeline().parameters.config)
pipeline().parameters.config refers to an array containing information about columns for the table. the length of the array = number of columns for table.
hope this helps! I was able to populate the database (albeit slowly) via ADF... would much prefer a COPY based method for better performance.

s3 timing out when counting number of objects in bucket

I wrote a script to count the number of objects in s3 buckets and total size of each bucket. The code works when I run it against a few test buckets, but then times out when I include all production buckets. Thousands of objects.
import boto3
s3 = boto3.resource('s3')
bucket_list = []
bucket_size = {}
bucket_list = s3.buckets.all()
skip_list = ('some-test-bucket')
for bu in bucket_list:
if bu.name not in skip_list:
bucket_size[bu.name] = [0, 0]
print(bu.name)
for obj in bu.objects.all():
bucket_size[bu.name][0] += 1
bucket_size[bu.name][1] += obj.size
print("{0:30} {1:15} {2:10}".format("bucket", "count", "size"))
for i,j in bucket_size.items():
print("{0:30} {1:15} {2:10}".format(i, j[0], j[1]))
It starts to run, moves along and then gets hung on certain buckets like this:
botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL:
There's no quick way to get metadata like this? This is doing the hard way in a sense - counting every object.
So, I'm asking if there's a better script, not why it times out. When I click through some of the timed out buckets, I noticed there are some .gz files in there. Don't know why it would matter.
Of course I looked at the documentation, but I find it hard to get meaningful actionable info.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html
If you just wish to know the number of objects in a bucket, you can use metrics from Amazon CloudWatch.
From Monitoring Metrics with Amazon CloudWatch - Amazon Simple Storage Service:
BucketSizeBytes
The amount of data in bytes stored in a bucket in the STANDARD storage class, INTELLIGENT_TIERING storage class, Standard - Infrequent Access (STANDARD_IA) storage class, OneZone - Infrequent Access (ONEZONE_IA), Reduced Redundancy Storage (RRS) class, Deep Archive Storage (DEEP_ARCHIVE) class or, Glacier (GLACIER) storage class. This value is calculated by summing the size of all objects in the bucket (both current and noncurrent objects), including the size of all parts for all incomplete multipart uploads to the bucket.
NumberOfObjects
The total number of objects stored in a bucket for all storage classes except for the GLACIER storage class. This value is calculated by counting all objects in the bucket (both current and noncurrent objects) and the total number of parts for all incomplete multipart uploads to the bucket.

Resources