Optimize the use of BigQuery resources to load 2 million JSON files from GCS using Google Dataflow - python-3.x

I have a vast database comprised of ~2.4 million JSON files that by themselves contain several records. I've created a simple apache-beam data pipeline (shown below) that follows these steps:
Read data from a GCS bucket using a glob pattern.
Extract records from JSON data.
Transform data: convert dictionaries to JSON strings, parse timestamps, others.
Write to BigQuery.
# Pipeline
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
p = beam.Pipeline(options=pipeline_options)
# Read
files = p | 'get_data' >> ReadFromText(files_pattern)
# Transform
output = (files
| 'extract_records' >> beam.ParDo(ExtractRecordsFn())
| 'transform_data' >> beam.ParDo(TransformDataFn()))
# Write
output | 'write_data' >> WriteToBigQuery(table=known_args.table,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
write_disposition=beam.io.BigQueryDisposition.WRITE_EMPTY,
insert_retry_strategy='RETRY_ON_TRANSIENT_ERROR',
temp_file_format='NEWLINE_DELIMITED_JSON')
# Run
result = p.run()
result.wait_until_finish()
I've tested this pipeline with a minimal sample dataset and is working as expected. But I'm pretty doubtful regarding the optimal use of BigQuery resources and quotas. The batch load quotas are very restrictive, and due to the massive amount of files to parse and load, I want to know if I'm missing some settings that could guarantee the pipeline will respect the quotas and run optimally. I don't want to exceed the quotas as I am running other loads to BigQuery in the same project.
I haven't finished understanding some parameters of the WriteToBigQuery() transform, specifically batch_size, max_file_size, and max_files_per_bundle, or if they could help to optimize the load jobs to BigQuery. Could you help me with this?
Update
I'm not only concerned about BigQuery quotas, but GCP quotas of other resources used by this pipeline are also a matter of concern.
I tried to run my simple pipeline over the target data (~2.4 million files), but I'm receiving the following warning message:
Project [my-project] has insufficient quota(s) to execute this workflow with 1 instances in region us-central1. Quota summary (required/available): 1/16 instances, 1/16 CPUs, 250/2096 disk GB, 0/500 SSD disk GB, 1/99 instance groups, 1/49 managed instance groups, 1/99 instance templates, 1/0 in-use IP addresses. Please see https://cloud.google.com/compute/docs/resource-quotas about requesting more quota.
I don't understand that message completely. The process activated 8 workers successfully and is using 8 from the 8 available in-use IP addresses. Is this a problem? How could I fix it?

If you're worried about load job quotas, you can try streaming data into bigquery that comes with a less restrictive quota policy.
To achieve what you want to do, you can try the Google provided templates or just refer to their code.
Cloud Storage Text to BigQuery (Stream) [code]
Cloud Storage Text to BigQuery (Batch)
And last but not the least, more detailed information can be found on the Google BigQuery I/O connector.

Related

How to get performance of a virtual machine in Azure?

I am trying to collect the Performance of a Virtual Machine like CPU Utilization, Available Memory, Logical Disk MB/s, and Logical Disk IOPS, which can be seen under Insights via console. I want to collect these data and save them into a CSV file. Is there any API to get the data with Avg, Min, Max, 50th, 90th, and 95th included?
I have tried the following solutions:
az monitor metrics command: az monitor metrics list --resource {ResourceName} --metric "Percentage CPU"
API: https://management.azure.com/subscriptions/{subscription_id}/resourceGroups/{resource_group_name}/providers/Microsoft.Compute/virtualMachines/{vm_name}/providers/microsoft.insights/metrics?api-version=2018-01-01&metricnames=Percentage CPU&timespan={start_time}/{end_time}&interval=PT1H&aggregation=average
Microsoft Azure Monitor Client Library (Python SDK): azure-mgmt-monitor
In all the above-mentioned approaches, Instead of CPU Utilization, I'm getting results of 'Percentage CPU', i.e., instead of Insights these approaches are giving metrics.
One possible solution is to use the Azure Monitor REST API which allows you to collect various metrics from a virtual machine. You can specify the metric names, time span, interval, and aggregation parameters in the request URL. For example:
https://management.azure.com/subscriptions/{subscription_id}/resourceGroups/{resource_group_name}/providers/Microsoft.Compute/virtualMachines/{vm_name}/providers/microsoft.insights/metrics?api-version=2018-01-01&metricnames=Percentage CPU,Average Memory Bytes,Disk Read Bytes/sec,Disk Write Bytes/sec,Disk Read Operations/Sec,Disk Write Operations/Sec&timespan={start_time}/{end_time}&interval=PT1H&aggregation=average,count,maximum,minimun,total
This request will return the average, count, maximum, minimum, and total values for each metric in each hour within the specified time span. You can also use other aggregation types such as percentile.
Another possible solution is to use the Azure Monitor libraries for Python which provides a wrapper for the REST API. You can install the azure-mgmt-monitor package and use the list method in MetricsOperations class to get the metrics data. For example:
import datetime
from azure.mgmt.monitor import MonitorManagementClient
# Get the ARM id of your resource
resource_id = (
"subscriptions/{}/"
"resourceGroups/{}/"
"providers/Microsoft.Compute/virtualMachines/{}"
).format(subscription_id, resource_group_name, vm_name)
# Get your credentials ready
credentials = ServicePrincipalCredentials(
client_id = client_id,
secret = secret,
tenant = tenant_id
)
# Create a monitor client
monitor_client = MonitorManagementClient(
credentials,
subscription_id
)
# Get metrics data
metrics_data = monitor_client.metrics.list(
resource_id,
timespan="{}/{}".format(start_time,end_time),
interval='PT1H',
metricnames="Percentage CPU,Average Memory Bytes,Disk Read Bytes/sec,Disk Write Bytes/sec,Disk Read Operations/Sec,Disk Write Operations/Sec",
aggregation="Average,count,maximum,minimun,total",
)
This code will return a similar result as the REST API request.
To save the metrics data into a CSV file, you can use Python’s built-in csv module or other libraries such as pandas. You can iterate over each metric value object in metrics_data.value and write its properties into a row of CSV file.

HDF5 Usage in Azure Blob

We store some data in HDF5 formats on Azure blob. I have noticed higher than expected ingress traffic and used capacity when overwriting and modifying H5.
To test out the usage, I use a Python script to generate a H5 file that is exactly 256MB in size. The attached plot from Azure portal shows usage during the experiments:
The first peak is the initial creation of the H5. Ingress traffic is 256MB and there's no egress, as expected.
The second peak is when I ran the same script again without deleting the file created from the first run. It shows egress traffic of 256MB and also total ingress of 512MB. The resulting file is still 256MB.
Ran it again for a third time without deleting the file, and third peak shows the same usage as the second.
The used capacity seems to be calculated based on ingress traffic, so we are being charged for 512MB even though we are only using 256MB. I would like to note that if I were to delete the original file and re-run the script again, we would have no egress traffic from the deletion and only 256MB ingress from creating the file again. I did similar experiments with csv and Python pickles and found no such odd behaviors in usage calculation. All tests are carried out on a Azure VM in the same region as the blob, with the blob storage mounted using blobfuse.
I would like to understand how Azure counts the traffic when modifying existing files. For those of you who uses H5 on Azure blob, is there a way to avoid the additional charge?
Python script I used to generate H5:
import tables
import numpy as np
db = 'test.h5'
class TestTable(tables.IsDescription):
col0 = tables.Float64Col(pos=0)
col1 = tables.Float64Col(pos=1)
col2 = tables.Float64Col(pos=2)
col3 = tables.Float64Col(pos=3)
data = np.zeros((1024*1024, 4))
tablenames = ['Test'+str(i) for i in range(8)]
mdb = tables.open_file(db, mode="w")
# Create tables
for name in tablenames:
mdb.create_table('/', name, TestTable)
table = eval('mdb.root.'+name)
table.append(list(map(tuple, data)))
table.flush()
mdb.close()

Submitting multiple runs to the same node on AzureML

I want to perform hyperparameter search using AzureML. My models are small (around 1GB) thus I would like to run multiple models on the same GPU/node to save costs but I do not know how to achieve this.
The way I currently submit jobs is the following (resulting in one training run per GPU/node):
experiment = Experiment(workspace, experiment_name)
config = ScriptRunConfig(source_directory="./src",
script="train.py",
compute_target="gpu_cluster",
environment="env_name",
arguments=["--args args"])
run = experiment.submit(config)
ScriptRunConfig can be provided with a distributed_job_config. I tried to use MpiConfiguration there but if this is done the run fails due to an MPI error that reads as if the cluster is configured to only allow one run per node:
Open RTE detected a bad parameter in hostfile: [...]
The max_slots parameter is less than the slots parameter:
slots = 3
max_slots = 1
[...] ORTE_ERROR_LOG: Bad Parameter in file util/hostfile/hostfile.c at line 407
Using HyperDriveConfig also defaults to submitting one run to one GPU and additionally providing a MpiConfiguration leads to the same error as shown above.
I guess I could always rewrite my train script to train multiple models in parallel, s.t. each run wraps multiple trainings. I would like to avoid this option though, because then logging and checkpoint writes become increasingly messy and it would require a large refactor of the train pipeline. Also this functionality seems so basic that I hope there is a way to do this gracefully. Any ideas?
Use Run.create_children method which will start child runs that are “local” to the parent run, and don’t need authentication.
For AMLcompute max_concurrent_runs map to maximum number of nodes that will be used to run a hyperparameter tuning run.
So there would be 1 execution per node.
single service deployed but you can load multiple model versions in the init then the score function, depending on the request’s param, uses particular model version to score.
or with the new ML Endpoints (Preview).
What are endpoints (preview) - Azure Machine Learning | Microsoft Docs

Azure Functions "The operation has timed out." for timer trigger blob archival

I have a Python Azure Functions timer trigger that is run once a day and archives files from a general purpose v2 hot storage container to a general purpose v2 cold storage container. I'm using the Linux Consumption plan. The code looks like this:
container = ContainerClient.from_connection_string(conn_str=hot_conn_str,
container_name=hot_container_name)
blob_list = container.list_blobs(name_starts_with = hot_data_dir)
files = []
for blob in blob_list:
files.append(blob.name)
for file in files:
blob_from = BlobClient.from_connection_string(conn_str=hot_conn_str,
container_name=hot_container_name,
blob_name=file)
data = blob_from.download_blob()
blob_to = BlobClient.from_connection_string(conn_str=cold_conn_str,
container_name=cold_container_name,
blob_name=f'archive/{file}')
try:
blob_to.upload_blob(data.readall())
except ResourceExistsError:
logging.debug(f'file already exists: {file}')
except ResourceNotFoundError:
logging.debug(f'file does not exist: {file}')
container.delete_blob(blob=file)
This has been working for me for the past few months with no problems, but for the past two days I am seeing this error halfway through the archive process:
The operation has timed out.
There is no other meaningful error message other than that. If I manually call the function through the UI, it will successfully archive the rest of the files. The size of the blobs ranges from a few KB to about 5 MB and the timeout error seems to be happening on files that are 2-3MB. There is only one invocation running at a time so I don't think I'm exceeding the 1.5GB memory limit on the consumption plan (I've seen python exited with code 137 from memory issues in the past). Why am I getting this error all of a sudden when it has been working flawlessly for months?
Update
I think I'm going to try using the method found here for archival instead so I don't have to store the blob contents in memory in Python: https://www.europeclouds.com/blog/moving-files-between-storage-accounts-with-azure-functions-and-event-grid
Just summarize the solution from comments for other communities reference:
As mentioned in comments, OP uses start_copy_from_url() method instead to implement the same requirements as a workaround.
start_copy_from_url() can process the file from original blob to target blob directly, it works much faster than using data = blob_from.download_blob() to store the file temporarily and then upload data to target blob.

AWS Glue Job fails at create_dynamic_frame_from_options when reading from s3 bucket with lot of files

The data inside my s3 bucket looks like this...
s3://bucketName/prefix/userId/XYZ.gz
There are around 20 million users, and within each user's subfolder, there will be 1 - 10 files.
My glue job starts like this...
datasource0 = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://bucketname/prefix/"], 'useS3ListImplementation':True, 'recurse':True, 'groupFiles': 'inPartition', 'groupSize': 100 * 1024 * 1024}, format="json", transformation_ctx = "datasource0")
There are a bunch of optimizations like groupFiles, groupSize & useS3ListImplementations I have attempted, as shown above.
I'm using G.2X worker instances to provide the maximum memory for the jobs.
This job however fails consistently on that first line, with 'SDKClientException, Unable to execute HTTP request: Unsupported record version Unknown-0.0', and with ' Unable to execute HTTP request: Received close_notify during handshake' error on enabling useS3ListImplementations.
From the monitoring, I observe that this job is using only one executor, though I have allocated 10 (or 20 in some runs), and driver memory is growing to 100%, and CPU hovers around 50%.
I understand my s3 folders are not organized the best way. Given this structure, is there a way to make this glue job work?
My objective is to transform the json data inside those historical folders to parquet in one go. Any better way of achieving this is also welcome.

Resources