Azure Functions "The operation has timed out." for timer trigger blob archival - azure

I have a Python Azure Functions timer trigger that is run once a day and archives files from a general purpose v2 hot storage container to a general purpose v2 cold storage container. I'm using the Linux Consumption plan. The code looks like this:
container = ContainerClient.from_connection_string(conn_str=hot_conn_str,
container_name=hot_container_name)
blob_list = container.list_blobs(name_starts_with = hot_data_dir)
files = []
for blob in blob_list:
files.append(blob.name)
for file in files:
blob_from = BlobClient.from_connection_string(conn_str=hot_conn_str,
container_name=hot_container_name,
blob_name=file)
data = blob_from.download_blob()
blob_to = BlobClient.from_connection_string(conn_str=cold_conn_str,
container_name=cold_container_name,
blob_name=f'archive/{file}')
try:
blob_to.upload_blob(data.readall())
except ResourceExistsError:
logging.debug(f'file already exists: {file}')
except ResourceNotFoundError:
logging.debug(f'file does not exist: {file}')
container.delete_blob(blob=file)
This has been working for me for the past few months with no problems, but for the past two days I am seeing this error halfway through the archive process:
The operation has timed out.
There is no other meaningful error message other than that. If I manually call the function through the UI, it will successfully archive the rest of the files. The size of the blobs ranges from a few KB to about 5 MB and the timeout error seems to be happening on files that are 2-3MB. There is only one invocation running at a time so I don't think I'm exceeding the 1.5GB memory limit on the consumption plan (I've seen python exited with code 137 from memory issues in the past). Why am I getting this error all of a sudden when it has been working flawlessly for months?
Update
I think I'm going to try using the method found here for archival instead so I don't have to store the blob contents in memory in Python: https://www.europeclouds.com/blog/moving-files-between-storage-accounts-with-azure-functions-and-event-grid

Just summarize the solution from comments for other communities reference:
As mentioned in comments, OP uses start_copy_from_url() method instead to implement the same requirements as a workaround.
start_copy_from_url() can process the file from original blob to target blob directly, it works much faster than using data = blob_from.download_blob() to store the file temporarily and then upload data to target blob.

Related

What does rotate period, bufferlogsize, and synctimeout mean exactly in Winston Azur blob storage? Explanation with simple examples are appreciated

In our project we are using winston3-azureblob-transport NPM package to store Application logs to blob storage.
However due to increase in users we are getting an error "409 - ClientOtherError - BlockCountExceedsLimit|The committed block count cannot exceed the maximum limit of 50,000 blocks".
Could anyone tell us using rotatePeriod, bufferLogSize and syncTimeout helps us to stop the error "409 - ClientOtherError - BlockCountExceedsLimit|The committed block count cannot exceed the maximum limit of 50,000 blocks".
Also provide any another alternative solution. However Winston logger should not be replaced.
The error "The committed block count cannot exceed the maximum limit of 50,000 blocks" usually occurs when the maximum limits are exceeded.
Each block in a block blob can be a different size. Based on the Service version you are using, maximum blob size differs.
If you attempt to upload a block that is larger than maximum limit your service version is supporting, the service returns status code 409(ClientOtherError - BlockCountExceedsLimit). The service also returns additional information about the error in the response, including the maximum block size permitted in bytes.
rotatePeriod: A moment format ex : YYYY-MM-DD will generate blobName.2022.03.29
bufferLogSize: A minimum number of logs before sync the blob, set to 1 if you want sync at each log.
syncTimeout: The maximum time between two sync, set to zero if you don't want.
For more in detail, please refer this link:
GitHub - agmoss/winston-azure-blob: NEW winston transport for azure blob storage by Andrew Moss agmoss

Optimize the use of BigQuery resources to load 2 million JSON files from GCS using Google Dataflow

I have a vast database comprised of ~2.4 million JSON files that by themselves contain several records. I've created a simple apache-beam data pipeline (shown below) that follows these steps:
Read data from a GCS bucket using a glob pattern.
Extract records from JSON data.
Transform data: convert dictionaries to JSON strings, parse timestamps, others.
Write to BigQuery.
# Pipeline
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
p = beam.Pipeline(options=pipeline_options)
# Read
files = p | 'get_data' >> ReadFromText(files_pattern)
# Transform
output = (files
| 'extract_records' >> beam.ParDo(ExtractRecordsFn())
| 'transform_data' >> beam.ParDo(TransformDataFn()))
# Write
output | 'write_data' >> WriteToBigQuery(table=known_args.table,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
write_disposition=beam.io.BigQueryDisposition.WRITE_EMPTY,
insert_retry_strategy='RETRY_ON_TRANSIENT_ERROR',
temp_file_format='NEWLINE_DELIMITED_JSON')
# Run
result = p.run()
result.wait_until_finish()
I've tested this pipeline with a minimal sample dataset and is working as expected. But I'm pretty doubtful regarding the optimal use of BigQuery resources and quotas. The batch load quotas are very restrictive, and due to the massive amount of files to parse and load, I want to know if I'm missing some settings that could guarantee the pipeline will respect the quotas and run optimally. I don't want to exceed the quotas as I am running other loads to BigQuery in the same project.
I haven't finished understanding some parameters of the WriteToBigQuery() transform, specifically batch_size, max_file_size, and max_files_per_bundle, or if they could help to optimize the load jobs to BigQuery. Could you help me with this?
Update
I'm not only concerned about BigQuery quotas, but GCP quotas of other resources used by this pipeline are also a matter of concern.
I tried to run my simple pipeline over the target data (~2.4 million files), but I'm receiving the following warning message:
Project [my-project] has insufficient quota(s) to execute this workflow with 1 instances in region us-central1. Quota summary (required/available): 1/16 instances, 1/16 CPUs, 250/2096 disk GB, 0/500 SSD disk GB, 1/99 instance groups, 1/49 managed instance groups, 1/99 instance templates, 1/0 in-use IP addresses. Please see https://cloud.google.com/compute/docs/resource-quotas about requesting more quota.
I don't understand that message completely. The process activated 8 workers successfully and is using 8 from the 8 available in-use IP addresses. Is this a problem? How could I fix it?
If you're worried about load job quotas, you can try streaming data into bigquery that comes with a less restrictive quota policy.
To achieve what you want to do, you can try the Google provided templates or just refer to their code.
Cloud Storage Text to BigQuery (Stream) [code]
Cloud Storage Text to BigQuery (Batch)
And last but not the least, more detailed information can be found on the Google BigQuery I/O connector.

MS Azure Data Factory ADF Copy Activity from BLOB to Azure Postgres Gen5 8 cores fails with connection closed by host error

I am using ADF copy acivity to copy files on azure blob to azure postgres.. im doing recursive copy i.e. there are multiple files withing the folder.. thats fine.. size of 5 files which i have to copy is total around 6 gb. activity fails after 30-60 min of run. used write batch size from 100- 500 but still fails.
used 4 or 8 orauto DIUS, similarly tried used 1,2,4,8 or auto parallel connections to postgres.normally it seems it uses 1 per source file. azure postgres server has 8 cores and temp buffer size is 8192 kb. max allowed is 16000 something kb. even tried using that but 2 errors which i have been constantly getting. ms support team suggested to use retry option. still awaiting response from there pg team if i get something but below r the errors.
Answer: {
'errorCode': '2200',
'message': ''Type=Npgsql.NpgsqlException,Message=Exception while reading from stream,Source=Npgsql,''Type=System.IO.IOException,Message=Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.,Source=System,''Type=System.Net.Sockets.SocketException,Message=An existing connection was forcibly closed by the remote host,Source=System,'',
'failureType': 'UserError',
'target': 'csv to pg staging data migration',
'details': []
}
or
Operation on target csv to pg staging data migration failed: 'Type=Npgsql.NpgsqlException,Message=Exception while flushing stream,Source=Npgsql,''Type=System.IO.IOException,Message=Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host.,Source=System,''Type=System.Net.Sockets.SocketException,Message=An existing connection was forcibly closed by the remote host,Source=System
I was also facing this issue recently and contacted our microsoft rep who got back to me with the following update on 2020-01-16:
“This is another issue we found in the driver, we just finished our
deployment yesterday to fix this issue by upgrading driver version.
Now customer can have up to 32767 columns data in one batch size(which
is the limitation in PostgreSQL, we can’t exceed that).
Please let customer make sure that (Write batch size* column size)<
32767 as I mentioned, otherwise they will face the limitation. “
"Column size" refers to the count of columns in the table. The "area" (row write batch size * column count) cannot be greater than 32,767.
I was able to change my ADF write batch size on copy activity to a dynamic formula to ensure optimum batch sizes per table with the following:
#div(32766,length(pipeline().parameters.config)
pipeline().parameters.config refers to an array containing information about columns for the table. the length of the array = number of columns for table.
hope this helps! I was able to populate the database (albeit slowly) via ADF... would much prefer a COPY based method for better performance.

Out of memory Exception running U-SQL Activity using Azure Data Factory

I am running a U-SQL Activity as part of a Pipeline in Azure Data Factory for a defined time slice. The U-SQL Activity runs a sequence of U-SQL scripts that read-in and process data stored in Azure Data Lake. While the data processes successfully in my local run it is throwing an System Out of Memory Exception when running in Azure Data Factory Cloud Environment.
The Input data is approximately 200MB, which should not be a problem processing, as bigger data sets have been processed previously.
Memory management is assumed to scale as needed, it is surprising to see an Out of Memory Exception in a Azure Cloud Environment, following are Exception snapshots of two runs on the same input data, the only difference being the time at which they occur.
Exception Snapshot - 1
Exception Snapshot - 2
Any assistance is highly appreciated, thanks.
Further Update: On further investigation it was observed skipping header row using variable skipNRow:1 re-solved the issue, our u-sql code behind snippet has a loop which is conditioned on a date comparison, its possible the loop isn't terminating because of an invalid date time cast of header row column given the snippet is processing DateTime type row column as input. That should ideally give an invalid date time format exception but we see an Out of memory exception instead.
It looks like something in the user code is causing the exception you can try running the failed vertex debug feature in VS. you can open the failed job in VS and it should give you an error bar in the job overview that lets you kick off that process. It will download the failed portion to the desktop and let you step through.

Start-AzureStorageBlobCopy vs AzCopy: which one takes lesser time

I need to move vhds from one subscription to other. I would like to know which one is better option for the same: Start-AzureStorageBlobCopy or AzCopy?
Which one takes lesser time ?
Both of them would take the same time as all they do is initiate Async Server-Side Blob Copy. They just tell the service to start copying blob from source to destination. The actual copy operation is performed by Azure Blob Storage Service. The time it would take to copy the blob would depend on a number of factors including but not limited to:
Source & destination location.
Size of the source blob.
Load on storage service.
Running AzCopy without specifying the option /SyncCopy and running PowerShell command Start-AzureStorageBlobCopy should take the same duration, because they both use server side asynchronous copy.
If you'd like to copy blobs across regions, you'd better consider specifying the option /SyncCopy while executing AzCopy in order to achieve a consistent speed because the asynchronous copying of data will run in the background of servers that being said you might see inconsistent copying speed among your “copying” operations.
If /SyncCopy option is specified, AzCopy will download the content to memory first, and then upload content back to Azure Storage. In order to achieve better performance of /SyncCopy, you are supposed to run AzCopy in the VM whose region is the same as source storage account. Besides that, the VM size (which decides bandwidth and CPU core number) will probably impact the copying performance as well.
For further information, please refer to Getting Started with the AzCopy Command-Line Utility
They don't take the same time.
I've tried to copy from one account to another and have a huge difference.
Start-AzureStorageBlobCopy -SrcBlob $_.Name -SrcContainer $Container -Context $ContextSrc -DestContainer $Container -DestBlob $_.Name -DestContext $ContextDst --Verbose
This takes about 2.5 hours.
& .\AzCopy.exe /Source:https://$StorageAccountNameSrc.blob.core.windows.net/$Container /Dest:https://$StorageAccountNameDst.blob.core.windows.net/$Container /SourceKey:$StorageAccountKeySrc /DestKey:$StorageAccountKeyDst /S
This takes several minutes.
I have about 600 Mb and about 7000 files here.
Elapsed time: 00.00:03:41
Finished 44 of total 44 file(s).
[2017/06/22 17:05:35] Transfer summary:
-----------------
Total files transferred: 44
Transfer successfully: 44
Transfer skipped: 0
Transfer failed: 0
Elapsed time: 00.00:00:08
Finished 345 of total 345 file(s).
[2017/06/22 17:06:07] Transfer summary:
-----------------
Total files transferred: 345
Transfer successfully: 345
Transfer skipped: 0
Transfer failed: 0
Elapsed time: 00.00:00:31
Do anyone know why it's so different?
In most scenarios, AzCopy is likely to be quicker than Start-AzureStorageBlobCopy due to way you would initiate the copy resulting in fewer calls to Azure API:
[AzCopy]1 call for whole container (regardless of blob count)
vs
[Start-AzureStorageBlobCopy] N number of calls due to number of blobs in container.
Initially I thought it would be same as both appear to trigger same asynchronous copies on Azure side, however on client side this would be directly visible as #Evgeniy has found in his answer.
In 1 blob in container scenario, theoretically both commands would complete at same time.
*EDIT (possible workaround): I was able to decrease my time tremendously by:
Removing console output AND
Using the -ConcurrentTaskCount switch, set to 100 in my case. Cut it down to under 5 minutes now.
AzCopy offers an SLA which the Async copy services lacks. AzCopy is designed for optimal performance. Use the/SyncCopy parameter to get a consistent copy speed.

Resources