Azure put_block_blob_from_path(): Avoid timeout error - azure

I am uploading a 60 GB file using Python and azure-storage. I get a timeout error (read timeout=65) more often than not:
HTTPSConnectionPool(host='myaccount.blob...', port=443): Read timed out. (read timeout=65)
The code:
bs = BlobService(account_name=storage_account, account_key =account_key)
bs.put_block_blob_from_path(
container_name = my_container,
blob_name = azure_blobname,
file_path = localpath,
x_ms_blob_content_type = "text/plain",
max_connections=5
)
Is there something I can do to increase the timeout or otherwise fix this issue? put_block_blob_from_path() doesn't seem to have a timeout parameter.
I am using an older version of azure-storage (0.20.0). That's so we don't have to rewrite our code (put_block_blob_from_path no longer exists) and so we avoid the inevitable downtime as we install the new version, switch the code over, and deal with whatever crap is related to installing the new version over the old version. Is this timeout an issue that has been solved in newer versions?

There're a few things you could try:
Increase the timeout: BlobService constructor has a timeout parameter, default value of which is 65 seconds. You can try to increase that. I believe the max timeout value you can specify is 90 seconds. So your code would be:
bs = BlobService(account_name=storage_account, account_key=account_key, timeout=90)
Reduce "max_connections": max_connections property defines the maximum number of parallel threads in which upload is happening. Since you're uploading a 60GB file, SDK automatically splits that file in 4MB chunks and upload 5 chunks (based on your current value) in parallel. You can try by reducing this value and see if that gets rid of the timeout error you're receiving.
Manually implement put_block and put_block_list: By default the SDK splits the blob in 4MB chunks and upload these chunks. You can try by using put_block and put_block_list methods in the SDK where you're reducing the chunk size from 4MB to a smaller value (say 512KB or 1MB).

Related

Encountered a retryable error. Will Retry with exponential backoff 413

Logstash keep encountering following error message that logs cannot be sent to AWS ElasticSearch.
[2021-04-28T16:01:28,253][ERROR][logstash.outputs.amazonelasticsearch]
Encountered a retryable error. Will Retry with exponential backoff
{:code=>413,
:url=>"https://search-xxxx.ap-southeast-1.es.amazonaws.com:443/_bulk"}
That's why I always need to restart logstash and cannot configure why it causes that issue. Regarding Logstash documentation I reduce pipeline.batch.size size to 100 but it didn't help. Please let me know how to resolve that issue. Thanks.
pipeline.batch.size: 125
pipeline.batch.delay: 50
A 413 response is "payload too large". It does not make much sense to retry this, since it will probably recur forever and shut down the flow of events through the pipeline. If there is a proxy or load balancer between logstash and elasticsearch then it is possible that that is returning the error, not elasticsearch, in which case you may need to reconfigure the proxy.
The maximum payload size accepted by amazonelasticsearch will depend on what type of instance you are running on (scroll down to Network Limits). For some instance types it is 10 MB.
In logstash, a batch of events is divided into 20 MB chunks as it is sent to elasticsearch. The 20 MB limit cannot be adjusted (unless you want to edit the source and build your own plugin). However, if there is a single large event it has to be sent in one request, so it is still possible for a request larger than that to be sent.
Since 20 MB is bigger than 10 MB this is going to be a problem if your batch size is over 10 MB. I do not think you have any visibility into the batch size other than the 413 error. You will have to keep reducing the pipeline.batch.size until the error goes away.
I've fixed issue that we need to adjust as to choose correct ES instance size based on max_content_length.
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-limits.html

Requestlist throws heap out of memory in apify for more than 10 million wordlist

I have a wordlist of 11 character which I want to append in a url. After some modification in request.js,I am able to run 5 million size wordlist in requestlist array.It start throwing JavaScript heap memory error after going higher.I have billion of size of wordlist to process. I can able to generate my wordlist with js code.5 million entry finishes up in an hour,due to higher server capacityR I possess. Requestlist is a static variable so I cant add again in it.How can I run it infinitely for billions of combination.If any cron script can help then I am open to this also.
It would be better to use RequestQueue for such a high amount of Requests. The queue is persisted to disk as an SQLite database so memory usage is not an issue.
I suggest adding let's say 1000 requests into the queue and immediately start crawling, while pushing more requests to the queue. Enqueueing tens of millions or billions of requests might take long, but you don't need to wait for that.
For best performance, use apify version 1.0.0 or higher.

Azure Functions "The operation has timed out." for timer trigger blob archival

I have a Python Azure Functions timer trigger that is run once a day and archives files from a general purpose v2 hot storage container to a general purpose v2 cold storage container. I'm using the Linux Consumption plan. The code looks like this:
container = ContainerClient.from_connection_string(conn_str=hot_conn_str,
container_name=hot_container_name)
blob_list = container.list_blobs(name_starts_with = hot_data_dir)
files = []
for blob in blob_list:
files.append(blob.name)
for file in files:
blob_from = BlobClient.from_connection_string(conn_str=hot_conn_str,
container_name=hot_container_name,
blob_name=file)
data = blob_from.download_blob()
blob_to = BlobClient.from_connection_string(conn_str=cold_conn_str,
container_name=cold_container_name,
blob_name=f'archive/{file}')
try:
blob_to.upload_blob(data.readall())
except ResourceExistsError:
logging.debug(f'file already exists: {file}')
except ResourceNotFoundError:
logging.debug(f'file does not exist: {file}')
container.delete_blob(blob=file)
This has been working for me for the past few months with no problems, but for the past two days I am seeing this error halfway through the archive process:
The operation has timed out.
There is no other meaningful error message other than that. If I manually call the function through the UI, it will successfully archive the rest of the files. The size of the blobs ranges from a few KB to about 5 MB and the timeout error seems to be happening on files that are 2-3MB. There is only one invocation running at a time so I don't think I'm exceeding the 1.5GB memory limit on the consumption plan (I've seen python exited with code 137 from memory issues in the past). Why am I getting this error all of a sudden when it has been working flawlessly for months?
Update
I think I'm going to try using the method found here for archival instead so I don't have to store the blob contents in memory in Python: https://www.europeclouds.com/blog/moving-files-between-storage-accounts-with-azure-functions-and-event-grid
Just summarize the solution from comments for other communities reference:
As mentioned in comments, OP uses start_copy_from_url() method instead to implement the same requirements as a workaround.
start_copy_from_url() can process the file from original blob to target blob directly, it works much faster than using data = blob_from.download_blob() to store the file temporarily and then upload data to target blob.

python time out of stream method on gcp firestore

I am using GCP firestore. For some reason, I am querying all the documents present in a collection. I am using the python API.
Code I am using
db=firestore.Client()
documents = db.collection(collection_name).stream()
for doc in tqdm(documents):
#some time consuming operation.(2-3 seconds)
Everything runs fine but after 1 minute, the for loop ends.
I thought maybe the connection was getting timed out. I found this on the documentation page.
The underlying stream of responses will time out after the max_rpc_timeout_millis value set in
the GAPIC client configuration for the RunQuery API. Snapshots not consumed from the iterator
before that point will be lost.
My question is how can I modify this timeout value, to suit my needs. Thank you.
In my case, the 503 The datastore operation timed out, or the data was temporarily unavailable. response from Firestore has also been causing AttributeError: '_UnaryStreamMultiCallable' object has no attribute '_retry'.
This looks like retry policy is not set, though Python's firebase_admin package is capable of retrying timeout errors too. So, I have just configured a basic Retry object explicitly and this solved my issue:
from google.api_core.retry import Retry
documents = db.collection(collection_name).stream(retry=Retry())
A collection of 190K items is exported in 5 minutes in my case. Originally, the iteration also has been interrupted after 60 seconds.
Counterintuitively, as mentioned in the docs, .stream() has a cumulative timeout for an entire collection consumption, and not a single item or chunk retrieval.
So, if your collection has 1000 items and every item processing takes 0.5 seconds, total consumption time will sum up to 500 seconds which is greater than the default (undocumented) timeout of 60 seconds.
Also counterintuitively, a timeout argument of the CollectionReference.stream method does not override the max_rpc_timeout_millis mentioned in the documentation. In fact, it behaves like a client-side timeout, and the operation is effectively timed out after min(max_rpc_timeout_millis / 1000, timeout) seconds.

How to processes remaining payload when the handler get timeout in AWS lambda?

I am trying to transform data from CSV to JSON in AWS lambda (using Python 3). The size of file is 65 MB, so its getting timeout before completing the process and the entire execution get fails.
I would need to know how I can handle such a case where AWS Lambda should able to process a maximum set of data within the time out period and the remaining payload should keep into an S3 bucket.
Below is the transformation code
import json
import boto3
import csv
import os
json_content = {}
def lambda_handler(event, context):
s3_source = boto3.resource('s3')
if event:
fileObj=event['Records'][0]
fileName=str(fileObj['s3']['object']['key'])
eventTime =fileObj['eventTime']
fileObject= s3_source.Object('inputs3', fileName)
data = fileObject.get()['Body'].read().decode('utf-8-sig').split()
arr=[]
csvreader= csv.DictReader(data)
newFile=getFile_extensionName(fileName,extension_type)
for row in csvreader:
arr.append(dict(row))
json_content['Employees']=arr
print("Json Content is",json_content)
s3_source.Object('s3-output', "output.json").put(Body=(bytes(json.dumps(json_content).encode('utf-8-sig'))))
print("File Uploaded")
return {
'statusCode': 200,
'fileObject':eventTime,
}
AWS Lambda function configuration:
Memory: 640 MB
Timeout: 15 min
Since your function is timing-out, you only have two options:
Increase the amount of assigned memory. This will also increase the amount of CPU assigned to the function, so it should run faster. However, this might not be enough to avoid the timeout.
or
Don't use AWS Lambda.
The most common use-case for AWS Lambda functions is for small microservices, sometimes only running for a few seconds or even a fraction of a second.
If your use-case runs for over 15 minutes, then it probably isn't a good candidate for AWS Lambda.
You can look at alternatives such as running your code on an Amazon EC2 instance or using a Fargate container.
It looks like your function is running out of memory:
Memory Size: 1792 MB Max Memory Used: 1792
Also, it only ran for 12 minutes:
Duration: 723205.42 ms
(723 seconds ≈ 12 minutes)
Therefore, you should either:
Increase memory (but this costs more), or
Change your program so that, instead of accumulating the JSON string in memory, you continually write it out to a local disk file to /tmp/ and then upload the resulting file to Amazon S3
However, the maximum disk storage space provided to an AWS Lambda function is 512MB and it appears that your output file is bigger than this. Therefore, increasing memory would be the only option. The increased expense related to assigning more resources to the Lambda function suggests that you might be better-off using EC2 or Fargate rather than Lambda.

Resources