Reduce Service Fabric backup size - azure

I'm trying to use Service Fabric backups with Actors:
var backupDescription = new BackupDescription(BackupOption.Full, BackupCallbackAsync);
await BackupAsync(backupDescription, TimeSpan.FromHours(1), cancellationToken);
But I've noticed that one backup file may contains several files like:
edb0000036A.log 5120 KB
edb0000036B.log 5120 KB
edb00000366.log 5120 KB
...
I haven't found any info about these files but it seems that they are just logs and I may not include them. Am I right or these files must be included in backup?
These files are quite heavy so I'm trying to reduce size of backups
UPDATE 1:
I have tried to use incremental backup. But it seems that Actors do not support Incremental backup as I have read on MSDN. Moreover I have tested but got Exception "Invalid backup option. Parameter name: option"

Instead of doing full backups every hour, you can also use incremental backups, which will result in a smaller size. (For example, do a full backup every day, and incrementals every hour for instance)
The log files are transaction logs, they are not optional for restore. More info here.

Related

What does rotate period, bufferlogsize, and synctimeout mean exactly in Winston Azur blob storage? Explanation with simple examples are appreciated

In our project we are using winston3-azureblob-transport NPM package to store Application logs to blob storage.
However due to increase in users we are getting an error "409 - ClientOtherError - BlockCountExceedsLimit|The committed block count cannot exceed the maximum limit of 50,000 blocks".
Could anyone tell us using rotatePeriod, bufferLogSize and syncTimeout helps us to stop the error "409 - ClientOtherError - BlockCountExceedsLimit|The committed block count cannot exceed the maximum limit of 50,000 blocks".
Also provide any another alternative solution. However Winston logger should not be replaced.
The error "The committed block count cannot exceed the maximum limit of 50,000 blocks" usually occurs when the maximum limits are exceeded.
Each block in a block blob can be a different size. Based on the Service version you are using, maximum blob size differs.
If you attempt to upload a block that is larger than maximum limit your service version is supporting, the service returns status code 409(ClientOtherError - BlockCountExceedsLimit). The service also returns additional information about the error in the response, including the maximum block size permitted in bytes.
rotatePeriod: A moment format ex : YYYY-MM-DD will generate blobName.2022.03.29
bufferLogSize: A minimum number of logs before sync the blob, set to 1 if you want sync at each log.
syncTimeout: The maximum time between two sync, set to zero if you don't want.
For more in detail, please refer this link:
GitHub - agmoss/winston-azure-blob: NEW winston transport for azure blob storage by Andrew Moss agmoss

Azure Functions "The operation has timed out." for timer trigger blob archival

I have a Python Azure Functions timer trigger that is run once a day and archives files from a general purpose v2 hot storage container to a general purpose v2 cold storage container. I'm using the Linux Consumption plan. The code looks like this:
container = ContainerClient.from_connection_string(conn_str=hot_conn_str,
container_name=hot_container_name)
blob_list = container.list_blobs(name_starts_with = hot_data_dir)
files = []
for blob in blob_list:
files.append(blob.name)
for file in files:
blob_from = BlobClient.from_connection_string(conn_str=hot_conn_str,
container_name=hot_container_name,
blob_name=file)
data = blob_from.download_blob()
blob_to = BlobClient.from_connection_string(conn_str=cold_conn_str,
container_name=cold_container_name,
blob_name=f'archive/{file}')
try:
blob_to.upload_blob(data.readall())
except ResourceExistsError:
logging.debug(f'file already exists: {file}')
except ResourceNotFoundError:
logging.debug(f'file does not exist: {file}')
container.delete_blob(blob=file)
This has been working for me for the past few months with no problems, but for the past two days I am seeing this error halfway through the archive process:
The operation has timed out.
There is no other meaningful error message other than that. If I manually call the function through the UI, it will successfully archive the rest of the files. The size of the blobs ranges from a few KB to about 5 MB and the timeout error seems to be happening on files that are 2-3MB. There is only one invocation running at a time so I don't think I'm exceeding the 1.5GB memory limit on the consumption plan (I've seen python exited with code 137 from memory issues in the past). Why am I getting this error all of a sudden when it has been working flawlessly for months?
Update
I think I'm going to try using the method found here for archival instead so I don't have to store the blob contents in memory in Python: https://www.europeclouds.com/blog/moving-files-between-storage-accounts-with-azure-functions-and-event-grid
Just summarize the solution from comments for other communities reference:
As mentioned in comments, OP uses start_copy_from_url() method instead to implement the same requirements as a workaround.
start_copy_from_url() can process the file from original blob to target blob directly, it works much faster than using data = blob_from.download_blob() to store the file temporarily and then upload data to target blob.

Optimize the use of BigQuery resources to load 2 million JSON files from GCS using Google Dataflow

I have a vast database comprised of ~2.4 million JSON files that by themselves contain several records. I've created a simple apache-beam data pipeline (shown below) that follows these steps:
Read data from a GCS bucket using a glob pattern.
Extract records from JSON data.
Transform data: convert dictionaries to JSON strings, parse timestamps, others.
Write to BigQuery.
# Pipeline
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
p = beam.Pipeline(options=pipeline_options)
# Read
files = p | 'get_data' >> ReadFromText(files_pattern)
# Transform
output = (files
| 'extract_records' >> beam.ParDo(ExtractRecordsFn())
| 'transform_data' >> beam.ParDo(TransformDataFn()))
# Write
output | 'write_data' >> WriteToBigQuery(table=known_args.table,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
write_disposition=beam.io.BigQueryDisposition.WRITE_EMPTY,
insert_retry_strategy='RETRY_ON_TRANSIENT_ERROR',
temp_file_format='NEWLINE_DELIMITED_JSON')
# Run
result = p.run()
result.wait_until_finish()
I've tested this pipeline with a minimal sample dataset and is working as expected. But I'm pretty doubtful regarding the optimal use of BigQuery resources and quotas. The batch load quotas are very restrictive, and due to the massive amount of files to parse and load, I want to know if I'm missing some settings that could guarantee the pipeline will respect the quotas and run optimally. I don't want to exceed the quotas as I am running other loads to BigQuery in the same project.
I haven't finished understanding some parameters of the WriteToBigQuery() transform, specifically batch_size, max_file_size, and max_files_per_bundle, or if they could help to optimize the load jobs to BigQuery. Could you help me with this?
Update
I'm not only concerned about BigQuery quotas, but GCP quotas of other resources used by this pipeline are also a matter of concern.
I tried to run my simple pipeline over the target data (~2.4 million files), but I'm receiving the following warning message:
Project [my-project] has insufficient quota(s) to execute this workflow with 1 instances in region us-central1. Quota summary (required/available): 1/16 instances, 1/16 CPUs, 250/2096 disk GB, 0/500 SSD disk GB, 1/99 instance groups, 1/49 managed instance groups, 1/99 instance templates, 1/0 in-use IP addresses. Please see https://cloud.google.com/compute/docs/resource-quotas about requesting more quota.
I don't understand that message completely. The process activated 8 workers successfully and is using 8 from the 8 available in-use IP addresses. Is this a problem? How could I fix it?
If you're worried about load job quotas, you can try streaming data into bigquery that comes with a less restrictive quota policy.
To achieve what you want to do, you can try the Google provided templates or just refer to their code.
Cloud Storage Text to BigQuery (Stream) [code]
Cloud Storage Text to BigQuery (Batch)
And last but not the least, more detailed information can be found on the Google BigQuery I/O connector.

Rate limit with Apache Spark GCS connector

I'm using Spark on a Google Compute Engine cluster with the Google Cloud Storage connector (instead of HDFS, as recommended), and get a lot of "rate limit" errors, as follows:
java.io.IOException: Error inserting: bucket: *****, object: *****
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.wrapException(GoogleCloudStorageImpl.java:1600)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:475)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 429 Too Many Requests
{
"code" : 429,
"errors" : [ {
"domain" : "usageLimits",
"message" : "The total number of changes to the object ***** exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
"reason" : "rateLimitExceeded"
} ],
"message" : "The total number of changes to the object ***** exceeds the rate limit. Please reduce the rate of create, update, and delete requests."
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:472)
... 3 more
Anyone knows any solution for that?
Is there a way to control the read/write rate of Spark?
Is there a way to increase the rate limit for my Google Project?
Is there a way to use local Hard-Disk for temp files that don't have
to be shared with other slaves?
Thanks!
Unfortunately, the usage of GCS when set as the DEFAULT_FS can pop up with high rates of directory-object creation whether using it for just intermediate directories or for final input/output directories. Especially for using GCS as the final output directory, it's difficult to apply any Spark-side workaround to reduce the rate of redundant directory-creation requests.
The good news is that most of these directory requests are indeed redundant, just because the system is used to being able to essentially "mkdir -p", and cheaply return true if the directory already exists. In our case, it's possible to fix it on the GCS-connector side by catching these errors and then just checking whether the directory indeed got created by some other worker in a race condition.
This should be fixed now with https://github.com/GoogleCloudPlatform/bigdata-interop/commit/141b1efab9ef23b6b5f5910d8206fcbc228d2ed7
To test, just run:
git clone https://github.com/GoogleCloudPlatform/bigdata-interop.git
cd bigdata-interop
mvn -P hadoop1 package
# Or or Hadoop 2
mvn -P hadoop2 package
And you should find the files "gcs/target/gcs-connector-*-shaded.jar" available for use. To plug it into bdutil, simply gsutil cp gcs/target/gcs-connector-*shaded.jar gs://<your-bucket>/some-path/ and then edit bdutil/bdutil_env.sh for Hadoop 1 or bdutil/hadoop2_env.sh to change:
GCS_CONNECTOR_JAR='https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.4.1-hadoop2.jar'
To instead point at your gs://<your-bucket>/some-path/ path; bdutil automatically detects that you're using a gs:// prefixed URI and will do the right thing during deployment.
Please let us know if it fixes the issue for you!
Have you tried to set the spark.local.dir config parameter and attach a disk (preferable SSD) for that tmp space to your Google Compute Engine instances?
https://spark.apache.org/docs/1.2.0/configuration.html
You can not change the rate limiting for your project, what you would have to use is a back-off algorithm once the limit is reached. Since you mentioned most of the reads/writes are for tmp files, try to configure Spark to use local disks for that.

Start-AzureStorageBlobCopy vs AzCopy: which one takes lesser time

I need to move vhds from one subscription to other. I would like to know which one is better option for the same: Start-AzureStorageBlobCopy or AzCopy?
Which one takes lesser time ?
Both of them would take the same time as all they do is initiate Async Server-Side Blob Copy. They just tell the service to start copying blob from source to destination. The actual copy operation is performed by Azure Blob Storage Service. The time it would take to copy the blob would depend on a number of factors including but not limited to:
Source & destination location.
Size of the source blob.
Load on storage service.
Running AzCopy without specifying the option /SyncCopy and running PowerShell command Start-AzureStorageBlobCopy should take the same duration, because they both use server side asynchronous copy.
If you'd like to copy blobs across regions, you'd better consider specifying the option /SyncCopy while executing AzCopy in order to achieve a consistent speed because the asynchronous copying of data will run in the background of servers that being said you might see inconsistent copying speed among your “copying” operations.
If /SyncCopy option is specified, AzCopy will download the content to memory first, and then upload content back to Azure Storage. In order to achieve better performance of /SyncCopy, you are supposed to run AzCopy in the VM whose region is the same as source storage account. Besides that, the VM size (which decides bandwidth and CPU core number) will probably impact the copying performance as well.
For further information, please refer to Getting Started with the AzCopy Command-Line Utility
They don't take the same time.
I've tried to copy from one account to another and have a huge difference.
Start-AzureStorageBlobCopy -SrcBlob $_.Name -SrcContainer $Container -Context $ContextSrc -DestContainer $Container -DestBlob $_.Name -DestContext $ContextDst --Verbose
This takes about 2.5 hours.
& .\AzCopy.exe /Source:https://$StorageAccountNameSrc.blob.core.windows.net/$Container /Dest:https://$StorageAccountNameDst.blob.core.windows.net/$Container /SourceKey:$StorageAccountKeySrc /DestKey:$StorageAccountKeyDst /S
This takes several minutes.
I have about 600 Mb and about 7000 files here.
Elapsed time: 00.00:03:41
Finished 44 of total 44 file(s).
[2017/06/22 17:05:35] Transfer summary:
-----------------
Total files transferred: 44
Transfer successfully: 44
Transfer skipped: 0
Transfer failed: 0
Elapsed time: 00.00:00:08
Finished 345 of total 345 file(s).
[2017/06/22 17:06:07] Transfer summary:
-----------------
Total files transferred: 345
Transfer successfully: 345
Transfer skipped: 0
Transfer failed: 0
Elapsed time: 00.00:00:31
Do anyone know why it's so different?
In most scenarios, AzCopy is likely to be quicker than Start-AzureStorageBlobCopy due to way you would initiate the copy resulting in fewer calls to Azure API:
[AzCopy]1 call for whole container (regardless of blob count)
vs
[Start-AzureStorageBlobCopy] N number of calls due to number of blobs in container.
Initially I thought it would be same as both appear to trigger same asynchronous copies on Azure side, however on client side this would be directly visible as #Evgeniy has found in his answer.
In 1 blob in container scenario, theoretically both commands would complete at same time.
*EDIT (possible workaround): I was able to decrease my time tremendously by:
Removing console output AND
Using the -ConcurrentTaskCount switch, set to 100 in my case. Cut it down to under 5 minutes now.
AzCopy offers an SLA which the Async copy services lacks. AzCopy is designed for optimal performance. Use the/SyncCopy parameter to get a consistent copy speed.

Resources