Parallel Copy using azcopy

Parallel Copy using azcopy - azure

I was using azcopy to copy models from Azure Blob Storage to Azure VM regularly. But when I am copying datasets to to my VM, I am using Azure File Share and moving data into the data disk using cp command. I want to utilize Azcopy to copy data in parallel. I believe one time I heard that AzCopy copy data in parallel but I am not able to find that statement. Maybe I heard it wrong.
I also saw this another question on stackoverflow which talked about parallelism in azcopy. The answer provided a link to the azcopy docuemntation and talked about --parallel-level but when I clicked on it, there is no such thing as it has been stated.
If anyone can redirect me to the azcopy parallel documentation link if it exists it would be really helpful.

AzCopy copies data in parallel by default, but you can change how many files are copied in parallel.
Throughput can decrease when transferring small files. You can
increase throughput by setting the AZCOPY_CONCURRENCY_VALUE
environment variable. This variable specifies the number of concurrent
requests that can occur.
If your computer has fewer than 5 CPUs, then the value of this
variable is set to 32. Otherwise, the default value is equal to 16
multiplied by the number of CPUs. The maximum default value of this
variable is 3000, but you can manually set this value higher or lower.
https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-configure#optimize-throughput

Related

Cloning only the filled portion of a read only raw data HDD source (without source partition resizing)

I often copy raw data from HDD's with FAT32 partitions at the file level. I would like to switch to bitwise cloning this raw data that consists of thousands of 10MiB files that are sequentially written across a single FAT32 partition.
The idea is on the large archival HDD, have a small partition which contains a shadow directory structure with symbolic links to separate raw data image partitions. Each additional partition being the aforenoted raw data, but sized to only the size consumed on the source drive. The number of raw data files on each source drive can be in the tens up through the tens of thousands.
i.e.: [[sdx1][--sdx2--][-------------sdx3------------][--------sdx4--------][-sdx5-][...]]
Where 'sdx1' = directory of symlinks to sdx2, sdx3, sdx4, ... such that the user can browse to multiple partitions but it appears to them as if they're just in subfolders.
Optimally I'd like to find both a Linux and a Windows solution. If the process can be scripted or a software solution that exists can step through a standard workflow, that'd be best. The process is almost always 1) Insert 4 HDD's with raw data 2) Copy whatever's in them 3) Repeat. Always the same drive slots and process.
AFAIK, in order to clone a source partition without cloning all the free space, one conventionally must resize the source HDD partition first. Since I can't alter the source HDD in any way, how can I get around that?
One way would be clone the entire source partition (incl. free space) and resize the target backup partition afterward, but that's not going to work out because of all the additional time that would take.
The goal is to retain bitwise accuracy and to save time (dd runs about 200MiB/s whereas rsync runs about 130MiB/s, however also needing to copy a ton of blank space every time makes the whole perk moot). I'd also like to be running with some kind of --rescue flag so when bad clusters are hit on the source drive it just behaves like clonezilla and just writes ???????? in place of the bad clusters. I know I said "retain bitwise accuracy" but a bad cluster's a bad cluster.
If you think one of the COTS or GOTS software like EaseUS, AOMEI, Paragon and whatnot are able to clone partitions as I've described please point me in the right direction. If you think there's some way I can dd it up with some script which sizes up the source, makes the right size target partition, then modifies the target FAT to its correct size, chime in I'd love many options and so would future people with a similar use case to mine that stumble on this thread :)

Not sure if this will fit you, but is very simple.
Syncthing https://syncthing.net/ will sync the content of 2 or more folders, works on Linux and Windows.

Batch Copy/Delete some blobs in container

I have a lot thousands containers and each container has up to 10k blobs inside. I have a list of tuple (container, blob) to
copy to another storage
delete later in the original storage
The blobs in containers are not related to each other - random date creation, random names (guids), nothing in common.
Q: is there any efficient way how to do these operations?
I already looked at az-cli and azcopy and haven't found any good way.
I tried e.g. to call azcopy repeatedly for each tuple, but this would take ages. One call to copy the blob took 2sec in average. So it's nice it starts operation in background, but if this "starting operation" takes about 2 seconds, it's pretty useless for my case.

I'm assuming based on the comments that within each container, it's an arbitrary number (and naming) of blobs to copy and delete. And that the delete is only for the blobs copied (not the full container). If so, and want to use something besides REST one suggestion would be Powershell script to read from a file the list of blobs to copy (service side copy) and then separately do a delete (more efficient to do a copy and if successful, then delete) e.g. https://learn.microsoft.com/en-us/powershell/module/az.storage/get-azstorageblobcopystate?view=azps-4.7.0#example-4--start-copy-and-pipeline-to-get-the-copy-status
Cheers, Klaas [Microsoft]

Preparing archive data for Stream Analytics Import

Before I had time to get an ingestion strategy & process setup, I started collecting data that will eventually go through a Stream Analytics job. Now I'm sitting on an Azure blob storage container with over 500,000 blobs in it (no folder organization), another with 300,000 and a few others with 10,000 - 90,000.
The production collection process now writes these blobs to different containers in the YYYY-MM-DD/HH format, but that's only great going forward. This archived data I have is critical to get into my system and I'd like to just modify the inputs a bit for the existing production ASA job so I can leverage the same logic in the query, functions and other dependencies.
I know ASA doesn't like batches of more than a few hundred / thousand, so I'm trying to figure a way to stage my data in order to work well under ASA. This would be a one time run...
One idea was to write a script that looked at every blob, looked at the timestamp within the blob and re-create the YYYY-MM-DD/HH folder setup, but in my experience, the ASA job will fail when the blob's lastModified time doesn't match the folders it's in...
Any suggestions how to tackle this?
EDIT: Failed to mention (1) there are no folders in these containers... all blobs live at the root of the container and (2) my LastModifiedTime on the blobs is no longer useful or have meaning. The reason for the latter is these blobs were collected from multiple other containers and merged together using the Azure CLI copy-batch command.

Can you please try below?
Do this processing in two different jobs, one for the folders with date partitioning (say partitionedJob). Another for old blobs without any date partitioning (say RefillJob)
Since RefillJob has a fixed number of blobs, put a predicate on System.Timestamp to make sure that it only processes old events. Start this job with at least 6 SUs and run it until all the events have been processed. You can confirm by looking at LastOutputProcessedTime or by looking at the input event count or by inspecting your output source. After this check, stop the job. This job is no longer needed.
Start the partitionedJob with timestamp > RefillJob. This assumes the folders for the timestamps exists.

moving Cassandra snapshots to a different disk/server/datacenter

I have Cassandra 1.2.6 cluster running on datacenter A, each node has a solid state drive with somewhat limited space (aprox 50% of disk space is free).
Now I need to implement somehow a way of having automatic backups of each node. Ideally I want to have a way of moving all of the cluster's datafiles to a different disk (standard cheaper disks), or even to a different server in the same datacenter A and possibly moving all the data once in a while to a datacenter B in a different location.
From what I've read I can use snapshots on each node to get the files to copy using whatever tool I want and in this case I have the option to move the data to a different disk/server/datacenter.
My question is, since each of my nodes is about 50% full, taking a snapshot will consume all that space? or the hard links will consume way less space than I anticipate?, if so, is there a better way of doing this, maybe with an already made tool, or everything should be custom made when it comes to this type of backups in Cassandra?
Thanks in advance!

A hard link just creates a new directory entry for the same file (http://en.wikipedia.org/wiki/Hard_link). So a snapshot takes up effectively zero space, but you'll want to clean it up after you're done copying it off to whatever your archive is, because when the "original" sstable is deleted (typically post-compaction), space won't be reclaimed as long as the snapshot reference is still there.
My impression is that tablesnap is the most popular tool for automating backups to s3. It also supports Cassandra incremental backups. If you want more control over where you're backing up to, DataStax OpsCenter supports running a custom script when it takes snapshots.

Rackspace cloud files: how to size containers to optimize performance?

Rackspace cloud files uses a flat storage system using 'containers' to store files. According to Rackspace there is no limit to the number of files per container.
My question is whether there is a best/most efficient number of files per container to optimize write/fetch performance.
If I have tens of thousands of files to store, should they all go in a single giant container or partitioned into many smaller containers? And if so, what is the optimal container size?

FYI:
[Snippets taken from rackspace support]
long story short, the containers are databases, and the more rows in a table, the more time it takes to write them on standard hardware. When a write hasn't been committed to disk, it sits in a queue, and it subject to data loss. It's something we noticed with large containers, and the more objects, the more likely it was, so we instituted the limits to protect the data.
because of the rate limits, your data is safe, it just slows down the writes a bit
the limits starts as low as 50,000 objects, and at that level it limits you to 100 writes per second
by 1,000,000 objects in a container, it's 25 per second
and at 5 million and above, you're down to 4 writes per second
We apologize for the limitations, and will be updating our documentation to more clearly express this.
-This has recently hurt us quite badly. Thought I'd share until they get there API doc's upto date, so others can plan around this issue.

We recommend no more than 1 million objects per container. The system will return a maximum of 10,000 object names per list request by default.
Update 9/20/2013 from Cloud Files development: The 1 million object per container recommendation is no longer accurate since Cloud Files switched to all SSD container servers. Also, the list is limited to 10,000 containers at a time.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string