Omit uploaded files with AzCopy - azure

I have uploaded with Cloudberry Explorer some files/folders to my Azure container but now I'm gonna change Cloudberry for AzCopy.
What I need is to omit those uploaded files. I don't know if can be done with a AzCopy parameter. the files to be uploaded are stored in a server so doing it manually is impossible due for are thousands of thousands of files/folders.
thanks in advance

As it is documented in azcopy reference
--overwrite string Overwrite the conflicting files and blobs at the destination if this flag is set to true. Possible values include 'true', 'false', 'ifSourceNewer', and 'prompt'. (default "true")
So something like this should work:
azcopy.exe copy "source location" "destination location" --overwrite=false

Use /XO flag in the command. It will not copy/replace old files. Sample command,
AzCopy /Source:C:\myfolder /Dest:https://myaccount.blob.core.windows.net/mycontainer /DestKey:key /XO

If the files uploaded by another tool has different naming convention with the new ones, you could use option /Pattern to upload only the new files,
e.g. old files have naming convention like “abcxxxx”, new files have naming convention like “xyzxxx”, then please specify /Pattern:xyz* to copy the new files only.
Or use option /xo (means exclude old files) to copy new files only, note that AzCopy will compare local files' change time with the 'Last Modified Time' of the destination blobs when you specified option /xo and /xn, please make sure the uploaded old files’ ‘Last modified time’ is same or newer than the their local copies’ change time, otherwise the old files will be uploaded again when you specified option /xo. You can use option /MT to set ‘Last Modified Time’ as same as local copies’ change time during upload.
For more details, please visit http://aka.ms/azcopy
Thanks

Related

AzCopy while source container is written

I tried to do some research but unable to find the answer.
Let's say you have container A.
at 1pm, container A had the following files:
test1.txt (inside the file is abcdefg)
test2.txt (inside the file is klmnopq)
AzCopy was performed to container A , copying it from storage C to storage D. It started at 1.01pm and finished at 1.05pm.
At 1.02pm, mr.X performed the following:
Add test3.txt inside container A
Modify test1.txt to "defgh"
The question is, what does end up in storage D for container A copy?
the original files? ie test1.txt (inside the file is abcdefg) and test2.txt (inside the file is klmnopq)
or something else?
thanks
The outcome depends on the specific timing of the writes/copies. AzCopy makes sure to guarantee the file integrity by checking the LastModified time of the source during the copy. In other words, whatever is at the source when AzCopy scanning happened will be locked on and copied.

Entering a proper path to files on DBFS

I uploaded files to DBFS:
/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
I tried to access them by pandas and I always receive information that such files don't exist.
I tried to use the following paths:
/dbfs/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
dbfs/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
dbfs:/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
./FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
What is funny, when I check them by dbutils.fs.ls I see all the files.
I found this solution, and I tried it already: Databricks dbfs file read issue
Moved them to a new folder:
dbfs:/new_folder/
I tried to access them from this folder, but still, it didn't work for me. The only difference is that I copied files to a different place.
I checked as well the documentation: https://docs.databricks.com/data/databricks-file-system.html
I use Databricks Community Edition.
I don't understand what I'm doing wrong and why it's happening like that.
I don't have any other ideas.
The /dbfs/ mount point isn't available on the Community Edition (that's a known limitation), so you need to do what is recommended in the linked answer:
dbutils.fs.cp(
'dbfs:/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv',
'file:/tmp/file_name.csv')
and then use /tmp/file_name.csv as input parameter to Pandas' functions. If you'll need to write something to DBFS, then you do other way around - write to local file /tmp/..., and copy that file to DBFS.

Azcopy - Copy only files without folders

As the title suggests I am trying to copy all files with a specific extension, within a folder structure, to blob storage without recreating the local folder structure;
This works fine when I run the following;
azcopy cp 'H:\folder1\folder2\*.txt' 'https://storage.blob.core.windows.net/folderA/folderB/?saskey'
This copies all *.txt files to /folderB
I have tried many variations of the following;
azcopy.exe cp 'H:\folder1\*\*' 'https://storage.blob.core.windows.net/folderA/folderB/?saskey' --recursive --include-pattern '*.txt'
Regardless of what I try I end up with the following;
/folderA/folderB
/folder1/fileA.txt
/folder2/fileB.txt
I was under the impress that is what the "--recursive" switch was for, but what I am doing is either not supported or my syntax is wrong.
I have read through this;
https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-files#use-wildcard-characters
I could probably script it with something similar to this;
AzCopy - Wildcards In Middle Of Pattern?
But was hoping this was built-in functionality
What you are looking for is not supported. Using --recursive would result in the subdirectory structure of the source retained in the destination. I am not aware of any flag to prevent that.
Actually that helps to avoid conflict. Let's say for example, you have files /folder1/fileA.txt and /folder2/fileA.txt in source. If you try to copy flat in destination (without subpath), that would have caused conflict since both file names are fileA.txt.

How does `aws s3 sync` determine if a file has been updated?

When I run the command in the terminal back to back, it doesn't sync the second time. Which is great! It shouldn't. But, if I run my build process and run aws s3 sync programmatically, back to back, it syncs all the files both times, as if my build process is changing something differently the second time.
Can't figure out what might be happening. Any ideas?
My build process is basically pug source/ --out static-site/ and stylus -c styles/ --out static-site/styles/
According to this - http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
S3 sync compares the size of the file and the last modified timestamp to see if a file needs to be synced.
In your case, I'd suspect the build system is resulting in a newer timestamp even though the file size hasn't changed?
AWS CLI sync:
A local file will require uploading if the size of the local file is
different than the size of the s3 object, the last modified time of
the local file is newer than the last modified time of the s3 object,
or the local file does not exist under the specified bucket and
prefix.
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.
You want the --size-only option which looks only at the file size not the last modified date. This is perfect for an asset build system that will change the last modified date frequently but not the actual contents of the files (I'm running into this with webpack builds where things like fonts kept syncing even though the file contents were identical). If you don't use a build method that incorporates the hash of the contents into the filename it might be possible to run into problems (if build emits same sized file but with different contents) so watch out for that.
I did manually test adding a new file that wasn't on the remote bucket and it is indeed added to the remote bucket with --size-only.
This article is a bit dated but i'll contribute nonetheless for folks arriving here via google.
I agree with checked answer. To add additional context, AWS S3 functionality is different than standard linux s3 in a number of ways. In Linux, an md5hash can be computed to determine if a file has changed. S3 does not do this, so it can only determine based on size and/or timestamp. What's worse, AWS does not preserve timestamp when transferring either way, so timestamp is ignored when syncing to local and only used when syncing to s3.

WinSCP - Do not synchronize subdirectories

I am writing winscp script in VBA to synchronize certain files from remote to local.
The code I am using is
""synchronize -filemask=""""*.xlsx"""" local C:\Users\xx\Desktop /JrnlDetailSFTPDirect""
There are three xlsx files: 14.xlsx, 12.xlsx, 13.xlsx. However, seems like it is running through all the files even though it is not synchronizing them. Besides, one folder under JrnlDetailSFTPDirect is also downloaded from remote, which is not expected.
Is it possible to avoid looping through all the files, just selecting those three files and downloading them?
Thanks
There are separate masks for files and folders.
To exclude all folders, use */ exclude mask:
synchronize -filemask="*.xlsx|*/" local C:\Users\xx\Desktop /JrnlDetailSFTPDirect
See How do I transfer (or synchronize) directory non-recursively?
I cannot tell anything regarding the other problem, as you didn't show us names of the files. Ideally, append a session log file to your question. Use /log switch like:
winscp.com /log=c:\writablepath\winscp.log /command ...

Resources