I'm working with Azure Data Factory to copy .txt files from an FTP site. I'm using a binary transfer approach leveraging binary formats, but ADF is showing incredibly slow throughput (90KB/s) so is taking hours to transfer a 4GB file, which isn't particularly large.
The FTP site is in the US while ADF is located in Europe, but I'm able to download from a VM in the same Europe data center and retrieve the file from FTP in a few minutes. It seems like something is not quite right, any idea why ADF is not able to retrieve a 4GB .txt file? I'm copying to BLOB and am using Azure IR for compute.
The pipeline is running 6-7 hours which seems absurd for a reasonably sized file. I have tried different formats (reading directly as delimited), etc. but it continues to be absurdly slow. I'm assuming the FTP has reasonable download speeds considering I can retrieve the file from a desktop after 4-5 minutes. When monitoring the load in the ADF monitoring I can see it is continually "reading from source" though I can see the data read amount is not changing so I'm wondering if it is dropping a connection, etc.
Any thoughts would help!
I've solved this by, instead of using binary transfer, reading in native format (delimited) and then write out in chunked up parquet in BLOB. I've set a max number of rows for the write out files so it's forcing physical files to be committed.
For whatever reason, it was struggling trying to read the entire 4GB file and then write it out. It seems counterintuitive to move away from a binary copy but seems to be the only option in this case.
Related
We have an incoming XML file in azure file storage path everyday, and we are loading them into Azure SQL database using ADF Copy activity. The source is a XML dataset referring to the XML file and the sink is a table in database. The copy activity completes in less than 3 minutes if the file is around 500mb. But when we tried a 680mb file, it ran nearly 5hours. We are not able to find what is the reason behind this huge increase in time. We tried to change the DIU & parallelism settings, but didn't help.
Any idea why such huge increase in loading time?
Does ADF XML copy activity has file size limit?
Is there any way to reduce the processing time, apart from rewriting the logic in an azure function?
Any help or suggestion is appreciated! Thanks
Why does Stream Analytics create separate files when using Azure Data Lake or Azure Blob Storage? Some times the stream runs for days in one file, while other times every day a couple of new files are made. It seems rather random?
I output the data to CSV, the query stays the same, and every now and then there is a new file generated.
I would prefer it to have one large CSV file, because I want to be able to run long-term statistics using Power BI on the data, but this seems impossible when it are all separate files with a seemingly random name.
https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-define-outputs - this page has details about when a new file is created. In your case, it is most likely due to an internal restart.
I found quite some answers to copy blobs between azure storage accounts. I know of the cmdlet using Start-AzureStorageBlobCopy. However, I have > 20 million files to copy between two storage accounts in the same data center and it seems to take forever (it is copying since more than a week) since it starts each file copy process separately.
Furthermore, I found that in the most current version of the Azure tools (7.4), the cmdlet downloads the full file list (to memory) and only then starts with the copy process. So it not only takes forever but uses large amount of memory. The same is also true if I use AzCopy.
Thus my question: what is a good possibility (that actually really works!) to copy large amounts of files of which each is not that big between two storage accounts in the same data center? Or maybe you know of parameters to set when using the cmdlets (the documentation is awful and not updated)?
I know there are two methods available to upload files in AWS S3 (i.e. PutObject and TransferUtility.Upload). Can someone please explain which one to use?
FYI, I have files ranging from 1kb to 250MB.
Thanks in advance.
Amazon deprecated the S3 Transfer Manager and migrated to the new
Transfer Utility.The Transfer Utility is a simple interface for handling the most common uses of S3.It has a single constructor, which
requires an instance of AmazonS3Client. Working with it is so easy
and let the develpers perform all operations with less code.
Following are key features of using Transfer Utility over Transfer Manager
When uploading large files, TransferUtility uses multiple threads to
upload multiple parts of a single upload at once.When dealing with
large content sizes and high bandwidth, this can increase throughput
significantly.TransferUtility detects if a file is large and switches into
multipart upload mode. The multipart upload gives the benefit of
better performance as the parts can be uploaded simultaneously as
well, and if there is an error, only the individual part has to be
retried.
Mostly we people try to upload large files on S3 that take too much
time to upload,at those situations we required progress information
such as the total number of bytes transferred and remaining amount of
data to transfer.To track current progress of transfer with the
Transfer Manager, developers pass an S3ProgressListener callback to
upload or download, which periodically fires the method below.
Pausing transfers using the Transfer Manager is not possible with
stream based uploads or downloads.But Transfer Utility provide us
pause and resume option, it also has one single-file-based method for
uploads, and downloads.
transferUtility.upload(MY_BUCKET,OBJECT_KEY,FILE_TO_UPLOAD)
transferUtility.download(MY_BUCKET,OBJECT_KEY,FILE_TO_UPLOAD)
The Transfer Manager only requires the INTERNET permission. However,
since the Transfer Utility automatically detects network state and
pauses/resumes transfers based on the network state
pause functionality to the Transfer Utility is easy, since all transfers can be paused and resumed.A transfer is paused because of a loss of network connectivity, it will automatically be resumed and there is no action you need to take.Transfers that are automatically paused and waiting for network connectivity will have the state.Additionally, the Transfer Utility stores all of the metadata about transfers to the local SQLite database, so developers do not need to persist anything.
Note :
Every thing else is good.But Transfer Utility does not support a copy() API.To accomplish it use AmazonS3Client class copyObject() method.
Based in Amazon docs, I would stick with TransferUtility.Upload:
Provides a high level utility for managing transfers to and from Amazon S3.
TransferUtility provides a simple API for uploading content to and downloading content from Amazon S3. It makes extensive use of Amazon S3 multipart uploads to achieve enhanced throughput, performance, and reliability.
When uploading large files by specifying file paths instead of a stream, TransferUtility uses multiple threads to upload multiple parts of a single upload at once. When dealing with large content sizes and high bandwidth, this can increase throughput significantly.
But please be aware of possible concurrency issues and the recommendation about using BeginUpload (the asynchronous version), like in this related post
I have an existing Azure CloudDrive that I want to make bigger. The simplist way I can think of is to creating a new drive and copying everything over. I cannot see anyway to just increase the size of the vhd. Is there a way?
Since an Azure drive is essentially a page blob, you can resize it. You'll find this blog post by Windows Azure Storage team useful regarding that: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/04/11/using-windows-azure-page-blobs-and-how-to-efficiently-upload-and-download-page-blobs.aspx. Please read the section titled "Advanced Functionality – Clearing Pages and Changing Page Blob Size" for sample code.
yes you can,
please i know this program, is ver easy for use, you can connect with you VHD and create new, upload VHD and connect with azure, upload to download files intro VHD http://azuredriveexplorer.codeplex.com/
I have found these methods so far:
“the soft way”: increase the size of the page blob and fix the
VHD data structure (the last 512 bytes).
Theoretically this creates unpartitioned disk space after the
current partition. But if the partition table also expects
metadata at the end of the disk (GPT? or Dynamic disks), that
should be fixed as well.
I'm aware of only one tool
that can do this in-place modification. Unfortunately this tool is
not much more than a one-weekend hack (at the time of this writing)
and thus it is fragile. (See the disclaimer of the author.) But fast.
Please notify me (or edit this post) if this tool gets improved significantly.
create a larger disk and copy everything over, as you've suggested.
This may be enough if you don't need to preserve NTFS features like
junctions, soft/hard links etc.
plan for the potential expansion and start with a huge (say 1TB) dynamic VHD,
comprised of a small partition and lots of unpartitioned (reserved) space.
Windows Disk Manager will see the unpartitioned space in the VHD, and can expand the
partition to it whenever you want -- an in-place operation. The subtle point is
that the unpartitioned area, as long as unparitioned, won't be billed, because
isn't written to. (Note that either formatting or defragmenting does allocate
the area and causes billing.)
However it'll count against the quota of your Azure Subscription (100TB).
“the hard way”: download the VHD file, use a VHD-resizer program to insert unpartitioned disk space, mount the
VHD locally, extend the partition to the unpartitioned space, unmount,
upload.
This preserves everything, even works for an OS partition, but is very
slow due to the download/upload and software installations involved.
same as above but performed on a secondary VM in Azure. This speeds up
downloading/uploading a lot. Step-by-step instructions are available here.
Unfortunately all these techniques require unmounting the drive for quite a lot of time, i.e. cannot be performed in high-available manner.