Best Way to Handle SFTP Files in Azure Data Factory

Best Way to Handle SFTP Files in Azure Data Factory - azure

I'm very new to Azure in general, particularly Data Factory v2, i'm also very new at this company. We have an ask from a vendor to query data out to a file and then drop it to an Amazon S3 bucket, however Azure Data Factory does not appear to support this. The client wants to use an SFTP method, but i'm wondering which is the best option. Apparently we have a Linux server all set-up but i'm not sure if it's within a VM in the cloud. I'm learning about logic apps and functions but not sure which is best way to go. I'm familiar with WinSCP and have scripted a way to handle SFTP files in the past, but wondering if that's truly a good way to handle this.
As I said, i'm very new to this so wanting to get some ideas. Have any of you done this in the past and what would you recommend?
I've done A ton of reading about transferring files to Amazon S3 and SFTP server. My head is spinning. I realize my question is general, but my team is very new and we took over this environment from a consulting firm so we don't have a lot of background.
The outcome which i'm hoping for with this post is the best, least intensive method for sending files via SFTP in Azure Data Factory

Related

How to upload video/audio files from web, to NodeJS, to Google Cloud Storage?

This is unfamiliar ground for me and I've been poking around with resources trying to find a suitable solution. But essentially, I have a website setup that I want to allow users to upload MP3/FLAC files. Then I want to take those files and send them to a Google Cloud Bucket. The second part seems easy enough, plenty of NodeJS tutorials regarding that.
Since I'm pretty in the dark with how this is done, would I need to "upload" a file on my frontend and then hit my node-express api backend with some sort of fs. solution that looks up the file on my machine? If so, how could that be consistent between users, what if their directory structure is different on their machines?
Anyway, kinda shooting in the dark here. Would love to have some advice regarding this.

It's not really feasible for a backend to "reach into" a frontend machine to pull files from it. The client needs to provide the data directly to the backend.
Mostly commonly, Firebase client libraries are used to directly upload contents from a client machine to a storage bucket. If you don't do that, you'll need to create your own backend API that clients can invoke to send data.

Azure Webjobs vs SSIS packages

I have been tasked with creating a scheduled job to first call an api, convert the response to a new format and then pass that data to another api. It doesn't sound like there is any logic in between
The company I work for has a lot of SSIS packages doing a variety of things but also has a healthy Azure platform with a few web jobs running. Several developers on my team have expressed a dislike for SSIS packages so I would like to implement this in Azure, but I want to make sure that is the most reasonable thing to do.
What I am asking for is a pro con list where each option is strong or weak. A good answer will assist readers in making a decision on if their specific situation is best solved using a SSIS package or an Azure webjob, assuming the needed environment is setup for either already.

Trying to figure out if / how Cloud would be an advantage

Compared to plain vanilla PhP/MySQL, what's the upside of Cloud?
A typical block of contents would be approximately 30,000 snippets of text, each 300 characters or less in length.
I'm looking at some good documents on buckets and objects and wondering if there's any reason for me to dive into all that.
Just a rough idea would be appreciated. Am I barking up the wrong tree even thinking of Cloud for this?
p.s. just guessing: is the way to go to run MySQL in the Cloud?

It will depend on the cloud service you choose. On the cloud you can choose between an IaaS, a PaaS or a SaaS.
On an IaaS you will get an infrastructure as a service where you need to install MySQL, the web server, ...
On a PaaS, all these services could be enabled just with click of your mouse and you will just use the service without taking care of the config or the installation process.
This blog article will give you an idea about how to use a MySQL database on a PaaS.
Regarding the web server, for PHP could be something really easy like zip your project and use a command to deploy your application without any config. See here an example.

Azure: SQL Compact possible?

I have a RESTful service running on azure. Currently, it has zero persistence. (It is just a REST gateway to another api.) I run it in a single, minimal Azure instance, and expect this will handle all the load this will ever get.
I now need to add some very lightweight persistence to it. A simple table, of 40-200 rows, eight data columns. The data is very static.
Doing the whole SQL Azure thing seems big overkill for my needs.
My thoughts have been to use:
An XML file, and load it into memory, as the db. XML file is
deployed with code.
Some better way to deploy XML, so it can be
rolled out/updated easier
SQL Compact (can I do this on Azure?)
___ ?
What is the right path here?
Thank you!

SQL Server Compact would need to store its data somewhere in persistent manner, so you would need to sync it regularly to a persistent storage and that's a lot of extra work and I have no idea how to do that reliably, so it's likely not a very good idea.
For your simple table the Azure Table Storage might be just enough. If that's not enough then SQL Azure is the next choice.

You can use the XML file as your store, there is no harm it it, rather this is a very easy and cost efficient solution, but there is a catch. As you mentioned currently you are using only azure instance, in this case you can store the XML file in your App_Data, but if in future if you want to shift to 2 azure instance, you will have to replicate the App_Data folder. In other words you will need to keep App_Data folder in sync.
Suggestion
Instead of storing file in App_Data store it in BLOB, you can retrieve it using WebClient and the store it in memory.
Pros: The advantage of BLOB is, you don't have to sync it.
Cons: There is a cost associated on the number of transactions you can make. This will depend upon how many times you update the file.
Summary
If you are going to work with only one Azure Instance, use App_Data
More than one Azure Instance, use BLOB with no syncing or use App_Data with sync.
Do not use Azure Table, as BLOB is the designated store provided for this purpose only.
EDIT
From MSDN post
As far as I know, Windows Azure does not support SQL Compact Edition. SQL Compact Edition stores data in file system which will not be synchronized in multiple instances (a web role may be deployed to more than one instance. An instance is similar to a virtual machine). And files stored in file system will lost when the instance is restarted or reimaged.
Hope this helps you.

Use Sql Server FileStream or traditional File Server?

I am designing a system that's going to have about 10 millions+ users, each has a photo, which is about 1~2 MB.
We are going to deploy both database and web app using Microsoft Azure
I am wondering the way I should store the photos, there are currently two options,
1, Store all photos use Sql Server FileStream
2, Use File Server
I haven't experienced such large scale BLOB data using FileStream.
Can anybody give my any suggestion? The Cons and Pros?
And anyone with Microsoft Azure experiences concerning the large photos store is really appreciated!
Thx
Ryan.

I vote for neither. Use Windows Azure Blob storage. Simple REST API, $0.15/GB/month. You can even serve the images directly from there, if you make them public (like <img src="http://myaccount.blob.core.windows.net/container/image.jpg" />), meaning you don't have to funnel them through your web app.

Database is almost always a horrible choice for any large-scale binary storage needs. Database is best for relational-only systems, and instead, provide references in your database to the actual storage location. There's a few factors you should consider:
Cost - SQL Azure costs quite a lot per GB of storage, and has small storage limitations (50GB per database), both of which make it a poor choice for binary data. Windows Azure Blob storage is vastly cheaper for serving up binary objects (though has a bit more complicated pricing system, still vastly cheaper per GB).
Throughput - SQL Azure has pretty good throughput, as it can scale well, however, Windows Azure Blog storage has even greater throughput as it can scale to any number of nodes.
Content Delivery Network - A feature not available to SQL Azure (though a complex, custom wrapper could be created), but can easily be setup within minutes to piggy-back off your Windows Azure Blob storage to provide limitless bandwidth to your end-users, so you never have to worry about your binary objects being a bottleneck in your system. CDN costs are similar to that of Blob storage, but you can find all that stuff here: http://www.microsoft.com/windowsazure/pricing/#windows
In other words, no reason not to go with Blob storage. It is simple to use, cost effective, and will scale to any needs.

I can't speak on anything Azure related but for my money the biggest advantage of using FILESTREAM is that that data can get backed up inside the normal SQL Server backup process. The size of the data that you are talking about also suggests that FILESTREAM may be a good choice as well.
I've worked on a SCM system with a RDBMS back end and one of our big decisions was whether to store the file deltas on the file system or inside the DB itself. Because it was cross-RDBMS we had to cook up a generic non-FILESTREAM way of doing it but the ability to do a single shot backup sold us.

FILESTREAM is a horrible option for storing images. I'm surprised MS ever promoted it.
We're currently using it for our images on our website. Mainly the user generated images and any CMS related stuff that admins create. The decision to use FILESTREAM was made before I started. The biggest issue is related to serving the images up. You better have a CDN sitting in front. If not, plan on your system coming to a screeching halt. Of course, most sites have a CDN, but you don't want to be at the mercy of that service going down meaning your system will get overloaded. The amount of stress put on your sql server is the main problem here.
In terms of ease of backup. Your tradeoff there is that your db is MUCH MUCH LARGER and, therefore, the backup takes longer. Potentially, much longer and the system runs slower during the backup. Not to mention, moving backups around takes longer (i.e., restoring prod data in a dev environment or on local machines for dev purposes). Don't use this as a deciding factor.
Most cloud services have automatic redundancy of any files that you store on their system (i.e., aws's S3 and azure's blob). If you're on premise, just make sure you use a shared location for the images and make sure that location is backed up. I think the best option is to set it up so each image (other UGC file types too) has an entry in your db with a path to that file. Going one step further, separate the root path into a config setting and only store the remaining path with the entry. For example, root path in config might be a base url, a shared drive or virtual dir, or a blank entry. Then your entry might have "/files/images/image.jpg". This way, if you move your filestore, you can just update the root config. I would also suggest creating a FileStoreProvider interface (Singleton) that can be used for managing (saving, deleting, updating) these files. This way, if you switch between AWS, Azure, or on premise, you can just create a new Provider.

I have a client server DB, i manage many files (doc, txt, pdf, ...) and all of them go in a filestream BLOB. Customers has 50+ MB dbs. If in azure you can do the same go for it. Having all in the db is a wonderful thing. It is considered good policy also for Postgres and MySQL

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string