Databricks autoloader methods: file notifications vs directory listing - databricks

When using databricks autoloader functionaliy, you can either choose file.notifications method or the default directory listing method.
I realise that the file.notifications is the superior method and that having azure functions and event grids identify file updates will mean that the solution will scale out "infinitely" and so on (I've also read that this is the recommended method in prod?). But I have a use case where the frequency of incoming files is quite small and am wondering if I can just use the directory listing method in Prod environment as well. Has anyone used directory listing method in autoloader in prod? Any thoughts/concerns? The idea at the moment is to start with directory listing method and then to switch to file notifications method as we start ingesting more files and more frequently.

Directory Listing Mode:
Auto Loader locates new files in directory listing mode by listing the input directory. With the exception of having access to your data on cloud storage, directory listing mode enables you to rapidly start Auto Loader streams.
Auto Loader can automatically determine whether files are arriving to your cloud storage with lexical ordering in Databricks Runtime 9.1 and later. This drastically reduces the number of API calls necessary to discover new files.
The hazards connected with directory listings have long been observed. Directory listings allow the Web user to see the majority (if not all) of the files in a directory, as well as any lower-level subdirectories, despite small information leakage. Directory listings offer far more open access than the "guided" experience of browsing a collection of prepared pages. Even a casual Web browser could gain access to files that shouldn't be made available to the public depending on a number of variables, including the rights of the files and directories as well as the server's settings for authorised files.
File Notification Mode:
In your cloud infrastructure account, file notification and queue services are used in file notification mode. A notification service and queue service that subscribe to file events from the input directory can be automatically set up by Auto Loader.
For big input directories or a high volume of files, file notification mode is more efficient and scalable, but it requires additional cloud permissions to set up.
For improved scalability and quicker performance as your directory grows, you might want to switch to the file notification mode.

Related

Do temp files get automatically deleted after Azure function stops

I have an Azure function app with multiple Azure functions in it. The function app is running on the consumption plan and a linux os. In one of the Azure functions, each time it runs I'm saving a file to a /tmp directory so that I can use that file in the Azure function. After the Azure function stops running, if you don't delete this file from the /tmp directory will it automatically be deleted? And if it isn't automatically deleted, what's the security implications of having those files still there. I say this because, as I understand it, using the consumption plan means sharing resources with other people. I don't know if the tmp directory is part of those shared resources or not. But I'd rather not have other people be able to access files I write to the tmp directory. Thanks :)
I had read here
"Keep in mind though, that Functions is conceived as a serverless
platform, and you should not depend on local disk access to do any
kind of persistence. Whatever you store in that location will be
deleted on other Function invocations."
I ran one of the Azure functions which uploads a file to that tmp directory each time it runs. From that link, I thought the files in tmp would be deleted each time I ran the function, but they persisted across each time that I ran the function. What am I getting wrong here?
Temp directory files are deleted automatically once in 12 hours of time but not after each run of the function app of if the function app is restarted.
It means the data stored in temp directory (For E.g., D:\local\Temp) in Azure, can exists up to the function host process is alive which means that temp data/files are ephemeral.
Persisted files are not shared among site instances and also cannot rely on those files by staying there. Hence, there are no security concerns here.
Please refer to the Azure App Service File System temporary files GitHub Document for more information.

Azure Storage - File Share - Move 16m files in nested folders

Posting here as server fault doesn't seem to have the detailed Azure knowledge.
I have a Azure storage account, a file share. This file share is connected to a Azure VM through mapped drive. A FTP server on the VM accepts a stream of files and stores them in the File Share directly.
There are no other connections. Only I have Azure admin access, limited support people have access to the VM.
Last week, for unknown reasons 16 million files, which are nested in many sub-folders (origin, date) moved instantly into a unrelated subfolder, 3 levels deep.
I'm baffled how this can happen. There is a clear instant cut off when files moved.
As a result, I'm seeing increased costs on LRS. I'm assuming because internally Azure storage is replicating the change at my expense.
I have attempted to copy the files back using a VM and AZCOPY. This process crashed midway through leaving me with a half a completed copy operation. This failed attempt took days, which makes me confident I wasn't the support guys dragging and moving a folder by accident.
Questions:
Is it possible to just instantly move so many files (how)
Is there a solid way I can move the files back, taking into account the half copied files - I mean an Azure backend operation way rather than writing an app / power shell / AZCOPY?
So there a cost efficient way of doing this (I'm on Transaction Optimised tier)
Do I have a case here to get Microsoft to do something, we didn't move them... I assume something internally messed up.
Thanks
A tool that supports server-side copy (like AzCopy) can move the files quickly because only the metadata is updated. If you wants to investigate the root cause, I recommend opening a support case. (To sort this out – Your best bet is to connect with our Azure support team by filing a ticket, our support team on best effort basis can help you guide on this matter. )

saving Image to file system in Asp.net Core

I'm building an application that saves a lot of images using the C# Web API of ASP.NET.
The best way seems to save the images in the file system and save their path into the database.
However, I am concerned about load balancing. Since every server will put the image in its own file system, how can another server behind the same load balancer retrieve the image?
If you have resources for it, I would state that:
the best way is to save them in the file system and save the image path into the database
Is not true at all.
Instead, Id say using an existing file server system is probably going to produce the best results, if you are willing to pay for the service.
For dotnet the 'go to' would be Azure Blob Storage, which is ideal for non-streamed data like images.
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-dotnet
Otherwise, you can try and create your own file storage service from scratch. In which case you will effectively be creating a separate service API apart from your main cluster that handles your actual app, this secondary API just handles file storage and runs on its own dedicated server.
You then simply just create an association between Id <-> File Data on the File Server, and you're App Servers can request and push files to the File Server via those Ids.
Its possible but a File Server is for sure one of those types of projects that seems straightforward at first but very quickly you realize its a very very difficult task and it may have been easier to just pay for an existing service.
There might be existing self hosted file server options out there as well!

Spring Integration - Reading files from multiple locations & putting them at a central repository

I need to transfer the file contents from multiple servers to a central repository as soon as there is a change in the file. Also the requirement is that only the changed contents should be transferred, not the whole file.
Could someone let me know if it is possible using Spring-Integration File Inbound/Outbound Adapters.
The file adapters only work on local files (but they can work if you can mount the remote filesystems).
The current file adapters do not support transferring parts of tiles, but we are working on File Tailing Adapters which should be in the code base soon. These will only work for text files, though (and only if you can mount the remote file system). For Windows (and other platforms that don't have a tail command, there's an Apache commons Tailer implementation but, again, it will only work for text files, and if you can mount the shares.
If you can't mount the remote files, or they are binary, there's no out of the box solution, but if you come up with a custom solution to transfer the data (e.g. google tailing remote files), its easy to then hook it into a Spring Integration flow to write the output.

Is there an API to set a NTFS ACL only on a particular folder without flowing permissions down?

In my environment, I have several projects that involve running NTFS ACL audit reports and various ACL cleanup activities on a number of file servers. There are two main reasons why I cannot perform these activities locally on the servers:
1) I do not have local access to the servers as they are actually owned and administered by another company.
2) They are SNAP NAS servers which run a modified Linux OS (called GuardianOS) so even if I could get local access, I'm not sure of the availability of tools to perform the operations I need.
With that out of the way, I ended up rolling my own ACL audit reporting tool that would recurse down the filesystem starting at a specified top-level path and would spit out an HTML report on all the groups/users it encountered on the ACLs as well as showing the changes in permissions as it descended the tree. While developing this tool, I found out that the network overhead was the worst part of doing these operations and by multi-threading the process, I could achieve substantially greater performance.
However, I'm still stuck for finding a good tool to perform the ACL modifications and cleanup. Your standard out of the box tools (cacls, xcacls, Explorer) seem to be single-threaded and suffer significant performance penalty when going across the network. I've looked at rolling my own ACL setting program that is multithreaded but the only API I'm familiar with is the .NET FileSystemAccessRule stuff and the problem is that if I set the permissions at a folder, it automatically wants to "flow" the permissions down. This causes a problem because I want to do the "flowing" myself using multi-threading.
I know NTFS "allows" inherited permissions to be inconsistent because I've seen it where a folder/file gets moved on the same volume between two parent folders with different inherited permissions and it keeps the old permissions as "inherited".
The Questions
1) Is there a way to set an ACL that applies to the current folder and all children (your standard "Applies to files, folders, and subfolders" ACL) but not have it automatically flow down to the child objects? Basically, I want to be able to tell Windows that "Yes, this ACL should be applied to the child objects but for now, just set it directly on this object".
Just to be crystal clear, I know about the ACL options for applying to "this folder only" but then I lose inheritance which is a requirement so that option is not valid for my use case.
2) Anyone know of any good algorithms or methodologies for performing ACL modifications in a multithreaded manner? My gut feeling is that any recursive traversal of the filesystem should work in theory especially if you're just defining a new ACL on a top-level folder and just want to "clean up" all the subfolders. You'd stamp the new ACL on the top-level and then recurse down removing any explicit ACEs and then "flowing" the inherited permissions down.
(FYI, this question is partially duplicated from ServerFault since it's really both a sysadmin and a programming problem. On the other question, I was asking if anyone knows of any tools that can do fast ACL setting over the network.)
Found the answer in a MS KB article:
File permissions that are set on files
and folders using Active Directory
Services Interface (ADSI) and the ADSI
resource kit utility, ADsSecurity.DLL,
do not automatically propagate down
the subtree to the existing folders
and files.
The reason that you cannot use ADSI to
set ACEs to propagate down to existing
files and folders is because
ADSSecurity.dll uses the low-level
SetFileSecurity function to set the
security descriptor on a folder. There
is no flag that can be set by using
SetFileSecurity to automatically
propagate the ACEs down to existing
files and folders. The
SE_DACL_AUTO_INHERIT_REQ control flag
will only set the
SE_DACL_AUTO_INHERITED flag in the
security descriptor that is associated
with the folder.
So I've got to use the low-level SetFileSecurity Win32 API function (which is marked obsolete in its MSDN entry) to set the ACL and that should keep it from automatically flowing down.
Of course, I'd rather tear my eyeballs out with a spoon rather than deal trying to P/Invoke some legacy Win32 API with all its warts so I may end up just using an old NT4 tool called FILEACL that is like CACLS but has an option to use the SetFileSecurity API so changes don't automatically propagate down.

Resources