Databricks: Removing cluster logs and revisions on root DBFS on cron - databricks

Trying to investigate high databricks expenses I've discovered surprisingly that lots of those are actually an auto created storage account with GRS replication to another zone containing tons of log files (TB on TB of data)
for example:
dbutils.fs.ls('dbfs:/cluster-logs')
dbfs:/cluster-logs/1129-093452-heard78
How can I automate removing this data on a daily basis without removing the logs from the last day or so
Also How can I send those logs to someplace else (If I want)

One of the solutions would be to create a separate storage account without GRS option for logs only, and set retention period for files for specific amount of time, like, several days. This storage account should be mounted, and logs location changed to point to that mount. You can enforce that via cluster policies, for example.
Cluster logs could be sent to the Azure Log Analytics using the spark-monitoring from Microsoft (see official docs for more details). If you want to send them somewhere else, you can setup init scripts (cluster or global), and use specific client to ship logs whatever you need.

Related

How to find who has deleted the file share folder from the storage account in azure

Some one has deleted the file share folder from the storage account in azure . It can be recovered as soft delete is enabled. But how to know that who has deleted the file?
It is possible to view operations within an Azure resource using Resource Logs. This is possible by Monitoring Azure Blob Storage which is a feature of Azure Monitor.
You would first start with creating a Diagnostic Setting- https://learn.microsoft.com/en-us/azure/storage/blobs/monitor-blob-storage?tabs=azure-portal#creating-a-diagnostic-setting
And then view logged activity by using a Log Analytics query or you can go the destination that you are forwarding the logs to as setup in the diagnostics setting and look for the respective API, example- "DeleteBlob" or "DeleteContainer" etc.,
However, if you have not already setup a diagnostic setting already and are forwarding data to a specific destination, it may not be possible to retrieve this information right now. Hope this helps!

How to move backup logs (1 year retention logs) from log analytics workspace to azure storage account

I am trying to fetch 1 year retention logs from azure log analytics workspace and push it to storage account.
Tried data export option but it is pushing the logs from the time I had enabled Data Export, not the retention logs.
As per my understanding, in your case Data export in Log Analytics workspace starts exporting data(new), formed from the configuration time of Data export rules. So, you are getting the logs formed after enabling the Data export but not the retention logs.
Please check whether you have configured retention time as per your requirement.
As per this Microsoft Doc, Data in a Log Analytics workspace is retained
for a specified period of time after which it's either removed or
archived with a reduced retention fee. Set the retention time to
balance your requirement for having data available with reducing your
cost for data retention.
To know how to set retention time, follow below steps:
Go to Azure Portal -> Your Log Analytics Workspace -> Usage and Estimated Costs -> Data Retention
To do this from PowerShell/CLI, find this link
Data Retention may also differ based on the pricing tier you are using. To know more about that in detail, please go through this link.
You can also try using other export options as mentioned in this Microsoft Doc.
If you are still facing any issue, please find below references if they are helpful.
References:
Troubleshooting why Log Analytics is no longer collecting data
Other options to export data from Log Analytics workspace
Data Collection Rules in Azure Monitor - Azure Monitor | Microsoft Docs

Azure Filestore & Backup

I have a physical server that holds documents customer orders in XML and their resultant orders PDF. The locations are mapped from the application server that generates them and the desktops that need to access them via drive mapping.
These files need to be kept for a number of years for regulatory purposes, the current file server needs expanding. So I was thinking that as this data will grow to about 5-8tb over time as the data needs to be held for approx 10 years, then it can be removed.
I could create a VM in Azure with the appropriate storage and then I presume to use MARS to create a backup strategy as if this was an onsight server. But to meet the disk sizing I need a large server as the processing of the server does not need to be very much its just storage.
So I would need to still be able to map the server and desktops to the drive where the files are stored
So I was wondering if anyone could suggest an approach. The data from the desktop would need to be available for the application to access for up to 18 months. So the old data could be archived but still needs to be backed up as retrieval of archive data would be via a manual search.
thanks in advance.
You can map the Azure file storage as a network drive on your desktop and then you can take a backup of files stored in Azure file storage for long-term retention.
Azure Backup for Files is in Limited Preview today. It enables you to take scheduled copies of yourFiles and can be managed from the Recovery Services Vault. If you are interested in signing up for this preview, you can drop in a mail with their subscription ID(s) to AskAzureBackupTeam#microsoft.com

Service Fabric Log etl files destination

I got into an issue with the size of the Log folder on the first node of a Service Fabric cluster.
It seems that there isn't any upper limit on the disk space it can use and it will eventually eat up the entire local disk.
The cluster was created via an ARM template and I set up two storage accounts associated to the cluster. The variable names for the storage accounts are: supportLogStorageAccountName and applicationDiagnosticsStorageAccountName
However, the etl files are written only to local cluster nodes disks and not to the storage (where I could find dtr files).
Is there any way to set the destination to the etl files to an external storage or at least to limit the size of the Log folder? I wonder if the parameter overallQuotaInMB in the ARM template could be related to that.
You can override the MaxDiskQuotaInMb setting from the default of 10240 to reduce the disk usage by etl's. (We use the local logs in cases where Azure storage isn't available for some reason)
https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-fabric-settings

Azure blobs, what are they for?

I'm reading about Azure blobs and storage, and there are things I don't understand.
First, you can hire Azure for just hosting, but when you create a web role ... do you need storage for the .dll's and other files (.js and .css) ?? Or there are a small storage quota in a worker role you can use? how long is it? I cannot understand getting charged every time a browser download a CSS file, so I guess I can store those things in another kind of storage.
Second, you get charged for transaction and bandwidth, so it's not a good idea to provide direct links to the blobs in your websites, then... what do you do? Download it from your web site code and write to the client output stream on the fly from ASP.NET? I think I've read that internal trafic/transactions are for free, so it looks like a "too-good-for-be-truth" solution :D
Is the trafic between hosting and storage also free?
Thanks in advance.
First, to answer your main question: blobs are best used for dynamic data files. If you run a YouTube sorta site, you would use blobs to store videos in every compressed state and thumbnails to images generated from those videos. Tables within table storage are best for dynamic data that does not require files. For example comments on YouTube videos would likely be best stored by tables in ATS.
You generally want a storage account for at least: publishing your deployments into Azure and to have your compute nodes transfer their diagnostic data to, for when you're deployed and need to monitor your compute nodes
Even though you publish your deployments THROUGH a storage account, the deployment code lives on your compute nodes. .CSS/.HTML files served by your app are served through your node's storage space which you get plenty of (it is NOT a good place for your dynamic data however)
You pay for traffic/data that crosses the Azure data center boundary, irregardless where it came from. Furthermore, transactions (reads or writes) between your azure table storage and anywhere else are not free. You also pay for storing the data in the storage account (storing data on compute nodes themselves is not metered). Data that does not leave their data center is not subject to transfer fees. Now in reality, the costs are so low, that you have to be pushing gigabytes per day to start noticing
Don't store any dynamic data only on compute instances. That data will get purged whenever you redeploy your app or whenever they decide to move your app onto a different node.
Hope this helps

Resources