How to save Databricks Log Audit as a Table?

How to save Databricks Log Audit as a Table? - apache-spark

I want to access the Databricks Audit Logs to check user activity. For example, the number of times that a table was viewed by a user.
I'd like to know if there is any way to get Logs as a Databricks table. I mean, saving the Logs as a table (let's say delta table).
Also, I want it to work continuously; adding new logs to the table when a new event happens (not just one time).
I am not sure if I can use SQL language for this purpose or not (instead of Rest API).
Any idea how to do that?

Enable audit logging at database workspace settings as below:
1.The data of logs is stored at '/mnt/audit-logs so create a table with that location using SQL:
CREATE TABLE audit_logs
USING delta
LOCATION '/mnt/audit-logs'
This creates a Delta table named audit_logs that points to the /mnt/audit-logs directory in the Databricks File System (DBFS).
2.Run a job or execute some code that generates audit logs in Databricks. The audit logs will be written to the /mnt/audit-logs directory.
3.Use the following code to continuously update the audit_logs Delta table with new audit log events:
INSERT INTO audit_logs
SELECT *
FROM delta.`/mnt/audit-logs`
This code inserts the new audit log events into the audit_logs Delta table by reading the events from the /mnt/audit-logs directory and filtering out any events that have already been processed.
4.Run the code from step 3 periodically (for example, using a Databricks job or a cron job) to continuously update the audit_logs Delta table with new audit log events.
In this way you can create audit logs as table by updating continuously.

Related

How to access hudi metrics

How can the Hudi metrics be accessed programatically. After a commit I would like to get metrics like records updated / records inserted and log them into a database.
I tried setting hoodie.metrics.on=true and hoodie.metrics.reporter.type=INMEMORY. But how can I get a HoodieMetrics object that contains the actual information?

Azure DF: Get metadata of millions of files located in a VM and call a store procedure to update file details in a DB

I have created a Getmeta data activity in azure pipeline to fetch the file details located in a VM and I am iterating the output of Getmeta data activity using For each loop.
In the for each loop , I am calling a store procedure to update file details in the database.
If I have 2K files in the VM, the store procedure is called 2K times and which I feel not a good practice.
Is there any method to update all the file details in one shot ?

Per my knowledge, i think you could use GetMetadata Activity to get the output then pass it into Azure Function Activity.
Inside azure function,you could loop the output and use sdk (such as java sql lib) to update the tables as you want in the batch.

Archiving Azure Search Service

Need suggestion on archiving unused data from search service and reload it back when needed(reload to be done later).
Initial design draft looks like this:
Find the keys from search service based on some conditions(like take inactive, how old) that need to be archived.
Run achiever job(need suggestion here, could be a web job, function app)
Fetch the data and insert to blob storage and delete it from the search service.
Now the real way is to run the job in the pool and should be asynchronous

There's no right / wrong answer for this question. What you need to do is perform batch queries (up to 1000 docs), and schedule it to archive past data (eg. run an Azure function which will trigger and search for docs where createdDate > DataTime.Now).
Then persist that data somewhere (can be a cosmos db or as blob into storage account). Once you need to upload it again, I would consider it as a new insert, so it should follow your current insert process.
You can also take a look on this tool which helps to copy data from your index pretty quick:
https://github.com/liamca/azure-search-backup-restore

Switching production azure tables powering cloud service

Would like to know what would be the best way to handle the following scenario.
I have an azure cloud service uses a Azure storage table to lookup data against requests. The data in the table is generated offline periodically (once a week).
When new data is generated offline I would need to upload it into a separate table and make config changes (change table name) to the service to pick up data from the new table and re-deploy the service. (Every time data changes I change the table name - stored as a constant in my code - and re-deploy)
The other way would be to keep a configuration parameter for my azure web role which specifies the name of the table which holds current production data. Then, within the service I read the config variable for every request - get a reference to the table and fetch data from there.
Is the second approach above ok - or would it have a performance hit because I read the config, create a table client on every request that comes to the service. (The SLA for my service is less than 2 seconds)

To answer your question, 2nd approach is definitely better than the 1st one. I don't think you will take a performance hit because the config settings are cached on 1st read (I read it in one of the threads here) and creating table client does not create a network overhead because unless you execute some methods on the table client, this object just sits in the memory. One possibility would be to read from config file and put that in a static variable. When you change the config setting, capture the role environment changing event and update the static variable to the new value from the config file.
A 3rd alternative could be to soft code the table name in another table and have your application read the table name from there. You could update the table name as part of your upload process by first uploading the data and then updating this table with the new table name where data has been uploaded.

Get Schema error when making Data sync in Azure

I finished setup for the making Azure hub and installing Client Agent and Database .
Then define dataset.
That time whatever database i chose and click get latest schema, got the error.
Error is
The get schema request is either taking a long time or has failed.
When check log ,it said like below :
Getting schema information for the database failed with the exception "There is already an open DataReader associated with this Command which must be closed first.
For more information, provide
tracing id ‘xxxx’ to customer support.
Any idea for this?

the current release has maximum of 500 tables in sync group. also, the drop down for the tables list is restricted to this same limit.
here's a quick workaround:
script the tables you want to sync
create a new temporary database and run the script to create the tables you want to sync
register and add the new temporary database as a member of the sync group
use the new temporary database to pick the tables you want to sync
add all other databases that you want to sync with (on-premise databases and hub database)
once the provisioning is done, remove the temporary database from the sync group.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to save Databricks Log Audit as a Table? - apache-spark

Related

How to access hudi metrics

Azure DF: Get metadata of millions of files located in a VM and call a store procedure to update file details in a DB

Archiving Azure Search Service

Switching production azure tables powering cloud service

Get Schema error when making Data sync in Azure

Categories

Resources