Azure Purview - Deleted folders in ADLS2 - azure

We are scanning ADLS Gen 2 data lake successfully with Purview. However, if a folder is deleted in the lake and you re-scan, the scan does not remove the deleted folder. The deleted folder remains in Purview, but the last modified date (from the scan) remains as the previous scan date/time from when it was present. How can I purge these now invalid entries? Removing the previous scan does not work. Removing the entire source from Purview leaves the scan results behind in the register and a new scan does not clean them up. There is also no manual delete/purge option. The only option seems to be to remove the entire purview account from Azure, redeploy and reconfigure everything.
Am I missing a trick?

Reading to this https://learn.microsoft.com/en-us/azure/purview/concept-detect-deleted-assets this mostly seems like a expected behavior, did you try scanning more than twice post 5mins intervals.
To keep deleted files out of your catalog, it's important to run regular scans. The scan interval is important, because the catalog can't detect deleted assets until another scan is run. So, if you run scans once a month on a particular store, the catalog can't detect any deleted data assets in that store until you run the next scan a month later. 😕

Related

Syncing incremental data with ADF

In Synapse I've setup 3 different pipelines. They all gather data from different sources (SQL, REST and CSV) and sink this to the same SQL database.
Currently they all run during the night, but I already know that the question is coming of running it more frequently. I want to prevent that my pipelines are going to run through all the sources while nothing has changed in the source.
Therefore I would like to store the last succesfull sync run of each pipeline (or pipeline activity). Before the next start of each pipeline I want to create a new pipeline, a 4th one, which checks if something has changed in sources. If so, it triggers the execution of one, two or all three the pipelines to run.
I still see some complications in doing this, so I'm not fully convinced on how to do this. So all help and thoughts are welcome, don't know if someone has experience in doing this?
This is (at least in part) the subject of the following Microsoft tutorial:
Incrementally load data from Azure SQL Database to Azure Blob storage using the Azure portal
You're on the correct path - the crux of the issue is creating and persisting "watermarks" for each source from which you can determine if there have been any changes. The approach you use may be different for different source types. In the above tutorial, they create a stored procedure that can store and retrieve a "last run date", and use this to intelligently query tables for only rows modified after this last run date. Of course this requires the cooperation of the data source to take note of when data is inserted or modified.
If you have a source that cannot by intelligently queried in part (e.g. a CSV file) you still have options to use things like the Get Metadata Activity to e.g. query the lastModified property of a source file (or even its contentMD5 if using blob or ADLGen2) and compare this to a value saved during your last run (You would have to pick a place to store this, e.g. an operational DB, Azure Table or small blob file) to determine whether it needs to be reprocessed.
If you want to go crazy, you can look into streaming patterns (might require dabbling in HDInsights or getting your hands dirty with Azure Event Hubs to trigger ADF) to move from the scheduled trigger to an automatic ingestion as new data appears at the sources.

Azure Storage - File Share - Move 16m files in nested folders

Posting here as server fault doesn't seem to have the detailed Azure knowledge.
I have a Azure storage account, a file share. This file share is connected to a Azure VM through mapped drive. A FTP server on the VM accepts a stream of files and stores them in the File Share directly.
There are no other connections. Only I have Azure admin access, limited support people have access to the VM.
Last week, for unknown reasons 16 million files, which are nested in many sub-folders (origin, date) moved instantly into a unrelated subfolder, 3 levels deep.
I'm baffled how this can happen. There is a clear instant cut off when files moved.
As a result, I'm seeing increased costs on LRS. I'm assuming because internally Azure storage is replicating the change at my expense.
I have attempted to copy the files back using a VM and AZCOPY. This process crashed midway through leaving me with a half a completed copy operation. This failed attempt took days, which makes me confident I wasn't the support guys dragging and moving a folder by accident.
Questions:
Is it possible to just instantly move so many files (how)
Is there a solid way I can move the files back, taking into account the half copied files - I mean an Azure backend operation way rather than writing an app / power shell / AZCOPY?
So there a cost efficient way of doing this (I'm on Transaction Optimised tier)
Do I have a case here to get Microsoft to do something, we didn't move them... I assume something internally messed up.
Thanks
A tool that supports server-side copy (like AzCopy) can move the files quickly because only the metadata is updated. If you wants to investigate the root cause, I recommend opening a support case. (To sort this out – Your best bet is to connect with our Azure support team by filing a ticket, our support team on best effort basis can help you guide on this matter. )

How do I use perforce checkpoints?

Why is perforce giving this error when I try to create a checkpoint? Can I restore the entire database from just a checkpoint file and the journal file? What am I doing wrong, how does this work? Why the perforce user guide a giant book, and there are no video tutorials online?
Why is perforce giving this error when I try to create a checkpoint?
You specified an invalid prefix (//. is not a valid filename). If you want to create a checkpoint that doesn't have a particular prefix, just omit that argument:
p4d -jc
This will create a checkpoint called something like checkpoint.8 in the server root directory (P4ROOT), and back up the previous journal file (journal.7) in the same location.
Can I restore the entire database from just a checkpoint file and the journal file?
Yes. The checkpoint is a snapshot of the database at the moment in time when you took the checkpoint. The journal file records all transactions made after that point.
If you restore from just the checkpoint, you will recover the database as of the point in time at which the checkpoint was taken. If you restore from your latest checkpoint plus the currently running journal, you can recover the entire database up to the last recorded transaction.
The old journal backups that are created as part of the checkpoint process provide a record of everything that happened in between checkpoints. You don't need these to recover the latest state, but they can be useful in extraordinary circumstances (e.g. you discover that important data was permanently deleted by a rogue admin a month ago and you need to recover a copy of the database to the exact moment in time before that happened).
The database (and hence the checkpoint/journal) does not include depot file content! Make sure that your depots are located on reasonably durable storage (e.g. a mirrored RAID) and/or have regular backups (ideally coordinated with your database checkpoints so that you can restore a consistent snapshot in the event of a catastrophe).
https://www.perforce.com/manuals/v15.1/p4sag/chapter.backup.html

Azure Data Lake Storage and Data Factory - Temporary GUID folders and files

I am using Azure Data Lake Store (ADLS), targeted by an Azure Data Factory (ADF) pipeline that reads from Blob Storage and writes in to ADLS. During execution I notice that there is a folder created in the output ADLS that does not exist in the source data. The folder has a GUID for a name and many files in it, also GUIDs. The folder is temporary and after around 30 seconds it disappears.
Is this part of the ADLS metadata indexing? Is it something used by ADF during processing? Although it appears in the Data Explorer in the portal, does it show up through the API? I am concerned it may create issues down the line, even though it it a temporary structure.
Any insight appreciated - a Google turned up little.
So what your seeing here is something that Azure Data Lake Storage does regardless of the method you use to upload and copy data into it. It's not specific to Data Factory and not something you can control.
For large files it basically parallelises the read/write operation for a single file. You then get multiple smaller files appearing in the temporary directory for each thread of the parallel operation. Once complete the process concatenates the threads into the single expected destination file.
Comparison: this is similar to what PolyBase does in SQLDW with its 8 external readers that hit a file in 512MB blocks.
I understand your concerns here. I've also done battle with this where by the operation fails and does not clean up the temp files. My advice would be to be explicit with you downstream services when specifying the target file path.
One other thing, I've had problems where using the Visual Studio Data Lake file explorer tool to do uploads of large files. Sometimes the parallel threads did not concatenate into the single correctly and caused corruption in my structured dataset. This was with files in the 4 - 8GB region. Be warned!
Side note. I've found PowerShell most reliable for handling uploads into Data Lake Store.
Hope this helps.

Azure Backup fails after installing Kaspersky Total Security

Since installing Kaspersky Total Security two days ago, my Azure backups keep failing. This is the process: Taking snapshot of volumes; Preparing storage; Estimating size of backup items; job failed. The error message for each volume is 'unable to find changes in a file. This could be due to various reasons (0x07EF8).
Data in my files has definitely been changing. I have tried two things: 1. Disabled Kaspersky. 2. Completely deleted the backup and rebuilt from scratch. This made no difference at all.
Try to exclude all VHD files and adding cbengine.exe to the Trusted applications list, see this article:
http://www.cantrell.co/blog/2016/3/21/microsoft-azure-backup-feat-kaspersky

Resources