How to zip a folder in Azure-Data-Lake-Storage gen2? - databricks

I am looking for a solution to zip all the files stored in an adls folder and then want to move it to another adls folder.
I am trying to do this operation in databricks.
Note: I am not allowed to mount adls data to dbfs.

Related

Matillion: Delete files from Azure Blob Storage Container and Windows Fileshare

I have a use case where I am Transferring XML files from Windows fileshare to Azure Blob Storage and then loading data to Snowflake Tables. I am using Matillion to achieve this.
The Windows Fileshare gets Zipped XML file which contains .xml and .xml.chk files. I am using Azure Blob Storage component of Matillion to Copy the .xml files to Snowflake table and have set Purge = True to delete them afterwards.
I need help in deleting the leftover .xml.chk files from Blob Storage Container. Also, once the data loading is complete, I would like to delete the zipped files from Windows Fileshare.
Thanks,
Shivroopa
you can delete the files from blob storage using the matillion python script component (orchestration->scripting->Python script)
here is an example of python code to delete blob items and containers
Delete Blob Example
I don't see a way to delete files on the windows machine from Matillion other than creating an API endpoint on the fileshare and calling the API from Matillion.

List directories in Azure Data Lake gen2 with hierarchical namespace

I am storing data on an Azure Datalake Gen2, with hierarchical namespace enabled. This enables me to create and rename directories, as in a traditional filesystem.
Is there any efficient way to list the sub-directories for a given directory? I can use the FileSystemClient.get_paths("my_directory"), but this method scans through all files and subdirectories under "my_directory".
Using the recursive=False flag in the get_paths() method gets the job done.

I'm getting continuous blob files in blob storage. I have to load in Databricks and put in Azure SQL DB. Data factory for orchestrating this pipeline

I receive data continuously in blob storage. I have initially 5 blob files in the blob storage I'm able to load from blob to Azure SQL DB using Databricks and automated it using Data factory, But the problem is when newer files come in blob storage the databricks loads these files along with older files and sends it into Azure SQL DB. I don't want these old files, every time I want only the newer one's, so that same data is not loaded again and again in the Azure SQL DB.
Easiest way to do that is to simply archive the file that you just read into a new folder name it archiveFolder. Say, your databricks is reading from the following directory:
mnt
sourceFolder
file1.txt
file2.txt
file3.txt
You run your code, you ingested the files and loaded them in SQL server. Then what you can simply do is to archive these files (move them from the sourceFolder into archiveFolder. This can simply be done in databricks using the following command
dbutils.fs.mv(sourcefilePath, archiveFilePath, True)
So, next time your code runs, you will only have the new files in your sourceFolder.

Is the transfer process of files from HDFS to ADLS Gen 2 using command line same as the transfers to BLOB?

In my project, we have been using BLOBs on Azure. We were able to upload ORC files into an existing BLOB container named, say, student_dept in quite a handy manner using:
hdfs fs -copyFromLocal myfolder/student_remarks/*.orc wasbs://student_dept#universitygroup.blob.core.windows.net/DEPT/STUDENT_REMARKS
And we have a Hive EXTERNAL table: STUDENT_REMARKS created on the student_dept BLOB. This way, we can very easily access our data from cloud using Hive queries.
Now, we're trying to shift from BLOB storage to ADLS Gen2 for storing the ORC files and I'm tring to understand the impact this change would have on our upload/data retrieval process.
I'm totally new to Azure, and what I want to know now is how do I upload the ORC files from my HDFS to ADLS Gen2 stoage? How different is it?
Does the same command with the different destination (ADLS G2 instead of BLOB) work or is there something extra that needs to be done in order to upload data to ADLS G2?
Can someone please help me with your inputs on this?
I didn't give it a try, but as per doc like this and this, you can use command like below for ADLS GEN2:
hdfs dfs -copyFromLocal myfolder/student_remarks/*.orc
abfs://student_dept#universitygroup.dfs.core.windows.net/DEPT/STUDENT_REMARKS

Downloading parquet files from Azure Blob Storage. File and folder with same names

I have created parquet files on Azure Blob Storage. And now I want to download them. Problem is it keeps failing. I think its because the is a file and folder with same names. Why is that? Do I just need the folder? Since the file is only 0B?
The error I get looks like:
Its saying that because it already downloaded the 0B file
As mentioned in the comments, instead of downloading the actual file, you might have downloaded the Block Blob file which is an Azure's implementation to handle and provide FileSystem like access when the blob storage is being used as a filesystem (Azure HDInsight clusters have their HDFS backed with Azure Blob Storage). Object storage like Azure Blob Storage and AWS S3 have these sort of custom implementation to provide (or simulate) a seamless filesystem like experience.
In short, don't download the Block Blob. Download the actual files.

Resources