Databricks reading from a zip file

Databricks reading from a zip file - zip

I have mounted an Azure Blob Storage in the Azure Databricks workspace filestore. The mounted container has zipped files with csv files in them.
I mounted the data using dbuitls:
dbutils.fs.mount(
source = f"wasbs://{container}#{storage_account}.blob.core.windows.net",
mount_point = mountPoint,
extra_configs = {f"fs.azure.sas.{container}.{storage_account}.blob.core.windows.net":sasKey})
And then I followed the following tutorial:
https://learn.microsoft.com/en-us/azure/databricks/_static/notebooks/zip-files-python.html
But in above since shell command does not work probably because the data does not reside in dbfs but blob storage which is mounted, which gives the error:
unzip: cannot find or open `/mnt/azureblobstorage/file.zip, /mnt/azureblobstorage/Deed/file.zip.zip or /mnt/azureblobstorage/file.zip.ZIP.`
What would be the best way to read the zipped files and write into a delta table?

The "zip" utility in unix does work. I will walk thru the commands so that you can code a dynamic notebook to extract zips files. The zip file is in ADLS Gen 2 and the extracted files are placed there also.
Because we are using a shell command, this runs at the JVM know as the executor not all the worked nodes. Therefore there is no parallelization.
We can see that I have the storage mounted.
The S&P 500 are the top 505 stocks and their data for 2013. All these files are in a windows zip file.
Cell 2 defines widgets (parameters) and retrieves their values. This only needs to be done once. The call program can pass the correct parameters to the program
Cell 3 creates variables in the OS (shell) for both the file path and file name.
In cell 4, we use a shell call to the unzip program to over write the existing directory/files with the contents of the zip file. This there is not existing directory, we just get the uncompressed files.
Last but not least, the files do appear in the sub-directory as instructed.
To recap, it is possible to unzip files using databricks (spark) using both remote storage or already mounted storage (local). Use the above techniques to accomplish this task is a note book that can be called repeatedly.

If I remember correctly, gzip is natively supported by spark. However, it might be slow or chew up a-lot of memory using dataframes. See this article on rdd's.
https://medium.com/#parasu/dealing-with-large-gzip-files-in-spark-3f2a999fc3fa
On the other hand, if you have a windows zip file you can use a unix script to uncompress the files and then read them. Check out this article.
https://docs.databricks.com/external-data/zip-files.html

Related

Splitting folders in a Google Bucket either from the Console or from a Notebook

Let's say there are 10 folders in my bucket. I want to split the contents of the folders in a ratio of 0.8,0.1,0.1 and move them to three new folders Train, Test and Val. I have earlier done this process by downloading the folders, splitting and uploading them again. I now want to split he folders in the bucket itself.
I was able to connect to the bucket using "google-cloud-storage" library from Notebook using the post here. I was able to download, upload files. I'm not sure how to achieve splitting the folders without downloading the content.
Appreciate the help.
PS: I don't need the full code, just how to approach will do

With Cloud Storage you can only READ, WRITE (CREATE/DELETE). You can't move blob inside the bucket, even if the operation exists in the console or in some client library, the move is a WRITE/CREATE of the content with another path and then a WRITE/DELETE of the previous path.
Thus, your strategy must follow the same logic:
Perform a gsutil ls to list all the files
Copy (or move) 80% in one directory, 10% and 10% in the 2 others directory
Delete the old directory (useless if you used move operation).
It's quicker than downloading and uploading files, but it takes time. Because it's not a file system, but only API calls, it takes time for each files. And if you have thousands of file, it can take hours!

Read edge DB files from HDFS or S3 in Spark

I have list Db files store into local folder, when I am running spark job on local mode I can provide local path to read those local files. but while running on client or cluster mode path is not accessible. seems they need to be kept at HDFS or access directly from S3.
I am doing following :
java.io.File directory = new File(dbPath)
at dbPath all the list of db files are present. is there any simple way to access those files folder from HDFS or from S3, as I am running this Spark job on AWS.

To my knowledge, there isn't a standard way to do this currently. But it seems you could reverse-engineer a dump-reading protocol through a close examination of how the dump is generated.
According to edgedb-cli/dump.rs, it looks like you can open the file with a binary stream reader and ignore the first 15 bytes of a given dump file.
output.write_all(
b"\xFF\xD8\x00\x00\xD8EDGEDB\x00DUMP\x00\
\x00\x00\x00\x00\x00\x00\x00\x01"
).await?;
But then it appears the remaining dump get written to a mutable async future result via:
header_buf.truncate(0);
header_buf.push(b'H');
header_buf.extend(
&sha1::Sha1::from(&packet.data).digest().bytes()[..]);
header_buf.extend(
&(packet.data.len() as u32).to_be_bytes()[..]);
output.write_all(&header_buf).await?;
output.write_all(&packet.data).await?;
with a SHA1 encoding.
Unfortunately, we're in the dark at this point because we don't know what the byte sequences of the header_buf actually say. You'll need to investigate how the undigested contents look in comparison to any protocols used by asyncpg and Postgres to verify what your dump resembles.
Alternatively, you could prepare a shim to the restore.rs with some pre-existing data loaders.

Databricks Filestore = 0

I just ran this:
dbutils.fs.ls("dbfs:/FileStore/")
I see this result:
[FileInfo(path='dbfs:/FileStore/import-stage/', name='import-stage/', size=0),
FileInfo(path='dbfs:/FileStore/jars/', name='jars/', size=0),
FileInfo(path='dbfs:/FileStore/job-jars/', name='job-jars/', size=0),
FileInfo(path='dbfs:/FileStore/plots/', name='plots/', size=0),
FileInfo(path='dbfs:/FileStore/tables/', name='tables/', size=0)]
Shouldn't there be something in filestore? I have hundreds of GB of data in a lake. I am having all kinds of problems getting Databricks to find these files. When I use Azure Data Factory, everything works perfectly fine. It's starting to drive me crazy!
For instance, when I run this:
dbutils.fs.ls("/mnt/rawdata/2019/06/28/parent/")
I get this message:
java.io.FileNotFoundException: File/6199764716474501/mnt/rawdata/2019/06/28/parent does not exist.
I have tens of thousands of files in my lake! I can't understand why I can't get a list these files!!

In Azure Databricks, this is expected behaviour.
For Files it displays the actual file size.
For Directories it displays the size=0
Example: In dbfs:/FileStore/ I have three files shown in white color and three folders shown in blue color. Checking the file size using databricks cli.
dbfs ls -l dbfs:/FileStore/
When you check out the result using dbutils as follows:
dbutils.fs.ls("dbfs:/FileStore/")
Important point to remember while reading the files larger than 2GB:
Support only files less than 2GB in size. If you use local file I/O APIs to read or write files larger than 2GB you might see corrupted files. Instead, access files larger than 2GB using the DBFS CLI, dbutils.fs, or Spark APIs or use the /dbfs/ml folder described in Local file APIs for deep learning.
If you write a file using the local file I/O APIs and then immediately try to access it using the DBFS CLI, dbutils.fs, or Spark APIs, you might encounter a FileNotFoundException, a file of size 0, or stale file contents. That is expected because the OS caches writes by default. To force those writes to be flushed to persistent storage (in our case DBFS), use the standard Unix system call sync.
There are multiple way to solve this issue. You may checkout similar SO thread answered by me.
Hope this helps.

Azure Data Factory deflate without creating a folder

I have a Data Factory v2 job which copies files from an SFTP server to an Azure Data Lake Gen2.
There is a mix of .csv files and .zip files (each containing only one csv file).
I have one dataset for copying the csv files and another for copying zip files (with Compressoin type set to ZipDeflate). The problem is that the ZipDeflate creates a new folder that contains the csv file and I need this to respect the folder hierarchy without creating any folders.
Is this possible in Azure Data Factory?

Good question, I ran into similar trouble* and it doesn't seem to be well documented.
If I remember correctly Data Factory assumes ZipDeflate could contain more than one file and appears to create a folder no matter what.
If you have Gzip files on the other hand which only have a single file, then it will create only that.
You'll probably already know this bit, but having it in the forefront of your mind helped me realise the sensible default data factory has:
My understanding of it is that the Zip standard is an archive format which is happening to use the Deflate algorithm. Being an archive format it naturally can contain multiple files.
Whereas gzip (for example) is just the compression algorithm it doesn't support multiple files (unless tar archived first), so it will decompress to just a file without a folder.
You could have an additional data factory step to take the hierarchy and copy it to a flat folder perhaps, but that leads to random file names (which you may or may not be happy with). For us it didn't work as our next step in the pipeline needed predictable filenames.
n.b. Data factory does not move files it copies them so if they're very large this could be a pain. You can trigger a meta data move operation via the data lake store API or Powershell etc however.
*Mine was slightly crazier situation in that I was receiving files named .gz from a source system but were in fact zip files in disguise! In the end the best option was to ask our source system to change to true gzip files.

Azure Import/Export tool dataset.csv and multiple session folders

I am in the process of copying a large set of data to an Azure Blob Storage area. My source set of data has a large number of files that I do not want to move, so my first thought was to create a DataSet.csv file of just the files I do want to copy. As a test, I created a csv file where each row is a single file that I want to include.
BasePath,DstBlobPathOrPrefix,BlobType,Disposition,MetadataFile,PropertiesFile
"\\SERVER\Share\Folder1\Item1\Page1\full.jpg","containername/Src/Folder1/Item1/Page1/full.jpg",BlockBlob,overwrite,"None",None
"\\SERVER\Share\Folder1\Item1\Page1\thumb.jpg","containername/Src/Folder1/Item1/Page1/thumb.jpg",BlockBlob,overwrite,"None",None
etc.
When I run the Import/Export tool (WAImportExport.exe) it seems to create a single folder on the destination for each file, so that it ends up looking like:
session#1
-session#1-0
-session#1-1
-session#1-2
etc.
All files share the same base, but do output their filename in the CSV. Is there any way to avoid this, so that all the files go into a single "session#1" folder? If possible, I'd like to avoid creating N-thousand folders on the destination drive.

I don't think you should worry about the way the files are stored on the disk, as they will be converted back to the directory structure you specified in the .csv file.
Here's what the documentation says:
How does the WAImportExport tool work on multiple source dir and disks?
If the data size is greater than the disk size, the WAImportExport
tool will distribute the data across the disks in an optimized way.
The data copy to multiple disks can be done in parallel or
sequentially. There is no limit on the number of disks the data can be
written to simultaneously. The tool will distribute data based on disk
size and folder size. It will select the disk that is most optimized
for the object-size. The data when uploaded to the storage account
will be converged back to the specified directory structure.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string