Read edge DB files from HDFS or S3 in Spark - apache-spark

I have list Db files store into local folder, when I am running spark job on local mode I can provide local path to read those local files. but while running on client or cluster mode path is not accessible. seems they need to be kept at HDFS or access directly from S3.
I am doing following :
java.io.File directory = new File(dbPath)
at dbPath all the list of db files are present. is there any simple way to access those files folder from HDFS or from S3, as I am running this Spark job on AWS.

To my knowledge, there isn't a standard way to do this currently. But it seems you could reverse-engineer a dump-reading protocol through a close examination of how the dump is generated.
According to edgedb-cli/dump.rs, it looks like you can open the file with a binary stream reader and ignore the first 15 bytes of a given dump file.
output.write_all(
b"\xFF\xD8\x00\x00\xD8EDGEDB\x00DUMP\x00\
\x00\x00\x00\x00\x00\x00\x00\x01"
).await?;
But then it appears the remaining dump get written to a mutable async future result via:
header_buf.truncate(0);
header_buf.push(b'H');
header_buf.extend(
&sha1::Sha1::from(&packet.data).digest().bytes()[..]);
header_buf.extend(
&(packet.data.len() as u32).to_be_bytes()[..]);
output.write_all(&header_buf).await?;
output.write_all(&packet.data).await?;
with a SHA1 encoding.
Unfortunately, we're in the dark at this point because we don't know what the byte sequences of the header_buf actually say. You'll need to investigate how the undigested contents look in comparison to any protocols used by asyncpg and Postgres to verify what your dump resembles.
Alternatively, you could prepare a shim to the restore.rs with some pre-existing data loaders.

Related

Databricks reading from a zip file

I have mounted an Azure Blob Storage in the Azure Databricks workspace filestore. The mounted container has zipped files with csv files in them.
I mounted the data using dbuitls:
dbutils.fs.mount(
source = f"wasbs://{container}#{storage_account}.blob.core.windows.net",
mount_point = mountPoint,
extra_configs = {f"fs.azure.sas.{container}.{storage_account}.blob.core.windows.net":sasKey})
And then I followed the following tutorial:
https://learn.microsoft.com/en-us/azure/databricks/_static/notebooks/zip-files-python.html
But in above since shell command does not work probably because the data does not reside in dbfs but blob storage which is mounted, which gives the error:
unzip: cannot find or open `/mnt/azureblobstorage/file.zip, /mnt/azureblobstorage/Deed/file.zip.zip or /mnt/azureblobstorage/file.zip.ZIP.`
What would be the best way to read the zipped files and write into a delta table?
The "zip" utility in unix does work. I will walk thru the commands so that you can code a dynamic notebook to extract zips files. The zip file is in ADLS Gen 2 and the extracted files are placed there also.
Because we are using a shell command, this runs at the JVM know as the executor not all the worked nodes. Therefore there is no parallelization.
We can see that I have the storage mounted.
The S&P 500 are the top 505 stocks and their data for 2013. All these files are in a windows zip file.
Cell 2 defines widgets (parameters) and retrieves their values. This only needs to be done once. The call program can pass the correct parameters to the program
Cell 3 creates variables in the OS (shell) for both the file path and file name.
In cell 4, we use a shell call to the unzip program to over write the existing directory/files with the contents of the zip file. This there is not existing directory, we just get the uncompressed files.
Last but not least, the files do appear in the sub-directory as instructed.
To recap, it is possible to unzip files using databricks (spark) using both remote storage or already mounted storage (local). Use the above techniques to accomplish this task is a note book that can be called repeatedly.
If I remember correctly, gzip is natively supported by spark. However, it might be slow or chew up a-lot of memory using dataframes. See this article on rdd's.
https://medium.com/#parasu/dealing-with-large-gzip-files-in-spark-3f2a999fc3fa
On the other hand, if you have a windows zip file you can use a unix script to uncompress the files and then read them. Check out this article.
https://docs.databricks.com/external-data/zip-files.html

What is the best practice writing massive amount of files to s3 using Spark

I'm trying to write about 30k-60k parquet files to s3 using Spark and it's taking a massive amount of time (40+ minutes) due to the s3 rate limit.
I wonder if there is a best practice to do such a thing. I heard that writing the data to HDFS and then copying it using s3-dist-cp may be faster. I can't understand why. isn't the copy from HDFS will take the same amount of time because of the s3 rate limit?
Thanks for your help
There is nothing wrong in this approach and works absolutely fine in most of the use cases, but there might be some challenges due to the way in S3 files are written.
Two Important Concepts to Understand
S3(Object Store) != POSIX File System : Rename Operation:
File rename process in POSIX based file system is a metadata only operation.Only the pointer changes and file remains as is on the disk. For example, I have a file abc.txt and I want to rename it as xyz.txt its instantaneous and atomic. xyz.txt’s last modified timestamp remain same as abc.txt’s last modfied timestamp.
Where as in AWS S3 (object store) the file rename under the hood is a copy followed by a delete operation. The source file is first copied to destination and then the source file is deleted.So “aws s3 mv” changes the last modified timestamp of destination file unlike POSIX file system.The metadata here is a key value store where key is the file path and value is the content of the file and there is no such process as changing the key and get this done immediately. The rename process depends on the size of the file. If there is a directory rename(there is nothing called directory in S3 for for simplicity we can assume a recusrive set of files as a directory) then it depends on the # of files inside the dir along with size of each file. So in a nutshell rename is very expensive operation in S3 as compared to normal file system.
S3 Consistency Model
S3 comes with 2 kinds of consistency a.read after write b.eventual consistency and which some cases results in file not found expectation.Files being added and not listed or files being deleted or not removed from list.
Deep explanation:
Spark leverages Hadoop’s “FileOutputCommitter” implementations to write data. Writing data again involves multiple steps and on a high level staging output files and then committing them i.e. writing final files.Here the rename step is involved as I was talking earlier from staging to final step.As you know a spark job is divided into multiple stages and set of tasks and due to nature of distributed computing the tasks are prone to failure so there is also provision to re-launch same task due to system failure or speculative execution of slow running tasks and that leads to concepts of task commit and job commit functions.Here we have 2 options of readily available algorithms and how job and task commits are done and having said this not one algorithm is better then other rather based on where we are committing data.
mapreduce.fileoutputcommitter.algorithm.version=1
commitTask renames the data generated by task from task temporary directory to job temporary directory.
When all the tasks are complete commitJob rename all the data from job temporary directory to final destination and at the end creates _SUCCESS file.
Here driver does the work of commitJob at the end so object stores like S3 may take longer time because of lots of task temporary file being queued up for rename operation(its not serial though)and the write performance is not optimized.It might work pretty well for HDFS as rename is not expensive and just a metadata change.For AWS S3 during commitJob each rename operation of files opens up huge number of API calls to AWS S3 and might cause issues of unexpected API call closure if the number of files are high. It might not also. I have seen both the cases on the same job running in two different times.
mapreduce.fileoutputcommitter.algorithm.version=2
commitTask moves data generated by task from task temporary directory directly to the final destination as soon as task is complete.
commitJob basically writes the _SUCCESS file and doesn't do much.
From a high level this looks optimized but it comes with a limitation not to have the speculative task execution and also if any task fails due to corrupt data then we might end up with residual data in the final destination and needs a clean up. So this algorithm doesn't give 100% data correctness or doesn't work for use cases where we need data in append mode to existing files.Even if this ensures optimised results comes with a risk.The reason for good performance is basically because of less number of rename operations as compared to algorithm 1(still there are renames). Here we might encounter issues of file not found expectations because commitTask writes the file in temporary path and immediately renames them and there are light chances of eventual consistency issues.
Best Practices
Here are few I think we can use while writing spark data processing applications :
If you have a HDFS cluster available then write data from Spark to HDFS and copy it to S3 to persist. s3-dist-cp can be used for data copy from HDFS to S3 optimally.Here we can avoid all that rename operation.With AWS EMR being running for only duration of compute and then terminated afterwards to persist result this approach looks preferable.
Try avoiding writing files and reading it again and again unless there are consumers for the files , and spark is well known for in-memory processing and careful data persistence/cache in-memory will help the optimized run time of the application.

Databricks Filestore = 0

I just ran this:
dbutils.fs.ls("dbfs:/FileStore/")
I see this result:
[FileInfo(path='dbfs:/FileStore/import-stage/', name='import-stage/', size=0),
FileInfo(path='dbfs:/FileStore/jars/', name='jars/', size=0),
FileInfo(path='dbfs:/FileStore/job-jars/', name='job-jars/', size=0),
FileInfo(path='dbfs:/FileStore/plots/', name='plots/', size=0),
FileInfo(path='dbfs:/FileStore/tables/', name='tables/', size=0)]
Shouldn't there be something in filestore? I have hundreds of GB of data in a lake. I am having all kinds of problems getting Databricks to find these files. When I use Azure Data Factory, everything works perfectly fine. It's starting to drive me crazy!
For instance, when I run this:
dbutils.fs.ls("/mnt/rawdata/2019/06/28/parent/")
I get this message:
java.io.FileNotFoundException: File/6199764716474501/mnt/rawdata/2019/06/28/parent does not exist.
I have tens of thousands of files in my lake! I can't understand why I can't get a list these files!!
In Azure Databricks, this is expected behaviour.
For Files it displays the actual file size.
For Directories it displays the size=0
Example: In dbfs:/FileStore/ I have three files shown in white color and three folders shown in blue color. Checking the file size using databricks cli.
dbfs ls -l dbfs:/FileStore/
When you check out the result using dbutils as follows:
dbutils.fs.ls("dbfs:/FileStore/")
Important point to remember while reading the files larger than 2GB:
Support only files less than 2GB in size. If you use local file I/O APIs to read or write files larger than 2GB you might see corrupted files. Instead, access files larger than 2GB using the DBFS CLI, dbutils.fs, or Spark APIs or use the /dbfs/ml folder described in Local file APIs for deep learning.
If you write a file using the local file I/O APIs and then immediately try to access it using the DBFS CLI, dbutils.fs, or Spark APIs, you might encounter a FileNotFoundException, a file of size 0, or stale file contents. That is expected because the OS caches writes by default. To force those writes to be flushed to persistent storage (in our case DBFS), use the standard Unix system call sync.
There are multiple way to solve this issue. You may checkout similar SO thread answered by me.
Hope this helps.

Using S3 data in a Spark Application

I am new to the spark and have some fundamental doubts. I am working on a pyspark application. It is supposed to process 500K items. The current implementation is not efficient and takes forever to complete.
I will briefly explain the tasks.
The application processes a S3 directory. It is supposed to process all the files present under s3://some-bucket/input-data/. The S3 directory structure looks like the below:
s3://some-bucket/input-data/item/Q12/sales.csv
s3://some-bucket/input-data/item/Q13/sales.csv
s3://some-bucket/input-data/item/Q14/sales.csv
The csv files don't have a item identifier column. The name of directory is the item identifier, like Q11, Q12 etc.
The application has a udf defined which downloads the data using boto3, process it, and then dumps the data in S3 in the directory structure like this:
s3://some-bucket/output-data/item/Q12/profit.csv
s3://some-bucket/output-data/item/Q13/profit.csv
s3://some-bucket/output-data/item/Q14/profit.csv
Making 500K API call to S3 for the data, doesn't seem right to me. I am running the spark application on EMR, should I download all the data as a bootstrap step?
Can S3DistCp (s3-dist-cp) solve the issue by downloading the whole data to HDFS and later workers/nodes can access them. Suggestions on how to use s3-dist-cp would be very helpful.

How to refer to the local filesystem where spark-submit is executed on?

Is it possible to write output of spark program's result in driver node when it is processed in cluster?
df = sqlContext("hdfs://....")
result = df.groupby('abc','cde').count()
result.write.save("hdfs:...resultfile.parquet", format="parquet") # this works fine
result = result.collect()
with open("<my drivernode local directory>//textfile") as myfile:
myfile.write(result) # I'll convert to python object before writing
Could someone give some idea how to refer to the local filesystem where I gave spark-submit?
tl;dr Use . (the dot) and the current working directory is resolved by API.
From what I understand from your question, you are asking about saving local files in driver or workers while running spark.
This is possible and is quite straightforward.
The point is that in the end, the driver and workers are running python so you can use python "open", "with", "write" and so on.
To do this in the workers you'll need to run "foreach" or "map" on your rdd and then save locally (This can be tricky, as you may have more than one partition on each executor).
Saving from the driver is even easier, after you collected the data you have a regular python object and you can save it in any stranded pythonic way.
BUT
When you save any local file, may it be on worker or driver, that file is created inside the container that the worker or driver are running in. Once the execution is over those containers are deleted and you would not be able to access any local data stored in them.
The way to solve this is to move those local files to somewhere also while the container is still alive. You can do this with a shell command, inserting into data base and so on.
For example, I use this technique to insert results of calculations into MySQL without the need to do collect. I save results locally on workers as part of a "map" operation and then upload them using MySQL "LOAD DATA LOCAL INFILE".

Resources