I just ran this:
dbutils.fs.ls("dbfs:/FileStore/")
I see this result:
[FileInfo(path='dbfs:/FileStore/import-stage/', name='import-stage/', size=0),
FileInfo(path='dbfs:/FileStore/jars/', name='jars/', size=0),
FileInfo(path='dbfs:/FileStore/job-jars/', name='job-jars/', size=0),
FileInfo(path='dbfs:/FileStore/plots/', name='plots/', size=0),
FileInfo(path='dbfs:/FileStore/tables/', name='tables/', size=0)]
Shouldn't there be something in filestore? I have hundreds of GB of data in a lake. I am having all kinds of problems getting Databricks to find these files. When I use Azure Data Factory, everything works perfectly fine. It's starting to drive me crazy!
For instance, when I run this:
dbutils.fs.ls("/mnt/rawdata/2019/06/28/parent/")
I get this message:
java.io.FileNotFoundException: File/6199764716474501/mnt/rawdata/2019/06/28/parent does not exist.
I have tens of thousands of files in my lake! I can't understand why I can't get a list these files!!
In Azure Databricks, this is expected behaviour.
For Files it displays the actual file size.
For Directories it displays the size=0
Example: In dbfs:/FileStore/ I have three files shown in white color and three folders shown in blue color. Checking the file size using databricks cli.
dbfs ls -l dbfs:/FileStore/
When you check out the result using dbutils as follows:
dbutils.fs.ls("dbfs:/FileStore/")
Important point to remember while reading the files larger than 2GB:
Support only files less than 2GB in size. If you use local file I/O APIs to read or write files larger than 2GB you might see corrupted files. Instead, access files larger than 2GB using the DBFS CLI, dbutils.fs, or Spark APIs or use the /dbfs/ml folder described in Local file APIs for deep learning.
If you write a file using the local file I/O APIs and then immediately try to access it using the DBFS CLI, dbutils.fs, or Spark APIs, you might encounter a FileNotFoundException, a file of size 0, or stale file contents. That is expected because the OS caches writes by default. To force those writes to be flushed to persistent storage (in our case DBFS), use the standard Unix system call sync.
There are multiple way to solve this issue. You may checkout similar SO thread answered by me.
Hope this helps.
Related
i am relatively new to spark/pyspark so any help is well appreciated.
currently we have files being delivered to Azure data lake hourly into a file directory, example:
hour1.csv
hour2.csv
hour3.csv
i am using databricks to read the files in the file directory using the code below:
sparkdf = spark.read.format(csv).option("recursiveFileLookup", "true").option("header", "true").schema(schema).load(file_location)
each of the CSV files is about 5kb and all have the same schema.
what i am unsure about is how scalable "spark.read" is? currently we are processing about 2000 of such small files, i am worried that there is a limit on the number of files being processed. is there a limit such as maximum 5000 files and my code above breaks?
from what i have read online, i believe data size is not a issue with the method above, spark can read petabytes worth of data(comparatively, our data size in total is still very small), but there are no mentions of the number of files that it is able to process - educate me if i am wrong.
any explanations is very much appreciated.
thank you
The limit it your driver's memory.
When reading a directory, the driver lists it (depending on the initial size, it may parallelize the listing to executors, but it collects the results either way).
After having the list of files, it creates tasks for the executors to run.
With that in mind, if the list is too large to fit in the driver's memory, you will have issues.
You can always increase the driver's memory space to manage it, or have some preprocess to merge the files (GCS has a gsutil compose which can merge files without downloading them).
I have mounted an Azure Blob Storage in the Azure Databricks workspace filestore. The mounted container has zipped files with csv files in them.
I mounted the data using dbuitls:
dbutils.fs.mount(
source = f"wasbs://{container}#{storage_account}.blob.core.windows.net",
mount_point = mountPoint,
extra_configs = {f"fs.azure.sas.{container}.{storage_account}.blob.core.windows.net":sasKey})
And then I followed the following tutorial:
https://learn.microsoft.com/en-us/azure/databricks/_static/notebooks/zip-files-python.html
But in above since shell command does not work probably because the data does not reside in dbfs but blob storage which is mounted, which gives the error:
unzip: cannot find or open `/mnt/azureblobstorage/file.zip, /mnt/azureblobstorage/Deed/file.zip.zip or /mnt/azureblobstorage/file.zip.ZIP.`
What would be the best way to read the zipped files and write into a delta table?
The "zip" utility in unix does work. I will walk thru the commands so that you can code a dynamic notebook to extract zips files. The zip file is in ADLS Gen 2 and the extracted files are placed there also.
Because we are using a shell command, this runs at the JVM know as the executor not all the worked nodes. Therefore there is no parallelization.
We can see that I have the storage mounted.
The S&P 500 are the top 505 stocks and their data for 2013. All these files are in a windows zip file.
Cell 2 defines widgets (parameters) and retrieves their values. This only needs to be done once. The call program can pass the correct parameters to the program
Cell 3 creates variables in the OS (shell) for both the file path and file name.
In cell 4, we use a shell call to the unzip program to over write the existing directory/files with the contents of the zip file. This there is not existing directory, we just get the uncompressed files.
Last but not least, the files do appear in the sub-directory as instructed.
To recap, it is possible to unzip files using databricks (spark) using both remote storage or already mounted storage (local). Use the above techniques to accomplish this task is a note book that can be called repeatedly.
If I remember correctly, gzip is natively supported by spark. However, it might be slow or chew up a-lot of memory using dataframes. See this article on rdd's.
https://medium.com/#parasu/dealing-with-large-gzip-files-in-spark-3f2a999fc3fa
On the other hand, if you have a windows zip file you can use a unix script to uncompress the files and then read them. Check out this article.
https://docs.databricks.com/external-data/zip-files.html
I have been getting notebook failures intermittingly relating to querying a TEMPORARY VIEW that is selecting from a parquet file located on a ADLS Gen2 mount.
Delta cache contains a stale footer and stale page entries for the file dbfs:/mnt/container/folder/parquet.file, these will be removed (4 stale page cache entries). Fetched file stats (modificationTime: 1616064053000, fromCachedFile: false) do not match file stats of cached footer and entries (modificationTime: 1616063556000, fromCachedFile: true).
at com.databricks.sql.io.parquet.CachingParquetFileReader.checkForStaleness(CachingParquetFileReader.java:700)
at com.databricks.sql.io.parquet.CachingParquetFileReader.close(CachingParquetFileReader.java:511)
at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.close(SpecificParquetRecordReaderBase.java:327)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.close(VectorizedParquetRecordReader.java:164)
at com.databricks.sql.io.parquet.DatabricksVectorizedParquetRecordReader.close(DatabricksVectorizedParquetRecordReader.java:484)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.close(RecordReaderIterator.scala:70)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:45)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:291)
The a datafactory Copy Data activity is performed to Source (from mssql table) and Sink (Parquet file) using snappy compression before the notebook command is executed. No other activities or pipelines write to this file. However, multiple notebooks will perform selects against this same parquet file.
From what I can tell from the error message, the delta cache is older than the parquet file itself. Is there a way to turn off the caching for this particular file (it is very small dataset) or invalidate the cache prior to the Data Copy activity? I am aware of the CLEAR CACHE command but this does it for all tables and not specifically temp views.
We have a similar process and we have been having the exact same problem.
If you need to invalidate the cache for a specific file/folder you can use something like the following Spark-SQL command:
REFRESH {file_path}
Where file path is either the path through the dbfs or your mount.
Worth noting is that if you specify a folder instead of a file all files within that folder (recursively) will be refreshed.
This also may very well not solve your problem. It seems to have helped us, but that is more of a gut feeling as we have not been actively looking at the frequency of these failures.
The documentation.
Our Specs:
Azure
Databricks Runtime 7.4
Driver: Standard_L8s_v2
Workers: 24 Standard_L8s_v2
I have list Db files store into local folder, when I am running spark job on local mode I can provide local path to read those local files. but while running on client or cluster mode path is not accessible. seems they need to be kept at HDFS or access directly from S3.
I am doing following :
java.io.File directory = new File(dbPath)
at dbPath all the list of db files are present. is there any simple way to access those files folder from HDFS or from S3, as I am running this Spark job on AWS.
To my knowledge, there isn't a standard way to do this currently. But it seems you could reverse-engineer a dump-reading protocol through a close examination of how the dump is generated.
According to edgedb-cli/dump.rs, it looks like you can open the file with a binary stream reader and ignore the first 15 bytes of a given dump file.
output.write_all(
b"\xFF\xD8\x00\x00\xD8EDGEDB\x00DUMP\x00\
\x00\x00\x00\x00\x00\x00\x00\x01"
).await?;
But then it appears the remaining dump get written to a mutable async future result via:
header_buf.truncate(0);
header_buf.push(b'H');
header_buf.extend(
&sha1::Sha1::from(&packet.data).digest().bytes()[..]);
header_buf.extend(
&(packet.data.len() as u32).to_be_bytes()[..]);
output.write_all(&header_buf).await?;
output.write_all(&packet.data).await?;
with a SHA1 encoding.
Unfortunately, we're in the dark at this point because we don't know what the byte sequences of the header_buf actually say. You'll need to investigate how the undigested contents look in comparison to any protocols used by asyncpg and Postgres to verify what your dump resembles.
Alternatively, you could prepare a shim to the restore.rs with some pre-existing data loaders.
I'm trying to write about 30k-60k parquet files to s3 using Spark and it's taking a massive amount of time (40+ minutes) due to the s3 rate limit.
I wonder if there is a best practice to do such a thing. I heard that writing the data to HDFS and then copying it using s3-dist-cp may be faster. I can't understand why. isn't the copy from HDFS will take the same amount of time because of the s3 rate limit?
Thanks for your help
There is nothing wrong in this approach and works absolutely fine in most of the use cases, but there might be some challenges due to the way in S3 files are written.
Two Important Concepts to Understand
S3(Object Store) != POSIX File System : Rename Operation:
File rename process in POSIX based file system is a metadata only operation.Only the pointer changes and file remains as is on the disk. For example, I have a file abc.txt and I want to rename it as xyz.txt its instantaneous and atomic. xyz.txt’s last modified timestamp remain same as abc.txt’s last modfied timestamp.
Where as in AWS S3 (object store) the file rename under the hood is a copy followed by a delete operation. The source file is first copied to destination and then the source file is deleted.So “aws s3 mv” changes the last modified timestamp of destination file unlike POSIX file system.The metadata here is a key value store where key is the file path and value is the content of the file and there is no such process as changing the key and get this done immediately. The rename process depends on the size of the file. If there is a directory rename(there is nothing called directory in S3 for for simplicity we can assume a recusrive set of files as a directory) then it depends on the # of files inside the dir along with size of each file. So in a nutshell rename is very expensive operation in S3 as compared to normal file system.
S3 Consistency Model
S3 comes with 2 kinds of consistency a.read after write b.eventual consistency and which some cases results in file not found expectation.Files being added and not listed or files being deleted or not removed from list.
Deep explanation:
Spark leverages Hadoop’s “FileOutputCommitter” implementations to write data. Writing data again involves multiple steps and on a high level staging output files and then committing them i.e. writing final files.Here the rename step is involved as I was talking earlier from staging to final step.As you know a spark job is divided into multiple stages and set of tasks and due to nature of distributed computing the tasks are prone to failure so there is also provision to re-launch same task due to system failure or speculative execution of slow running tasks and that leads to concepts of task commit and job commit functions.Here we have 2 options of readily available algorithms and how job and task commits are done and having said this not one algorithm is better then other rather based on where we are committing data.
mapreduce.fileoutputcommitter.algorithm.version=1
commitTask renames the data generated by task from task temporary directory to job temporary directory.
When all the tasks are complete commitJob rename all the data from job temporary directory to final destination and at the end creates _SUCCESS file.
Here driver does the work of commitJob at the end so object stores like S3 may take longer time because of lots of task temporary file being queued up for rename operation(its not serial though)and the write performance is not optimized.It might work pretty well for HDFS as rename is not expensive and just a metadata change.For AWS S3 during commitJob each rename operation of files opens up huge number of API calls to AWS S3 and might cause issues of unexpected API call closure if the number of files are high. It might not also. I have seen both the cases on the same job running in two different times.
mapreduce.fileoutputcommitter.algorithm.version=2
commitTask moves data generated by task from task temporary directory directly to the final destination as soon as task is complete.
commitJob basically writes the _SUCCESS file and doesn't do much.
From a high level this looks optimized but it comes with a limitation not to have the speculative task execution and also if any task fails due to corrupt data then we might end up with residual data in the final destination and needs a clean up. So this algorithm doesn't give 100% data correctness or doesn't work for use cases where we need data in append mode to existing files.Even if this ensures optimised results comes with a risk.The reason for good performance is basically because of less number of rename operations as compared to algorithm 1(still there are renames). Here we might encounter issues of file not found expectations because commitTask writes the file in temporary path and immediately renames them and there are light chances of eventual consistency issues.
Best Practices
Here are few I think we can use while writing spark data processing applications :
If you have a HDFS cluster available then write data from Spark to HDFS and copy it to S3 to persist. s3-dist-cp can be used for data copy from HDFS to S3 optimally.Here we can avoid all that rename operation.With AWS EMR being running for only duration of compute and then terminated afterwards to persist result this approach looks preferable.
Try avoiding writing files and reading it again and again unless there are consumers for the files , and spark is well known for in-memory processing and careful data persistence/cache in-memory will help the optimized run time of the application.