Does it take longer to copy many small files compared to 1 large file both totaling of same size? Is it just because of the overhead for copying details of each file?
Yes, it takes more time to copy many small files as compared to copying the same amount of data in one large file.
And yes, the overhead comes from having to manage all the file system entries and metadata.
Related
i am relatively new to spark/pyspark so any help is well appreciated.
currently we have files being delivered to Azure data lake hourly into a file directory, example:
hour1.csv
hour2.csv
hour3.csv
i am using databricks to read the files in the file directory using the code below:
sparkdf = spark.read.format(csv).option("recursiveFileLookup", "true").option("header", "true").schema(schema).load(file_location)
each of the CSV files is about 5kb and all have the same schema.
what i am unsure about is how scalable "spark.read" is? currently we are processing about 2000 of such small files, i am worried that there is a limit on the number of files being processed. is there a limit such as maximum 5000 files and my code above breaks?
from what i have read online, i believe data size is not a issue with the method above, spark can read petabytes worth of data(comparatively, our data size in total is still very small), but there are no mentions of the number of files that it is able to process - educate me if i am wrong.
any explanations is very much appreciated.
thank you
The limit it your driver's memory.
When reading a directory, the driver lists it (depending on the initial size, it may parallelize the listing to executors, but it collects the results either way).
After having the list of files, it creates tasks for the executors to run.
With that in mind, if the list is too large to fit in the driver's memory, you will have issues.
You can always increase the driver's memory space to manage it, or have some preprocess to merge the files (GCS has a gsutil compose which can merge files without downloading them).
There seems to be quite some confusion about PAR files and Im struggling to find an answer to this.
I have several PAR files, each containing several GB of data. Considering PAR is a type of archive file (similar to tar I assume), I would like to extract its contents using linux. However, I cant seem to find how to do this. I can only find how to repair files or create a par file.
I am trying to use the par2 command line tool to do this.
Any help would be appreciated
TLDR: They're not really like .tar archives - they are generally created to support other files (including archives) to protect against data damage/loss. Without any of the original data, I think it is very unlikely any data can be recovered from these files.
.par files are (if they are genuinely PAR2 files) error recovery files for supporting a set of data stored separately. PAR files are useful, because they can protect the whole of the source data without needing a complete second copy.
For example, you might choose to protect 1GB of data using 100MB of .par files in the form of 10x 10MB files. This means that if any part of the original data (up to 100MB) is damaged or lost, it can be recalculated and repaired using the .par records.
This will still work if some of the .par files are lost, but the amount of data that can be recovered cannot exceed what .par files remain.
So...given that it is rare to create par files constituting 100% of the size of the original data, unless you have some of the original data as well, you probably won't be able to recover anything from the files.
http://www.techsono.com/usenet/files/par2
I have a small file which is about 6.5 GB and I tried to split it into files of size 5MB each using split -d -line--bytes=5MB. It took me over 6 minutes to split this file.
I have files over 1TB.
Is there a faster way to do this?
Faster than a tool specifically designed to do this kind of job? Doesn't sound likely in the general case. However, there are a few things you may be able to do:
Save the output files to a different physical storage unit. This avoids reading and writing data to the same disk at the same time, allowing more uninterrupted processing.
If the record size is static you can use --bytes to avoid the processing overhead of dealing with full lines.
I'm facing a problem of adding more file to a partition when there are too many of them, currently, I've approximately 10 million files + Linux file system, for some reason I want to add more files, but it keep saying that there is not enough space (I do have 30+ GB left though) any idea why is that happing and is it possible to be resolved?
The most common cause is that there's too many files in a single directory - directories can hold only a finite number of files. if that's not the problem, there are some other meta-data structures which can also limit the total number of files on disk.
You can differentiate between these two problems by checking if you can add files to another directory.
One software we developed generates more and more, currently about 70000 files per day, 3-5 MB each. We store these files on a Linux server with ext3 file system. The software creates a new directory every day, and writes the files generated that day into this directory. Writing and reading such a large number of files is getting slower and slower (I mean, per file), so one of my colleagues suggested opening subdirectories in every hour. We will test whether this makes the system faster, but this problem can be generalized:
Has anyone measured the speed of writing and reading files, as a function of the number of files in the target directory? Is there an optimal file count above which it's faster to put the files into subdirectories? What are the important parameters which may influence the optimum?
Thank you in advance.