We have some raw data files that get updated quite frequently in our repository which are used during the build process of our app. I usually zip these files because they are x times smaller when zipped so I'd like to automate this process and let my repository to do this job for me. The repo performance is also much faster when all datafiles are zipped due to the reduced filesize.
Is there any possibility in Gitlab to scan the received worktree when pushed for files bigger than 20MB and zip those files before finishing the push?
Or is there anything similar that can be done to achieve a similar goal?
Related
We had issue in our system due to which high number of ready messages were seen as a result rdq files piled up. So we took backup of msg_store_persistent to another place with hope to reprocess them again. Now we have huge amount of .rdq files in our backup(msg_store_persistent) directory.
Problem: Is there a way to parse out json data from rdq files(Over TB) and reprocess them ?
Attempt #1: I have come across this project, tried to parse rdq files however this didn't help me much.
Let's say there are 10 folders in my bucket. I want to split the contents of the folders in a ratio of 0.8,0.1,0.1 and move them to three new folders Train, Test and Val. I have earlier done this process by downloading the folders, splitting and uploading them again. I now want to split he folders in the bucket itself.
I was able to connect to the bucket using "google-cloud-storage" library from Notebook using the post here. I was able to download, upload files. I'm not sure how to achieve splitting the folders without downloading the content.
Appreciate the help.
PS: I don't need the full code, just how to approach will do
With Cloud Storage you can only READ, WRITE (CREATE/DELETE). You can't move blob inside the bucket, even if the operation exists in the console or in some client library, the move is a WRITE/CREATE of the content with another path and then a WRITE/DELETE of the previous path.
Thus, your strategy must follow the same logic:
Perform a gsutil ls to list all the files
Copy (or move) 80% in one directory, 10% and 10% in the 2 others directory
Delete the old directory (useless if you used move operation).
It's quicker than downloading and uploading files, but it takes time. Because it's not a file system, but only API calls, it takes time for each files. And if you have thousands of file, it can take hours!
There seems to be quite some confusion about PAR files and Im struggling to find an answer to this.
I have several PAR files, each containing several GB of data. Considering PAR is a type of archive file (similar to tar I assume), I would like to extract its contents using linux. However, I cant seem to find how to do this. I can only find how to repair files or create a par file.
I am trying to use the par2 command line tool to do this.
Any help would be appreciated
TLDR: They're not really like .tar archives - they are generally created to support other files (including archives) to protect against data damage/loss. Without any of the original data, I think it is very unlikely any data can be recovered from these files.
.par files are (if they are genuinely PAR2 files) error recovery files for supporting a set of data stored separately. PAR files are useful, because they can protect the whole of the source data without needing a complete second copy.
For example, you might choose to protect 1GB of data using 100MB of .par files in the form of 10x 10MB files. This means that if any part of the original data (up to 100MB) is damaged or lost, it can be recalculated and repaired using the .par records.
This will still work if some of the .par files are lost, but the amount of data that can be recovered cannot exceed what .par files remain.
So...given that it is rare to create par files constituting 100% of the size of the original data, unless you have some of the original data as well, you probably won't be able to recover anything from the files.
http://www.techsono.com/usenet/files/par2
I'm trying to write about 30k-60k parquet files to s3 using Spark and it's taking a massive amount of time (40+ minutes) due to the s3 rate limit.
I wonder if there is a best practice to do such a thing. I heard that writing the data to HDFS and then copying it using s3-dist-cp may be faster. I can't understand why. isn't the copy from HDFS will take the same amount of time because of the s3 rate limit?
Thanks for your help
There is nothing wrong in this approach and works absolutely fine in most of the use cases, but there might be some challenges due to the way in S3 files are written.
Two Important Concepts to Understand
S3(Object Store) != POSIX File System : Rename Operation:
File rename process in POSIX based file system is a metadata only operation.Only the pointer changes and file remains as is on the disk. For example, I have a file abc.txt and I want to rename it as xyz.txt its instantaneous and atomic. xyz.txt’s last modified timestamp remain same as abc.txt’s last modfied timestamp.
Where as in AWS S3 (object store) the file rename under the hood is a copy followed by a delete operation. The source file is first copied to destination and then the source file is deleted.So “aws s3 mv” changes the last modified timestamp of destination file unlike POSIX file system.The metadata here is a key value store where key is the file path and value is the content of the file and there is no such process as changing the key and get this done immediately. The rename process depends on the size of the file. If there is a directory rename(there is nothing called directory in S3 for for simplicity we can assume a recusrive set of files as a directory) then it depends on the # of files inside the dir along with size of each file. So in a nutshell rename is very expensive operation in S3 as compared to normal file system.
S3 Consistency Model
S3 comes with 2 kinds of consistency a.read after write b.eventual consistency and which some cases results in file not found expectation.Files being added and not listed or files being deleted or not removed from list.
Deep explanation:
Spark leverages Hadoop’s “FileOutputCommitter” implementations to write data. Writing data again involves multiple steps and on a high level staging output files and then committing them i.e. writing final files.Here the rename step is involved as I was talking earlier from staging to final step.As you know a spark job is divided into multiple stages and set of tasks and due to nature of distributed computing the tasks are prone to failure so there is also provision to re-launch same task due to system failure or speculative execution of slow running tasks and that leads to concepts of task commit and job commit functions.Here we have 2 options of readily available algorithms and how job and task commits are done and having said this not one algorithm is better then other rather based on where we are committing data.
mapreduce.fileoutputcommitter.algorithm.version=1
commitTask renames the data generated by task from task temporary directory to job temporary directory.
When all the tasks are complete commitJob rename all the data from job temporary directory to final destination and at the end creates _SUCCESS file.
Here driver does the work of commitJob at the end so object stores like S3 may take longer time because of lots of task temporary file being queued up for rename operation(its not serial though)and the write performance is not optimized.It might work pretty well for HDFS as rename is not expensive and just a metadata change.For AWS S3 during commitJob each rename operation of files opens up huge number of API calls to AWS S3 and might cause issues of unexpected API call closure if the number of files are high. It might not also. I have seen both the cases on the same job running in two different times.
mapreduce.fileoutputcommitter.algorithm.version=2
commitTask moves data generated by task from task temporary directory directly to the final destination as soon as task is complete.
commitJob basically writes the _SUCCESS file and doesn't do much.
From a high level this looks optimized but it comes with a limitation not to have the speculative task execution and also if any task fails due to corrupt data then we might end up with residual data in the final destination and needs a clean up. So this algorithm doesn't give 100% data correctness or doesn't work for use cases where we need data in append mode to existing files.Even if this ensures optimised results comes with a risk.The reason for good performance is basically because of less number of rename operations as compared to algorithm 1(still there are renames). Here we might encounter issues of file not found expectations because commitTask writes the file in temporary path and immediately renames them and there are light chances of eventual consistency issues.
Best Practices
Here are few I think we can use while writing spark data processing applications :
If you have a HDFS cluster available then write data from Spark to HDFS and copy it to S3 to persist. s3-dist-cp can be used for data copy from HDFS to S3 optimally.Here we can avoid all that rename operation.With AWS EMR being running for only duration of compute and then terminated afterwards to persist result this approach looks preferable.
Try avoiding writing files and reading it again and again unless there are consumers for the files , and spark is well known for in-memory processing and careful data persistence/cache in-memory will help the optimized run time of the application.
I have a 200GB flat file (one word per line) and I want to sort the file, then remove the duplicates and create one clean final TXT file out of it.
I tried sort with --parallel but it ran for 3 days and I got frustrated and killed the process as I didn't see any changes to the chunk of files it created in /tmp.
I need to see the progress somehow and make sure its not stuck and its working. Whats the best way to do so? Are there any Linux tools or open source project dedicated for something like this?
I don't use Linux, but if this is Gnu sort, you should be able to see the temporary files it creates from another window to monitor progress. The parallel feature only helps during the initial pass that sorts and creates the initial list of temporary files. After that, the default is a 16-way merge.
Say for example the first pass is creating temp files around 1GB in size. In this case, Gnu sort will end up creating 200 of these 1GB temp files before starting the merge phase. The 16 way merge means that 16 of those temp files will be merged at a time, creating temp files of size 16GB, and so on.
So one way to monitor progress is to monitor the creation of those temporary files.