Perforce repository operations - perforce

I got a huge folder on Perforce repository, say //myroot/foo/.... And the tree structure looks like:
//myroot/foo/sub1/a/... 10MB
//myroot/foo/sub1/b/... 1GB
//myroot/foo/sub2/a/... 20MB
//myroot/foo/sub2/b/... 1.5GB
//myroot/foo/sub3/a/... 30MB
//myroot/foo/sub3/b/... 1GB
Is there any way to check the folder size like above without sync'ing //myroot/foo to local disk space as it takes too long.
When creating a new workspace, is it possible to create a mapping to exclude all "b" subfolder? Something like:
-//myroot/foo/.../b/... //myWorkSpace/foo/.../b/...

As raven mentioned, the first question has been answered:
p4 sizes -s //myroot/foo/...
Also answered by raven here

Related

Cloning only the filled portion of a read only raw data HDD source (without source partition resizing)

I often copy raw data from HDD's with FAT32 partitions at the file level. I would like to switch to bitwise cloning this raw data that consists of thousands of 10MiB files that are sequentially written across a single FAT32 partition.
The idea is on the large archival HDD, have a small partition which contains a shadow directory structure with symbolic links to separate raw data image partitions. Each additional partition being the aforenoted raw data, but sized to only the size consumed on the source drive. The number of raw data files on each source drive can be in the tens up through the tens of thousands.
i.e.: [[sdx1][--sdx2--][-------------sdx3------------][--------sdx4--------][-sdx5-][...]]
Where 'sdx1' = directory of symlinks to sdx2, sdx3, sdx4, ... such that the user can browse to multiple partitions but it appears to them as if they're just in subfolders.
Optimally I'd like to find both a Linux and a Windows solution. If the process can be scripted or a software solution that exists can step through a standard workflow, that'd be best. The process is almost always 1) Insert 4 HDD's with raw data 2) Copy whatever's in them 3) Repeat. Always the same drive slots and process.
AFAIK, in order to clone a source partition without cloning all the free space, one conventionally must resize the source HDD partition first. Since I can't alter the source HDD in any way, how can I get around that?
One way would be clone the entire source partition (incl. free space) and resize the target backup partition afterward, but that's not going to work out because of all the additional time that would take.
The goal is to retain bitwise accuracy and to save time (dd runs about 200MiB/s whereas rsync runs about 130MiB/s, however also needing to copy a ton of blank space every time makes the whole perk moot). I'd also like to be running with some kind of --rescue flag so when bad clusters are hit on the source drive it just behaves like clonezilla and just writes ???????? in place of the bad clusters. I know I said "retain bitwise accuracy" but a bad cluster's a bad cluster.
If you think one of the COTS or GOTS software like EaseUS, AOMEI, Paragon and whatnot are able to clone partitions as I've described please point me in the right direction. If you think there's some way I can dd it up with some script which sizes up the source, makes the right size target partition, then modifies the target FAT to its correct size, chime in I'd love many options and so would future people with a similar use case to mine that stumble on this thread :)
Not sure if this will fit you, but is very simple.
Syncthing https://syncthing.net/ will sync the content of 2 or more folders, works on Linux and Windows.

Fastest way to sort very large files preferably with progress

I have a 200GB flat file (one word per line) and I want to sort the file, then remove the duplicates and create one clean final TXT file out of it.
I tried sort with --parallel but it ran for 3 days and I got frustrated and killed the process as I didn't see any changes to the chunk of files it created in /tmp.
I need to see the progress somehow and make sure its not stuck and its working. Whats the best way to do so? Are there any Linux tools or open source project dedicated for something like this?
I don't use Linux, but if this is Gnu sort, you should be able to see the temporary files it creates from another window to monitor progress. The parallel feature only helps during the initial pass that sorts and creates the initial list of temporary files. After that, the default is a 16-way merge.
Say for example the first pass is creating temp files around 1GB in size. In this case, Gnu sort will end up creating 200 of these 1GB temp files before starting the merge phase. The 16 way merge means that 16 of those temp files will be merged at a time, creating temp files of size 16GB, and so on.
So one way to monitor progress is to monitor the creation of those temporary files.

Gitlab scan worktree on push and zip files bigger than 20MB

We have some raw data files that get updated quite frequently in our repository which are used during the build process of our app. I usually zip these files because they are x times smaller when zipped so I'd like to automate this process and let my repository to do this job for me. The repo performance is also much faster when all datafiles are zipped due to the reduced filesize.
Is there any possibility in Gitlab to scan the received worktree when pushed for files bigger than 20MB and zip those files before finishing the push?
Or is there anything similar that can be done to achieve a similar goal?

Arangodb journal logfiles

What for are logfiles in:
"arango_instance_database/journals/logfile-xxxxxx.db
Can I delete them ?
How can I reduce their size ?
I set
database.maximal-journal-size = 1048576
but those files are still 32M large.
Can I set some directory for them like
/var/log/...
?
You're referencing the Write Ahead Logfiles which are at least temporarily the files your data is kept in.
So its a very bad idea to remove them on your own, as long as you still like your data to be intact.
The files are used so documents can be written to disk in a contineous fashion. Once the system is idle, the aggregator job will pick the documents from them and move them over into your database files.
You can find interesting documentation of situations where others didn't choose such an architectural approach, and data was written directly into their data files on the disk, and what this then does to your sytem.
Once all documents in a WAL-file have been moved into db files, the ArangoDB will remove the allocated space.
Thank You a lot for reply :-)
So in case of arangodb deployed as "single instance" I can set:
--wal.suppress-shape-information true
--wal.historic-logfiles 0
Anything else ?
How about
--wal.logfile-size
What are best/common practises in determining its size ?

handling lots of temporary small files

I have a web server which saves cache files and keeps them for 7 days. The file names are md5 hashes, i.e. exactly 32 hex characters long, and are being kept in a tree structure that looks like this:
00/
00/
00000ae9355e59a3d8a314a5470753d8
.
.
00/
01/
You get the idea.
My problem is that deleting old files is taking a really long time. I have a daily cron job that runs
find cache/ -mtime +7 -type f -delete
which takes more than half a day to complete. I worry about scalability and the effect this has on the performance of the server. Additionally, the cache directory is now a black hole in my system, trapping the occasional innocent du or find.
The standard solution to LRU cache is some sort of a heap. Is there a way to scale this to the filesystem level?
Is there some other way to implement this in a way which makes it easier to manage?
Here are ideas I considered:
Create 7 top directories, one for each week day, and empty one directory every day. This increases the seek time for a cache file 7-fold, makes it really complicated when a file is overwritten, and I'm not sure what it will do to the deletion time.
Save the files as blobs in a MySQL table with indexes on name and date. This seemed promising, but in practice it's always been much slower than FS. Maybe I'm not doing it right.
Any ideas?
When you store a file, make a symbolic link to a second directory structure that is organized by date, not by name.
Retrieve your files using the "name" structure, delete them using the "date" structure.
Assuming this is ext2/3 have you tried adding in the indexed directories? When you have a large number of files in any particular directory the lookup will be painfully slow to delete something.
use tune2fs -o dir_index to enable the dir_index option.
When mounting a file system, make sure to use noatime option, which stops the OS from updating access time information for the directories (still needs to modify them).
Looking at the original post it seems as though you only have 2 levels of indirection to the files, which means that you can have a huge number of files in the leaf directories. When there are more than a million entries in these you will find that searches and changes are terribly slow. An alternative is to use a deeper hierarchy of directories, reducing the number of items in any particular directory, therefore reducing the cost of search and updates to the particular individual directory.
Reiserfs is relatively efficient at handling small files. Did you try different Linux file systems? I'm not sure about delete performance - you can consider formatting (mkfs) as a substitute for individual file deletion. For example, you can create a different file system (cache1, cache2, ...) for each weekday.
How about this:
Have another folder called, say, "ToDelete"
When you add a new item, get today's date and look for a subfolder in "ToDelete" that has a name indicative of the current date
If it's not there, create it
Add a symbolic link to the item you've created in today's folder
Create a cron job that goes to the folder in "ToDelete" which is of the correct date and delete all the folders that are linked.
Delete the folder which contained all the links.
How about having a table in your database that uses the hash as the key. The other field would then be the name of the file. That way the file can be stored in a date-related fashion for fast deletion, and the database can be used for finding that file's location based on the hash in a fast fashion.

Resources