Perforce change list size - perforce

Someone is submitting change lists that are using up a lot of disk space on our Perforce server. I want to find the size of recent change lists so I can track down where this is happening. Is there a way to do this?
p4 sizes shows the size of files in the depot but doesn't show history. I suppose I can run this periodically and see what's growing, but I would like to look back in time.
p4 describe shows what's in change lists, but doesn't show the size of the files in the change list.
p4 diskspace shows the size of depots but doesn't show history.

I'd do something like this:
p4 -Ztag -F "#=%change%" changes -s submitted -m 1000 | p4 -x - -F "%fileSize% %path%" sizes -sz | sort -rn
That'll give you output like:
23245 #=3849
22499 #=4109
22438 #=3948
etc (#=change ordered by size of all revisions in that changelist).
This isn't a perfect representation of the space on disk that each changelist eats (it doesn't account for compression on the back end), but close enough that if someone is periodically submitting a bunch of huge binaries they'll be somewhere near the top of that list, so it won't take you long to find them.

If you want to eyeball changelist by changelist, you can run p4 sizes with the changelist info, for example, if you want to know the sizes of the files in changelist #99:
p4 sizes //...#99,99
That doesn't give you the history, but if you can narrow it down to a particular path you could do:
p4 sizes //path/to/suspected/troublemakers/...#50
for example, until you find the culprit. If it's regularly an issue, that's easily scriptable.

Related

Cloning only the filled portion of a read only raw data HDD source (without source partition resizing)

I often copy raw data from HDD's with FAT32 partitions at the file level. I would like to switch to bitwise cloning this raw data that consists of thousands of 10MiB files that are sequentially written across a single FAT32 partition.
The idea is on the large archival HDD, have a small partition which contains a shadow directory structure with symbolic links to separate raw data image partitions. Each additional partition being the aforenoted raw data, but sized to only the size consumed on the source drive. The number of raw data files on each source drive can be in the tens up through the tens of thousands.
i.e.: [[sdx1][--sdx2--][-------------sdx3------------][--------sdx4--------][-sdx5-][...]]
Where 'sdx1' = directory of symlinks to sdx2, sdx3, sdx4, ... such that the user can browse to multiple partitions but it appears to them as if they're just in subfolders.
Optimally I'd like to find both a Linux and a Windows solution. If the process can be scripted or a software solution that exists can step through a standard workflow, that'd be best. The process is almost always 1) Insert 4 HDD's with raw data 2) Copy whatever's in them 3) Repeat. Always the same drive slots and process.
AFAIK, in order to clone a source partition without cloning all the free space, one conventionally must resize the source HDD partition first. Since I can't alter the source HDD in any way, how can I get around that?
One way would be clone the entire source partition (incl. free space) and resize the target backup partition afterward, but that's not going to work out because of all the additional time that would take.
The goal is to retain bitwise accuracy and to save time (dd runs about 200MiB/s whereas rsync runs about 130MiB/s, however also needing to copy a ton of blank space every time makes the whole perk moot). I'd also like to be running with some kind of --rescue flag so when bad clusters are hit on the source drive it just behaves like clonezilla and just writes ???????? in place of the bad clusters. I know I said "retain bitwise accuracy" but a bad cluster's a bad cluster.
If you think one of the COTS or GOTS software like EaseUS, AOMEI, Paragon and whatnot are able to clone partitions as I've described please point me in the right direction. If you think there's some way I can dd it up with some script which sizes up the source, makes the right size target partition, then modifies the target FAT to its correct size, chime in I'd love many options and so would future people with a similar use case to mine that stumble on this thread :)
Not sure if this will fit you, but is very simple.
Syncthing https://syncthing.net/ will sync the content of 2 or more folders, works on Linux and Windows.

How can I find & delete duplicate strings from ~800gb worth of text files?

I have a dataset of ~800gb worth of text files, with about 50k .txt files in total.
I'd like to go through and make a master .txt file from these, with all duplicate lines removed from all txt files.
I can't find a way to do this that isn't going to take months for my computer to process, idealy i'd like to keep it less than a week.
sort -u <data.txt >clean.txt
All you need is a large disk.
sort is quite efficient: it will automatically split the file into manageable bites, sort each one separately, then merge them (which can be done in O(N) time); and while merging, it will discard the dupes (due to -u option). But you will need at least the space for the output file, plus the space for all the intermediate files.

Fastest way to sort very large files preferably with progress

I have a 200GB flat file (one word per line) and I want to sort the file, then remove the duplicates and create one clean final TXT file out of it.
I tried sort with --parallel but it ran for 3 days and I got frustrated and killed the process as I didn't see any changes to the chunk of files it created in /tmp.
I need to see the progress somehow and make sure its not stuck and its working. Whats the best way to do so? Are there any Linux tools or open source project dedicated for something like this?
I don't use Linux, but if this is Gnu sort, you should be able to see the temporary files it creates from another window to monitor progress. The parallel feature only helps during the initial pass that sorts and creates the initial list of temporary files. After that, the default is a 16-way merge.
Say for example the first pass is creating temp files around 1GB in size. In this case, Gnu sort will end up creating 200 of these 1GB temp files before starting the merge phase. The 16 way merge means that 16 of those temp files will be merged at a time, creating temp files of size 16GB, and so on.
So one way to monitor progress is to monitor the creation of those temporary files.

Perforce repository operations

I got a huge folder on Perforce repository, say //myroot/foo/.... And the tree structure looks like:
//myroot/foo/sub1/a/... 10MB
//myroot/foo/sub1/b/... 1GB
//myroot/foo/sub2/a/... 20MB
//myroot/foo/sub2/b/... 1.5GB
//myroot/foo/sub3/a/... 30MB
//myroot/foo/sub3/b/... 1GB
Is there any way to check the folder size like above without sync'ing //myroot/foo to local disk space as it takes too long.
When creating a new workspace, is it possible to create a mapping to exclude all "b" subfolder? Something like:
-//myroot/foo/.../b/... //myWorkSpace/foo/.../b/...
As raven mentioned, the first question has been answered:
p4 sizes -s //myroot/foo/...
Also answered by raven here

handling lots of temporary small files

I have a web server which saves cache files and keeps them for 7 days. The file names are md5 hashes, i.e. exactly 32 hex characters long, and are being kept in a tree structure that looks like this:
00/
00/
00000ae9355e59a3d8a314a5470753d8
.
.
00/
01/
You get the idea.
My problem is that deleting old files is taking a really long time. I have a daily cron job that runs
find cache/ -mtime +7 -type f -delete
which takes more than half a day to complete. I worry about scalability and the effect this has on the performance of the server. Additionally, the cache directory is now a black hole in my system, trapping the occasional innocent du or find.
The standard solution to LRU cache is some sort of a heap. Is there a way to scale this to the filesystem level?
Is there some other way to implement this in a way which makes it easier to manage?
Here are ideas I considered:
Create 7 top directories, one for each week day, and empty one directory every day. This increases the seek time for a cache file 7-fold, makes it really complicated when a file is overwritten, and I'm not sure what it will do to the deletion time.
Save the files as blobs in a MySQL table with indexes on name and date. This seemed promising, but in practice it's always been much slower than FS. Maybe I'm not doing it right.
Any ideas?
When you store a file, make a symbolic link to a second directory structure that is organized by date, not by name.
Retrieve your files using the "name" structure, delete them using the "date" structure.
Assuming this is ext2/3 have you tried adding in the indexed directories? When you have a large number of files in any particular directory the lookup will be painfully slow to delete something.
use tune2fs -o dir_index to enable the dir_index option.
When mounting a file system, make sure to use noatime option, which stops the OS from updating access time information for the directories (still needs to modify them).
Looking at the original post it seems as though you only have 2 levels of indirection to the files, which means that you can have a huge number of files in the leaf directories. When there are more than a million entries in these you will find that searches and changes are terribly slow. An alternative is to use a deeper hierarchy of directories, reducing the number of items in any particular directory, therefore reducing the cost of search and updates to the particular individual directory.
Reiserfs is relatively efficient at handling small files. Did you try different Linux file systems? I'm not sure about delete performance - you can consider formatting (mkfs) as a substitute for individual file deletion. For example, you can create a different file system (cache1, cache2, ...) for each weekday.
How about this:
Have another folder called, say, "ToDelete"
When you add a new item, get today's date and look for a subfolder in "ToDelete" that has a name indicative of the current date
If it's not there, create it
Add a symbolic link to the item you've created in today's folder
Create a cron job that goes to the folder in "ToDelete" which is of the correct date and delete all the folders that are linked.
Delete the folder which contained all the links.
How about having a table in your database that uses the hash as the key. The other field would then be the name of the file. That way the file can be stored in a date-related fashion for fast deletion, and the database can be used for finding that file's location based on the hash in a fast fashion.

Resources