Quickly list random set of files in directory in Linux - linux

Question:
I am looking for a performant, concise way to list N randomly selected files in a Linux directory using only Bash. The files must be randomly selected from different subdirectories.
Why I'm asking:
In Linux, I often want to test a random selection of files in a directory for some property. The directories contain 1000's of files, so I only want to test a small number of them, but I want to take them from different subdirectories in the directory of interest.
The following returns the paths of 50 "randomly"-selected files:
find /dir/of/interest/ -type f | sort -R | head -n 50
The directory contains many files, and resides on a mounted file system with slow read times (accessed through ssh), so the command can take many minutes. I believe the issue is that the first find command finds every file (slow), and only then prints a random selection.

If you are using locate and updatedb updates regularly (daily is probably the default), you could:
$ locate /home/james/test | sort -R | head -5
/home/james/test/10kfiles/out_708.txt
/home/james/test/10kfiles/out_9637.txt
/home/james/test/compr/bar
/home/james/test/10kfiles/out_3788.txt
/home/james/test/test

How often do you need it? Do the work periodically in advance to have it quickly available when you need it.
Create a refreshList script.
#! /bin/env bash
find /dir/of/interest/ -type f | sort -R | head -n 50 >/tmp/rand.list
mv -f /tmp/rand.list ~
Put it in your crontab.
0 7-20 * * 1-5 nice -25 ~/refresh
Then you will always have a ~/rand.list that's under an hour old.
If you don't want to use cron and aren't too picky about how old it is, just write a function that refreshes the file after you use it every time.
randFiles() {
cat ~/rand.list
{ find /dir/of/interest/ -type f |
sort -R | head -n 50 >/tmp/rand.list
mv -f /tmp/rand.list ~
} &
}

If you can't run locate and the find command is too slow, is there any reason this has to be done in real time?
Would it be possible to use cron to dump the output of the find command into a file and then do the random pick out of there?

Related

How can I list the files in a directory that have zero size/length in the Linux terminal?

I am new to using the Linux terminal, so I'm just starting to learn about the commands I can use. I have figured out how to list the files in a directory using the Linux terminal, and how to list them according to file size. I was wondering if there's a way to list only the files of a specific file size. Right now, I'm trying to list files with zero size, like those that you might create using the touch command. I looked through the flags I could use when I use ls, but I couldn't find exactly what I was looking for. Here's what I have right now:
ls -lsh /mydirectory
The "mydirectory" part is just a placeholder. Is there anything I can add that will only list files that have zero size?
There's a few ways you can go about this; if you want to stick with ls -l you could use e.g. awk in a pipeline to do the filtering.
ls -lsh /mydirectory | awk '$5 == 0'
Here, $5 is the fifth field in ls's output, the size.
Another approach would be to use a different tool, find.
find /mydirectory -maxdepth 1 -size 0 -ls
This will also list hidden files, analogous to an ls -la.
The -maxdepth 1 is there so it doesn't traverse the directory tree if you have nested directories.
A simple script can do this.
for file_name in *
do
if [[ !( -s $file_name ) ]]
then
echo $file_name
fi
done
explanation:
for is a loop. * gives list of all files in a current directory.
-s file_name becomes true if the file has size greater than 0.
! to negate that

Finding the oldest folder in a directory in linux even when files inside are modified

I have two folders A and B, inside that there are two files each.
which are created in the below order
mkdir A
cd A
touch a_1
touch a_2
cd ..
mkdir B
cd B
touch b_1
touch b_2
cd ..
From the above i need to find which folder was created first(not modified).
ls -c <path_to_root_before_A_and_B> | tail -1
Now this outputs as "A" (no issues here).
Now i delete the file a_1 inside the Directory A.
Now i again execute the command
ls -c <path_to_root_before_A_and_B> | tail -1
This time it shows "B".
But the directory A contains the file a_2, but the ls command shows as "B". how to overcome this
How To Get File Creation Date Time In Bash-Debian
You'll want to read the link above for that, files and directories would save the same modification time types, which means directories do not save their creation date. Methods like the ls -i one mentioned earlier may work sometimes, but when I ran it just now it got really old files mixed up with really new files, so I don't think it works exactly how you think it might.
Instead try touching a file immediately after creating a directory, save it as something like .DIRBIRTH and make it hidden. Then when trying to find the order the directories were made, just grep for which .DIRBIRTH has the oldest modification date.
Assuming that all the stars align (You're using a version of GNU stat(1) that supports the file birth time formats, you're using a filesystem that records them, and a linux kernel version new enough to support the statx(2) syscall, this script should print out all immediate subdirectories of the directory passed as its argument sorted by creation time:
#!/bin/sh
rootdir=$1
find "$rootdir" -maxdepth 1 -type d -exec stat -c "%W %n" {} + | tail -n +2 \
| sort -k1,1n | cut --complement -d' ' -f1

Unable to cat ~9000 files using command line

I am trying to cat ~9000 fasta like files into one larger file. All of the files are in a single subfolder. I keep getting the argument list is to long error.
This is a sample name from one of the files
efetch.fcgi?db=nuccore&id=CL640905.1&rettype=fasta&retmode=text
They are considered a document type file by the computer.
You can't use cat * > concatfile as you have limits on command line size. So take them one at a time and append:
ls | while read; do cat "$REPLY" >> concatfile; done
(Make sure concatfile doesn't exist beforehand.)
EDIT: As user6292850 rightfully points out, I might be overthinking it. This suffices, if your files don't have too weird names:
ls | xargs cat > concatfile
(but files with spaces in them, for example, would blow it up)
There is a limit on how many arguments you can place on the commandline.
You could use a for loop to handle this:
while read file;do
cat "${file}" >> path/to/output_folder;
done < <(find path/to/output_folder -maxdepth 1 -type f -print)
This will bypass the problem of an expanded glob with too many arguments.

Linux/Perl Returning list of folders that have not been modified for over x minutes

I have a directory that has multiple folders. I want to get a list of the names of folders that have not been modified in the last 60 minutes.
The folders will have multiple files that will remain old so I can't use -mmin +60
I was thinking I could do something with inverse though. Get a list of files that have been modified in 60 minutes -mmin -60 and then output the inverse of this list.
Not sure to go about doing that or if there is a simpler way to do so?
Eventually I will take these list of folders in a perl script and will add them to a list or something.
This is what I have so far to get the list of folders
find /path/to/file -mmin -60 | sed 's/\/path\/to\/file\///' | cut -d "/" -f1 | uniq
Above will give me just the names of the folders that have been updated.
There is a neat trick to do set operations on text lines like this with sort and uniq. You have already the paths that have been updated. Assume they are in a file called upd. A simple find -type d can give you all folders. Lets assume to have them in file all. Then run
cat all upd | sort |uniq -c |grep '^1'
All paths that appear in both files will have a count of 2 prefixed. All paths only appearing in file all will be prefixed with a 1. The lines prefixed with 1 represent the set difference between all and upd, i.e. the paths that were not touched. (I take it you are able to remove the prefix 1 yourself.)
Surely this can be done with perl or any other scripting language, but this simple sort|uniq is just too nice.-)
The diff command is made for this.
Given two files, "all":
# cat all
/dir1/old2
/dir2/old4
/dir2/old5
/dir1/new1
/dir2/old2
/dir2/old3
/dir1/old1
/dir1/old3
/dir1/new4
/dir2/new1
/dir2/old1
/dir1/new2
/dir2/new2
and "updated":
# cat updated
/dir2/new1
/dir1/new4
/dir2/new2
/dir1/new2
/dir1/new1
We can sort the files and run diff. For this task,I prefer inline sorting:
# diff <(sort all) <(sort updated)
4,6d3
< /dir1/old1
< /dir1/old2
< /dir1/old3
9,13d5
< /dir2/old1
< /dir2/old2
< /dir2/old3
< /dir2/old4
< /dir2/old5
If there are any files in "updated" that aren't in "all", they'll be prefixed with '>'.

Searching a particular string pattern out of 10000 files in parallel

Problem Statement:-
I need to search a particular String Pattern in around 10000 files and find the records in the files which contains that particular pattern. I can use grep here, but it is taking lots of time.
Below is the command I am using to search a particular string pattern after unzipping the dat.gz file
gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1'
If I simply count how many files are there after unzipping the above dat.gz file
gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | wc -l
I get around 10000 files. And I need to search the above string pattern in all these 10000 files and find out the records which contains the above String Pattern. And my above command is working fine but it is very very slow.
What is the best approach on this? Should we take 100 files at a time and search for the particular String Pattern in that 100 files parallelly.
Note:
I am running SunOS
bash-3.00$ uname -a
SunOS lvsaishdc3in0001 5.10 Generic_142901-02 i86pc i386 i86pc
Do NOT run this in parallel!!!! That's going to bounce the disk head all over the place, it will be much slower.
Since you are reading an archive file there's one way to get a substantial performance boost--don't write the results of the decompression out. The ideal answer would be to decompress to a stream in memory, if that's not viable then decompress to a ramdisk.
In any case you do want some parallelism here--one thread should be obtaining the data and then handing it off to another that does the search. That way you will either be waiting on the disk or on the core doing the decompressing, you won't waste any of that time doing the search.
(Note that in case of the ramdisk you will want to aggressively read the files it wrote and then kill them so the ramdisk doesn't fill up.)
For starters, you will need to uncompress the file to disk.
This does work (in bash,) but you probably don't want to try to start 10,000 processes all at once. Run it inside the uncompressed directory:
for i in `find . -type f`; do ((grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1' $i )&); done
So, we need to have a way to limit the number of spawned processes. This will loop as long as the number of grep processes running on the machine exceeds 10 (including the one doing the counting):
while [ `top -b -n1 | grep -c grep` -gt 10 ]; do echo true; done
I have run this, and it works.... but top takes so long to run that it effectively limits you to one grep per second. Can someone improve upon this, adding one to a count when a new process is started and decrementing by one when a process ends?
for i in `find . -type f`; do ((grep -l 'blah' $i)&); (while [ `top -b -n1 | grep -c grep` -gt 10 ]; do sleep 1; done); done
Any other ideas for how to determine when to sleep and when not to? Sorry for the partial solution, but I hope someone has the other bit you need.
If you are not using regular expressions you can use the -F option of grep or use fgrep. This may provide you with additional performance.
Your gzcat .... | wc -l does not indicate 10000 files, it indicates 10000 lines total for however many files there are.
This is the type of problem that xargs exists for. Assuming your version of gzip came with a script called gzgrep (or maybe just zgrep), you can do this:
find /data/newfolder/real-time-newdata -type f -name "*_20120809_0_*.gz" -print | xargs gzgrep
That will run one gzgrep command with batches of as many individual files as it can fit on a command line (there are options to xargs to limit how many, or for a number of other things). Unfortunately, gzgrep still has to uncompress each file and pass it off to grep, but there's not really any good way to avoid having to uncompress the whole corpus in order to search through it. Using xargs in this way will however cut down some on the overall number of new processes that need to be spawned.

Resources