Linux: How to get files with a name that ends with a number greater than 10 - linux

I have some files that have names like this: 'abcdefg_y_zz.jpg'
The 'abcdefg' is a sequence of digits, while the 'y' and 'zz' are letters.
I need to get all the files that have the sequence of digits ending with a number greater than 10. The files that have 'fg' greater that 10.
Does anyone have an idea on how to do that in a bash script?

Ok, technically, based on all your info...
ls | grep '[0-9]{5}[1-9][0-9]_[[:alpha:]]_[[:alpha:]]{2}.jpg'

How about this? Just exclude ones which have 0 in position f.
ls -1 | grep -v "?????0?_?_??.jpg"
Update
Since you want > 10 and not >= 10, you'll need to exclude 10 too. So do this:
ls -1 | grep -v "?????0*_?_??.jpg" | grep -v "??????10_?_??.jpg"

with more scripting
#!/bin/bash
for i in seq FIRST INCREMENT LAST
do
cp abcde$i_y_zz.jpg /your_new_dir //or whatever you want to do with those files
done
so in your example line with seq will be
for i in seq 11 1 100000000

If the filenames are orderly named this awk solution works:
ls | awk 'BEGIN { FIELDWIDTHS = "5 2" } $2 > 10'
Explanation
FIELDWIDTHS = "5 2" means that $1 will refer to the first 5 characters and $2 the next 2.
$2 > 10 matches when field 2 is greater than 10 and implicitly invokes the default code block, i.e. '{ print }'

Just one process:
ls ?????+(1[1-9]|[2-9]?)_?_??.jpg

All the solutions provided so far are fine, but anybody who's had some experience with shell programming knows that parsing ls is never a good idea and must be avoided. This actually doesn't apply in this case, where we can assume that the names of the files follow a certain pattern, but it's a rule that should be remembered. More explanation here.
What you want can be achieved much safer with GNU find - assuming that you run the command in the directory where the files are, it would look something like this :
find . -regextype posix-egrep -regex '\./[0-9]{5}[1-9][0-9]_[[:alpha:]]_[[:alpha:]]{2}.jpg$'

Related

Make grep to exact match strings with and without dash "-"

The problem looks simple and common, so I've looked through many answers but seems that none of them provides appropriate general solution.
I need to grep large tab-separated 6 columns file (*.bed file in fact) to split it by the content of the first column using the list of string variables (items). I just need a row starting with a given string.
I was succesfully using
grep -w "$name" inputfile
$name is read from the list of strings
for that purpose until the case where strings have the following format (example): YAL038W but also YAL038W-A, YAL038W-B,...
So, grep with -w option considers YAL038W identical to YAL038W-A, YAL038W-B since "-" is word separator. it would work with "_" but not with "-".
I've found solutions based on awk which are working fine, for example:
awk -F $'\t' -vsearch=$name '$1==search' inputfile
but awk is terribly slow, over 10 times, see time measurements below
For 2.5 Gb input file and > 5000 items to look for, script is already running for >24 hours!
Example of inputfile:
YAL038W-A 0 48 HWI-1KL176:101:CC27NACXX:3:2208:17646:92047 0 +
YAL038W-A 0 48 HWI-1KL176:101:CC27NACXX:3:2211:17326:31268 0 +
YAL038W 1 50 HWI-1KL176:101:CC27NACXX:8:1205:16311:19319 3 +
YAL038W 1 27 HWI-1KL176:101:CC27NACXX:8:2103:4951:94527 42 +
time grep -w "YAL038W" inputfile > testfile.txt
real 0m3.569s
time awk -F $'\t' -vsearch="YAL038W" '$1==search' inputfile > testfile.txt
real 0m29.521s
I am looking for FAST solution using grep or something else, and I need to pass the variable to this command in the cycle.
Alternative is to modify the imput file by replacing "-" by "_", but it is the last possibility I believe...
Thanks in advance
I've found solutions based on awk which are working fine, for example:
awk -F $'\t' -vsearch=$name '$1==search' inputfile
but awk is terribly slow…
I am looking for FAST solution using grep …
If the above awk command worked for you, then this will do:
grep ^$name$'\t' inputfile
Just search at the beginning of each line for the name followed by a TAB.

how to filter the lines contain at least one "x" for the first nine characters

for example if we have txt as below
drwxr-sr-x 7 abcdefgetdf
drwxr-sr-x 7 abcdef123123sa
drwxr-sr-- 7 abcdefgetdf
drwxr-sr-- 7 abcdeadfvcxvxcvx
drwxr-sr-x 7 abcdef123ewlld
To answer the question in the title strictly:
awk 'substr($0, 1, 9) ~ /x/' txt
Though if you're interested in files with at least one execute permission bit set, then perhaps find -perm /0111 would be something to look into.
awk solution:
ls -l | awk '$1~/x/'
$1~/x/ - regular expression match, accepts only lines with the 1st field containing x character
I couldn't figure out a one liner that just checked the first 9 (or 10) characters for 'x'. But this little script does the trick. Uses "ls -l" as input instead of a file, but it's trivial to pass in a file to filter instead.
#!/bin/bash
ls -l | while read line
do
perms="${line:0:10}"
if [[ "${perms}" =~ "x" ]]; then
echo "${line}"
fi
done

How to tell how many files match description with * in unix

Pretty simple question: say I have a set of files:
a1.txt
a2.txt
a3.txt
b1.txt
And I use the following command:
ls a*.txt
It will return:
a1.txt a2.txt a3.txt
Is there a way in a bash script to tell how many results will be returned when using the * pattern. In the above example if I were to use a*.txt the answer should be 3 and if I used *1.txt the answer should be 2.
Comment on using ls:
I see all the other answers attempt this by parsing the output of
ls. This is very unpredictable because this breaks when you have
file names with "unusual characters" (e.g. spaces).
Another pitfall would be, it is ls implementation dependent. A
particular implementation might format output differently.
There is a very nice discussion on the pitfalls of parsing ls output on the bash wiki maintained by Greg Wooledge.
Solution using bash arrays
For the above reasons, using bash syntax would be the more reliable option. You can use a glob to populate a bash array with all the matching file names. Then you can ask bash the length of the array to get the number of matches. The following snippet should work.
files=(a*.txt) && echo "${#files[#]}"
To save the number of matches in a variable, you can do:
files=(a*.txt)
count="${#files[#]}"
One more advantage of this method is you now also have the matching files in an array which you can iterate over.
Note: Although I keep repeating bash syntax above, I believe the above solution applies to all sh-family of shells.
You can't know ahead of time, but you can count how many results are returned. I.e.
ls -l *.txt | wc -l
ls -l will display the directory entries matching the specified wildcard, wc -l will give you the count.
You can save the value of this command in a shell variable with either
num=$(ls * | wc -l)
or
num=`ls -l *.txt | wc -l`
and then use $num to access it. The first form is preferred.
You can use ls in combination with wc:
ls a*.txt | wc -l
The ls command lists the matching files one per line, and wc -l counts the number of lines.
I like suvayu's answer, but there's no need to use an array:
count() { echo $#; }
count *
In order to count files that might have unpredictable names, e.g. containing new-lines, non-printable characters etc., I would use the -print0 option of find and awk with RS='\0':
num=$(find . -maxdepth 1 -print0 | awk -v RS='\0' 'END { print NR }')
Adjust the options to find to refine the count, e.g. if the criteria is files starting with a lower-case a with .txt extension in the current directory, use:
find . -type f -name 'a*.txt' -maxdepth 1 -print0

Tail inverse / printing everything except the last n lines?

Is there a (POSIX command line) way to print all of a file EXCEPT the last n lines? Use case being, I will have multiple files of unknown size, all of which contain a boilerplate footer of a known size, which I want to remove. I was wondering if there is already a utility that does this before writing it myself.
Most versions of head(1) - GNU derived, in particular, but not BSD derived - have a feature to do this. It will show the top of the file except the end if you use a negative number for the number of lines to print.
Like so:
head -n -10 textfile
Probably less efficient than the "wc" + "do the math" + "tail" method, but easier to look at:
tail -r file.txt | tail +NUM | tail -r
Where NUM is one more than the number of ending lines you want to remove, e.g. +11 will print all but the last 10 lines. This works on BSD which does not support the head -n -NUM syntax.
The head utility is your friend.
From the man page of head:
-n, --lines=[-]K
print the first K lines instead of the first 10;
with the leading `-', print all but the last K lines of each file
There's no standard commands to do that, but you can use awk or sed to fill a buffer of N lines, and print from the head once it's full. E.g. with awk:
awk -v n=5 '{if(NR>n) print a[NR%n]; a[NR%n]=$0}' file
cat <filename> | head -n -10 # Everything except last 10 lines of a file
cat <filename> | tail -n +10 # Everything except 1st 10 lines of a file
If the footer starts with a consistent line that doesn't appear elsewhere, you can use sed:
sed '/FIRST_LINE_OF_FOOTER/q' filename
That prints the first line of the footer; if you want to avoid that:
sed -n '/FIRST_LINE_OF_FOOTER/q;p' filename
This could be more robust than counting lines if the size of the footer changes in the future. (Or it could be less robust if the first line changes.)
Another option, if your system's head command doesn't support head -n -10, is to precompute the number of lines you want to show. The following depends on bash-specific syntax:
lines=$(wc -l < filename) ; (( lines -= 10 )) ; head -$lines filename
Note that the head -NUMBER syntax is supported by some versions of head for backward compatibility; POSIX only permits the head -n NUMBER form. POSIX also only permits the argument to -n to be a positive decimal integer; head -n 0 isn't necessarily a no-op.
A POSIX-compliant solution is:
lines=$(wc -l < filename) ; lines=$(($lines - 10)) ; head -n $lines filename
If you need to deal with ancient pre-POSIX shells, you might consider this:
lines=`wc -l < filename` ; lines=`expr $lines - 10` ; head -n $lines filename
Any of these might do odd things if a file is 10 or fewer lines long.
tac file.txt | tail +[n+1] | tac
This answer is similar to user9645's, but it avoids the tail -r command, which is also not a valid option many systems. See, e.g., https://ubuntuforums.org/showthread.php?t=1346596&s=4246c451162feff4e519ef2f5cb1a45f&p=8444785#post8444785 for an example.
Note that the +1 (in the brackets) was needed on the system I tried it on to test, but it may not be required on your system. So, to remove the last line, I had to put 2 in the brackets. This is probably related to the fact that you need to have the last line ending with regular line feed character(s). This, arguably, makes the last line a blank line. If you don't do that, then the tac command will combine the last two lines, so removing the "last" line (or the first to the tail command) will actually remove the last two lines.
My answer should also be the fastest solution of those listed to date for systems lacking the improved version of head. So, I think it is both the most robust and the fastest of all the answers listed.
head -n $((`(wc -l < Windows_Terminal.json)`)) Windows_Terminal.json
This will work on Linux and on MacOs, keep in mind Mac does not support a negative value. so This is quite handy.
n.b : replace Windows_Terminal.json with your file name
It is simple. You have to add + to the number of lines that you wanted to avoid.
This example gives to you all the lines except the first 9
tail -n +10 inputfile
(yes, not the first 10...because it counts different...if you want 10, just type
tail -n 11 inputfile)

Egrep acts strange with -f option

I've got a strangely acting egrep -f.
Example:
$ egrep -f ~/tmp/tmpgrep2 orig_20_L_A_20090228.txt | wc -l
3
$ for lines in `cat ~/tmp/tmpgrep2` ; do egrep $lines orig_20_L_A_20090228.txt ; done | wc -l
12
Could someone give me a hint what could be the problem?
No, the files did not changed between executions. The expected answer for the egrep line count is 12.
UPDATE on file contents: the searched file contains cca 13000 lines, each of them are 500 char long, the pattern file contains 12 lines, each of them are 24 char long. The pattern always (and only) occurs on a fixed position in the seached file (26-49).
UPDATE on pattern contents: every pattern from tmpgrep2 are a 24 char long number.
If the search patterns are found on the same lines, then you can get the result you see:
Suppose you look for:
abc
def
ghi
jkl
and the data file is:
abcdefghijklmnoprstuvwxzy
then the one-time command will print 1 and the loop will print 4.
Could it be that the lines read contain something that the shell is expanding/substituting for you, in the second version? Then that doesn't get done by grep when it reads the patterns itself, thus leading to a different sent of patterns being matched.
I'm not totally sure if the shell is doing any expansion on the variable value in an invocation like that, but it's an idea at least.
EDIT: Nope, it doesn't seem to do any substitutions. But it could be quoting issue, if your patterns contain whitespace the for loop will step through each token, not through each line. Take a look at the read bash builtin.
Do you have any duplicates in ~/tmp/tmpgrep2? Egrep will only use the dupes one time, but your loop will use each occurrence.
Get rid of dupes by doing something like this:
$ for lines in `sort < ~/tmp/tmpgrep2 | uniq` ; do egrep $lines orig_20_L_A_20090228.txt ; done | wc -l
I second #unwind.
Why don't you run without wc -l and see what each search is finding?
And maybe:
for lines in `cat ~/tmp/tmpgrep2` ; do echo $lines ; done
Just to see now the shell is handling $lines?
The others have already come up with most of the things I would look at. The next thing I would check is the environment variable GREP_OPTIONS, or whatever it is called on your machine. I've gotten the strangest error messages or behaviors when using a command line argument that interfered with the environment settings.

Resources