Illogical number priority in file names in BASH [duplicate] - linux

This question already has answers here:
How to loop over files in natural order in Bash?
(7 answers)
Closed 1 year ago.
It so happens that I wrote a script in BASH, part of which is supposed to take files from a specified directory in numerical order. Obviously, files in that directory are named as follows: 1, 2, 3, 4, 5, etc. The thing is, I discovered that while running this script with 10 files in the directory, something that appears quite illogical to me, occurs, as the script takes files in strange order: 10, 1, 2, 3, etc.
How do I make it run from minimum value of name of a file to maximum in decimals?
Also, I am using the following line of code to define loop and path:
for file in /dir/*
Don't know if it matters, but I'm using Fedora 33 as OS.

Directories are sorted by alphabetical order. So "10" is before "2".
If I list 20 files whose names correspond to the 20 first integers, I get:
1 10 11 12 13 14 15 16 17 18 19 2 20 3 4 5 6 7 8 9
I can call the function 'sort -n' so I'll sort them numerically rather than alphabetically. The following command:
for i in $(ls | sort -n) ; do echo $i ; done
produces the following output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
i.e. your command:
for file in /dir/*
should be rewritten:
for file in "dir/"$(ls /dir/* | sort -n)

If you have GNU sort then use the -V flag.
for file in /dir/* ; do echo "$file" ; done | sort -V
Or store the data in an array.
files=(/dir/*); printf '%s\n' "${files[#]}" | sort -V

As an aside, if you have the option and work once ahead of time is preferable to sorting every time, you could also format the names of your directories with leading zeroes. This is frequently a better design when possible.
I made both for some comparisons.
$: echo [0-9][0-9]/ # perfect list based on default string sort
00/ 01/ 02/ 03/ 04/ 05/ 06/ 07/ 08/ 09/ 10/ 11/ 12/ 13/ 14/ 15/ 16/ 17/ 18/ 19/ 20/
That also filters out any non-numeric names, and any non-directories.
$: for d in [0-9][0-9]/; do echo "${d%/}"; done
00
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
If I show both single- and double-digit versions (I made both)
$: shopt -s extglob
$: echo #(?|??)
0 00 01 02 03 04 05 06 07 08 09 1 10 11 12 13 14 15 16 17 18 19 2 20 3 4 5 6 7 8 9
Only the single-digit versions without leading zeroes get out of order.

The shell sorts the names by the locale order (not necessarily the byte value) of each individual character. Anything that starts with 1 will go before anything that starts with 2, and so on.
There's two main ways to tackle your problem:
sort -n (numeric sort) the file list, and iterate that.
Rename or recreate the target files (if you can), so all numbers are the same length (in bytes/characters). Left pad shorter numbers with 0 (eg. 01). Then they'll expand like you want.
Using sort (properly):
mapfile -td '' myfiles <(printf '%s\0' * | sort -zn)
for file in "${myfiles[#]}"; do
# what you were going to do
sort -z for zero/null terminated lines is common but not posix. It makes processing paths/data that contains new lines safe. Without -z:
mapfile -t myfiles <(printf '%s\n' * | sort -n)
# Rest is the same.
Rename the target files:
#!/bin/bash
cd /path/to/the/number/files || exit 1
# Gets length of the highest number. Or you can just hardcode it.
length=$(printf '%s\n' * | sort -n | tail -n 1)
length=${#length}
for i in *; do
mv -n "$i" "$(printf "%.${length}d" "$i")"
done
Examples for making new files with zero padded numbers for names:
touch {000..100} # Or
for i in {000..100}; do
> "$i"
done
If it's your script that made the target files, something like $(printf %.Nd [file]) can be used to left pad the names before you write to them. But you need to know the length in characters of the highest number first (N).

Related

awk: get log data part by part

the log file is
Oct 01 [time] a
Oct 02 [time] b
Oct 03 [time] c
.
.
.
Oct 04 [time] d
Oct 05 [time] e
Oct 06 [time] f
.
.
.
Oct 28 [time] g
Oct 29 [time] h
Oct 30 [time] i
and it is really big ( millions of lines )
I wanna to get logs between Oct 01 and Oct 30
I can do it with gawk
gawk 'some conditions' filter.log
and it works correctly.
and it return millions of log lines that is not good
because I wanna to get it part by part
some thing like this
gawk 'some conditions' -limit 100 -offset 200 filter.log
and every time when I change limit and offset
I can get another part of that.
How can I do that ?
awk solution
I would harness GNU AWK for this task following way, let file.txt content be
1
2
3
4
5
6
7
8
9
and say I want to print such lines that 1st field is odd in part starting at 3th line and ending at 7th line (inclusive), then I can use GNU AWK following way
awk 'NR<3{next}$1%2{print}NR>=7{exit}' file.txt
which will give
3
5
7
Explanation: NR is built-in variable, which hold number of row, when processing lines before 3 just go to next row without doing anything, when remainder from division by 2 is non-zero do print line, when processing 7th or further row just exit. Using exit might give notice boost in performance if you are processing relatively small part of file. Observe order of 3 pattern-action pairs in code above: next is first, then whatever you do want do, exit is last. If you want to know more about NR read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
(tested in GNU Awk 5.0.1)
linux solution
If you prefer working with offset limit, then you might exploit tail-head combination e.g. for above file.txt
tail -n +5 file.txt | head -3
gives output
5
6
7
observe that offset goest first and with + before value then limit with - before value.
Using OP's pseudo code mixed with some actual awk code:
gawk -v limit=100 -v offset=200 '
some conditions { matches++ # track number of matches
if (matches >= offset and limit > 0) {
print # print current line
limit-- # decrement limit
}
if (limit == 0) exit # optional: abort processing if we found "limit" number of matches
}
' filter.log

arithmetic operation on text file column

I have four columns in my text file. I need to take average value of fourth column corresponding to second column and save output in another file which will contain only two columns with average results. Kindly help
ABC DEF IGK LMN
21 56700 001000 -98.3
24 56700 002000 -96.3
6 56700 003000 -93.8
9 56700 004000 -47.3
21 56700 005000 -58.3
36 56700 006000 -78.3
21 56701 001000 -98.3
28 56701 002000 -98.3
21 56701 003000 -99.3
20 56701 004000 -58.3
21 56701 005000 -99.3
10 56701 006000 -98.3
2 56701 007000 -87.3
2 56701 008000 -57.3
21 56702 001000 -63.3
1 56702 002000 -67.3
17 56702 003000 -47.3
21 56702 004000 -73.3
13 56702 005000 -60.3
10 56702 006000 -90.3
14 56702 007000 -77.3
11 56702 008000 -97.3
10 56702 009000 -98.3
13 56702 010000 -87.3
17 56702 011000 -77.3
11 56702 012000 -68.3
Expected output:
DEF Average of LMN
56700 -78.71666667
56701 -87.05
56702 -75.63333333
I can get the overall average of 4th column in one go by using:
awk '{total+= $4} END {print total/NR}' inputfilename.txt
but I need to apply a condition.
Use two arrays, one for sums; one for counting how many numbers added to them. At the end of file print DEFs and corresponding averages.
awk 'NR>1{count[$2]++;total[$2]+=$4} END{for(key in count) print key, total[key]/count[key]}' file
Note: NR>1 is for excluding header line, if actual input doesn't have a header line simply drop it.
Given your sample its output looks like:
56700 -74.8
56701 -87.05
56702 -75.6333
Then you can sort the output using sort if it's necessary.
You may also consider using a more powerful language, specially when you need to do more fancy stuff.
E.g. python
DEF_map = {}
with open('in.txt') as file:
for line in file.readlines()[1:]:
s = line.split()
if s[1] not in DEF_map:
DEF_map[s[1]] = []
DEF_map[s[1]].append(float(s[3]))
print("DEF Average of LMN")
for DEF, LMN_list in DEF_map.items():
print("{}\t{}".format(DEF, sum(LMN_list)/len(LMN_list)))
because your original tags include bash, here is an example with bash and the bc-tool (not an one-line code, but sometimes hopeful to learn bash):
# only if needed in a short variable, later possible to test if exist, readable, ...
in=/path/to/your/testfile.txt
# we build a loop over your keys, possible
# - for fixed length files and a fixed byte position
# cut -b 5-10
# - for variable blocked with one (ore more) spaces as delimiter
# sed -e 's/ */ /g' | cut -d ' ' -f 2
for key in $(cat $in | cut -b 5-10 | sort -u) ; do
# initialize counter for summary and number of elements per key
s=0; a=0
# grep all your relevant data from your inputfile (only for the key)
# depends on your data you can grep on bytes (here from start of line with 4
# characters and from byte 5-10 with your key)
for x in $(grep -E "^.{4}${key}" $in | sed -e 's/ */ /g' | cut -d' ' -f4) ; do
# count sum and add 1 to the number of entries
s=$(echo "$s+$x" | bc --mathlib)
((a++))
done
# now print your key (as integer) and avg (as float with 6 decimals)
printf "%i %.6f\n" $key $(echo "$s/$a" | bc --mathlib)
done
bc used with parameter --mathlib use a scale of 20. If needed or you want it, you can use a higher scale and reduce the decimals only at printing the result.
This solution with two loops (one for the keys and the other per key) is only acceptable if your linenumbers of the inputfile are not to large (I don't use this example for million of lines), but it's more readable as some one-line code (especially for beginners).

print the count of files for each sub folder iterativly

I have the following folder structure:
A/B/C/D/E/00
A/B/C/D/E/01
.
.
A/B/C/D/E/23
Similarly,
M/N/O/P/Q/00
M/N/O/P/Q/01
.
.
M/N/O/P/Q/23
Now, each folder from 00 to 23 has many files inside, which I would like to count.
If I run this simple command:
ls /A/B/C/D/E/00 | wc -l
I can get the count of files in each of these sub directories. I want to automate this or get it iteratively.
Also, the final output I am looking at is a file that should look like this:
C E RESULT OF ls /A/B/C/D/E/00 | wc -l RESULT OF ls /A/B/C/D/E/01 | wc -l
M Q RESULT OF ls /M/N/O/P/Q/00 | wc -l RESULT OF ls /M/N/O/P/Q/01 | wc -l
So, the output should look like this finally
C E 23 23 4 6 7 4 76 98 57 2 67 9 12 34 67 0 2 3 78 98 12 3 57 213
M Q 12 10 2 34 32 1 35 65 87 8 32 2 65 87 98 0 4 12 1 35 34 76 9 67
Please note, the values after the alphabets are the values of file counts of the 24 folders 00, 01 through 23.
Using the eval approach: I can hardcode and get the exact results. But, I wanted it in a way that would show me the data for the previous day. So this is what I did:
d=`date --date ="1 days ago" +%Y%m%d`
month= `date +%Y%m`
eval echo YZ $d '"$(ls "/A/B/YZ/$month/$d/"'{20150800..20150823})'| wc -l)"'
This works perfectly because in the given location there are files inside child directories 20150800,20150801..20150823. However when I try to generalize this like below, it shows no such file or directory:
eval echo YZ $d '"$(ls "/A/B/YZ/$month/$d/"'{"$d"00.."$d"23})'| wc -l)"'
Is there something I am missing in the above line?
A very safe way of counting files:
find . -mindepth 1 -exec printf x \; | wc -c
To not count recursively add -maxdepth 1 before -exec.
Some other notes:
eval is evil. Don't use it. There is only one place I've ever seen where it's appropriate, and that's when using getopt.
You should not parse the output of ls.
Use $() for command substitutions.

Print the count of files in a specific format iteratively in shell script

I have the following folder structure:
A/B/C/D/E/00
A/B/C/D/E/01
.
.
A/B/C/D/E/23
Similarly,
M/N/O/P/Q/00
M/N/O/P/Q/01
.
.
M/N/O/P/Q/23
Now, each folder from 00 to 23 has many files inside, which I would like to count.
If I run this simple command:
ls /A/B/C/D/E/00 | wc -l
I can get the count of files in each of these sub directories. I want to automate this or get it iteratively. Can anyone suggest a way?
Also, the final output I am looking at is a file that should look like this:
C E RESULT OF ls /A/B/C/D/E/00 | wc -l RESULT OF ls /A/B/C/D/E/01 | wc -l
M Q RESULT OF ls /M/N/O/P/Q/00 | wc -l RESULT OF ls /M/N/O/P/Q/01 | wc -l
So, the output should look like this finally
C E 23 23 4 6 7 4 76 98 57 2 67 9 12 34 67 0 2 3 78 98 12 3 57 213
M Q 12 10 2 34 32 1 35 65 87 8 32 2 65 87 98 0 4 12 1 35 34 76 9 67
Please note, the values after the alphabets are the values of file counts of the 24 folders 00, 01 through 23.
Using the eval approach: I can hardcode and get the exact results. But, I wanted it in a way that would show me the data for the previous day. So this is what I did:
d=`date --date ="1 days ago" +%Y%m%d`
month= `date +%Y%m`
eval echo YZ $d '"$(ls "/A/B/YZ/$month/$d/"'{20150800..20150823})'| wc -l)"'
This works perfectly because in the given location there are files inside child directories 20150800,20150801..20150823. However when I try to generalize this like below, it gives me the total count of the folder instead of the count of each sub folder:
eval echo YZ $d '"$(ls "/A/B/YZ/$month/$d/"'{"$d"00.."$d"23})'| wc -l)"'
Something like this (not tested):
for d in [A-Z]/[A-Z]/[A-Z]/[A-Z]/[A-Z]/[0-9][0-9]
do
[[ -d $d ]] && echo $d : $(ls $d|wc -l)
done
Note that this gives an inccorect line count if one of the file names contains a newline character.

Shellscript: Is it possible to format seq to display n numbers per line?

Is it possible to format seq in a way that it will display the range desired but with N numbers per line?
Let say that I want seq 20 but with the following output:
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
My next guess would be a nested loop but I'm not sure how...
Any help would be appreciated :)
Use can use awk to format it as per your needs.
$ seq 20 | awk '{ORS=NR%5?FS:RS}1'
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
ORS is awk's built-in variable which stands for Output Record Separator and has a default value of \n. NR is awk's built-in variable which holds the line number. FS is built-in variable that stands for Field Separator and has the default value of space. RS is built-in variable that stands for Record Separator and has the default value of \n.
Our action which is a ternary operator, to check if NR%5 is true. When it NR%5 is not 0 (hence true) it uses FS as Output Record Separator. When it is false we use RS which is newline as Output Record Separator.
1 at the end triggers awk default action that is to print the line.
You can use xargs to limit the sequence displayed per line.
$ seq 20 | xargs -n 5
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
The parameter -n 5 tells xargs to only display 5 sequence numbers.
If you have bash you can use the builtin sequence.
echo {1..20} | xargs -n 5
Using Bash:
while read num; do
((num % 5)) && printf "$line " || echo "$line"
done < <(seq 20)
Or:
for i in {1..20}; do
s+="$i "
if ! ((i % 5)); then
echo $s
s=""
fi
done

Resources