Sort file beginning at a certain line

Sort file beginning at a certain line - linux

I'd like to be able to sort a file but only at a certain line and below. From the manual sort isn't able to parse content so I'll need a second utility to do this. read? or awk possibly? Here's the file I'd like to be able to sort:
tar --exclude-from=$EXCLUDE_FILE --exclude=$BACKDEST/$PC-* \
-cvpzf $BACKDEST/$BACKUPNAME.tar.gz \
/etc/X11/xorg.conf \
/etc/X11/xorg.conf.1 \
/etc/fonts/conf.avail.1 \
/etc/fonts/conf.avail/60-liberation.conf \
So for this case, I'd like to begin sorting on line three. I'm thinking I'm going to have to do a function to be able to do this something like
cat backup.sh | while read LINE; do echo $LINE | sort; done
Pretty new to this and the script looks like it's missing something. Also, not sure how to begin at a certain line number.
Any ideas?

Something like this?
(head -n 2 backup.sh; tail -n +3 backup.sh | sort) > backup-sorted.sh
You may have to fixup the last line of the input... it probably doesn't have the trailing \ for the line continuation, so you might have a broken 'backup-sorted.sh' if you just do the above.
You might want to consider using tar's --files-from (or -T) option, and having the sorted list of files in a data file instead of the script itself.

clumsy way:
len=$(cat FILE | wc -l)
sortable_len=$((len-3))
head -3 FILE > OUT
tail -$sortable_len FILE | sort >> OUT
I'm sure someone will post an elegant 1-liner shortly.

Sort the lines excluding the (2 lines) header, just for view.
cat file.txt | awk '{if (NR < 3) print $0 > "/dev/stderr"; else print $0}' | sort
Sort the lines excluding the (2 lines) headers and send the output to another file.
Method #1:
cat file.txt | awk '{if (NR < 3) print $0 > "/dev/stderr"; else print $0}' 2> file_sorted.txt | sort >> file_sorted.txt
Method #2:
cat file.txt | awk '{if (NR < 3) print $0 > "file_sorted.txt"; else print $0}' | sort >> file_sorted.txt

You could try this:
(read line; echo "$line"; sort) < file.txt
It takes one line and echoes it, then sorts the rest. You can also:
file.txt | (read line; echo "$line"; sort)
For two lines, just repeat the read and echo:
(read line; echo "$line"; read line; echo "$line"; sort) < file.txt

Using awk:
awk '{ if ( NR > 2 ) { print $0 } }' file.txt | sort
NR is a built-in awk variable and contains the current record/line number. It starts at 1.

Extending Vigneswaran R's answer using awk:
using tty to get your current terminals' stdin file, print the first three lines directly to you terminal (no it won't run the input) within awk and pipe the rest to sort.
tty
>/dev/pts/3
cat file.txt | awk '{if (NR < 3) print $0 > "/dev/pts/3"; else print $0}' | sort

Related

Bash function with input fails awk command

I am writing a function in a BASH shell script, that should return lines from csv-files with headers, having more commas than the header. This can happen, as there are values inside these files, that could contain commas. For quality control, I must identify these lines to later clean them up. What I have currently:
#!/bin/bash
get_bad_lines () {
local correct_no_of_commas=$(head -n 1 $1/$1_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $1 | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$1/$1_0_${i}_0.csv" ]; then
echo "File: $1_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "$1_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$1/$1_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk '$1 > $correct_no_of_commas {print}'
done
}
get_bad_lines products
get_bad_lines users
The output of this program is now all the comma-counts with all of the line numbers in all the files,
and I suspect this is due to the input $1 (foldername, i.e. products & users) conflicting with the call to awk with reference to $1 as well (where I wish to grab the first column being the count of commas for that line in the current file in the loop).
Is this the issue? and if so, would it be solvable by either referencing the 1.st column or the folder name by different variable names instead of both of them using $1 ?
Example, current output:
5 6667
5 6668
5 6669
5 6670
(should only show lines for that file having more than 5 commas).
Tried variable declaration in call to awk as well, with same effect
(as in the accepted answer to Awk field variable clash with function argument)
:
get_bad_lines () {
local table_name=$1
local correct_no_of_commas=$(head -n 1 $table_name/${table_name}_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $table_name | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$table_name/${table_name}_0_${i}_0.csv" ]; then
echo "File: ${table_name}_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "${table_name}_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$table_name/${table_name}_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk -v table_name="$table_name" '$1 > $correct_no_of_commas {print}'
done
}

You can use awk the full way to achieve that :
get_bad_lines () {
find "$1" -maxdepth 1 -name "$1_0_*_0.csv" | while read -r my_file ; do
awk -v table_name="$1" '
NR==1 { num_comma=gsub(/,/, ""); }
/,/ { if (gsub(/,/, ",", $0) > num_comma) wrong_array[wrong++]=NR":"$0;}
END { if (wrong > 0) {
print(FILENAME" has over "num_comma" commas in the following lines:");
for (i=0;i<wrong;i++) { print(wrong_array[i]); }
}
}' "${my_file}"
done
}
For why your original awk command failed to give only lines with too many commas, that is because you are using a shell variable correct_no_of_commas inside a single quoted awk statement ('$1 > $correct_no_of_commas {print}'). Thus there no substitution by the shell, and awk read "$correct_no_of_commas" as is, and perceives it as an undefined variable. More precisely, awk look for the variable correct_no_of_commas which is undefined in the awk script so it is an empty string . awk will then execute $1 > $"" as matching condition, and as $"" is a $0 equivalent, awk will compare the count in $1 with the full input line. From a numerical point of view, the full input line has the form <tab><count><tab><num_line>, so it is 0 for awk. Thus, $1 > $correct_no_of_commas will be always true.

You can identify all the bad lines with a single awk command
awk -F, 'FNR==1{print FILENAME; headerCount=NF;} NF>headerCount{print} ENDFILE{print "#######\n"}' /path/here/*.csv
If you want the line number also to be printed, use this
awk -F, 'FNR==1{print FILENAME"\nLine#\tLine"; headerCount=NF;} NF>headerCount{print FNR"\t"$0} ENDFILE{print "#######\n"}' /path/here/*.csv

Linux usernames /etc/passwd listing

I want to print the longest and shortest username found in /etc/passwd. If I run the code below it works fine for the shortest (head -1), but doesn't run for (sort -n |tail -1 | awk '{print $2}). Can anyone help me figure out what's wrong?
#!/bin/bash
grep -Eo '^([^:]+)' /etc/passwd |
while read NAME
do
echo ${#NAME} ${NAME}
done |
sort -n |head -1 | awk '{print $2}'
sort -n |tail -1 | awk '{print $2}'

Here the issue is:
Piping finishes with the first sort -n |head -1 | awk '{print $2}' command. So, input to first command is provided through piping and output is obtained.
For the second command, no input is given. So, it waits for the input from STDIN which is the keyboard and you can feed the input through keyboard and press ctrl+D to obtain output.
Please run the code like below to get desired output:
#!/bin/bash
grep -Eo '^([^:]+)' /etc/passwd |
while read NAME
do
echo ${#NAME} ${NAME}
done |
sort -n |head -1 | awk '{print $2}'
grep -Eo '^([^:]+)' /etc/passwd |
while read NAME
do
echo ${#NAME} ${NAME}
done |
sort -n |tail -1 | awk '{print $2}
'

All you need is:
$ awk -F: '
NR==1 { min=max=$1 }
length($1) > length(max) { max=$1 }
length($1) < length(min) { min=$1 }
END { print min ORS max }
' /etc/passwd
No explicit loops or pipelines or multiple commands required.

The problem is that you only have two pipelines, when you really need one. So you have grep | while read do ... done | sort | head | awk and sort | tail | awk: the first sort has an input (i.e., the while loop) - the second sort doesn't. So the script is hanging because your second sort doesn't have an input: or rather it does, but it's STDIN.
There's various ways to resolve:
save the output of the while loop to a temporary file and use that as an input to both sort commands
repeat your while loop
use awk to do both the head and tail
The first two involve iterating over the password file twice, which may be okay - depends what you're ultimately trying to do. But using a small awk script, this can give you both the first and last line by way of the BEGIN and END blocks.

While you already have good answers, you can also use POSIX shell to accomplish your goal without any pipe at all using the parameter expansion and string length provided by the shell itself (see: POSIX shell specifiction). For example you could do the following:
#!/bin/sh
sl=32;ll=0;sn=;ln=; ## short len, long len, short name, long name
while read -r line; do ## read each line
u=${line%%:*} ## get user
len=${#u} ## get length
[ "$len" -lt "$sl" ] && { sl="$len"; sn="$u"; } ## if shorter, save len, name
[ "$len" -gt "$ll" ] && { ll="$len"; ln="$u"; } ## if longer, save len, name
done </etc/passwd
printf "shortest (%2d): %s\nlongest (%2d): %s\n" $sl "$sn" $ll "$ln"
Example Use/Output
$ sh cketcpw.sh
shortest ( 2): at
longest (17): systemd-bus-proxy
Using either pipe/head/tail/awk or the shell itself is fine. It's good to have alternatives.
(note: if you have multiple users of the same length, this just picks the first, you can use a temp file if you want to save all names and use -le and -ge for the comparison.)

If you want both the head and the tail from the same input, you may want something like sed -e 1b -e '$!d' after you sort the data to get the top and bottom lines using sed.
So your script would be:
#!/bin/bash
grep -Eo '^([^:]+)' /etc/passwd |
while read NAME
do
echo ${#NAME} ${NAME}
done |
sort -n | sed -e 1b -e '$!d'
Alternatively, a shorter way:
cut -d":" -f1 /etc/passwd | awk '{ print length, $0 }' | sort -n | cut -d" " -f2- | sed -e 1b -e '$!d'

How should I count the duplicate lines in each file?

I have tried this :
dirs=$1
for dir in $dirs
do
ls -R $dir
done

Like this?:
$ cat > foo
this
nope
$ cat > bar
neither
this
$ sort *|uniq -c
1 neither
1 nope
2 this
and weed out the ones with just 1s:
... | awk '$1>1'
2 this

Use sort with uniq to find the duplicate lines.
#!/bin/bash
dirs=("$#")
for dir in "${dirs[#]}" ; do
cat "$dir"/*
done | sort | uniq -c | sort -n | tail -n1
uniq -c will prepend the number of occurrences to each line
sort -n will sort the lines by the number of occurrences
tail -n1 will only output the last line, i.e. the maximum. If you want to see all the lines with the same number of duplicates, add the following instead of tail:
perl -ane 'if ($F[0] == $n) { push #buff, $_ }
else { #buff = $_ }
$n = $F[0];
END { print for #buff }'

You could use awk. If you just want to "count the duplicate lines", we could infer that you're after "all lines which have appeared earlier in the same file". The following would produce these counts:
#!/bin/sh
for file in "$#"; do
if [ -s "$file" ]; then
awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' "$file"
fi
done
The awk script first checks to see if the current line is stored in the array a, and if it does, increments a counter. Then it adds the line to its array. At the end of the file, we print the total.
Note that this might have problems on very large files, since the entire input file needs to be read into memory in the array.
Example:
$ printf 'foo\nbar\nthis\nbar\nthat\nbar\n' > inp.txt
$ awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' inp.txt
inp.txt: 2
The word 'bar' exist three times in the file, thus there are two duplicates.
To aggregate multiple files, you can just feed multiple files to awk:
$ printf 'foo\nbar\nthis\nbar\n' > inp1.txt
$ printf 'red\nblue\ngreen\nbar\n' > inp2.txt
$ awk '$0 in a {c++} {a[$0]} END {print c}' inp1.txt inp2.txt
2
For this, the word 'bar' appears twice in the first file and once in the second file -- a total of three times, thus we still have two duplicates.

Pass argument to awk inside do loop

I have a large number of tab-separated text files containing a score I'm interested in in the second column:
test_score_1.txt
Title FRED Chemgauss4 File
24937 -6.111582 A
24972 -7.644171 A
26246 -8.551361 A
21453 -7.291059 A
test_score_2.txt
Title FRED Chemgauss4 File
14721 -7.322331 B
27280 -6.229842 B
21451 -8.407396 B
10035 -7.482369 B
10037 -7.706176 B
I want to check if I have Titles with a score smaller than a number I define.
The following code defines my score in the script and works:
check_score_1
#!/bin/bash
find . -name 'test_score_*.txt' -type f -print0 |
while read -r -d $'\0' x; do
awk '{FS = "\t" ; if ($2 < -7.5) print $0}' "$x"
done
If I try to pass an argument to awk like so check_scores_2.sh "-7.5" as shown in check_score_2.sh, that returns all entries from both files.
check_scores_2.sh
#!/bin/bash
find . -name 'test_score_*.txt' -type f -print0 |
while read -r -d $'\0' x; do
awk '{FS = "\t" ; if ($2 < ARGV[1]) print $0}' "$x"
done
Finally, check_scores_3.sh reveals that I'm actually not passing any arguments from my command line.
check_scores_3.sh
#!/bin/bash
find . -name 'test_score_*.txt' -type f -print0 |
while read -r -d $'\0' x; do
awk '{print ARGV[0] "\t" ARGV[1] "\t" ARGV[2]}' "$x"
done
$ ./check_score_3.sh "-7.5" gives the following output:
awk ./test_score_1.txt
awk ./test_score_1.txt
awk ./test_score_1.txt
awk ./test_score_1.txt
awk ./test_score_1.txt
awk ./test_score_2.txt
awk ./test_score_2.txt
awk ./test_score_2.txt
awk ./test_score_2.txt
awk ./test_score_2.txt
awk ./test_score_2.txt
What am I doing wrong?

In your shell script, the first argument to the shellscript is available as $1. You can assign that value to an awk variable as follows:
find . -name 'test_score_*.txt' -type f -exec awk -v a="$1" -F'\t' '$2 < a' {} +
Discussion
Your print0/while read loop is very good. The -exec option offered by find, however, makes it possible to run the same command without any explicit looping.
The command {if ($2 < -7.5) print $0} can optionally be simplified to just the condition $2 < -7.5. This is because the default action for a condition is print $0.
Note that the references $1 and $2 are entirely unrelated to each other. Because $1 is in double-quotes, the shell substitutes in for it before the awk command starts to run. The shell interprets $1 to mean the first argument to the script. Because $2 appears in single quotes, the shell leaves it alone and it is interpreted by awk. Awk interprets it to mean the second field of its current record.

Your first example:
awk '{FS = "\t" ; if ($2 < -7.5) print $0}' "$x"
only works by a happy coincidence that setting FS actually makes no difference for your particular case. Otherwise it would fail for the first line of the input file since you're not setting FS until AFTER the first line is read and has been split into fields. You meant this:
awk 'BEGIN{FS = "\t"} {if ($2 < -7.5) print $0}' "$x"
which can be written more idiomatically as just:
awk -F'\t' '$2 < -7.5' "$x"
For the second case you're just not passing in the argument, as you already realised. All you need to do is:
awk -F'\t' -v max="$1" '$2 < max' "$x"
See http://cfajohnson.com/shell/cus-faq-2.html#Q24.

Print columns from specific line of file?

I'm looking at files that all have a different version number that starts at column 18 of line 7.
What's the best way with Bash to read (into a $variable) the string on line 7, from column, i.e. "character," 18 to the end of the line? What about to the 5th to last character of the line?

sed way:
variable=$(sed -n '7s/^.\{17\}//p' file)
EDIT (thanks to commenters): If by columns you mean fields (separated with tabs or spaces), the command can be changed to
variable=$(sed -n '7s/^\(\s\+\S\+\)\{17\}//p' file)

You have a number of different ways you can go about this, depending on the utilities you want to use. One of your options is to make use of Bash's substring expansion in any of the following ways:
sed
line=1
string=$(sed -n "${line}p" /etc/passwd)
echo "${string:17}"
awk
line=1
string=$(awk "NR==${line} {print}; {next}" /etc/passwd)
echo "${string:17}"
coreutils
line=1
string=`{ head -n $line | tail -n1; } < /etc/passwd`
echo "${string:17}"

Use
var=$(head -n 17 filename | tail -n 1 | cut -f 18-)
or
var=$(awk 'NR == 17' {delim = ""; for (i = 18; i <= NF; i++) {printf "%s%s", delim, $i; delim = OFS}; printf "\n"}')
If you mean "characters" instead of "fields":
var=$(head -n 17 filename | tail -n 1 | cut -c 18-)
or
var=$(awk 'NR == 17' {print substr($0, 18)}')

If by 'columns' you mean 'fields':
a=$( awk 'NR==7{ print $18 }' file )
If you really want the 18th byte through the end of line 7, do:
a=$( sed -n 7p | cut -b 18- )

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Sort file beginning at a certain line - linux

clumsy way: len=$(cat FILE | wc -l) sortable_len=$((len-3)) head -3 FILE > OUT tail -$sortable_len FILE | sort >> OUT I'm sure someone will post an elegant 1-liner shortly.

You could try this: (read line; echo "$line"; sort) < file.txt It takes one line and echoes it, then sorts the rest. You can also: file.txt | (read line; echo "$line"; sort) For two lines, just repeat the read and echo: (read line; echo "$line"; read line; echo "$line"; sort) < file.txt

Using awk: awk '{ if ( NR > 2 ) { print $0 } }' file.txt | sort NR is a built-in awk variable and contains the current record/line number. It starts at 1.

Related

Bash function with input fails awk command

Linux usernames /etc/passwd listing

How should I count the duplicate lines in each file?

Pass argument to awk inside do loop

Print columns from specific line of file?

Categories

Resources