I have a large number of tab-separated text files containing a score I'm interested in in the second column:
test_score_1.txt
Title FRED Chemgauss4 File
24937 -6.111582 A
24972 -7.644171 A
26246 -8.551361 A
21453 -7.291059 A
test_score_2.txt
Title FRED Chemgauss4 File
14721 -7.322331 B
27280 -6.229842 B
21451 -8.407396 B
10035 -7.482369 B
10037 -7.706176 B
I want to check if I have Titles with a score smaller than a number I define.
The following code defines my score in the script and works:
check_score_1
#!/bin/bash
find . -name 'test_score_*.txt' -type f -print0 |
while read -r -d $'\0' x; do
awk '{FS = "\t" ; if ($2 < -7.5) print $0}' "$x"
done
If I try to pass an argument to awk like so check_scores_2.sh "-7.5" as shown in check_score_2.sh, that returns all entries from both files.
check_scores_2.sh
#!/bin/bash
find . -name 'test_score_*.txt' -type f -print0 |
while read -r -d $'\0' x; do
awk '{FS = "\t" ; if ($2 < ARGV[1]) print $0}' "$x"
done
Finally, check_scores_3.sh reveals that I'm actually not passing any arguments from my command line.
check_scores_3.sh
#!/bin/bash
find . -name 'test_score_*.txt' -type f -print0 |
while read -r -d $'\0' x; do
awk '{print ARGV[0] "\t" ARGV[1] "\t" ARGV[2]}' "$x"
done
$ ./check_score_3.sh "-7.5" gives the following output:
awk ./test_score_1.txt
awk ./test_score_1.txt
awk ./test_score_1.txt
awk ./test_score_1.txt
awk ./test_score_1.txt
awk ./test_score_2.txt
awk ./test_score_2.txt
awk ./test_score_2.txt
awk ./test_score_2.txt
awk ./test_score_2.txt
awk ./test_score_2.txt
What am I doing wrong?
In your shell script, the first argument to the shellscript is available as $1. You can assign that value to an awk variable as follows:
find . -name 'test_score_*.txt' -type f -exec awk -v a="$1" -F'\t' '$2 < a' {} +
Discussion
Your print0/while read loop is very good. The -exec option offered by find, however, makes it possible to run the same command without any explicit looping.
The command {if ($2 < -7.5) print $0} can optionally be simplified to just the condition $2 < -7.5. This is because the default action for a condition is print $0.
Note that the references $1 and $2 are entirely unrelated to each other. Because $1 is in double-quotes, the shell substitutes in for it before the awk command starts to run. The shell interprets $1 to mean the first argument to the script. Because $2 appears in single quotes, the shell leaves it alone and it is interpreted by awk. Awk interprets it to mean the second field of its current record.
Your first example:
awk '{FS = "\t" ; if ($2 < -7.5) print $0}' "$x"
only works by a happy coincidence that setting FS actually makes no difference for your particular case. Otherwise it would fail for the first line of the input file since you're not setting FS until AFTER the first line is read and has been split into fields. You meant this:
awk 'BEGIN{FS = "\t"} {if ($2 < -7.5) print $0}' "$x"
which can be written more idiomatically as just:
awk -F'\t' '$2 < -7.5' "$x"
For the second case you're just not passing in the argument, as you already realised. All you need to do is:
awk -F'\t' -v max="$1" '$2 < max' "$x"
See http://cfajohnson.com/shell/cus-faq-2.html#Q24.
Related
I am writing a function in a BASH shell script, that should return lines from csv-files with headers, having more commas than the header. This can happen, as there are values inside these files, that could contain commas. For quality control, I must identify these lines to later clean them up. What I have currently:
#!/bin/bash
get_bad_lines () {
local correct_no_of_commas=$(head -n 1 $1/$1_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $1 | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$1/$1_0_${i}_0.csv" ]; then
echo "File: $1_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "$1_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$1/$1_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk '$1 > $correct_no_of_commas {print}'
done
}
get_bad_lines products
get_bad_lines users
The output of this program is now all the comma-counts with all of the line numbers in all the files,
and I suspect this is due to the input $1 (foldername, i.e. products & users) conflicting with the call to awk with reference to $1 as well (where I wish to grab the first column being the count of commas for that line in the current file in the loop).
Is this the issue? and if so, would it be solvable by either referencing the 1.st column or the folder name by different variable names instead of both of them using $1 ?
Example, current output:
5 6667
5 6668
5 6669
5 6670
(should only show lines for that file having more than 5 commas).
Tried variable declaration in call to awk as well, with same effect
(as in the accepted answer to Awk field variable clash with function argument)
:
get_bad_lines () {
local table_name=$1
local correct_no_of_commas=$(head -n 1 $table_name/${table_name}_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $table_name | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$table_name/${table_name}_0_${i}_0.csv" ]; then
echo "File: ${table_name}_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "${table_name}_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$table_name/${table_name}_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk -v table_name="$table_name" '$1 > $correct_no_of_commas {print}'
done
}
You can use awk the full way to achieve that :
get_bad_lines () {
find "$1" -maxdepth 1 -name "$1_0_*_0.csv" | while read -r my_file ; do
awk -v table_name="$1" '
NR==1 { num_comma=gsub(/,/, ""); }
/,/ { if (gsub(/,/, ",", $0) > num_comma) wrong_array[wrong++]=NR":"$0;}
END { if (wrong > 0) {
print(FILENAME" has over "num_comma" commas in the following lines:");
for (i=0;i<wrong;i++) { print(wrong_array[i]); }
}
}' "${my_file}"
done
}
For why your original awk command failed to give only lines with too many commas, that is because you are using a shell variable correct_no_of_commas inside a single quoted awk statement ('$1 > $correct_no_of_commas {print}'). Thus there no substitution by the shell, and awk read "$correct_no_of_commas" as is, and perceives it as an undefined variable. More precisely, awk look for the variable correct_no_of_commas which is undefined in the awk script so it is an empty string . awk will then execute $1 > $"" as matching condition, and as $"" is a $0 equivalent, awk will compare the count in $1 with the full input line. From a numerical point of view, the full input line has the form <tab><count><tab><num_line>, so it is 0 for awk. Thus, $1 > $correct_no_of_commas will be always true.
You can identify all the bad lines with a single awk command
awk -F, 'FNR==1{print FILENAME; headerCount=NF;} NF>headerCount{print} ENDFILE{print "#######\n"}' /path/here/*.csv
If you want the line number also to be printed, use this
awk -F, 'FNR==1{print FILENAME"\nLine#\tLine"; headerCount=NF;} NF>headerCount{print FNR"\t"$0} ENDFILE{print "#######\n"}' /path/here/*.csv
In a directory, there is several files such as:
file1
file2
file3
Is there a simple way to concatenate those files to get one line (connected by "OR") in bash as follows:
file1 OR file2 OR file3
Or do I need to write a script for it?
You can use this function to print all filenames (including ones with space, newline or special characters) with " OR " as separator (assuming your filename doesn't contain ASCII code 4):
orfiles() {
local IFS=$'\4'
local out="$*"
echo "${out//$'\4'/ OR }"
}
Then call it as:
orfiles *
How it works:
We set IFS (Internal Field Separator) to ASCII 4 locally inside the function
We store output of "$*" in local variable out. This will place \4 after each filename in variable $out.
Finally using BASH string substitution we globally replace \4 by " OR " while printing the output from $out.
In Unix systems IFS is only a single character delimiter therefore it cannot store multi character string " OR " and we have to do this in 2 steps as shown above.
You can simply do that with
printf '%s OR ' $(ls -1 *) | sed 's/OR $/''/'; echo -e '\n'
Where ls -1 * is the directory.
The moment that should be considered is that a filename could contain whitespace(s).
Use the following ls + awk solution:
ls -1 * | awk '{ r=(r)? r" OR "$0 : $0 }END{ print r }'
Workaround for filenames with newline(s):
echo -e $(ls -1b hello* | awk -v RS= '{gsub(/\n/," OR ",$0); gsub(/\\ /," ",$0); print $0}')
-b - ls option to print C-style escapes for nongraphic characters
ls -1|awk -v q='"' '{printf "%s%s", NR==1?"":" OR ", q $0 q}END{print ""}'
the ls & awk way to do it, with example that the filename containing spaces:
kent$ ls -1
file1
file2
'file with OR and space'
kent$ ls -1|awk -v q='"' '{printf "%s%s", NR==1?"":" OR ", q $0 q}END{print ""}'
"file1" OR "file2" OR "file with OR and space"
$ for f in *; do printf '%s%s' "$s" "$f"; s=" OR "; done; printf '\n'
file1 OR file2 OR file3
Below is my comma separated input.txt file, i want to read the columns and write the lines in to the output.txt when any 1 column has a space.
Content of input.txt:
1,Hello,world
2,worl d,hell o
3,h e l l o, world
4,Hello_Hello,World#c#
5,Hello,W orld
Content of output.txt:
1,Hello,world
4,Hello_Hello,World#c#
is't possible to achieve using awk? Please help!
A simple way to filter out lines with spaces is using inverted matching with grep:
grep -v ' ' input.txt
If you must use awk:
awk '!/ /' input.txt
Or perl:
perl -ne '/ / || print' input.txt
Or pure bash:
while read line; do [[ $line == *' '* ]] || echo $line; done < input.txt
# or
while read line; do [[ $line =~ ' ' ]] || echo $line; done < input.txt
UPDATE
To check if let's say field 2 contains space, you could use awk like this:
awk -F, '$2 !~ / /' input.txt
To check if let's say field 2 OR field 3 contains space:
awk -F, '!($2 ~ / / || $3 ~ / /)' input.txt
For your follow-up question in comments
To do the same using sed, I only know these awkward solutions:
# remove lines if 2nd field contains space
sed -e '/^[^,]*,[^,]* /d' input.txt
# remove lines if 2nd or 3rd field contains space
sed -e '/^[^,]*,[^,]* /d' -e '/^[^,]*,[^,]*,[^,]* /d' input.txt
For your 2nd follow-up question in comments
To disregard leading spaces in the 2nd or 3rd fields:
awk -F', *' '!($2 ~ / / || $3 ~ / /)' input.txt
# or perhaps what you really want is this:
awk -F', *' -v OFS=, '!($2 ~ / / || $3 ~ / /) { print $1, $2, $3 }' input.txt
This can also be done easily with sed
sed '/ /d' input.txt
try this one-liner
awk 'NF==1' file
as #jwpat7 pointed out, it won't give correct output if the line has only leading space, then this line, with regex should do, but it has been already posted in janos's answer.
awk '!/ /' file
or
awk -F' *' 'NF==1'
Pure bash for the fun of it...
#!/bin/bash
while read line
do
if [[ ! $line =~ " " ]]
then
echo $line
fi
done < input.txt
columnWithSpace=2
ColumnBef=$(( ${columnWithSpace} - 1 ))
sed '/\([^,]*,\)\{${ColumnBef\}[^ ,]* [^,]*,/ d'
if you know the column directly (by example the 3):
sed '/\([^,]*,\)\{2}[^ ,]* [^,]*,/ d'
If you can trust the input to always have no more than three fields, simply finding a space somewhere after a comma is sufficient.
grep ',.* ' input.txt
If there can be (or usually are) more fields, you can pull that off with grep -E and a suitable ERE, but you are fast approaching the point at which the equivalent Awk solution will be more readable and maintainable.
I have a text files with a line like this in them:
MC exp. sig-250-0 events & $0.98 \pm 0.15$ & $3.57 \pm 0.23$ \\
sig-250-0 is something that can change from file to file (but I always know what it is for each file). There are lines before and above this, but the string "MC exp. sig-250-0 events" is unique in the file.
For a particular file, is there a good way to extract the second number 3.57 in the above example using bash?
use awk for this:
awk '/MC exp. sig-250-0/ {print $10}' your.txt
Note that this will print: $3.57 - with the leading $, if you don't like this, pipe the output to tr:
awk '/MC exp. sig-250-0/ {print $10}' your.txt | tr -d '$'
In comments you wrote that you need to call it in a script like this:
while read p ; do
echo $p,awk '/MC exp. sig-$p/ {print $10}' filename | tr -d '$'
done < grid.txt
Note that you need a sub shell $() for the awk pipe. Like this:
echo "$p",$(awk '/MC exp. sig-$p/ {print $10}' filename | tr -d '$')
If you want to pass a shell variable to the awk pattern use the following syntax:
awk -v p="MC exp. sig-$p" '/p/ {print $10}' a.txt | tr -d '$'
More lines would've been nice but I guess you would like to have a simple use awk.
awk '{print $N}' $file
If you don't tell awk what kind of field-separator it has to use it will use just a space ' '. Now you just have to count how many fields you have got to get your field you want to get. In your case it would be 10.
awk '{print $10}' file.txt
$3.57
Don't want the $?
Pipe your awk result to cut:
awk '{print $10}' foo | cut -d $ -f2
-d will use the $ als field-separator and -f will select the second field.
If you know you always have the same number of fields, then
#!/bin/bash
file=$1
key=$2
while read -ra f; do
if [[ "${f[0]} ${f[1]} ${f[2]} ${f[3]}" == "MC exp. $key events" ]]; then
echo ${f[9]}
fi
done < "$file"
I'd like to be able to sort a file but only at a certain line and below. From the manual sort isn't able to parse content so I'll need a second utility to do this. read? or awk possibly? Here's the file I'd like to be able to sort:
tar --exclude-from=$EXCLUDE_FILE --exclude=$BACKDEST/$PC-* \
-cvpzf $BACKDEST/$BACKUPNAME.tar.gz \
/etc/X11/xorg.conf \
/etc/X11/xorg.conf.1 \
/etc/fonts/conf.avail.1 \
/etc/fonts/conf.avail/60-liberation.conf \
So for this case, I'd like to begin sorting on line three. I'm thinking I'm going to have to do a function to be able to do this something like
cat backup.sh | while read LINE; do echo $LINE | sort; done
Pretty new to this and the script looks like it's missing something. Also, not sure how to begin at a certain line number.
Any ideas?
Something like this?
(head -n 2 backup.sh; tail -n +3 backup.sh | sort) > backup-sorted.sh
You may have to fixup the last line of the input... it probably doesn't have the trailing \ for the line continuation, so you might have a broken 'backup-sorted.sh' if you just do the above.
You might want to consider using tar's --files-from (or -T) option, and having the sorted list of files in a data file instead of the script itself.
clumsy way:
len=$(cat FILE | wc -l)
sortable_len=$((len-3))
head -3 FILE > OUT
tail -$sortable_len FILE | sort >> OUT
I'm sure someone will post an elegant 1-liner shortly.
Sort the lines excluding the (2 lines) header, just for view.
cat file.txt | awk '{if (NR < 3) print $0 > "/dev/stderr"; else print $0}' | sort
Sort the lines excluding the (2 lines) headers and send the output to another file.
Method #1:
cat file.txt | awk '{if (NR < 3) print $0 > "/dev/stderr"; else print $0}' 2> file_sorted.txt | sort >> file_sorted.txt
Method #2:
cat file.txt | awk '{if (NR < 3) print $0 > "file_sorted.txt"; else print $0}' | sort >> file_sorted.txt
You could try this:
(read line; echo "$line"; sort) < file.txt
It takes one line and echoes it, then sorts the rest. You can also:
file.txt | (read line; echo "$line"; sort)
For two lines, just repeat the read and echo:
(read line; echo "$line"; read line; echo "$line"; sort) < file.txt
Using awk:
awk '{ if ( NR > 2 ) { print $0 } }' file.txt | sort
NR is a built-in awk variable and contains the current record/line number. It starts at 1.
Extending Vigneswaran R's answer using awk:
using tty to get your current terminals' stdin file, print the first three lines directly to you terminal (no it won't run the input) within awk and pipe the rest to sort.
tty
>/dev/pts/3
cat file.txt | awk '{if (NR < 3) print $0 > "/dev/pts/3"; else print $0}' | sort