Print columns from specific line of file?

Print columns from specific line of file? - string

I'm looking at files that all have a different version number that starts at column 18 of line 7.
What's the best way with Bash to read (into a $variable) the string on line 7, from column, i.e. "character," 18 to the end of the line? What about to the 5th to last character of the line?

sed way:
variable=$(sed -n '7s/^.\{17\}//p' file)
EDIT (thanks to commenters): If by columns you mean fields (separated with tabs or spaces), the command can be changed to
variable=$(sed -n '7s/^\(\s\+\S\+\)\{17\}//p' file)

You have a number of different ways you can go about this, depending on the utilities you want to use. One of your options is to make use of Bash's substring expansion in any of the following ways:
sed
line=1
string=$(sed -n "${line}p" /etc/passwd)
echo "${string:17}"
awk
line=1
string=$(awk "NR==${line} {print}; {next}" /etc/passwd)
echo "${string:17}"
coreutils
line=1
string=`{ head -n $line | tail -n1; } < /etc/passwd`
echo "${string:17}"

Use
var=$(head -n 17 filename | tail -n 1 | cut -f 18-)
or
var=$(awk 'NR == 17' {delim = ""; for (i = 18; i <= NF; i++) {printf "%s%s", delim, $i; delim = OFS}; printf "\n"}')
If you mean "characters" instead of "fields":
var=$(head -n 17 filename | tail -n 1 | cut -c 18-)
or
var=$(awk 'NR == 17' {print substr($0, 18)}')

If by 'columns' you mean 'fields':
a=$( awk 'NR==7{ print $18 }' file )
If you really want the 18th byte through the end of line 7, do:
a=$( sed -n 7p | cut -b 18- )

Related

Bash function with input fails awk command

I am writing a function in a BASH shell script, that should return lines from csv-files with headers, having more commas than the header. This can happen, as there are values inside these files, that could contain commas. For quality control, I must identify these lines to later clean them up. What I have currently:
#!/bin/bash
get_bad_lines () {
local correct_no_of_commas=$(head -n 1 $1/$1_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $1 | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$1/$1_0_${i}_0.csv" ]; then
echo "File: $1_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "$1_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$1/$1_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk '$1 > $correct_no_of_commas {print}'
done
}
get_bad_lines products
get_bad_lines users
The output of this program is now all the comma-counts with all of the line numbers in all the files,
and I suspect this is due to the input $1 (foldername, i.e. products & users) conflicting with the call to awk with reference to $1 as well (where I wish to grab the first column being the count of commas for that line in the current file in the loop).
Is this the issue? and if so, would it be solvable by either referencing the 1.st column or the folder name by different variable names instead of both of them using $1 ?
Example, current output:
5 6667
5 6668
5 6669
5 6670
(should only show lines for that file having more than 5 commas).
Tried variable declaration in call to awk as well, with same effect
(as in the accepted answer to Awk field variable clash with function argument)
:
get_bad_lines () {
local table_name=$1
local correct_no_of_commas=$(head -n 1 $table_name/${table_name}_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $table_name | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$table_name/${table_name}_0_${i}_0.csv" ]; then
echo "File: ${table_name}_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "${table_name}_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$table_name/${table_name}_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk -v table_name="$table_name" '$1 > $correct_no_of_commas {print}'
done
}

You can use awk the full way to achieve that :
get_bad_lines () {
find "$1" -maxdepth 1 -name "$1_0_*_0.csv" | while read -r my_file ; do
awk -v table_name="$1" '
NR==1 { num_comma=gsub(/,/, ""); }
/,/ { if (gsub(/,/, ",", $0) > num_comma) wrong_array[wrong++]=NR":"$0;}
END { if (wrong > 0) {
print(FILENAME" has over "num_comma" commas in the following lines:");
for (i=0;i<wrong;i++) { print(wrong_array[i]); }
}
}' "${my_file}"
done
}
For why your original awk command failed to give only lines with too many commas, that is because you are using a shell variable correct_no_of_commas inside a single quoted awk statement ('$1 > $correct_no_of_commas {print}'). Thus there no substitution by the shell, and awk read "$correct_no_of_commas" as is, and perceives it as an undefined variable. More precisely, awk look for the variable correct_no_of_commas which is undefined in the awk script so it is an empty string . awk will then execute $1 > $"" as matching condition, and as $"" is a $0 equivalent, awk will compare the count in $1 with the full input line. From a numerical point of view, the full input line has the form <tab><count><tab><num_line>, so it is 0 for awk. Thus, $1 > $correct_no_of_commas will be always true.

You can identify all the bad lines with a single awk command
awk -F, 'FNR==1{print FILENAME; headerCount=NF;} NF>headerCount{print} ENDFILE{print "#######\n"}' /path/here/*.csv
If you want the line number also to be printed, use this
awk -F, 'FNR==1{print FILENAME"\nLine#\tLine"; headerCount=NF;} NF>headerCount{print FNR"\t"$0} ENDFILE{print "#######\n"}' /path/here/*.csv

Using bash, I want to print a number followed by sizes of 2 paths on one line. i.e. output of 3 commands on one line

Using bash, I want to print a number followed by sizes of 2 paths on one line. i.e. output of 3 commands on one line.
All the 3 items should be separated by ":"
echo -n "10001:"; du -sch /abc/def/* | grep 'total' | awk '{ print $1 }'; du -sch /ghi/jkl/* | grep 'total' | awk '{ print $1 }'
I am getting the output as -
10001:61M
:101M
But I want the output as -
10001:61M:101M

This should work for you. The two key elements added being the
tr - d '\n'
which effectively strips new line characters from the end of the output. As well as adding in the echo ":" to get the extra colon for formatting in there.
Hope this helps! Here's a link to the docs for tr command.
https://ss64.com/bash/tr.html
echo -n "10001:"; du -sch /abc/def/* | grep 'total' | awk '{ print $1 }' | tr -d '\n'; echo ":" | tr -d '\n'; du -sch /ghi/jkl/* | grep 'total' | awk '{ print $1 }'

Save your values to variables, and then use printf:
printf '%s:%s:%s\n' "$first" "$second" "$third"

Easy way of selecting certain lines from a file in a certain order

I have a text file, with many lines. I also have a selected number of lines I want to print out, in certain order. Let's say, for example, "5, 3, 10, 6". In this order.
Is there some easy and "canonical" way of doing this? (with "standard" Linux tools, and bash)
When I tried the answers from this question
Bash tool to get nth line from a file
it always prints the lines in order they are in the file.

A one liner using sed:
for i in 5 3 10 6 ; do sed -n "${i}p" < ff; done

A rather efficient method if your file is not too large is to read it all in memory, in an array, one line per field using mapfile (this is a Bash ≥4 builtin):
mapfile -t array < file.txt
Then you can echo all the lines you want in any order, e.g.,
printf '%s\n' "${array[4]}" "${array[2]}" "${array[9]}" "${array[5]}"
to print the lines 5, 3, 10, 6. Now you'll feel it's a bit awkward that the array fields start with a 0 so that you have to offset your numbers. This can be easily cured with the -O option of mapfile:
mapfile -t -O 1 array < file.txt
this will start assigning to array at index 1, so that you can print your lines 5, 3, 10 and 6 as:
printf '%s\n' "${array[5]}" "${array[3]}" "${array[10]}" "${array[6]}"
Finally, you want to make a wrapper function for this:
printlines() {
local i
for i; do printf '%s\n' "${array[i]}"; done
}
so that you can just state:
printlines 5 3 10 6
And it's all pure Bash, no external tools!
As #glennjackmann suggests in the comments you can make the helper function also take care of reading the file (passed as argument):
printlinesof() {
# $1 is filename
# $2,... are the lines to print
local i array
mapfile -t -O 1 array < "$1" || return 1
shift
for i; do printf '%s\n' "${array[i]}"; done
}
Then you can use it as:
printlinesof file.txt 5 3 10 6
And if you also want to handle stdin:
printlinesof() {
# $1 is filename or - for stdin
# $2,... are the lines to print
local i array file=$1
[[ $file = - ]] && file=/dev/stdin
mapfile -t -O 1 array < "$file" || return 1
shift
for i; do printf '%s\n' "${array[i]}"; done
}
so that
printf '%s\n' {a..z} | printlinesof - 5 3 10 6
will also work.

Here is one way using awk:
awk -v s='5,3,10,6' 'BEGIN{split(s, a, ","); for (i=1; i<=length(a); i++) b[a[i]]=i}
b[NR]{data[NR]=$0} END{for (i=1; i<=length(a); i++) print data[a[i]]}' file
Testing:
cat file
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
Line 10
Line 11
Line 12
awk -v s='5,3,10,6' 'BEGIN{split(s, a, ","); for (i=1; i<=length(a); i++) b[a[i]]=i}
b[NR]{data[NR]=$0} END{for (i=1; i<=length(a); i++) print data[a[i]]}' file
Line 5
Line 3
Line 10
Line 6

First, generate a sed expression that would print the lines with a number at the beginning that you can later use to sort the output:
#!/bin/bash
lines=(5 3 10 6)
sed=''
i=0
for line in "${lines[#]}" ; do
sed+="${line}s/^/$((i++)) /p;"
done
for i in {a..z} ; do echo $i ; done \
| sed -n "$sed" \
| sort -n \
| cut -d' ' -f2-
I's probably use Perl, though:
for c in {a..z} ; do echo $c ; done \
| perl -e 'undef #lines{#ARGV};
while (<STDIN>) {
$lines{$.} = $_ if exists $lines{$.};
}
print #lines{#ARGV};
' 5 3 10 6
You can also use Perl instead of hacking with sed in the first solution:
for c in {a..z} ; do echo $c ; done \
| perl -e ' %lines = map { $ARGV[$_], ++$i } 0 .. $#ARGV;
while (<STDIN>) {
print "$lines{$.} $_" if exists $lines{$.};
}
' 5 3 10 6 | sort -n | cut -d' ' -f2-

l=(5 3 10 6)
printf "%s\n" {a..z} |
sed -n "$(printf "%d{=;p};" "${l[#]}")" |
paste - - | {
while IFS=$'\t' read -r nr text; do
line[nr]=$text
done
for n in "${l[#]}"; do
echo "${line[n]}"
done
}

You can use the nl trick: number the lines in the input and join the output with the list of actual line numbers. Additional sorts are needed to make the join possible as it needs sorted input (so the nl trick is used once more the number the expected lines):
#! /bin/bash
LINES=(5 3 10 6)
lines=$( IFS=$'\n' ; echo "${LINES[*]}" | nl )
for c in {a..z} ; do
echo $c
done | nl \
| grep -E '^\s*('"$( IFS='|' ; echo "${LINES[*]}")"')\s' \
| join -12 -21 <(echo "$lines" | sort -k2n) - \
| sort -k2n \
| cut -d' ' -f3-

unix - breakdown of how many lines with number of character occurrences

Is there an inbuilt command to do this or has anyone had any luck with a script that does it?
I am looking to get counts of how many lines had how many occurrences of a specfic character. (sorted descending by the number of occurrences)
For example, with this sample file:
gkdjpgfdpgdp
fdkj
pgdppp
ppp
gfjkl
Suggested input (for the 'p' character)
bash/perl some_script_name "p" samplefile
Desired output:
occs count
4 1
3 2
0 2
Update:
How would you write a solution that worked off a 2 character string such as 'gd' not a just a specific character such as p?

$ sed 's/[^p]//g' input.txt | awk '{print length}' | sort -nr | uniq -c | awk 'BEGIN{print "occs", "count"}{print $2,$1}' | column -t
occs count
4 1
3 2
0 2

You could give the desired character as the field separator for awk, and do this:
awk -F 'p' '{ print NF-1 }' |
sort -k1nr |
uniq -c |
awk -v OFS="\t" 'BEGIN { print "occs", "count" } { print $2, $1 }'
For your sample data, it produces:
occs count
4 1
3 2
0 2
If you want to count occurrences of multi-character strings, just give the desired string as the separator, e.g., awk -F 'gd' ... or awk -F 'pp' ....

#!/usr/bin/env perl
use strict; use warnings;
my $seq = shift #ARGV;
die unless defined $seq;
my %freq;
while ( my $line = <> ) {
last unless $line =~ /\S/;
my $occurances = () = $line =~ /(\Q$seq\E)/g;
$freq{ $occurances } += 1;
}
for my $occurances ( sort { $b <=> $a} keys %freq ) {
print "$occurances:\t$freq{$occurances}\n";
}
If you want short, you can always use:
#!/usr/bin/env perl
$x=shift;/\S/&&++$f{$a=()=/(\Q$x\E)/g}while<>
;print"$_:\t$f{$_}\n"for sort{$b<=>$a}keys%f;
or, perl -e '$x=shift;/\S/&&++$f{$a=()=/(\Q$x\E)/g}while<>;print"$_:\t$f{$_}\n"for sort{$b<=>$a}keys%f' inputfile, but now I am getting silly.

Pure Bash:
declare -a count
while read ; do
cnt=${REPLY//[^p]/} # remove non-p characters
((count[${#cnt}]++)) # use length as array index
done < "$infile"
for idx in ${!count[*]} # iterate over existing indices
do echo -e "$idx ${count[idx]}"
done | sort -nr
Output as desired:
4 1
3 2
0 2

Can to it in one gawk process (well, with a sort coprocess)
gawk -F p -v OFS='\t' '
{ count[NF-1]++ }
END {
print "occs", "count"
coproc = "sort -rn"
for (n in count)
print n, count[n] |& coproc
close(coproc, "to")
while ((coproc |& getline) > 0)
print
close(coproc)
}
'

Shortest solution so far:
perl -nE'say tr/p//' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
For multiple characters, use a regex pattern:
perl -ple'$_ = () = /pg/g' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
This one handles overlapping matches (e.g. it finds 3 "pp" in "pppp" instead of 2):
perl -ple'$_ = () = /(?=pp)/g' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
Original cryptic but short pure-Perl version:
perl -nE'
++$c{ () = /pg/g };
}{
say "occs\tcount";
say "$_\t$c{$_}" for sort { $b <=> $a } keys %c;
'

Is there a way to 'uniq' by column?

I have a .csv file like this:
stack2#domain.example,2009-11-27 01:05:47.893000000,domain.example,127.0.0.1
overflow#domain2.example,2009-11-27 00:58:29.793000000,domain2.example,255.255.255.0
overflow#domain2.example,2009-11-27 00:58:29.646465785,domain2.example,256.255.255.0
...
I have to remove duplicate e-mails (the entire line) from the file (i.e. one of the lines containing overflow#domain2.example in the above example). How do I use uniq on only field 1 (separated by commas)? According to man, uniq doesn't have options for columns.
I tried something with sort | uniq but it doesn't work.

sort -u -t, -k1,1 file
-u for unique
-t, so comma is the delimiter
-k1,1 for the key field 1
Test result:
overflow#domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

awk -F"," '!_[$1]++' file
-F sets the field separator.
$1 is the first field.
_[val] looks up val in the hash _(a regular variable).
++ increment, and return old value.
! returns logical not.
there is an implicit print at the end.

To consider multiple column.
Sort and give unique list based on column 1 and column 3:
sort -u -t : -k 1,1 -k 3,3 test.txt
-t : colon is separator
-k 1,1 -k 3,3 based on column 1 and column 3

If you want to use uniq:
<mycvs.cvs tr -s ',' ' ' | awk '{print $3" "$2" "$1}' | uniq -c -f2
gives:
1 01:05:47.893000000 2009-11-27 tack2#domain.example
2 00:58:29.793000000 2009-11-27 overflow#domain2.example
1

If you want to retain the last one of the duplicates you could use
tac a.csv | sort -u -t, -r -k1,1 |tac
Which was my requirement
here
tac will reverse the file line by line

Here is a very nifty way.
First format the content such that the column to be compared for uniqueness is a fixed width. One way of doing this is to use awk printf with a field/column width specifier ("%15s").
Now the -f and -w options of uniq can be used to skip preceding fields/columns and to specify the comparison width (column(s) width).
Here are three examples.
In the first example...
1) Temporarily make the column of interest a fixed width greater than or equal to the field's max width.
2) Use -f uniq option to skip the prior columns, and use the -w uniq option to limit the width to the tmp_fixed_width.
3) Remove trailing spaces from the column to "restore" it's width (assuming there were no trailing spaces beforehand).
printf "%s" "$str" \
| awk '{ tmp_fixed_width=15; uniq_col=8; w=tmp_fixed_width-length($uniq_col); for (i=0;i<w;i++) { $uniq_col=$uniq_col" "}; printf "%s\n", $0 }' \
| uniq -f 7 -w 15 \
| awk '{ uniq_col=8; gsub(/ */, "", $uniq_col); printf "%s\n", $0 }'
In the second example...
Create a new uniq column 1. Then remove it after the uniq filter has been applied.
printf "%s" "$str" \
| awk '{ uniq_col_1=4; printf "%15s %s\n", uniq_col_1, $0 }' \
| uniq -f 0 -w 15 \
| awk '{ $1=""; gsub(/^ */, "", $0); printf "%s\n", $0 }'
The third example is the same as the second, but for multiple columns.
printf "%s" "$str" \
| awk '{ uniq_col_1=4; uniq_col_2=8; printf "%5s %15s %s\n", uniq_col_1, uniq_col_2, $0 }' \
| uniq -f 0 -w 5 \
| uniq -f 1 -w 15 \
| awk '{ $1=$2=""; gsub(/^ */, "", $0); printf "%s\n", $0 }'

well, simpler than isolating the column with awk, if you need to remove everything with a certain value for a given file, why not just do grep -v:
e.g. to delete everything with the value "col2" in the second place
line: col1,col2,col3,col4
grep -v ',col2,' file > file_minus_offending_lines
If this isn't good enough, because some lines may get improperly stripped by possibly having the matching value show up in a different column, you can do something like this:
awk to isolate the offending column:
e.g.
awk -F, '{print $2 "|" $line}'
the -F sets the field delimited to ",", $2 means column 2, followed by some custom delimiter and then the entire line. You can then filter by removing lines that begin with the offending value:
awk -F, '{print $2 "|" $line}' | grep -v ^BAD_VALUE
and then strip out the stuff before the delimiter:
awk -F, '{print $2 "|" $line}' | grep -v ^BAD_VALUE | sed 's/.*|//g'
(note -the sed command is sloppy because it doesn't include escaping values. Also the sed pattern should really be something like "[^|]+" (i.e. anything not the delimiter). But hopefully this is clear enough.

By sorting the file with sort first, you can then apply uniq.
It seems to sort the file just fine:
$ cat test.csv
overflow#domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
overflow#domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
$ sort test.csv
overflow#domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
overflow#domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
$ sort test.csv | uniq
overflow#domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
overflow#domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
You could also do some AWK magic:
$ awk -F, '{ lines[$1] = $0 } END { for (l in lines) print lines[l] }' test.csv
stack2#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3#domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
overflow#domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Print columns from specific line of file? - string

I'm looking at files that all have a different version number that starts at column 18 of line 7. What's the best way with Bash to read (into a $variable) the string on line 7, from column, i.e. "character," 18 to the end of the line? What about to the 5th to last character of the line?

sed way: variable=$(sed -n '7s/^.\{17\}//p' file) EDIT (thanks to commenters): If by columns you mean fields (separated with tabs or spaces), the command can be changed to variable=$(sed -n '7s/^\(\s\+\S\+\)\{17\}//p' file)

If by 'columns' you mean 'fields': a=$( awk 'NR==7{ print $18 }' file ) If you really want the 18th byte through the end of line 7, do: a=$( sed -n 7p | cut -b 18- )

Related

Bash function with input fails awk command

Using bash, I want to print a number followed by sizes of 2 paths on one line. i.e. output of 3 commands on one line

Easy way of selecting certain lines from a file in a certain order

unix - breakdown of how many lines with number of character occurrences

Is there a way to 'uniq' by column?

Categories

Resources