Count number of ';' in column - linux

I use the following command to count number of ; in a first line in a file:
awk -F';' '(NR==1){print NF;}' $filename
I would like to do same with all lines in the same file. That is to say, count number of ; on all line in file.
What I have :
$ awk -F';' '(NR==1){print NF;}' $filename
11
What I would like to have :
11
11
11
11
11
11

Straight forward method to count ; per line should be:
awk '{print gsub(/;/,"&")}' Input_file
To remove empty lines try:
awk 'NF{print gsub(/;/,"&")}' Input_file
To do this in OP's way reduce 1 from value of NF:
awk -F';' '{print (NF-1)}' Input_file
OR
awk -F';' 'NF{print (NF-1)}' Input_file

I'd say you can solve your problem with the following:
awk -F';' '{if (NF) {a += NF-1;}} END {print a}' test.txt
You want to keep a running count of all the occurrences made (variable a).
As NF will return the number of fields, which is one more than the number of separators, you'll need to subtract 1 for each line. This is the NF-1 part.
However, you don't want to count "-1" for the lines in which there is no separator at all. To skip those you need the if (NF) part.
Here's a (perhaps contrived) example:
$ cat test.txt
;;
; ; ; ;;
; asd ;;a
a ; ;
$ awk -F';' '{if (NF) {a += NF-1;}} END {print a}' test.txt
12
Notice the empty line at the end (to test against the "no separator" case).

A different approach using tr and wc:
$ tr -cd ';' < file | wc -c
42

Your code returns a number one more than the number of semicolons; NF is the number of fields you get from splitting on a semicolon (so for example, if there is one semicolon, the line is split in two).
If you want to add this number from each line, that's easy;
awk -F ';' '{ sum += NF-1 } END { print sum }' "$filename"
If the number of fields is consistent, you could also just count the number of lines and multiply;
awk -F ':' 'END { print NR * (NF-1) }' "$filename"
But that's obviously wrong if you can't guarantee that all lines contain exactly the same number of fields.

Related

Extract substring of string if position is known

first, I need to extract the substring by a known position in the file.txt
file.txt in bash, but starting from the second line
>header
cgatgcgctctgtgcgtgcgtgcg
so let's assume I want position 10 from the second line, the output should be:
c
second, I want to include the surrounding ±5 characters, resulting in
gcgctctgtgc
{ read -r; read -r; echo "${REPLY:9:1}"; echo "${REPLY:4:11}"; } < file.txt
Output:
c
gcgctctgtgc
The ${parameter:offset:length} syntax for substrings is explained in https://www.gnu.org/software/bash/manual/bashref.html#Shell-Parameter-Expansion.
The read command is explained in https://www.gnu.org/software/bash/manual/bashref.html#index-read.
Input redirection: https://www.gnu.org/software/bash/manual/bashref.html#Redirections.
With awk:
To get the character at position 10, 1-indexed:
awk 'NR==2 {print substr($0, 10, 1)}'
NR==2 is checking if the record is second, if so the statements inside {} would be executed
substr($0, 10, 1) will extract 1 character starting from position 10 from field $0 (the whole record) i.e. only the 10-th character will be extracted. The format for substr() is substr(field, offset, length).
Similarly, to get ±5 characters around 10-th:
awk 'NR==2 {print substr($0, (10-5), 11)}'
(10-5) instead of 5 is just to give you the idea of the stuffs.
Example:
% cat file.txt
>header
cgatgcgctctgtgcgtgcgtgcg
% awk 'NR==2 {print substr($0, 10, 1)}' file.txt
c
% awk 'NR==2 {print substr($0, (10-5), 11)}' file.txt
gcgctctgtgc
use sed and cut:
sed -n '2p' file|cut -c 5-15
sed for access 2nd line and cut for print desired characters

Add a variable to a column in a CSV file

I have a large file (~10GB) and I want to duplicate that file 10 times but each time add a variable to the first column:
for i in (1, 10):
var = (i-1) * 1000
# add var to the first column of the file and save the file as file(i).csv
So far I have tried:
#!/bin/bash
for i in {1..10}
do
t=1
j=$(( $i - t ))
s=1000
person_id=$(( j * add ))
awk -F"," 'BEGIN{OFS=","} NR>1{$1=$1+$person_id} {print $0}' file.csv > file$i.csv
done
but no change in column value.
Awk variables are different from shell variables.
Replace:
awk -F"," 'BEGIN{OFS=","} NR>1{$1=$1+$person_id} {print $0}' file.csv > file$i.csv
With:
awk -F"," -v id="$person_id" 'BEGIN{OFS=","} NR>1{$1=$1+id} {print $0}' file.csv > "file$i.csv"
This uses the -v option to define an awk variable id whose value is the value of the shell variable person_id.
Because , is not a shell-active character, the code can be simplified. Also, changing the location of the definition of OFS can further shorten the code:
awk -F, -v id="$person_id" 'NR>1{$1+=id} 1' OFS=, file.csv > "file$i.csv"
Lastly, we replaced {print $0} with the cryptic shorthand 1. (This works because awk interprets 1 as a logical condition which it evaluates to true and, since no action was supplied, awk will perform the default action which is to print the line.)

using variable defined outside Awk

I have codded the following lines :
ARRAY=($(awk 'FS = ";" {print $3}' file.txt))
LINE_CREATOR=`echo "aaaa;bbbb;cccccccc" |
'{awk -F";"};
END
for (i in ARRAY)
{
print $'${ARRAY['i']}'
}
}'`
the File.txt looks like
1;8;3
4;6;1
7;9;2
Explanation :
the array contains the value : 3 1 2
so the loop will loop on the array , and extract fields $3 $1 $2 from the "aaaa;bbbb;cccccccc" using awk
and the final output should be this
ccccccccaaaabbbb
I still have some errors while launching my script.
I'm making a few guesses here but I think that this does what you want:
$ echo "aaaa;bbbb;cccccccc" | awk -F\; 'NR == FNR { n = split($0, a); next }
{ printf "%s", a[$3] } END { print "" }' - file
ccccccccaaaabbbb
NR == FNR means that the block is only run for the first input. - as an argument tells awk to read first from standard input. The string is split on FS (;) into the array a. next skips the rest of the script.
The second block is only run for the second input (the text file). The values in the third field are used to print the elements in the array a.
if you want to pass the index as an awk variable, here is another way
$ awk -F';' -v ix="$(cut -d\; -f3 file | paste -sd\;)" '
BEGIN{n=split(ix,a)}
{for(i=1;i<n;i++) printf "%s",$a[i];
printf "%s\n",$a[n]}' <<< "aaaa;bbbb;cccccccc"
ccccccccaaaabbbb

How to replace fields using substr comparison

I have two files where I need to fetch the last 6 char of Field-11 from F1 and lookup on F2, if it match I need to replace Field-9 of F1 with Field-1 and Filed-2 of F2.
file1:
12345||||||756432101000||756432||756432101000||
aaaaa||||||986754812345||986754||986754812345||
ccccc||||||134567222222||134567||134567222222||
file2:
101000|AAAA
812345|20030
The expected output is:
12345||||||756432101000||101000AAAA ||756432101000||
aaaaa||||||986754812345||81234520030||986754812345||
ccccc||||||134567222222||134567||134567222222||
I have tried:
awk -F '|' -v OFS='|' 'NR==FNR{a[$1,$2];next} {b=substr($11,length($11)-7)} b in a {$9=a[$1,$2]}1'
I'd write it this way as a full script in a file, rather than a one-liner:
#!/usr/bin/awk -f
BEGIN {
FS = "|";
OFS = FS;
}
NR == FNR { # second file: the replacements to use
map[$1] = $2
next;
}
{ # first file specified: the main file to manipulate
b = substr($11,length($11)-5);
if (map[b]) {
$9 = b map[b]
}
print
}
$ awk -F '|' -v OFS='|' 'NR==FNR{a[$1]=$2;next} {b=substr($11,length($11)-5)} b in a {$9=b a[b]}1' file2 file1
12345||||||756432101000||101000AAAA||756432101000||
aaaaa||||||986754812345||81234520030||986754812345||
ccccc||||||134567222222||134567||134567222222||
How it works
awk implicitly loops through every line in both files, starting with file2 because it is specified first on the command line.
-F '|'
This tells awk to use | as the field separator on input
-v OFS='|'
This tells awk to use | as the field separator on output
NR==FNR{a[$1]=$2;next}
While reading the first file, file2, this saves the second field, $2, as the value of associative array a with the first field, $1, as the key.
next tells awk to skip the rest of the commands and start over on the next line.
b=substr($11,length($11)-5)
This extracts the last six characters of field 11 and saves them in variable b.
b in a {$9=b a[b]}
This tests to see if b is one of the keys of associative array a. If it is, this assigns the ninth field, $9, to the combination of b and a[b].
1
This is awk's cryptic shorthand for print-the-line.
You are almost there:
$ awk -F '|' -v OFS='|' 'NR==FNR{a[$1]=$2;next} {b=substr($11,length($11)-5)} b in a {$9=b a[b]}1' file2 file1
12345||||||756432101000||101000AAAA ||756432101000||
aaaaa||||||986754812345||81234520030||986754812345||
ccccc||||||134567222222||134567||134567222222||
$

How to print third column to last column?

I'm trying to remove the first two columns (of which I'm not interested in) from a DbgView log file. I can't seem to find an example that prints from column 3 onwards until the end of the line. Note that each line has variable number of columns.
...or a simpler solution: cut -f 3- INPUTFILE just add the correct delimiter (-d) and you got the same effect.
awk '{for(i=3;i<=NF;++i)print $i}'
awk '{ print substr($0, index($0,$3)) }'
solution found here:
http://www.linuxquestions.org/questions/linux-newbie-8/awk-print-field-to-end-and-character-count-179078/
Jonathan Feinberg's answer prints each field on a separate line. You could use printf to rebuild the record for output on the same line, but you can also just move the fields a jump to the left.
awk '{for (i=1; i<=NF-2; i++) $i = $(i+2); NF-=2; print}' logfile
awk '{$1=$2=$3=""}1' file
NB: this method will leave "blanks" in 1,2,3 fields but not a problem if you just want to look at output.
If you want to print the columns after the 3rd for example in the same line, you can use:
awk '{for(i=3; i<=NF; ++i) printf "%s ", $i; print ""}'
For example:
Mar 09:39 20180301_123131.jpg
Mar 13:28 20180301_124304.jpg
Mar 13:35 20180301_124358.jpg
Feb 09:45 Cisco_WebEx_Add-On.dmg
Feb 12:49 Docker.dmg
Feb 09:04 Grammarly.dmg
Feb 09:20 Payslip 10459 %2828-02-2018%29.pdf
It will print:
20180301_123131.jpg
20180301_124304.jpg
20180301_124358.jpg
Cisco_WebEx_Add-On.dmg
Docker.dmg
Grammarly.dmg
Payslip 10459 %2828-02-2018%29.pdf
As we can see, the payslip even with space, shows in the correct line.
What about following line:
awk '{$1=$2=$3=""; print}' file
Based on #ghostdog74 suggestion. Mine should behave better when you filter lines, i.e.:
awk '/^exim4-config/ {$1=""; print }' file
awk -v m="\x0a" -v N="3" '{$N=m$N ;print substr($0, index($0,m)+1)}'
This chops what is before the given field nr., N, and prints all the rest of the line, including field nr.N and maintaining the original spacing (it does not reformat). It doesn't mater if the string of the field appears also somewhere else in the line, which is the problem with daisaa's answer.
Define a function:
fromField () {
awk -v m="\x0a" -v N="$1" '{$N=m$N; print substr($0,index($0,m)+1)}'
}
And use it like this:
$ echo " bat bi iru lau bost " | fromField 3
iru lau bost
$ echo " bat bi iru lau bost " | fromField 2
bi iru lau bost
Output maintains everything, including trailing spaces
Works well for files where '/n' is the record separator so you don't have that new-line char inside the lines. If you want to use it with other record separators then use:
awk -v m="\x01" -v N="3" '{$N=m$N ;print substr($0, index($0,m)+1)}'
for example. Works well with almost all files as long as they don't use hexadecimal char nr. 1 inside the lines.
awk '{a=match($0, $3); print substr($0,a)}'
First you find the position of the start of the third column.
With substr you will print the whole line ($0) starting at the position(in this case a) to the end of the line.
The following awk command prints the last N fields of each line and at the end of the line prints a new line character:
awk '{for( i=6; i<=NF; i++ ){printf( "%s ", $i )}; printf( "\n"); }'
Find below an example that lists the content of the /usr/bin directory and then holds the last 3 lines and then prints the last 4 columns of each line using awk:
$ ls -ltr /usr/bin/ | tail -3
-rwxr-xr-x 1 root root 14736 Jan 14 2014 bcomps
-rwxr-xr-x 1 root root 10480 Jan 14 2014 acyclic
-rwxr-xr-x 1 root root 35868448 May 22 2014 skype
$ ls -ltr /usr/bin/ | tail -3 | awk '{for( i=6; i<=NF; i++ ){printf( "%s ", $i )}; printf( "\n"); }'
Jan 14 2014 bcomps
Jan 14 2014 acyclic
May 22 2014 skype
Perl solution:
perl -lane 'splice #F,0,2; print join " ",#F' file
These command-line options are used:
-n loop around every line of the input file, do not automatically print every line
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the perl code
splice #F,0,2 cleanly removes columns 0 and 1 from the #F array
join " ",#F joins the elements of the #F array, using a space in-between each element
If your input file is comma-delimited, rather than space-delimited, use -F, -lane
Python solution:
python -c "import sys;[sys.stdout.write(' '.join(line.split()[2:]) + '\n') for line in sys.stdin]" < file
Well, you can easily accomplish the same effect using a regular expression. Assuming the separator is a space, it would look like:
awk '{ sub(/[^ ]+ +[^ ]+ +/, ""); print }'
awk '{print ""}{for(i=3;i<=NF;++i)printf $i" "}'
A bit late here, but none of the above seemed to work. Try this, using printf, inserts spaces between each. I chose to not have newline at the end.
awk '{for(i=3;i<=NF;++i) printf("%s ", $i) }'
awk '{for (i=4; i<=NF; i++)printf("%c", $i); printf("\n");}'
prints records starting from the 4th field to the last field in the same order they were in the original file
In Bash you can use the following syntax with positional parameters:
while read -a cols; do echo ${cols[#]:2}; done < file.txt
Learn more: Handling positional parameters at Bash Hackers Wiki
If its only about ignoring the first two fields and if you don't want a space when masking those fields (like some of the answers above do) :
awk '{gsub($1" "$2" ",""); print;}' file
awk '{$1=$2=""}1' FILENAME | sed 's/\s\+//g'
First two columns are cleared, sed removes leading spaces.
In AWK columns are called fields, hence NF is the key
all rows:
awk -F '<column separator>' '{print $(NF-2)}' <filename>
first row only:
awk -F '<column separator>' 'NR<=1{print $(NF-2)}' <filename>

Resources