AWK reporting duplicate lines and count, program explanation

AWK reporting duplicate lines and count, program explanation - linux

I found the following AWK program on the internet and tweaked it slightly to look at column $2:
{ a[$2,NR]=$0; c[$2]++ }
END {
for( k in a ) {
split(k,b,SUBSEP)
t=c[b[1]] # added this bit to capture count
if( b[1] in c && t>1 ) { # added && t>1 only print if count more than 1
print RS "TIMES ID" RS c[b[1]] " " b[1] RS
delete c[b[1]]
}
for(i=1;i<=NR;i++) if( a[b[1],i] ) {
if(t>1){print a[b[1],i]} # added if(t>1) only print lines if count more than 1
delete a[b[1],i]
}
}
}
Given the following file:
abc,2,3
def,3,4
ghi,2,3
jkl,5,9
mno,3,2
The output is as follows when the command is run:
Command: awk -F, -f find_duplicates.awk duplicates
Output:
TIMES ID
2 2
abc,2,3
ghi,2,3
TIMES ID
2 3
def,3,4
mno,3,2
This is fine.
I would like to understand what is happening in the AWK program.
I understand that the first line is loading each line into a multidimentional array ?
So first line of file would be a['2','1']='abc,2,3' and so on.
However I'm a bit confised as to what c[$2]++ does, and also what is the significance of split(k,b,SUBSEP) ??
Would appreciate it if someone could explain line by line what is going on in this AWK program.
Thanks.

The increment operator simply adds one to the value of the referenced variable. So c[$2]++ takes the value for c[$2] and adds one to it. If $2 is a and c["a"] was 3 before, its value will be 4 after this. So c keeps track of how many of each $2 value you have seen.
for (k in a) loops over the keys of a. If the value of $2 on the first line was "a", the first value of k will be "a","1" (with 1 being the line number). The next time, it will be the combination of the value of $2 from the second line and the line number 2, etc.
The split(k,b,SUBSEP) will create a new array b from the compound value in k, i.e. basically reconstruct the parts of the compound key that went into a. The value in b[1] will now be the value which was in $2 when the corresponding value in a was created, and the value in b[2] will be the corresponding line number.
The final loop is somewhat inefficient; it loops over all possible line numbers, then skips immediately to the next one if an entry for that ID and line number did not exist. Because this runs inside the outer loop for (k in a) it will be repeated a large number of times if you have a large number of inputs (it will loop over all input line numbers for each input line). It would be more efficient, at the expense of some additional memory, to just build a final output incrementally, then print it all after you have looped over all of a, by which time you have processed all input lines anyway. Perhaps something like this:
END {
for (k in a) {
split (k,b,SUBSEP)
if (c[b[1]] > 1) {
if (! o[b[1]]) o[b[1]] = c[b[1]] " " b[1] RS
o[b[1]] = o[b[1]] RS a[k]
}
delete a[k]
}
for (q in o) print o[q] RS
}
Update: Removed the premature deletion of c[b[1]].

Related

Merge multiple columns from different files with a partial match via awk

I have two files, A and B with the columns separated by \.
Column 2 of file A is exactly the same as column 1 of file B.
I want to merge these two files keeping file B the same, add a new column based on the same fields between the two files and a partial match between column 1 of file A and column 2 of file B.
By partial match I mean something like this:
File A (column 1)
File B (column 2)
A=B?
A
A?
True
A
Asd
True
B
B
True
C
c
True
C
CA
True
D
A
False
If there are values with the same column 1 and 2 in file A, they must be added to file B separated by ;
File A
A\2022.10.10\note a
A\2022.10.10\note b
B\2022.10.14\note c
A\2022.10.14\note d
C\2022.10.15\note e
File B
2022.10.10\A?
2022.10.14\B?
2022.10.14\a
2022.10.15\C
2022.10.15\D
Desired output
2022.10.10\A?\note a;note b\
2022.10.14\B?\note c\
2022.10.14\a\note d\
2022.10.15\C\note e\
2022.10.15\D\
How can I do this with awk?

An awk script might work depending on the details of your requirements (e.g. of the keys are case sensitive or not).
An awk script like this might work:
function make_key(k)
{
# either return k or an uppercase version for case-insensitive keys
# return k;
return toupper(k);
}
BEGIN {
FS="\\";
}
NR==FNR {
key=make_key($2 "\\" $1);
if( key in notes){
notes[key]=notes[key] ";" $3
}
else {
notes[key]=$3
}
}
NR!=FNR {
for(k in notes){
pos=index(make_key($0),k);
if(pos==1){
printf "%s%s%s%s\n", $0, FS, notes[k], FS;
next;
}
}
print $0 FS;
}
You would use it like this:
awk -f script.awk file_A file_B
In the function make_key you can configure the case sensitiveness by either returning k or an uppercase version.
The NF==FNR block is used during reading the first file (file_A) here the notes are stored under a key made out of the second and first column. Notes are appended if the key is already existing.
In the NF!=FNR block the second file (file_B) is read. Here we print the line of file_b and the matching notes by comparing every key if the line of file_B starts with the key. If no key matches, then only the line of file_B is printed.
The BEGIN block just sets up the field separator.

To process the first file, we could use condition NR==FNR. During the first file processing, I used two hash tables, namely pattern_ht to store column 1 values in lower case and note_ht to store column 3 values. Since date string itself is not unique to serve as key, we will use the concatenation of columns 2 and 3 as key for both hash tables.
When processing the second file, firstly check if date string matches column 1 exactly after performing a split of the key string. If matched, perform the partial matching of column 2 to pattern_ht values after converting to lower case. If it matches and this is the first match, record the third_col value. If it already has another value, append to it with ";". Finally, display accordingly:
awk -F'\' 'NR==FNR {pattern_ht[$2";"$3] = tolower($1); note_ht[$2";"$3] = $3; next}
{third_col="";
for (key in pattern_ht) {
split(key,date_str,";");
if (date_str[1] == $1) {
if (tolower($2) ~ pattern_ht[key]) {
if (length(third_col) == 0)
third_col = note_ht[key]
else
third_col = third_col ";" note_ht[key]
}
}
}
if (length(third_col) != 0)
print $1"\\"$2"\\"third_col"\\"
else
print $1"\\"$2"\\"
}' fileA fileB
Here is the output:
2022.10.10\A?\note a;note b\
2022.10.14\B?\note c\
2022.10.14\a\note d\
2022.10.15\C\note e\
2022.10.15\D\

Compare multiple columns for each row

Using a csv file, i will like to compare multiple columns to check if all values are the same or not.
First row are the headers
First column is the label
The constant values should be from column 2 to the end ( can be 100 columns ) for the example i put only 8 columns.
The purpose is to check that all values are the same. and when it is not, report
input file
Number,V2 1563,V03-1555,V4 - 294,V-05 1580,V6-1561,V7-1562,V05-1601,V9-1587
Code,4.1.06,4.1.03,4.1.06,4.1.06,4.1.06,4.1.06,4.1.06,4.1.06
Host Id,b90c27,b90c13,3.30E+65,b90c46,b90c21,b90c1f,b88a63,b90c49
SR,SR_2_MS,SR_2_MS,SR_4_MS,SR_2_MS,SR_2_MS,SR_2_MS,SR_2_MS,SR_2_MS
output desired
Bad code in V03-1555
Bad SR in V4 - 294
Appreciate your support

awk to the rescue!
I improvised little bit. How do we know which values are correct, which are not? Popular vote, counts the occurrences and assumes majority is right. As a side benefit, if all values are different as in your "Host Id" row, nothing is reported
$ awk -F, 'NR==1 {split($0,h); next}
{delete r;
for(i=2;i<=NF;i++) {r[$i]++; idx[$i]=i}
max=0;
for(k in r) if(max<r[k]) max=r[k];
if(length(r)>1)
for(k in r)
if(r[k]!=max)
print "Bad " $1 " in " h[idx[k]] " -> " k}' file
returns
Bad Code in V03-1555 -> 4.1.03
Bad SR in V4 - 294 -> SR_4_MS
you can remove the values printed, which I put for verification.

character position in string

I have a data frame with character strings in column1 and ID in column2. The string contains A,T,G or C.
I would like to print the lines that have an A at position 1.
Then I would like to print the lines that have A at position 2 and so on and save them in separate files.
So far I have used biostrings in R for similar analysis, but it won't work for this problem exactly. I would like to use perl.
Sequence ID
TATACAAGGGCAAGCTCTCTGT mmu-miR-381-3p
TCGGATCCGTCTGAGCT mmu-miR-127-3p
ATAGTAGACCGTATAGCGTACG mmu-miR-411-5p
......
600 more lines

Biostrings will work perfectly, and will be pretty fast. Let's call your DNA stringset mydata
HasA <- sapply(mydata,function(x) as.character(x[2]) == "A")
Now you have a vector of TRUE or FALSE indicating which sequence has an A at position 2. You can make that into a nice data frame like this
HasA.df <- data.frame("SeqName" = names(mydata), "A_at_2" = HasA)

Not sure about the expected result,
mydata <- read.table(text="Sequence ID
TATACAAGGGCAAGCTCTCTGT mmu-miR-381-3p
TCGGATCCGTCTGAGCT mmu-miR-127-3p
ATAGTAGACCGTATAGCGTACG mmu-miR-411-5p",sep="",header=T,stringsAsFactors=F)
mCh <- max(nchar(mydata[,1])) #gives the maximum number of characters in the first column
sapply(seq(mCh), function(i) substr(mydata[,1],i,i)=="A") #gives the index
You can use which to get the index of the row that satisfies the condition for each position
res <- stack(setNames(sapply(seq(mCh),
function(i) which(substr(mydata[,1],i,i)=="A")),1:mCh))[,2:1]
tail(res, 5) #for the 13th position, 1st and 3rd row of the sequence are TRUE
ind values
#11 13 1
#12 13 3
#13 14 2
#14 15 3
#15 20 3
use the index values to extract the rows. For the 1st position
mydata[res$values[res$ind==1],]
# Sequence ID
# 3 ATAGTAGACCGTATAGCGTACG mmu-miR-411-5p

Using a perl one-liner
perl -Mautodie -lane '
BEGIN {($f) = #ARGV}
next if $. == 1;
my #c = split //, $F[0];
for my $i (grep {$c[$_] eq "A"} (0..$#c)) {
open my $fh, ">>", "$f.$i";
print $fh $_;
}
' file

How to efficiently interlace multiple groups of lines in Vim?

I am trying to interlace three groups of lines of text. For example, the following text:
a
a
a
b
b
b
c
c
c
is to be transformed into:
a
b
c
a
b
c
a
b
c
Is there an efficient way of doing this?

Somewhere in the depths of my ~/.vim files I have an :Interleave command (appended below). With out any arguments :Interleave will just interleave just as normal. With 2 arguments how ever it will specify how many are to be grouped together. e.g. :Interleave 2 1 will take 2 rows from the top and then interleave with 1 row from the bottom.
Now to solve your problem
:1,/c/-1Interleave
:Interleave 2 1
1,/c/-1 range starting with the first row and ending 1 row above the first line matching a letter c.
:1,/c/-1Interleave basically interleave the groups of a's and b's
:Interleave 2 1 the range is the entire file this time.
:Interleave 2 1 interleave the group of mixed a's and b's with the group of cs. With a mixing ratio of 2 to 1.
The :Interleave code is below.
command! -bar -nargs=* -range=% Interleave :<line1>,<line2>call Interleave(<f-args>)
fun! Interleave(...) range
if a:0 == 0
let x = 1
let y = 1
elseif a:0 == 1
let x = a:1
let y = a:1
elseif a:0 == 2
let x = a:1
let y = a:2
elseif a:0 > 2
echohl WarningMsg
echo "Argument Error: can have at most 2 arguments"
echohl None
return
endif
let i = a:firstline + x - 1
let total = a:lastline - a:firstline + 1
let j = total / (x + y) * x + a:firstline
while j < a:lastline
let range = y > 1 ? j . ',' . (j+y) : j
silent exe range . 'move ' . i
let i += y + x
let j += y
endwhile
endfun

Here is a "oneliner" (almost), but you have to redo it for every unique line minus 1, in your example 2 times. Perhaps of no use, but I think it was a good exercise to learn more about patterns in VIM. It handles all kind of lines as long as the whole line is unique (e.g. mno and mnp are two unique lines).
First make sure of this (and do not have / mapped to anything, or anything else in the line):
:set nowrapscan
Then map e.g. these (should be recursive, not nnoremap):
<C-R> and <CR> should be typed literally.
\v in patterns means "very magic", #! negative look-ahead. \2 use what's found in second parenthesis.
:nmap ,. "xy$/\v^<C-R>x$<CR>:/\v^(<C-R>x)#!(.*)$\n(\2)$/m-<CR>j,.
:nmap ,, gg,.
Then do ,, as many times as it takes, in your example 2 times. One for all bs and one for all cs.
EDIT: explanation of the mapping. I will use the example in the question as if it has run one time with this mapping.
After one run:
1. a
2. b
3. a
4. b
5. a
6. b
7. c
8. c
9. c
The cursor is then at the last a (line 5), when typing ,,, it first go back to first line, and then runs mapping for ,., and that mapping is doing this:
"xy$ # yanks current line (line 1) to reg. "x" ("a") "
/\v^<C-R>x$<CR> # finds next line matching reg. "x" ("a" at line 3)
:/\v^(<C-R>x)#!(.*)$\n(\2)$/m-<CR>
# finds next line that have a copy under it ("c" in line 7) and moves that line
# to current line (to line 3, if no "-" #after "m" it's pasted after current line)
# Parts in the pattern:
- ^(<C-R>x)#!(.*)$ # matches next line that don't start with what's in reg. "x"
- \n(\2)$ # ...and followed by newline and same line again ("c\nc")
- m-<CR> # inserts found line at current line (line 3)
j # down one line (to line 4, where second "a" now is)
,. # does all again (recursive), this time finding "c" in line 8
...
,. # gives error since there are no more repeated lines,
# and the "looping" breaks.

I just ran into this issue independently tonight. Mine's not as elegant as some of the answers, but it's easier to understand I think. It makes many assumptions, so it's a bit of a hack:
A) It assumes there's some unique character (or arbitrary character
string) not present in any of the lines - I assume # below.
B) It
assumes you don't want leading or trailing white space in any of the
a, b, or c sections.
C) It assumes you can easily identify the
maximum line length, and then pad all lines to be that length (e.g.
perhaps using %! into awk or etc., using printf)
Pad all lines with spaces to the same maximum length.
Visual Select just the a and b sections, then %s/$/#
Block copy and past the b section to precede the c section.
Block copy and paste the a section to precede the bc section.
%s/#/\r
%s/^ *//g
%s/ *$//g
delete the lines left where the a and b sections were.

If you have xclip you can cut the lines and use paste to interleave them:
Visual select one set of lines
Type "+d to cut them to the clipboard
Visual select the other set of lines
Type !paste -d '\n' /dev/stdin <(xclip -o -selection clipboard)

Put the following as interleave.awk in your path, make it executable.
#!/usr/bin/awk -f
BEGIN { C = 2; if (ARGC > 1) C = ARGV[1]; ARGV[1]="" }
{ g = (NR - 1) % C; if (!g) print $0; else O[g] = O[g] $0 "\n" }
END { for (i = 1; i < C; i++) printf O[i] }
Then from vim highlight the lines in visual mode, then call :'<,'>!interleave.awk 3, or replace 3 with however many groups to interleave (or leave blank for 2).
You asked for an efficient way. Interpreted languages aside, this may be the most efficient algorithm for interleaving arbitrary lines - the first group are immediately printed, saving some RAM. If RAM was at a premium (eg, massive lines or too many of them) you could instead store offsets to the start of each line, and if the lines had a consistent well defined length (at least within groups), you wouldn't even need to store offsets. However, this way the file is scanned only once (permitting use of stdin), and CPUs are fast at copying blocks of data, while file pointer operations probably each require a context switch as they would normally have to trigger a system call.
Perhaps most importantly, the code is simple and short - and efficiency of reading and implementation are usually the most important of all.
Edit: looks like others have come to the same solution - just found https://stackoverflow.com/a/16088069/118153 when reframing the question in a search engine to see if I'd missed something obvious.

How do I read a delimited file with strings/numbers with Octave?

I am trying to read a text file containing digits and strings using Octave. The file format is something like this:
A B C
a 10 100
b 20 200
c 30 300
d 40 400
e 50 500
but the delimiter can be space, tab, comma or semicolon. The textread function works fine if the delimiter is space/tab:
[A,B,C] = textread ('test.dat','%s %d %d','headerlines',1)
However it does not work if delimiter is comma/semicolon. I tried to use dklmread:
dlmread ('test.dat',';',1,0)
but it does not work because the first column is a string.
Basically, with textread I can't specify the delimiter and with dlmread I can't specify the format of the first column. Not with the versions of these functions in Octave, at least. Has anybody ever had this problem before?

textread allows you to specify the delimiter-- it honors the property arguments of strread. The following code worked for me:
[A,B,C] = textread( 'test.dat', '%s %d %d' ,'delimiter' , ',' ,1 )

I couldn't find an easy way to do this in Octave currently. You could use fopen() to loop through the file and manually extract the data. I wrote a function that would do this on arbitrary data:
function varargout = coltextread(fname, delim)
% Initialize the variable output argument
varargout = cell(nargout, 1);
% Initialize elements of the cell array to nested cell arrays
% This syntax is due to {:} producing a comma-separated
[varargout{:}] = deal(cell());
fid = fopen(fname, 'r');
while true
% Get the current line
ln = fgetl(fid);
% Stop if EOF
if ln == -1
break;
endif
% Split the line string into components and parse numbers
elems = strsplit(ln, delim);
nums = str2double(elems);
nans = isnan(nums);
% Special case of all strings (header line)
if all(nans)
continue;
endif
% Find the indices of the NaNs
% (i.e. the indices of the strings in the original data)
idxnans = find(nans);
% Assign each corresponding element in the current line
% into the corresponding cell array of varargout
for i = 1:nargout
% Detect if the current index is a string or a num
if any(ismember(idxnans, i))
varargout{i}{end+1} = elems{i};
else
varargout{i}{end+1} = nums(i);
endif
endfor
endwhile
endfunction
It accepts two arguments: the file name, and the delimiter. The function is governed by the number of return variables that are specified, so, for example, [A B C] = coltextread('data.txt', ';'); will try to parse three different data elements from each row in the file, while A = coltextread('data.txt', ';'); will only parse the first elements. If no return variable is given, then the function won't return anything.
The function ignores rows that have all-strings (e.g. the 'A B C' header). Just remove the if all(nans)... section if you want everything.
By default, the 'columns' are returned as cell arrays, although the numbers within those arrays are actually converted numbers, not strings. If you know that a cell array contains only numbers, then you can easily convert it to a column vector with: cell2mat(A)'.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

AWK reporting duplicate lines and count, program explanation - linux

Related

Merge multiple columns from different files with a partial match via awk

Compare multiple columns for each row

character position in string

How to efficiently interlace multiple groups of lines in Vim?

How do I read a delimited file with strings/numbers with Octave?

Categories

Resources