Generate ranges from entries in a file - linux

I have a txt file containing entries as below
1
2
3
4
7
8
9
12
14
15
I need to generate ranges as below
1-4
7-9
12-12
14-15
How do I achieve the above output?
This is what I tried:
awk '{q=$1}{f=$1}{print $q} $1!=p+1{print l"-"f}{l=p+1}{p=$1} END{print}' filename

I would say...
awk -v OFS=- 'prev+1<$0 {print first ? first : 1,prev; first=$0}
{prev=$0}
END {print first, prev}' file
For your given file it returns:
$ awk -v OFS=- 'prev+1<$0 {print first ? first : 1,prev; first=$0} {prev=$0} END {print first, prev}' file
1-4
7-9
12-12
14-15
I won't go through your attempt awk '{q=$1}{f=$1}{print $q} $1!=p+1{print l"-"f}{l=p+1}{p=$1} END{print}' filename but I do suggest you to use more representative variable names, as well as to start from a little piece and then make your script grow. Otherwise, it becomes a jungle you want to throw away once it does not work.

Related

find lines existing in one file and not in another, based on a portion of the line

I have two files A.dat and B.dat.
A.dat
112381550RSAP002839002C00000000020200600000110102020-05-26
112539961RSAP002839002C00000000020200700000140102020-05-26
140823748RSAP002839002C00000000020210200000050102020-05-26
110604754RSAP002839002C00000000020200600000110102020-05-26
B.dat
112381550RSAP002839002C00000000020200600000000102020-05-26
112539961RSAP002839002C00000000020200700000000102020-05-26
119A06559RSAP002839002C00000000020210100000000102020-05-26
119231672RSAP002839002C00000000020200900000000102020-05-26
118372226RSAP002839002C00000000020200800000000102020-05-26
I want to find records in B.dat that do not exist in A.dat based on the first 22 characters (in BOLD)
the output should be below
119A06559RSAP002839002C00000000020210100000000102020-05-26
119231672RSAP002839002C00000000020200900000000102020-05-26
118372226RSAP002839002C00000000020200800000000102020-05-26
Tried using grep like below
grep -Fvxf B.dat A.dat > c.dat
But didn't find a way to compare only that portion of the data.
Could you please try the following.
awk 'FNR==NR{array[substr($0,1,22)];next} !(substr($0,1,22) in array)' A.dat B.dat
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR then do following.
array[substr($0,1,22)] ##Creating an array whose index is first 22 elements of current line.
next ##next will skip all further statements from here.
}
!(substr($0,1,22) in array) ##Checking condition if current line first 22 characters are NOT in array the print the current line.
' A.dat B.dat ##Mentioning Input_file names here.
I would use the following method based on awk:
awk '{s=substr($0,1,22)}(FNR==NR){a[s];next}!(s in a)' A.dat B.dat
This ensures that you will always match the first 22 characters.
It essentially does the following: everytime a line is read (disregarding the file) it creates a little string s containing the first 22 characters of the line. If we process the first file (FNR==NR) store the string in an array a, if we process the second file, check if that string is a member of a and if not, print the line.
You could also attempt a grep based solution, but this could lead to false positives, depending on how you like your input:
cut -c1-22 A.dat | grep -vFf - B.dat
This however could match the first 22 characters of the lines of A.dat anywhere in the lines of B.dat (not necessarily the first 22 characters)
You can do this with just grep and colrm as follows (a filename of "-" is understood as stdin and you can use that with "-f"):
colrm 23 < A.dat | grep -F -v -f - B.dat
If you're not 100% sure those 22-character patterns are going to match only at the starts of lines, you need to add a '^' to each line of output from colrm and elide the "-F" flag from grep's flags, like so:
colrm 23 < A.dat | sed -e 's/^/\^/;' | grep -v -f - B.dat
If the order of the output is unimportant, here's a grep-free method using bash, sort, and GNU uniq:
sort {A,A,B}.dat | uniq -uw 22
...or in POSIX shell:
sort A.dat A.dat B.dat | uniq -uw 22
Output of either method:
118372226RSAP002839002C00000000020200800000000102020-05-26
119231672RSAP002839002C00000000020200900000000102020-05-26
119A06559RSAP002839002C00000000020210100000000102020-05-26

Remove duplicates, but keeping only the last occurrence in linux file [duplicate]

This question already has answers here:
Eliminate partially duplicate lines by column and keep the last one
(4 answers)
Closed 6 years ago.
INPUT FILE :
5,,OR1,1000,Nawras,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,,user,,aaa8016058f008ddceae6329f0c5d551,50293277591,,,30001,C
5,,OR1,1000,Nawras,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,20160217T01:45:18+0400,,user,aaa8016058f008ddceae6329f0c5d551,50293277591,,,30001,H
5,,OR2,2000,Nawras,OR,20160216T06:30:18+0400,20191231T23:59:59+0400,,user,,f660818af5625b3be61fe12489689601,50328589469,,,30002,C
5,,OR2,2000,Nawras,OR,20160216T06:30:18+0400,20191231T23:59:59+0400,20160216T06:30:18+0400,,user,f660818af5625b3be61fe12489689601,50328589469,,,30002,H
5,,OR1,1000,Nawras,OR,20150328T03:00:13+0400,20171230T23:59:59+0400,,user,,22bf18b024e1d4f42ac79943062cf576,50212935879,,,10001,C
5,,OR1,1000,Nawras,OR,20150328T03:00:13+0400,20171230T23:59:59+0400,20150328T03:00:13+0400,,user,22bf18b024e1d4f42ac79943062cf576,50212935879,,,10001,H
0,,OR5,5000,Nawras,OR,20160421T02:45:16+0400,20191231T23:59:59+0400,,user,,c7c501ac92d85a04bb26c575929e9317,50329769192,,,11001,C
0,,OR5,5000,Nawras,OR,20160421T02:45:16+0400,20191231T23:59:59+0400,20160421T02:45:16+0400,,user,c7c501ac92d85a04bb26c575929e9317,50329769192,,,11001,H
0,,OR1,1000,Nawras,OR,20160330T02:00:14+0400,20181231T23:59:59+0400,,user,,d4ea749306717ec5201d264fc8044201,50285524333,,,11001,C
DESIRED OUTPUT :
5,,OR1,1000,UY,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,20160217T01:45:18+0400,,user,aaa8016058f008ddceae6329f0c5d551,50293277591,,,30001,H
5,,OR2,2000,UY,OR,20160216T06:30:18+0400,20191231T23:59:59+0400,20160216T06:30:18+0400,,user,f660818af5625b3be61fe12489689601,50328589469,,,30002,H
5,,OR1,1000,UY,OR,20150328T03:00:13+0400,20171230T23:59:59+0400,20150328T03:00:13+0400,,user,22bf18b024e1d4f42ac79943062cf576,50212935879,,,10001,H
0,,OR5,5000,UY,OR,20160421T02:45:16+0400,20191231T23:59:59+0400,20160421T02:45:16+0400,,user,c7c501ac92d85a04bb26c575929e9317,50329769192,,,11001,H
0,,OR1,1000,UY,OR,20160330T02:00:14+0400,20181231T23:59:59+0400,,user,,d4ea749306717ec5201d264fc8044201,50285524333,,,11001,C*
CODE USED :
for i in `cat file | awk -F, '{print $13}' | sort | uniq`
do
grep $i file | tail -1 >> TESTINGGGGGGG_SV
done
This took much time as the file has 300 million records and which has 65 million uniq records at 13th column .
So i would require a output which can traverse 13th column value - last occurrence in file as the output .
awk to the rescue!
awk -F, 'p!=$13 && p0 {print p0} {p=$13; p0=$0} END{print p0}' file
expects sorted input.
Please post the timing if you can successfully run the script.
If sorting is not possible, another option is
tac file | awk -F, '!a[$13]++' | tac
reverse the file, take the first entry for $13 and reverse the results back.
Here's a solution that should work:
awk -F, '{rows[$13]=$0} END {for (i in rows) print rows[i]}' file
Explanation:
rows is an associative array indexed by field 13 $13, the element of the array indexed by $13 gets overwritten every time there's a duplicate of field 13; its value is the whole line $0.
But this is inefficient in terms of memory because of the space needed to save the array.
An improvement to the above solution that's still not using sorting is to just save the line numbers in the associative array:
awk -F, '{rows[$13]=NR}END {for(i in rows) print rows[i]}' file|while read lN; do sed "${lN}q;d" file; done
Explanation:
rows as before but the values are the line numbers and not the whole lines
awk -F, '{rows[$13]=NR}END {for(i in rows) print rows[i]}' file outputs a list of row numbers containing the sought lines
sed "${lN}q;d" fetches line number lN from file

How to format decimal space using awk in linux

original file :
a|||a 2 0.111111
a|||book 1 0.0555556
a|||is 2 0.111111
now i need to control third columns with 6 decimal space
after i tried awk {'print $1,$2; printf "%.6f\t",$3'}
but the output is not what I want
result :
a|||a 2
0.111111 a|||book 1
0.055556 a|||is 2
that's weird , how can I do that will just modify third columns
Your print() is adding a newline character. Include your third field inside it, but formatted. Try with sprintf() function, like:
awk '{print $1,$2, sprintf("%.6f", $3)}' infile
That yields:
a|||a 2 0.111111
a|||book 1 0.055556
a|||is 2 0.111111
Print adds a newline on the end of printed strings, whereas printf by default doesn't. This means a newline is added after every second field and none is added after the third.
You can use printf for the whole string and manually add a newline.
Also I'm not sure why you are adding a tab to the end of the lines, so i removed that
awk '{printf "%s %d %.6f\n",$1,$2,$3}' file
a|||a 2 0.111111
a|||book 1 0.055556
a|||is 2 0.111111

Removing last column from rows that have three columns using bash

I have a file that contains several lines of data. Some lines contain three columns, but most contain only two. All lines are single-tab separated. For those that contain three columns, the third column is typically redundant and contains the same data as the second so I'd like to remove it.
I imagine awk or cut would be appropriate, but I'm drawing a blank on how to test the row for three columns so my script will only work on those rows. I know awk is a very powerful language with logic and whatnot built into it, I'm just not that strong with it.
I looked at a similar question, but I'm not sure what is going on with the awk answer. Should the -4 be -1 since I only want to remove one column? What about if the row has two columns; will it remove the second even though I don't want to do anything?
I modified it to what I think it would be:
awk -F"\t" -v OFS="\t" '{ for (i=1;i<=NF-4;i++){ print $i }}'
But when I run it (with the file) nothing happens. If I change NF-1 or NF-2 I get some output, but it only a handful of lines and only the first column.
Can anyone clue me into what I should be doing?
If you just want to remove the third column, you could just print the first and the second:
awk -F '\t' '{print $1 "\t" $2}'
And it's similar to cut:
cut -f 1,2
The awk variable NF gives you the number for fields. So an expression like this should work for you.
awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
Running it on an input file like so
a,b,c
x,y
u,v,w
l,m
gives me
$ cat test | awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
a,b
x,y
u,v
l,m
This might work for you (GNU sed):
sed 's/\t[^\t]*//2g' file
Restricts the file to two columns.
awk 'NF==3{print $1"\t"$2}NF==2{print}' your_file
Testde below:
> cat temp
1 2
3 4 5
6 7
8 9 10
>
> awk 'NF==3{print $1"\t"$2}NF==2{print}' temp
1 2
3 4
6 7
8 9
>
or in a much more simplere way in awk:
awk 'NF==3{print $1"\t"$2}NF==2' your_file
Or you can also go with perl:
perl -lane 'print "$F[0]\t$F[1]"' your_file

write a two column file from two files using awk

I have two files of one column each
1
2
3
and
4
5
6
I want to write a unique file with both elements as
1 4
2 5
3 6
It should be really simple I think with awk.
You could try paste -d ' ' <file1> <file2>. (Without -d ' ' the delimiter would be tab.)
paste works okay for the example given but it doesn't handle variable length lines very well. A nice little-know core-util pr provides a more flexible solution:
$ pr -mtw 4 file1 file2
1 4
2 5
3 6
A variable length example:
$ pr -mtw 22 file1 file2
10 4
200 5
300,000,00 6
And since you asked about awk here is one way:
$ awk '{a[FNR]=a[FNR]$0" "}END{for(i=1;i<=length(a);i++)print a[i]}' file1 file2
1 4
2 5
3 6
Using awk
awk 'NR==FNR { a[FNR]=$0;next } { print a[FNR],$0 }' file{1,2}
Explanation:
NR==FNR will ensure our first action statement runs for first file only.
a[FNR]=$0 with this we are inserting first file into array a indexed at line number
Once first file is complete we move to second action
Here we print each line of first file along with second file

Resources