how to change this file
335
339
666665
666668
to this result
335
336
337
338
339
666665
666666
666667
666668
explain : between two numbers with the same long, it will push the missed number to make numeric ascending order .Many Thanks
I believe this does what you want.
awk 'alen==length($1) {for (i=a;i<=$1;i++) print i}; {a=$1; alen=length(a); if (a==(i-1)) {a++}}'
When alen (the length of a) is the same as the length of the current line loop between a and $1 printing out all missing values.
Then set a to the new $1, alen to the length of a, and when we dealt with a missing range (when a is the same as i - 1) increment a so we don't duplicate that number (this handles cases of sequential lines like 335, 339, 350 without duplicating 339).
With credit to #fedorqui for the basic idea.
Edit: I believe this fixes the problem I noted in the comments (which I think is what #JohnB was indicating as well):
awk '{f=0; if (alen==length($1)) {for (i=a;i<=$1;i++) print i} else {f=1}} {a=$1; alen=length(a)} a==(i-1){a++} f{print; a++}'
I feel like there should be a simpler way to do that but I don't see it at the moment.
Edit again: The input file I ended up testing with:
335
339
340
345
3412
34125
666665
666668
The first approach is this:
$ awk 'NR%2 {a=$1; next} $1>a {for (i=a;i<=$1;i++) print i}' file
335
336
337
338
339
666665
666666
666667
666668
It can be improved as much as info and effort you put in your question :)
Explanation
NR%2 {a=$1; next} as NR stands for number of record (number of line in this case), NR%2 is 1 if NR is not multiple of 2. So this stores the value of the line in the variable a in the odd lines. Then, next stops processing current line.
$1>a {for (i=a;i<=$1;i++) print i} in the other cases (even lines), if the value is bigger than the one that was stored, it loops from that value up to the current one, printing all the values in between.
Related
This question already has answers here:
Find duplicate lines in a file and count how many time each line was duplicated?
(7 answers)
print unique lines based on field
(3 answers)
Closed 2 years ago.
I have a text file supplied.tsv with filepaths and a column with filesize as follows, I want to ensure that the filenames are unique
./statistics/variant_calls/v12_HG03486_hgsvc_pbsq2-ccs_1000.snv.QUAL10.GQ100.vcf.cluster.stats 676
./statistics/variant_calls/v12_HG03486_hgsvc_pbsq2-ccs_1000.snv.QUAL10.GQ100.vcf.stats 788
./v12_config_20200721-092246_HG02818_HG03125_HG03486.json 887
./v12_config_20200721-092246_HG02818_HG03125_HG03486.json 887
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz 566
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz 566
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz.tbi 772
Expected output
Yes all unique filenames
MY PLAN
I will extract the first column from file
awk -F"\t" '{print $1}' supplied.tsv > supplied_firstcolumn.txt
Extract filename and then check the distinct lines. Kindly let me know how to do this efficiently.
awk '{ fil[$1]++ } END { for (i in fil) { if (fil[i]>1) { print i" - "fil[i];dup++ } } if (dup < 1) { print "No duplicates" } }' files.txt
Create an array called fil with the filename as the index and increment the value every time the file is seen. At the end, loop through the fil array and if the value is greater than 1, print the filename and the count. Also increment a duplicates count (dup). If the dup variable is less that 1 at the end of the loop, print "No duplicates"
how to do this efficiently
As you are interested in if, not how many duplicates you have I suggest stop processing after hitting 1st duplicate. I would do it following way. Let file.txt content be:
./statistics/variant_calls/v12_HG03486_hgsvc_pbsq2-ccs_1000.snv.QUAL10.GQ100.vcf.cluster.stats 676
./statistics/variant_calls/v12_HG03486_hgsvc_pbsq2-ccs_1000.snv.QUAL10.GQ100.vcf.stats 788
./v12_config_20200721-092246_HG02818_HG03125_HG03486.json 887
./v12_config_20200721-092246_HG02818_HG03125_HG03486.json 887
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz 566
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz 566
./variant_calls/v12_HG02818_hgsvc_pbsq2-ccs_1000.wh-phased.vcf.bgz.tbi 772
then
awk 'BEGIN{uniq=1}(++arr[$1]>=2){uniq=0;exit}END{print uniq ? "all unique" : "found nonunique"}' file.txt
output
found nonunique
Explanation: Firstly I set uniq to 1, which will stay such if not duplicates are found. Then for every line I increase counter in arr for given path ($1) and check if after that operation it is bigger or equal 2 - if it is this mean it is 2nd or following occurence, so I set uniq to 0 and end processing file using exit - or in other words jump to END. In END I print pending on uniq value, if you prefer to print only if duplicate were not found, you might use if(uniq){print "unique"} in END.
(tested in gawk 4.2.1)
I want to add a symbol " >>" at the end of 1st line and then 5th line and then so on. 1,5,9,13,17,.... I was searching the web and went through below article but I'm unable to achieve it. Please help.
How can I append text below the specific number of lines in sed?
retentive
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)
Output should be like-
retentive >>
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable >>
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)
You can do it with awk:
awk '{if ((NR-1) % 5) {print $0} else {print $0 " >>"}}'
We check if line number minus 1 is a multiple of 5 and if it is we output the line followed by a >>, otherwise, we just output the line.
Note: The above code outputs the suffix every 5 lines, because that's what is needed for your example to work.
You can do it multiple ways. sed is kind of odd when it comes to selecting lines but it's doable. E.g.:
sed:
sed -i -e 's/$/ >>/;n;n;n;n' file
You can do it also as perl one-liner:
perl -pi.bak -e 's/(.*)/$1 >>/ if not (( $. - 1 ) % 5)' file
You're thinking about this wrong. You should append to the end of the first line of every paragraph, don't worry about how many lines there happen to be in any given paragraph. That's just:
$ awk -v RS= -v ORS='\n\n' '{sub(/\n/," >>&")}1' file
retentive >>
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable >>
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)
This might work for you (GNU sed):
sed -i '1~4s/$/ >>/' file
There's a couple more:
$ awk 'NR%5==1 && sub(/$/,">>>") || 1 ' foo
$ awk '$0=$0(NR%5==1?">>>":"")' foo
Here is a non-numeric way in Awk. This works if we have an Awk that supports the RS variable being more than one character long. We break the data into records based on the blank line separation: "\n\n". Inside these records, we break fields on newlines. Thus $1 is the word, $2 is the definition, $3 is the quote and $4 is the source:
awk 'BEGIN {OFS=FS="\n";ORS=RS="\n\n"} $1=$1" >>"'
We use the same output separators as input separators. Our only pattern/action step is then to edit $1 so that it has >> on it. The default action is { print }, which is what we want: print each record. So we can omit it.
Shorter: Initialize RS from catenation of FS.
awk 'BEGIN {OFS=FS="\n";ORS=RS=FS FS} $1=$1" >>"'
This is nicely expressive: it says that the format uses two consecutive field separators to separate records.
What if we use a flag, initially reset, which is reset on every blank line? This solution still doesn't depend on a hard-coded number, just the blank line separation. The rule fires on the first line, because C evaluates to zero, and then after every blank line, because we reset C to zero:
awk 'C++?1:$0=$0" >>";!NF{C=0}'
Shorter version of accepted Awk solution:
awk '(NR-1)%5?1:$0=$0" >>"'
We can use a ternary conditional expression cond ? then : else as a pattern, leaving the action empty so that it defaults to {print} which of course means {print $0}. If the zero-based record number is is not congruent to 0, modulo 5, then we produce 1 to trigger the print action. Otherwise we evaluate `$0=$0" >>" to add the required suffix to the record. The result of this expression is also a Boolean true, which triggers the print action.
Shave off one more character: we don't have to subtract 1 from NR and then test for congruence to zero. Basically whenever the 1-based record number is congruent to 1, modulo 5, then we want to add the >> suffix:
awk 'NR%5==1?$0=$0" >>":1'
Though we have to add ==1 (+3 chars), we win because we can drop two parentheses and -1 (-4 chars).
We can do better (with some assumptions): Instead of editing $0, what we can do is create a second field which contains >> by assigning to the parameter $2. The implicit print action will print this, offset by a space:
awk 'NR%5==1?$2=">>":1'
But this only works when the definition line contains one word. If any of the words in this dictionary are compound nouns (separated by space, not hyphenated), this fails. If we try to repair this flaw, we are sadly brought back to the same length:
awk 'NR%5==1?$++NF=">>":1'
Slight variation on the approach: Instead of trying to tack >> onto the record or last field, why don't we conditionally install >>\n as ORS, the output record separator?
awk 'ORS=(NR%5==1?" >>\n":"\n")'
Not the tersest, but worth mentioning. It shows how we can dynamically play with some of these variables from record to record.
Different way for testing NR == 1 (mod 5): namely, regexp!
awk 'NR~/[16]$/?$0=$0" >>":1'
Again, not tersest, but seems worth mentioning. We can treat NR as a string representing the integer as decimal digits. If it ends with 1 or 6 then it is congruent to 1, mod 5. Obviously, not easy to modify to other moduli, not to mention computationally disgusting.
I have two files, similar to the ones below:
File 1 - with phenotype informations, the first column are the individual, the orinal file has 400 rows:
215 2 25 13.8354303 15.2841303
222 2 25.2 15.8507278 17.2994278
216 2 28.2 13.0482192 14.4969192
223 11 15.4 9.2714745 11.6494745
File 2 - with SNPs information, the original file has 400 lines and 42,000 characters per line.
215 20211111201200125201212202220111202005111102
222 20111011212200025002211001111120211015112111
216 20210005201100025210212102210212201005101001
223 20222120201200125202202102210121201005010101
217 20211010202200025201202102210121201005010101
218 02022000252012021022101212010050101012021101
And I need to remove from file 2 individuals that do not appear in the file 1, for example:
215 20211111201200125201212202220111202005111102
222 20111011212200025002211001111120211015112111
216 20210005201100025210212102210212201005101001
223 20222120201200125202202102210121201005010101
I could do this with this code:
awk 'NR==FNR{a[$1]; next}$1 in a{print $0}' file1 file2> file3
However, when I do my main analysis with the generated file the following error appears:
*** Error in `./airemlf90': free(): invalid size: 0x00007f5041cc2010 ***
*** Error in `./postGSf90': free(): invalid size: 0x00007fec4a04f010 ***
airemlf90 and postGSf90 are software. But when I use original file this problem does not occur. Does the command that I made to delete individuals is adequate? Another detail that did not say is that some individuals have identification with 4 characters, can be this the error?
Thanks
I wrote a small python script in a few minutes. Works well, I have tested with 42000-char lines and it works fine.
import sys,re
# rudimentary argument parsing
file1 = sys.argv[1]
file2 = sys.argv[2]
file3 = sys.argv[3]
present = set()
# first read file 1, discard all fields except the first one (the key)
with open(file1,"r") as f1:
for l in f1:
toks = re.split("\s+",l) # same as awk fields
if toks: # robustness against empty lines
present.add(toks[0])
#now read second one and write in third one only if id is in the set
with open(file2,"r") as f2:
with open(file3,"w") as f3:
for l in f2:
toks = re.split("\s+",l)
if toks and toks[0] in present:
f3.write(l)
(First install python if not already present.)
Call my sample script mytool.py and run it like this:
python mytool.py file1.txt file2.txt file3.txt
To process several files at once simply in a bash file (to replace the original solution) it's easy (although not optimal because could be done in a whirl in python)
<whatever the for loop you need>; do
python my_tool.py $1 $2 $3
done
exactly like you would call awk with 3 files.
I have a file 'test' of DNA sequences, each with a header or ID like so:
>new
ATCGGC
>two
ACGGCTGGG
>tre
ACAACGGTAGCTACTATACGGTCGTATTTTTT
I would like to print the length of each contiguous string before and after a match to a given string, e.g. CGG
The output would then look like this:
>new
2 1
>two
1 5
>tre
4 11 11
or could just have the character lengths before and after matches for each line.
2 1
1 5
4 11 11
My first attempts used sed to print the next line after finding '>' ,then found the byte offset for each grep match of "CGG", which I was going to use to convert to lengths, but this produced the following:
sed -n '/>/ {n;p}' test | grep -aob "CGG"
2:CGG
8:CGG
21:CGG
35:CGG
Essentially, grep is printing the byte offset for each match, counting up, while I want the byte offset for each line independently (i.e. resetting after each line).
I suppose I need to use sed for the search as well, as it operates line by line, but Im not sure how to counnt the byte offset or characters in a given string.
Any help would be much appreciated.
By using your given string as the field separator in awk, it's as easy as iterating through the fields on each line and printing their lengths. (Lines starting with > we just print as they are.)
This gives the desired output for your sample data, though you'll probably want to check edge cases like starts with CGG, ends with CGG, only contains CGG, etc.
$ awk -F CGG '/^>/ {print; next} {for (i=1; i<=NF; ++i) {printf "%s%s", length($i), (i==NF)?"\n":" "}}' file.txt
>new
2 1
>two
1 5
>tre
4 11 11
awk -F CGG
Invoke awk using "CGG" as the field separator. This parses each line into a set of fields separated by each (if any) occurrence of the string "CGG". The "CGG" strings themselves are neither included as nor part of any field.
Thus the line ACAACGGTAGCTACTATACGGTCGTATTTTTT is parsed into the three fields: ACAA, TAGCTACTATA, and TCGTATTTTTT, denoted in the awk program by $1, $2, and $3, respectively.
'/^>/ {print; next}
This pattern/action tells awk that if the line starts with > to print the line and go immediately to the next line of input, without considering any further patterns or actions in the awk program.
{for (i=1; i<=NF; ++i) {printf "%s%s", length($i), (i==NF)?"\n":" "}}
If we arrive to this action, we know the line did not start with > (see above). Since there is only an action and no pattern, the action is executed for every line of input that arrives here.
The for loop iterates through all the fields (NF is a special awk variable that contains the number of fields in the current line) and prints their length. By checking if we've arrived at the last field, we know whether to print a newline or just a space.
Here is one example of one line of log:
2016-04-24 23:59:45 -1 6bd3fbb8-65ac-4d16-bf32-48659a76c499 2 +15173583107 14 +161760555935 14 de.xxxx-O2 layxxxd 0 1
I know how to group by one filed, so this is the solution:
awk '{arr[$11]+=$12} END {for (i in arr) {print i,arr[i]}}' exmaple.log
and this would be results:
xx 144
layxxxd 49.267
My question is that how can I group by two fields instead of one, first should be $11 and second is $10? So results should change to:
layxxxd unknown 100
layxxxd de.xxxx-O2 44
how can I group by two fields instead of one, first should be $11 and second is $10?
You can use $11 FS $10 as your key for associative array:
awk '{arr[$11 FS $10] += $12} END {for (i in arr) {print i,arr[i]}}' exmaple.log