Repeat each line multiple times and add ascending numbers - linux

Would like to have each line in a file repeated a fixed number of times and add ascending numbers, like this:
I have
wwx.domain.com/pageA/?page=1
wwx.domain.com/pageB/?page=1
wwx.domain.com/pageC/?page=1
I want
wwx.domain.com/pageA/?page=1
wwx.domain.com/pageA/?page=2
wwx.domain.com/pageA/?page=3
wwx.domain.com/pageB/?page=1
wwx.domain.com/pageB/?page=2
wwx.domain.com/pageB/?page=3
wwx.domain.com/pageC/?page=1
wwx.domain.com/pageC/?page=2
wwx.domain.com/pageC/?page=3
How can I do this?

awk '{ sub(/.$/,""); for(i=1; i<4; i++) print $0 i }' inputfile > outputfile
Explanation: Remove the last character from the input line, and in a loop print the (modified) input line followed by the loop index.

This might work for you (GNU sed):
sed -E 'h;:a;s/[^\n]*/&/3;t;x;s/(.*=)(.*)/echo "\1$((\2+1))"/e;x;G;ta' file
While the pattern space contains less than n (in this case 3) lines, append the current line, incrementing the last field by 1.
The solution uses the hold space to keep the last incremented line and shell arithmetic to increment the last field of the line.

Related

awk command to split filename based on substring

I have a directory in that file names are like
Abc_def_ijk.txt-1
Abc_def_ijk.txt-2
Abc_def_ijk.txt-3
Abc_def_ijk.txt-4
Abc_def_ijk.txt-5
Abc_def_ijk.txt-6
Abc_def_ijk.txt-7
Abc_def_ijk.txt-8
Abc_def_ijk.txt-9
I like to divide into 4 variables as below
v1=Abc_def_ijk.txt-1,Abc_def_ijk.txt-5,Abc_def_ijk.txt-9
V2=Abc_def_ijk.txt-2,Abc_def_ijk.txt-6
V3=Abc_def_ijk.txt-3,Abc_def_ijk.txt-7
V4=Abc_def_ijk.txt-4,Abc_def_ijk.txt-8
If no of files increase it will goto any of above variables. I'm looking for awk one liners to achieve above.
I would do it using GNU AWK following way, let file.txt content be
Abc_def_ijk.txt-1
Abc_def_ijk.txt-2
Abc_def_ijk.txt-3
Abc_def_ijk.txt-4
Abc_def_ijk.txt-5
Abc_def_ijk.txt-6
Abc_def_ijk.txt-7
Abc_def_ijk.txt-8
Abc_def_ijk.txt-9
then
awk '{arr[NR%4]=arr[NR%4] "," $0}END{print substr(arr[1],2);print substr(arr[2],2);print substr(arr[3],2);print substr(arr[0],2)}' file.txt
output
Abc_def_ijk.txt-1,Abc_def_ijk.txt-5,Abc_def_ijk.txt-9
Abc_def_ijk.txt-2,Abc_def_ijk.txt-6
Abc_def_ijk.txt-3,Abc_def_ijk.txt-7
Abc_def_ijk.txt-4,Abc_def_ijk.txt-8
Explanation: I store lines in array arr and decide where to put given line based on numer of line (NR) modulo (%) four (4). I do concatenate to what is currently stored (empty string if nothing so far) with , and content of current line ($0), this result in leading , which I remove using substr function, i.e. starting at 2nd character.
(tested in GNU Awk 5.0.1)

Using sed to delete specific lines after LAST occurrence of pattern

I have a file that looks like:
this name
this age
Remove these lines and space above.
Remove here too and space below
Keep everything below here.
I don't want to hardcode 2 as the number of lines containing "this" can change. How can I delete 4 lines after the last occurrence of the string. I am trying sed -e '/this: /{n;N;N;N;N;d}' but it is deleting after the first occurrence of the string.
Could you please try following.
awk '
FNR==NR{
if($0~/this/){
line=FNR
}
next
}
FNR<=line || FNR>(line+4)
' Input_file Input_file
Output will be as follows with shown samples.
this: name
this: age
Keep everything below here.
You can also use this minor change to make your original sed command work.
sed '/^this:/ { :k ; n ; // b k ; N ; N ; N ; d }' input_file
It uses a loop which prints the current line and reads the next one (n) while it keeps matching the regex (the empty regex // recalls the latest one evaluated, i.e. /^this:/, and the command b k goes back to the label k on a match). Then you can append the next 3 lines and delete the whole pattern space as you did.
Another possibility, more concise, using GNU sed could be this.
sed '/^this:/ b ; /^/,$ { //,+3 d }' input_file
This one prints any line beginning with this: (b without label goes directly to the next line cycle after the default print action).
On the first line not matching this:, two nested ranges are triggered. The outer range is "one-shot". It is triggered right away due to /^/ which matches any line then it stays triggered up to the last line ($). The inner range is a "toggle" range. It is also triggered right away because // recalls /^/ on this line (and only on this line, hence the one-shot outer range) then it stays trigerred for 3 additional lines (the end address +3 is a GNU extension). After that, /^/ is no longer evaluated so the inner range cannot trigger again because // recalls /^this:/ (which is short cut early).
This might work for you (GNU sed):
sed -E ':a;/this/n;//ba;$!N;$!ba;s/^([^\n]*\n?){4}//;/./!d' file
If the pattern space (PS) contains this, print the PS and fetch the next line.
If the following line contains this repeat.
If the current line is not the last line, append the next line and repeat.
Otherwise, remove the first four lines of the PS and print the remainder.
Unless the PS is empty in which case delete the PS entirely.
N.B. This only reads the file once. Also the OP says
How can I delete 4 lines after the last occurrence of the string
However the example would seem to expect 5 lines to be deleted.

Replace fasta headers using sed command

I have a fasta file which looks like this.
>header1
ATGC....
>header2
ATGC...
My list files looks like this
organism1
organism2
and contains a list of organism that I want to replace the header with.
I tried to use a for loop using sed command which is as follows:
for i in `cat list7b`; do sed "s/^>/$i/g" sequence.fa; done
but it didn't work please tell how I can achieve this task.
The result file should look like this
>organism1
ATGC...
>organism2
ATGC....
that is >header1 replaced with >organism_1 and so on
The two headers are distinguished from ATGC as header always starts with > greater than sign whereas ATGC would not. That's how they are distinguished.
The header lines should be replaced by the order of appearance, i.e. first header* replaced with first-line from file, 2nd header from the second and so on.
I also request to explain the logic if possible.
thanks in advance.
With awk this is easy to do in one run.
Assuming your fasta file is named sequence.fa and your organisms list file is named list7b as in the question you can use
awk 'NR == FNR { o[n++] = $0; next } /^>/ && i < n { $0 = ">" o[i++] } 1' list7b sequence.fa > output.fa
Explanation:
NR == FNR is a condition for doing something with the first file only. (total number of records is equal to number of records in current file)
{ o[n++] = $0; next } puts the input line into array o, counts the entries and skips further processing of the input line, so o will contain all your organism lines.
The next part is executed for the remaining file(s).
/^>/ && i < n is valid for lines that start with > as long as i is less than the number of elements n that were put into array o.
{ $0 = ">" o[i++] } replaces the current line with > followed by the array element (i.e. a line from the first file) and increments the index i to the next element.
1 is an "always true" condition with the implicit default action { print } to print the current line for every input line.

extracting first line from file command such that

I have a file with almost 5*(10^6) lines of integer numbers. So, my file is big enough.
The question is all about extract specific lines, filtering them by a condition.
For example, I'd like to:
Extract the N first lines without read entire file.
Extract the lines with the numbers less or equal X (or >=, <=, <, >)
Extract the lines with a condition related a number (math predicate)
Is there a cleaver way to perform these tasks? (using sed or awk or cat or head)
Thanks in advance.
To extract the first $NUMBER lines,
head -n $NUMBER filename
Assuming every line contains just a number (although it will also work if the first token is one), 2 can be solved like this:
awk '$1 >= 1234 && $1 < 5678' filename
And keeping in spirit with that, 3 is just the extension
awk 'condition' filename
It would have helped if you had specified what condition is supposed to be, though. This way, you'll have to read the awk documentation to find out how to code it. Again, the number will be represented by $1.
I don't think I can explain anything about the head call, it's really just what it says on the tin. As for the awk lines: awk, like sed, works linewise. awk fetches lines in a loop and applies your code to each line. This code takes the form
condition1 { action1 }
condition2 { action2 }
# and so forth
For every line awk fetches, the conditions are checked in the order they appear, and the associated action to each condition is performed if the condition is true. It would, for example, have been possible to extract the first $NUMBER lines of a file with awk like this:
awk -v number="$NUMBER" '1 { print } NR == number { exit }' filename
where 1 is synonymous with true (like in C) and NR is the line number. The -v command line option initializes the awk variable number to $NUMBER. If no action is specified, the default action is { print }, which prints the whole line. So
awk 'condition' filename
is shorthand for
awk 'condition { print }' filename
...which prints every line where the condition holds.

Removing two columns from csv without removing the column heading

Been stuck on this for a while, managed to remove two columns completely from it but now I need to remove two columns (3 in total) within the 1 column heading. I've attached a snippit from my csv file.
timestamp;CPU;%usr;%nice;%sys;%iowait;%steal;%irq;%soft;%guest;%idle
2014-09-17 10-20-39 UTC;-1;6.53;0.00;4.02;0.00;0.00;0.00;0.00;0.00;89.45
2014-09-17 10-20-41 UTC;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
2014-09-17 10-20-43 UTC;-1;1.98;0.00;1.98;5.45;0.00;0.50;0.00;0.00;90.10
2014-09-17 10-20-45 UTC;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
2014-09-17 10-20-47 UTC;-1;0.50;0.00;1.50;0.00;0.00;0.00;0.00;0.00;98.00
2014-09-17 10-20-49 UTC;-1;0.50;0.00;1.01;3.02;0.00;0.00;0.00;0.00;95.48
What I'm wanting to do is remove yyyy-mm-dd and also UTC, leaving just 10-20-39 underneath the timestamp column heading. I've tried removing them but I can't seem to do it without taking out the headings.
Thanks to anyone who can help me with this
A perl way:
perl -pe 's/^.+? (.+?) .+?;/$1;/ if $.>1' file
Explanation
The -pe means "print each line after applying the script to it". The script itself simply substitutes identifies the 3 first non-whitespace words and replaces them with the 2nd of the three ($1 since the pattern was captured). This is only run if the current line number ($.) is greater than 1.
An awk way
awk -F';' '(NR>1){sub(/[^ ]* /,"",$1); sub(/ [^ ]*$/,"",$1)}1;' OFS=";" file
Here, we set the input field delimiter to ; and use sub() to remove the 1st and last word from the 1st field.
This following sed command works for you:
sed '1!s/^[^ ]\+ //;1!s/ UTC//'
Explanations:
1! Do not apply to the first line.
s/^[^ ]\+ // Remove the first group of non-space characters at line beginning ("2014-09-17 " in your case).
s/ UTC// Remove the string " UTC".
Assuming the csv file is stored as a.csv, then
sed '1!s/^[^ ]\+ //;1!s/ UTC//' < a.csv
prints the results to standard output, and
sed '1!s/^[^ ]\+ //;1!s/ UTC//' < a.csv > b.csv
saves the result to b.csv.
EDITED:
Added: sample results:
[pengyu#GLaDOS tmp]$ sed '1!s/^[^ ]\+ //;1!s/ UTC//' < a.csv
timestamp;CPU;%usr;%nice;%sys;%iowait;%steal;%irq;%soft;%guest;%idle
10-20-39;-1;6.53;0.00;4.02;0.00;0.00;0.00;0.00;0.00;89.45
10-20-41;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
10-20-43;-1;1.98;0.00;1.98;5.45;0.00;0.50;0.00;0.00;90.10
10-20-45;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
10-20-47;-1;0.50;0.00;1.50;0.00;0.00;0.00;0.00;0.00;98.00
10-20-49;-1;0.50;0.00;1.01;3.02;0.00;0.00;0.00;0.00;95.48

Resources