How to cut file into chuck - linux

How to get information from specimen1 to specimen3 and paste it into another file 'DNA_combined.txt'?
I tried cut command and awk commend but I found that it is tricky to cutting by paragraph(?) or sequence.
My trial was something like cut -d '>' -f 1-3 dna1.fasta > DNA_combined.txt

You can get the line number for each row using Esc + : and type set nu
Once you get the line number corresponding to each row:
Note down the line number corresponding to Line containing >Specimen 1 (say X) and Specimen 3 (say Y)
Then, use sed command to get the text between two lines
sed -n 'X,Yp' dna1.fasta > DNA_combined.txt
Please let me know if you have any questions.

If you want the first three sequences irrespective of the content after >, you can use this:
$ cat ip.txt
>one
ACGTA
TCGAAA
>two
TGACA
>three
ACTG
AAAAC
>four
ATGC
>five
GTA
$ awk '/^>/ && ++count==4{exit} 1' ip.txt
>one
ACGTA
TCGAAA
>two
TGACA
>three
ACTG
AAAAC
/^>/ matches the start of a sequence
for such sequences, increment the count variable
if count reaches 4, the exit command will terminate the script
1 idiomatic way to print contents of input record

Would you please try the following:
awk '
BEGIN {print ">Specimen1-3"} # print header
/^>Specimen/ {f = match($0, "^>Specimen[1-3]") ? 1 : 0; next}
# set the flag depending on the number
f # print if f == 1
' dna1.fasta > DNA_combined.txt

Related

Export every line (excpet last line ) of a file to create a new file for each line using AWK

For example if there are n no of lines, then n-1 no of files should be created.
Here is what i have achieved so far. I am manually inserting 8 (i.e. total no of lines). I dont know how to get total no of lines in a variable and then use it.
awk '{if (NR<8) F=NR".ndjson"; print >> F; close (F)}' export.ndjson
One of the way IMHO could be take total number of lines by wc command then store it to a variable. Fair warning since OP has now shown samples so couldn't test it and edited/tweaked OP's attempt here.
awk -v lines="$(wc -l < "export.ndjson")" '{if (NR<lines) F=NR".ndjson"; print >> F; close (F)}' export.ndjson
NOTE: Another way could be read your Input_file export.ndjson 2 times where take its number of total lines in 1st time(only number of lines) and 2nd time its execution we could use that count to be used in condition.
2nd solution: If my assumption is correct and you want to leave last line only then you could try tac + awk combination where first line could be left.
tac export.ndjson | awk 'FNR>1{F=NR".ndjson"; print >> F; close (F)}'
If you want your output file numbers to start at 2:
awk 'NR>1{out=NR".json"; print prev > out; close(out)} {prev=$0}' export.ndjson
or at 1:
awk 'NR>1{out=(NR-1)".json"; print prev > out; close(out)} {prev=$0}' export.ndjson
or:
awk 'NR>1{print prev > out; close(out)} {prev=$0; out=NR".json"}' export.ndjson
to print every line to its own file, except the last line.
$ awk 'p{close(f); f="object"(NR-1)".ndjson"; print p > f} {p=$0}' file
note also that empty lines won't create empty files.
The logic is simply to delay printing by one record.

changing two lines of a text file

I have a bash script which gets a text file as input and takes two parameters (Line N° one and line N° two), then changes both lines with each other in the text. Here is the code:
#!/bin/bash
awk -v var="$1" -v var1="$2" 'NR==var {
s=$0
for(i=var+1; i < var1 ; i++) {
getline; s1=s1?s1 "\n" $0:$0
}
getline; print; print s1 s
next
}1' Ham > newHam_changed.txt
It works fine for every two lines which are not consecutive. but for lines which follows after each other (for ex line 5 , 6) it works but creates a blank line between them. How can I fix that?
I think your actual script is not what you posted in the question. I think the line with all the prints contains:
print s1 "\n" s
The problem is that when the lines are consecutive, s1 will be empty (the for loop is skipped), but it will still print a newline before s, producing a blank line.
So you need to make that newline conditional.
awk -v var="4" -v var1="6" 'NR==var {
s=$0
for(i=var+1; i < var1 ; i++) {
getline; s1=s1?s1 "\n" $0:$0
}
getline; print; print (s1 ? s1 "\n" : "") s
next
}1' Ham > newHam_changed.txt
Using getline makes awk scripts always a bit complicated. It is better to prevent the use of getline and just make use of the awk pattern { action } syntax. This will make perfectly readable scripts. In any other language you would just do a loop and get the next line, but in awk I think it is best to make good use of this feature.
awk -v var="$1" -v var1="$2" '
NR==var {s=$0; collect=1; next;}
NR==var1 {collect=0; print; printf inbetween; print s}
collect {inbetween=inbetween""$0"\n"; next;}
1' Ham
Here I capture the first line in s when I found it and set the collect flag. This will trigger the collect block on the next iteration which collects all lines in between. Whenever the second line is found it sets the collect back to zero and prints first the current line, than the inbetween lines and then s. If the lines are consecutive inbetween is empty and printf will than do nothing.
Too complex for my taste, here is something quite simple that achieves the same task:
#!/bin/bash
ORIGFILE='original.txt' # original text file
PROCFILE='processed.txt' # copy of the original file to be proccesed
CHGL1=`sed "$1q;d" $ORIGFILE` # get original $1 line
CHGL2=`sed "$2q;d" $ORIGFILE` # get original $2 line
`cat $ORIGFILE > $PROCFILE`
sed -i "$2s/^.*/$CHGL1/" $PROCFILE # replace
sed -i "$1s/^.*/$CHGL2/" $PROCFILE # replace
More code doesn't mean more useful, keep it simple. This code do not use for and instead goes directly to the specific lines.
EDIT:
A simple way on one line to do this task:
printf '%s\n' 14m26 26-m14- w q | ed -s file
Found in this answer.

How to delete first 10 lines containing certain string?

Suppose i have a file with this structures
1001.txt
1002.txt
1003.txt
1004.txt
1005.txt
2001.txt
2002.txt
2003.txt
...
Now how can I delete first 10 numbers of line which start with '2'? There might be more than 10 lines start with '2'.
I know I can use grep '^2' file | wc -l to find number of lines which start with '2'. But how to delete the first 10 numbers of line?
You can pipe your list through this Perl one-liner:
perl -p -e '$_="" if (/^2/ and $i++ >= 10)'
Another in awk. Testing with value 2 as your data only had 3 lines of related data. Replace the latter 2 with a 10.
$ awk '/^2/ && ++c<=2 {next} 1' file
1001.txt
1002.txt
1003.txt
1004.txt
1005.txt
2003.txt
.
.
.
Explained:
$ awk '/^2/ && ++c<=2 { # if it starts with a 2 and counter still got iterations left
next # skip to the next record
} 1 # (else) output
' file
awk alternative:
awk '{ if (substr($0,1,1)=="2") { count++ } if ( count > 10 || substr($0,1,1)!="2") { print $0 } }' filename
If the first character of the line is 2, increment a counter. Then only print the line if count is greater than 10 or the first character isn't 2.

Paste corresponding characters from multiple lines together

I'm writing a linux-command that pasts corresponding characters from multiple lines together. For example: I want to change these lines
A---
-B--
---C
--D-
to this:
A----B-----D--C-
So far, i've made this:
cat sanger.a sanger.c sanger.g sanger.t | cut -c 1
This does the trick for only the first column, but it has to work for all the columns.
Is there anyone who can help?
EDIT: This is a better example. I want this:
SUGAR
HONEY
CANDY
to become
SHC UOA GND AED RYY (without spaces)
Awk way for updated spec
awk -vFS= '{for(i=1;i<=NF;i++)a[i]=a[i]$i}
END{for(i=1;i<=NF;i++)printf "%s",a[i];print ""}' file
Output
A----B-----D--C-
SHCUOAGNNAEDRYY
P.s for a large file this will use lots of memory
A terrible way not using awk, also you need to know the number of fields before hand.
for i in {1..4};do cut -c $i test | tr -d "\n" ; done;echo
Here's a solution without awk or sed, assuming the file is named f:
paste -s -d "" <(for i in $(seq 1 $(wc -L < f)); do cut -c $i f; done)
wc -L is a GNUism which returns the length of the longest line in the input file, which might not work depending on your version/locale. You could instead find the longest line by doing something like:
awk '{if (length > x) {x = length}} END {print x}' f
Then using this value in the seq command instead of the above command substitution.
All right, time for some sed insanity! :D
Disclaimer: If this is for something serious, use something less brittle than this. awk comes to mind. Unless you feel confident enough in your sed abilities to maintain this lunacy.
cat file1 file2 etc | sed -n '1h; 1!H; $ { :loop; g; s/$/\n/; s/\([^\n]\)[^\n]*\n/\1/g; p; g; s/^.//; s/\n./\n/g; h; /[^\n]/ b loop }' | tr -d '\n'; echo
This comes in three parts: Say you have a file foo.txt
12345
67890
abcde
fghij
then
cat foo.txt | sed -n '1h; 1!H; $ { :loop; g; s/$/\n/; s/\([^\n]\)[^\n]*\n/\1/g; p; g; s/^.//; s/\n./\n/g; h; /[^\n]/ b loop }'
produces
16af
27bg
38ch
49di
50ej
After that, tr -d '\n' deletes the newlines, and ;echo adds one at the end.
The heart of this madness is the sed code, which is
1h
1!H
$ {
:loop
g
s/$/\n/
s/\([^\n]\)[^\n]*\n/\1/g
p
g
s/^.//
s/\n./\n/g
h
/[^\n]/ b loop
}
This first follows the basic pattern
1h # if this is the first line, put it in the hold buffer
1!H # if it is not the first line, append it to the hold buffer
$ { # if this is the last line,
do stuff # do stuff. The whole input is in the hold buffer here.
}
which assembles all input in the hold buffer before working on it. Once the whole input is in the hold buffer, this happens:
:loop
g # copy the hold buffer to the pattern space
s/$/\n/ # put a newline at the end
s/\([^\n]\)[^\n]*\n/\1/g # replace every line with only its first character
p # print that
g # get the hold buffer again
s/^.// # remove the first character from the first line
s/\n./\n/g # remove the first character from all other lines
h # put that back in the hold buffer
/[^\n]/ b loop # if there's something left other than newlines, loop
And there you have it. I might just have summoned Cthulhu.

How to add data beside each other in a csv file

If I have 3 csv files, and I want to merge the data all into one, but beside each other, how would I do it? For example:
Initial Merged file:
,,,,,,,,,,,,
File 1:
20,09/05,5694
20,09/06,3234
20,09/08,2342
File 2:
20,09/05,2341
20,09/06,2334
20,09/09,342
File 3:
20,09/05,1231
20,09/08,3452
20,09/10,2345
20,09/11,372
Final merged File:
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,,,09/11,372
Basically data from each file goes into a specific column of the merged file.
I know the awk function can be used for this, but I have no clue how to start
EDIT: Only the 2nd and 3rd Columns of each file are being printed. I was using this to print out the 2nd and 3rd columns:
awk -v f="${i}" -F, 'match ($0,f) { print $2","$3 }' file3.csv > d$i.csv
however, say for example, file1 and file2 were null in that row, the data for that row would be shifted to the left. so I came up with this to account for the shift:
awk -v x="${i}" -F, 'match ($0,x) { if ($2='/NULL') { print "," }; else { print $2","$3}; }' alld.csv > d$i.csv
Using GNU awk for ARGIND:
$ gawk '{ a[FNR,ARGIND]=$0; maxFnr=(FNR>maxFnr?FNR:maxFnr) }
END {
for (i=1;i<=maxFnr;i++) {
for (j=1;j<ARGC;j++)
printf "%s%s", (j==1?"":",,,"), (a[i,j]?a[i,j]:",")
print ""
}
}
' file1 file2 file3
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,,,09/11,372
If you don't have GNU awk, just add an initial line that says FNR==1{ARGIND++}.
Commented version per request:
$ gawk '
{ a[FNR,ARGIND]=$0; # Store the current line in a 2-D array `a` indexed by
# the current line number `FNR` and file number `ARGIND`.
maxFnr=(FNR>maxFnr?FNR:maxFnr) # save the max FNR value
}
END{
for (i=1;i<=maxFnr;i++) { # Loop from 1 to max number of fields
# seen across all files and for each:
for (j=1;j<ARGC;j++) # Loop from 1 to total number of files parsed and:
printf "%s%s", # Print 2 strings, specifically:
(j==1?"":",,,"), # A field separator - empty if were printing
# the first field, three commas otherwise.
(a[i,j]?a[i,j]:",") # The value stored in the array if it was
# present in the files, a comma otherwise.
print "" # Print a newline
}
}
' file1 file2 file3
I originally was using an array fnr[FNR] to track the max value of FNR but IMHO that's kinda obscure and it has a flaw where if no lines have, say, a 2nd field then a loop on for (i=1;i in fnr;i++) in the END section would bail out before getting to the 3rd field.
paste is done for this:
$ paste -d";" f1 f2 f3 | sed 's/;/,,,/g'
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,09/11,372
Note that the paste alone will output just one comma:
$ paste -d, f1 f2 f3
09/05,5694,09/05,2341,09/05,1231
09/06,3234,09/06,2334,09/08,3452
09/08,2342,09/09,342,09/10,2345
,,09/11,372
So to have multiple ones we can use another delimiter like ; and then replace by ,,, with sed:
$ paste -d";" f1 f2 f3 | sed 's/;/,,,/g'
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,09/11,372
Using pr:
$ pr -mts',,,' file[1-3]
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,09/11,372

Resources