How to cut file into chuck - linux
How to get information from specimen1 to specimen3 and paste it into another file 'DNA_combined.txt'?
I tried cut command and awk commend but I found that it is tricky to cutting by paragraph(?) or sequence.
My trial was something like cut -d '>' -f 1-3 dna1.fasta > DNA_combined.txt
You can get the line number for each row using Esc + : and type set nu
Once you get the line number corresponding to each row:
Note down the line number corresponding to Line containing >Specimen 1 (say X) and Specimen 3 (say Y)
Then, use sed command to get the text between two lines
sed -n 'X,Yp' dna1.fasta > DNA_combined.txt
Please let me know if you have any questions.
If you want the first three sequences irrespective of the content after >, you can use this:
$ cat ip.txt
>one
ACGTA
TCGAAA
>two
TGACA
>three
ACTG
AAAAC
>four
ATGC
>five
GTA
$ awk '/^>/ && ++count==4{exit} 1' ip.txt
>one
ACGTA
TCGAAA
>two
TGACA
>three
ACTG
AAAAC
/^>/ matches the start of a sequence
for such sequences, increment the count variable
if count reaches 4, the exit command will terminate the script
1 idiomatic way to print contents of input record
Would you please try the following:
awk '
BEGIN {print ">Specimen1-3"} # print header
/^>Specimen/ {f = match($0, "^>Specimen[1-3]") ? 1 : 0; next}
# set the flag depending on the number
f # print if f == 1
' dna1.fasta > DNA_combined.txt
Related
Export every line (excpet last line ) of a file to create a new file for each line using AWK
For example if there are n no of lines, then n-1 no of files should be created. Here is what i have achieved so far. I am manually inserting 8 (i.e. total no of lines). I dont know how to get total no of lines in a variable and then use it. awk '{if (NR<8) F=NR".ndjson"; print >> F; close (F)}' export.ndjson
One of the way IMHO could be take total number of lines by wc command then store it to a variable. Fair warning since OP has now shown samples so couldn't test it and edited/tweaked OP's attempt here. awk -v lines="$(wc -l < "export.ndjson")" '{if (NR<lines) F=NR".ndjson"; print >> F; close (F)}' export.ndjson NOTE: Another way could be read your Input_file export.ndjson 2 times where take its number of total lines in 1st time(only number of lines) and 2nd time its execution we could use that count to be used in condition. 2nd solution: If my assumption is correct and you want to leave last line only then you could try tac + awk combination where first line could be left. tac export.ndjson | awk 'FNR>1{F=NR".ndjson"; print >> F; close (F)}'
If you want your output file numbers to start at 2: awk 'NR>1{out=NR".json"; print prev > out; close(out)} {prev=$0}' export.ndjson or at 1: awk 'NR>1{out=(NR-1)".json"; print prev > out; close(out)} {prev=$0}' export.ndjson or: awk 'NR>1{print prev > out; close(out)} {prev=$0; out=NR".json"}' export.ndjson
to print every line to its own file, except the last line. $ awk 'p{close(f); f="object"(NR-1)".ndjson"; print p > f} {p=$0}' file note also that empty lines won't create empty files. The logic is simply to delay printing by one record.
changing two lines of a text file
I have a bash script which gets a text file as input and takes two parameters (Line N° one and line N° two), then changes both lines with each other in the text. Here is the code: #!/bin/bash awk -v var="$1" -v var1="$2" 'NR==var { s=$0 for(i=var+1; i < var1 ; i++) { getline; s1=s1?s1 "\n" $0:$0 } getline; print; print s1 s next }1' Ham > newHam_changed.txt It works fine for every two lines which are not consecutive. but for lines which follows after each other (for ex line 5 , 6) it works but creates a blank line between them. How can I fix that?
I think your actual script is not what you posted in the question. I think the line with all the prints contains: print s1 "\n" s The problem is that when the lines are consecutive, s1 will be empty (the for loop is skipped), but it will still print a newline before s, producing a blank line. So you need to make that newline conditional. awk -v var="4" -v var1="6" 'NR==var { s=$0 for(i=var+1; i < var1 ; i++) { getline; s1=s1?s1 "\n" $0:$0 } getline; print; print (s1 ? s1 "\n" : "") s next }1' Ham > newHam_changed.txt
Using getline makes awk scripts always a bit complicated. It is better to prevent the use of getline and just make use of the awk pattern { action } syntax. This will make perfectly readable scripts. In any other language you would just do a loop and get the next line, but in awk I think it is best to make good use of this feature. awk -v var="$1" -v var1="$2" ' NR==var {s=$0; collect=1; next;} NR==var1 {collect=0; print; printf inbetween; print s} collect {inbetween=inbetween""$0"\n"; next;} 1' Ham Here I capture the first line in s when I found it and set the collect flag. This will trigger the collect block on the next iteration which collects all lines in between. Whenever the second line is found it sets the collect back to zero and prints first the current line, than the inbetween lines and then s. If the lines are consecutive inbetween is empty and printf will than do nothing.
Too complex for my taste, here is something quite simple that achieves the same task: #!/bin/bash ORIGFILE='original.txt' # original text file PROCFILE='processed.txt' # copy of the original file to be proccesed CHGL1=`sed "$1q;d" $ORIGFILE` # get original $1 line CHGL2=`sed "$2q;d" $ORIGFILE` # get original $2 line `cat $ORIGFILE > $PROCFILE` sed -i "$2s/^.*/$CHGL1/" $PROCFILE # replace sed -i "$1s/^.*/$CHGL2/" $PROCFILE # replace More code doesn't mean more useful, keep it simple. This code do not use for and instead goes directly to the specific lines. EDIT: A simple way on one line to do this task: printf '%s\n' 14m26 26-m14- w q | ed -s file Found in this answer.
How to delete first 10 lines containing certain string?
Suppose i have a file with this structures 1001.txt 1002.txt 1003.txt 1004.txt 1005.txt 2001.txt 2002.txt 2003.txt ... Now how can I delete first 10 numbers of line which start with '2'? There might be more than 10 lines start with '2'. I know I can use grep '^2' file | wc -l to find number of lines which start with '2'. But how to delete the first 10 numbers of line?
You can pipe your list through this Perl one-liner: perl -p -e '$_="" if (/^2/ and $i++ >= 10)'
Another in awk. Testing with value 2 as your data only had 3 lines of related data. Replace the latter 2 with a 10. $ awk '/^2/ && ++c<=2 {next} 1' file 1001.txt 1002.txt 1003.txt 1004.txt 1005.txt 2003.txt . . . Explained: $ awk '/^2/ && ++c<=2 { # if it starts with a 2 and counter still got iterations left next # skip to the next record } 1 # (else) output ' file
awk alternative: awk '{ if (substr($0,1,1)=="2") { count++ } if ( count > 10 || substr($0,1,1)!="2") { print $0 } }' filename If the first character of the line is 2, increment a counter. Then only print the line if count is greater than 10 or the first character isn't 2.
Paste corresponding characters from multiple lines together
I'm writing a linux-command that pasts corresponding characters from multiple lines together. For example: I want to change these lines A--- -B-- ---C --D- to this: A----B-----D--C- So far, i've made this: cat sanger.a sanger.c sanger.g sanger.t | cut -c 1 This does the trick for only the first column, but it has to work for all the columns. Is there anyone who can help? EDIT: This is a better example. I want this: SUGAR HONEY CANDY to become SHC UOA GND AED RYY (without spaces)
Awk way for updated spec awk -vFS= '{for(i=1;i<=NF;i++)a[i]=a[i]$i} END{for(i=1;i<=NF;i++)printf "%s",a[i];print ""}' file Output A----B-----D--C- SHCUOAGNNAEDRYY P.s for a large file this will use lots of memory A terrible way not using awk, also you need to know the number of fields before hand. for i in {1..4};do cut -c $i test | tr -d "\n" ; done;echo
Here's a solution without awk or sed, assuming the file is named f: paste -s -d "" <(for i in $(seq 1 $(wc -L < f)); do cut -c $i f; done) wc -L is a GNUism which returns the length of the longest line in the input file, which might not work depending on your version/locale. You could instead find the longest line by doing something like: awk '{if (length > x) {x = length}} END {print x}' f Then using this value in the seq command instead of the above command substitution.
All right, time for some sed insanity! :D Disclaimer: If this is for something serious, use something less brittle than this. awk comes to mind. Unless you feel confident enough in your sed abilities to maintain this lunacy. cat file1 file2 etc | sed -n '1h; 1!H; $ { :loop; g; s/$/\n/; s/\([^\n]\)[^\n]*\n/\1/g; p; g; s/^.//; s/\n./\n/g; h; /[^\n]/ b loop }' | tr -d '\n'; echo This comes in three parts: Say you have a file foo.txt 12345 67890 abcde fghij then cat foo.txt | sed -n '1h; 1!H; $ { :loop; g; s/$/\n/; s/\([^\n]\)[^\n]*\n/\1/g; p; g; s/^.//; s/\n./\n/g; h; /[^\n]/ b loop }' produces 16af 27bg 38ch 49di 50ej After that, tr -d '\n' deletes the newlines, and ;echo adds one at the end. The heart of this madness is the sed code, which is 1h 1!H $ { :loop g s/$/\n/ s/\([^\n]\)[^\n]*\n/\1/g p g s/^.// s/\n./\n/g h /[^\n]/ b loop } This first follows the basic pattern 1h # if this is the first line, put it in the hold buffer 1!H # if it is not the first line, append it to the hold buffer $ { # if this is the last line, do stuff # do stuff. The whole input is in the hold buffer here. } which assembles all input in the hold buffer before working on it. Once the whole input is in the hold buffer, this happens: :loop g # copy the hold buffer to the pattern space s/$/\n/ # put a newline at the end s/\([^\n]\)[^\n]*\n/\1/g # replace every line with only its first character p # print that g # get the hold buffer again s/^.// # remove the first character from the first line s/\n./\n/g # remove the first character from all other lines h # put that back in the hold buffer /[^\n]/ b loop # if there's something left other than newlines, loop And there you have it. I might just have summoned Cthulhu.
How to add data beside each other in a csv file
If I have 3 csv files, and I want to merge the data all into one, but beside each other, how would I do it? For example: Initial Merged file: ,,,,,,,,,,,, File 1: 20,09/05,5694 20,09/06,3234 20,09/08,2342 File 2: 20,09/05,2341 20,09/06,2334 20,09/09,342 File 3: 20,09/05,1231 20,09/08,3452 20,09/10,2345 20,09/11,372 Final merged File: 09/05,5694,,,09/05,2341,,,09/05,1231 09/06,3234,,,09/06,2334,,,09/08,3452 09/08,2342,,,09/09,342,,,09/10,2345 ,,,,,,,,09/11,372 Basically data from each file goes into a specific column of the merged file. I know the awk function can be used for this, but I have no clue how to start EDIT: Only the 2nd and 3rd Columns of each file are being printed. I was using this to print out the 2nd and 3rd columns: awk -v f="${i}" -F, 'match ($0,f) { print $2","$3 }' file3.csv > d$i.csv however, say for example, file1 and file2 were null in that row, the data for that row would be shifted to the left. so I came up with this to account for the shift: awk -v x="${i}" -F, 'match ($0,x) { if ($2='/NULL') { print "," }; else { print $2","$3}; }' alld.csv > d$i.csv
Using GNU awk for ARGIND: $ gawk '{ a[FNR,ARGIND]=$0; maxFnr=(FNR>maxFnr?FNR:maxFnr) } END { for (i=1;i<=maxFnr;i++) { for (j=1;j<ARGC;j++) printf "%s%s", (j==1?"":",,,"), (a[i,j]?a[i,j]:",") print "" } } ' file1 file2 file3 09/05,5694,,,09/05,2341,,,09/05,1231 09/06,3234,,,09/06,2334,,,09/08,3452 09/08,2342,,,09/09,342,,,09/10,2345 ,,,,,,,,09/11,372 If you don't have GNU awk, just add an initial line that says FNR==1{ARGIND++}. Commented version per request: $ gawk ' { a[FNR,ARGIND]=$0; # Store the current line in a 2-D array `a` indexed by # the current line number `FNR` and file number `ARGIND`. maxFnr=(FNR>maxFnr?FNR:maxFnr) # save the max FNR value } END{ for (i=1;i<=maxFnr;i++) { # Loop from 1 to max number of fields # seen across all files and for each: for (j=1;j<ARGC;j++) # Loop from 1 to total number of files parsed and: printf "%s%s", # Print 2 strings, specifically: (j==1?"":",,,"), # A field separator - empty if were printing # the first field, three commas otherwise. (a[i,j]?a[i,j]:",") # The value stored in the array if it was # present in the files, a comma otherwise. print "" # Print a newline } } ' file1 file2 file3 I originally was using an array fnr[FNR] to track the max value of FNR but IMHO that's kinda obscure and it has a flaw where if no lines have, say, a 2nd field then a loop on for (i=1;i in fnr;i++) in the END section would bail out before getting to the 3rd field.
paste is done for this: $ paste -d";" f1 f2 f3 | sed 's/;/,,,/g' 09/05,5694,,,09/05,2341,,,09/05,1231 09/06,3234,,,09/06,2334,,,09/08,3452 09/08,2342,,,09/09,342,,,09/10,2345 ,,,,,,09/11,372 Note that the paste alone will output just one comma: $ paste -d, f1 f2 f3 09/05,5694,09/05,2341,09/05,1231 09/06,3234,09/06,2334,09/08,3452 09/08,2342,09/09,342,09/10,2345 ,,09/11,372 So to have multiple ones we can use another delimiter like ; and then replace by ,,, with sed: $ paste -d";" f1 f2 f3 | sed 's/;/,,,/g' 09/05,5694,,,09/05,2341,,,09/05,1231 09/06,3234,,,09/06,2334,,,09/08,3452 09/08,2342,,,09/09,342,,,09/10,2345 ,,,,,,09/11,372
Using pr: $ pr -mts',,,' file[1-3] 09/05,5694,,,09/05,2341,,,09/05,1231 09/06,3234,,,09/06,2334,,,09/08,3452 09/08,2342,,,09/09,342,,,09/10,2345 ,,,,,,09/11,372