Extract n-th line from file in bash loop - linux

I would like to extract n-th line from file and save it to a new file. For example I have index.txt :
cat index.txt
1 AAAGCGT
2 ACGAAGT
3 ACCTTGT
4 ATAATGT
5 AGGGTGT
6 AGCCAGT
7 AGTTCGT
8 AATGCAG
9 AAAGCGT
10 ACGAAGT
and output should be
cat index.1.txt:
1 AAAGCGT
2 ACGAAGT
cat index.2.txt:
3 ACCTTGT
4 ATAATGT
cat index.3.txt:
5 AGGGTGT
6 AGCCAGT
And so on.. So I would like to extract form input file first 2 rows in cycle and save to new file.

It doesn't give you exactly the names you want, but:
split -l 2 index.txt index.
seems like the easiest solution. It will create files with names beginning with the final argument, so will get names like 'index.aa' and 'index.bb'

This will work for any number of grouped lines just by changing the 2 to a 3 or whatever number you like:
$ awk 'NR%2==1{++i} {print > ("index." i ".txt")}' index.txt
$ ls index.?.txt
index.1.txt index.2.txt index.3.txt index.4.txt index.5.txt
$ tail index.?.txt
==> index.1.txt <==
1 AAAGCGT
2 ACGAAGT
==> index.2.txt <==
3 ACCTTGT
4 ATAATGT
==> index.3.txt <==
5 AGGGTGT
6 AGCCAGT
==> index.4.txt <==
7 AGTTCGT
8 AATGCAG
==> index.5.txt <==
9 AAAGCGT
10 ACGAAGT

awk '{print >"index."(x+=NR%2)".txt"}' file
This increments x every two lines starting from 1 and then prints the line into a file with that name
cat index.1.txt:
1 AAAGCGT
2 ACGAAGT
cat index.2.txt:
3 ACCTTGT
4 ATAATGT
cat index.3.txt:
5 AGGGTGT
6 AGCCAGT
In some awks, extra parens may be required as shown below (As commented by Ed Morton)
awk '{print >("index."(x+=NR%2)".txt")}' file

I would say:
awk '{file=int((NR+1)/2)".txt"; print > file}' file
int((NR+1)/2 maps every line number:
1 --> 1
2 --> 1
3 --> 2
x --> (x+1) / 2
So you get these files:
$ cat 1.txt
1 AAAGCGT
2 ACGAAGT
or
$ cat 3.txt
5 AGGGTGT
6 AGCCAGT

Related

Negative arguments to head

I was trying using head command, in macOS using zsh, code below,
a.txt:
1
2
3
4
5
6
7
8
9
10
tail -n +5 a.txt // line 5 to line end
tail -n -5 a.txt // last line 5 to line end
head -n +5 a.txt // line 1 to line 5
head -n -5 a.txt // # What did this do?
The last command shows an error.
head: illegal line count -- -5
What did head -n -5 actually do?
Some implementations of head like GNU head support negative arguments for -n
But that's not standard! Your case is clearly not supported.
When supported The negative argument should remove the last 5 lines before doing the head
It becomes more clear, if using 3 instead of 5. Note the signs!
# print 10 lines:
seq 10
1
2
3
4
5
6
7
8
9
10
#-------------------------
# get the last 3 lines:
seq 10 | tail -n 3
8
9
10
#--------------------------------------
# start at line 3 (skip first 2 lines)
seq 10 | tail -n +3
3
4
5
6
7
8
9
10
#-------------------------
# get the first 3 lines:
seq 10 | head -n 3
1
2
3
#-------------------------
# skip the last 3 lines:
seq 10 | head -n -3
1
2
3
4
5
6
7
btw, man tail and man head explain this behavior.

How to sort a group of data in a columnwise manner?

I have a group of data like the attached raw data, when I sort the raw data by sort -n , the data were sorted line by line, the output looks like this:
3 6 9 22
2 3 4 5
1 7 16 20
I want to sort the data in a columnwise manner, the output would look like this:
1 2 4 3
3 6 9 16
5 7 20 22
Ok, I did try something.
My primary ideal is to extract the data columnwise and then sort and then paste them, but I can't get through. Here is my script:
for ((i=1; i<=4; i=i+1))
do
awk '{print $i}' file | sort -n >>output
done
The output:
1 7 20 16
3 6 9 22
5 2 4 3
1 7 20 16
3 6 9 22
5 2 4 3
1 7 20 16
3 6 9 22
5 2 4 3
1 7 20 16
3 6 9 22
5 2 4 3
It seems that $i is unchangeable and equals to $0
Thanks a lot.
raw data1
3 6 9 22
5 2 4 3
1 7 20 16
raw data2
488.000000 1236.000000 984.000000 2388.000000 788.000000 704.000000
600.000000 1348.000000 872.000000 2500.000000 900.000000 816.000000
232.000000 516.000000 1704.000000 1668.000000 68.000000 16.000000
244.000000 504.000000 1716.000000 1656.000000 56.000000 28.000000
2340.000000 3088.000000 868.000000 4240.000000 2640.000000 2556.000000
2588.000000 3336.000000 1116.000000 4488.000000 2888.000000 2804.000000
Let me introduce a flexible solution using cut and sort that you can use on any M,N size tab delimited input matrix.
$ cat -vTE data_to_sort.in
3^I6^I9^I22$
5^I2^I4^I3$
1^I7^I20^I16$
$ col=4; line=3;
$ for i in $(seq ${col}); do cut -f$i data_to_sort.in |\
> sort -n; done | paste $(for i in $(seq ${line}); do echo -n "- "; done) |\
> datamash transpose
1 2 4 3
3 6 9 16
5 7 20 22
If the input file is not \t delimited you need to define proper delimiter to using -d"$DELIM_CHAR" have the cut working properly.
for i in $(seq ${col}); do cut -f$i data_to_sort.in | sort -n; done will separate each column of the file and sort it
paste $(for i in $(seq ${line}); do echo -n "- "; done) the paste column will then recreate a matrix structure
datamash transpose is needed to transpose the intermediate matrix
Thanks to the feedback from Sundeep, let me introduce to you a better solution using pr instead of paste command to generate the columns:
$ col=4; line=3
$ for i in $(seq ${col}); do cut -f$i data_to_sort.in |\
> sort -n; done | pr -${line}ats | datamash transpose
Last but not least,
$ col=4; for i in $(seq ${col}); do cut -f$i data_to_sort.in |\
> sort -n; done | pr -${col}ts
1 2 4 3
3 6 9 16
5 7 20 22
The following solution will allow us to not use datamash at all!!!
(many thanks to Sundeep)
Proof that is working for the skeptics and the downvoters...
2nd run with 6 columns:
$ col=6; for i in $(seq ${col}); do cut -f$i <(sed 's/^ \+//g;s/ \+/\t/g' data2) | sort -n; done | pr -${col}ts | tr '\t' ' '
232.000000 504.000000 868.000000 1656.000000 56.000000 16.000000
244.000000 516.000000 872.000000 1668.000000 68.000000 28.000000
488.000000 1236.000000 984.000000 2388.000000 788.000000 704.000000
600.000000 1348.000000 1116.000000 2500.000000 900.000000 816.000000
2340.000000 3088.000000 1704.000000 4240.000000 2640.000000 2556.000000
2588.000000 3336.000000 1716.000000 4488.000000 2888.000000 2804.000000
awk to the rescue!!
awk '{f1[NR]=$1; f2[NR]=$2; f3[NR]=$3; f4[NR]=$4}
END{asort(f1); asort(f2); asort(f3); asort(f4);
for(i=1;i<=NR;i++) print f1[i],f2[i],f3[i],f4[i]}' file
1 2 4 3
3 6 9 16
5 7 20 22
there may a smarter way of doing this as well...

Adding new line to file with sed

I want to add a new line to the top of a data file with sed, and write something to that line.
I tried this as suggested in How to add a blank line before the first line in a text file with awk :
sed '1i\
\' ./filename.txt
but it printed a backslash at the beginning of the first line of the file instead of creating a new line. The terminal also throws an error if I try to put it all on the same line ("1i\": extra characters after \ at the end of i command).
Input :
1 2 3 4
1 2 3 4
1 2 3 4
Expected output
14
1 2 3 4
1 2 3 4
1 2 3 4
$ sed '1i\14' file
14
1 2 3 4
1 2 3 4
1 2 3 4
but just use awk for clarity, simplicity, extensibility, robustness, portability, and every other desirable attribute of software:
$ awk 'NR==1{print "14"} {print}' file
14
1 2 3 4
1 2 3 4
1 2 3 4
Basially you are concatenating two files. A file containing one line and the original file. By it's name this is a task for cat:
cat - file <<< 'new line'
# or
echo 'new line' | cat - file
while - stands for stdin.
You can also use cat together with command substitution if your shell supports this:
cat <(echo 'new line') file
Btw, with sed it should be simply:
sed '1i\new line' file

How to use paste command for different lengths of columns

I have:
file1.txt file2.txt file3.txt
8 2 2
1 2 1
8 1 0
3 3
5 3
3
4
I want to paste all these three columns in ofile.txt
I tried with
paste file1.txt file2.txt file3.txt > ofile.txt
Result I got in ofile.txt:
ofile.txt:
8 2 2
1 2 1
8 1 0
3 3
5 3
3
4
Which should come
ofile.txt
8 2 2
1 2 1
8 1 0
3 3
5 3
3
4
You can try this paste command in bash using process substitution:
paste <(sed 's/^[[:blank:]]*//' file1.txt) file2.txt file3.txt
8 2 2
1 2 1
8 8 0
3 3
5 3
3
4
sed command is used to remove leading whitespace from file1.txt.
I can reproduce your output when I make inputfiles with tabs.
paste also uses tabs betwen the columns and does this how he thinks it should.
You see the results when I replace the tabs with -:
# more x* | tr '\t' '-'
::::::::::::::
x1
::::::::::::::
-1a
-1b
-1c
-1d
::::::::::::::
x2
::::::::::::::
-2a
-2b
::::::::::::::
x3
::::::::::::::
-3a
-3b
-3c
-3d
-3e
-3f
-3g
# paste x? | tr '\t' '-'
-1a--2a--3a
-1b--2b--3b
-1c---3c
-1d---3d
---3e
---3f
---3g
Think how you want it. When you want correct indents, you need to append lines with tab for files with less lines. Or manipulate the result: 3 tabs into 4 and 4 tabs at the beginning of the line to 5 tabs.
sed -e 's/\t\t\t/\t\t\t\t/' -e 's/^\t\t\t\t/\t\t\t\t\t/'

missing number from two squence

How do I findout missing number from two sequence using bash script
from example I have file which contain following data
1 1
1 2
1 3
1 5
2 1
2 3
2 5
output : missing numbers are
1 4
2 2
2 4
This awk one-liner gives the requested output for the specified input:
$ awk '$2!=l2+1&&$1==l1{for(i=l2+1;i<$2;i++)print l1,i}{l1=$1;l2=$2}' file
1 4
2 2
2 4
a solution using grep:
printf "%s\n" {1..2}" "{1..5} | grep -vf file

Resources