How to delete first 10 lines containing certain string? - linux

Suppose i have a file with this structures
1001.txt
1002.txt
1003.txt
1004.txt
1005.txt
2001.txt
2002.txt
2003.txt
...
Now how can I delete first 10 numbers of line which start with '2'? There might be more than 10 lines start with '2'.
I know I can use grep '^2' file | wc -l to find number of lines which start with '2'. But how to delete the first 10 numbers of line?

You can pipe your list through this Perl one-liner:
perl -p -e '$_="" if (/^2/ and $i++ >= 10)'

Another in awk. Testing with value 2 as your data only had 3 lines of related data. Replace the latter 2 with a 10.
$ awk '/^2/ && ++c<=2 {next} 1' file
1001.txt
1002.txt
1003.txt
1004.txt
1005.txt
2003.txt
.
.
.
Explained:
$ awk '/^2/ && ++c<=2 { # if it starts with a 2 and counter still got iterations left
next # skip to the next record
} 1 # (else) output
' file

awk alternative:
awk '{ if (substr($0,1,1)=="2") { count++ } if ( count > 10 || substr($0,1,1)!="2") { print $0 } }' filename
If the first character of the line is 2, increment a counter. Then only print the line if count is greater than 10 or the first character isn't 2.

Related

How to cut file into chuck

How to get information from specimen1 to specimen3 and paste it into another file 'DNA_combined.txt'?
I tried cut command and awk commend but I found that it is tricky to cutting by paragraph(?) or sequence.
My trial was something like cut -d '>' -f 1-3 dna1.fasta > DNA_combined.txt
You can get the line number for each row using Esc + : and type set nu
Once you get the line number corresponding to each row:
Note down the line number corresponding to Line containing >Specimen 1 (say X) and Specimen 3 (say Y)
Then, use sed command to get the text between two lines
sed -n 'X,Yp' dna1.fasta > DNA_combined.txt
Please let me know if you have any questions.
If you want the first three sequences irrespective of the content after >, you can use this:
$ cat ip.txt
>one
ACGTA
TCGAAA
>two
TGACA
>three
ACTG
AAAAC
>four
ATGC
>five
GTA
$ awk '/^>/ && ++count==4{exit} 1' ip.txt
>one
ACGTA
TCGAAA
>two
TGACA
>three
ACTG
AAAAC
/^>/ matches the start of a sequence
for such sequences, increment the count variable
if count reaches 4, the exit command will terminate the script
1 idiomatic way to print contents of input record
Would you please try the following:
awk '
BEGIN {print ">Specimen1-3"} # print header
/^>Specimen/ {f = match($0, "^>Specimen[1-3]") ? 1 : 0; next}
# set the flag depending on the number
f # print if f == 1
' dna1.fasta > DNA_combined.txt

separate columns of a text file

Hii experts i have a big text file that contain many columns.Now i want to extract each column in separate text file serially with adding two strings on the top.
suppose i have a input file like this
2 3 4 5 6
3 4 5 6 7
2 3 4 5 6
1 2 2 2 2
then i need to extract each column in separate text file with two strings on the top
file1.txt file2.txt .... filen.txt
s=5 s=5
r=9 r=9
2 3
3 4
2 3
1 2
i tried script as below:but it doesnot work properly.need help from experts.Thanks in advance.
#!/bin/sh
for i in $(seq 1 1 5)
do
echo $i
awk '{print $i}' inp_file > file_$i
done
Could you please try following, written and tested with shown samples in GNU awk. Following doesn't have close file function used because your sample shows you have only 5 columns in Input_file. Also created 2 awk variables which will be printed before actual column values are getting printed to output file(named var1 and var2).
awk -v var1="s=5" -v var2="r=9" '
{
count++
for(i=1;i<=NF;i++){
outputFile="file"i".txt"
if(count==1){
print (var1 ORS var2) > (outputFile)
}
print $i > (outputFile)
}
}
' Input_file
In case you can have more than 5 or more columns then better close output files kin backend using close option, use this then(to avoid error too many files opened).
awk -v var1="s=5" -v var2="r=9" '
{
count++
for(i=1;i<=NF;i++){
outputFile="file"i".txt"
if(count==1){
print (var1 ORS var2) > (outputFile)
}
print $i >> (outputFile)
}
close(outputFile)
}
' Input_file
Pretty simple to do in one pass through the file with awk using its output redirection:
awk 'NR==1 { for (n = 1; n <= NF; n++) print "s=5\nr=9" > ("file_" n) }
{ for (n = 1; n <= NF; n++) print $n > ("file_" n) }' inp_file
With GNU awk to internally handle more than a dozen or so simultaneously open files:
NR == 1 {
for (i=1; i<=NF; i++) {
out[i] = "file" i ".txt"
print "s=5" ORS "r=9" > out[i]
}
}
{
for (i=1; i<=NF; i++) {
print $i > out[i]
}
}
or with any awk just close them as you go:
NR == 1 {
for (i=1; i<=NF; i++) {
out[i] = "file" i ".txt"
print "s=5" ORS "r=9" > out[i]
close(out[i])
}
}
{
for (i=1; i<=NF; i++) {
print $i >> out[i]
close(out[i])
}
}
split -nr/$(wc -w <(head -1 input) | cut -d' ' -f1) -t' ' --additional-suffix=".txt" -a4 --numeric-suffix=1 --filter "cat <(echo -e 's=5 r=9') - | tr ' ' '\n' >\$FILE" <(tr -s '\n' ' ' <input) file
This uses the nifty split command in a unique way to rearrange the columns. Hopefully it's faster than awk, although after spending a considerable amount of time coding it, testing it, and writing it up, I find that it may not be scalable enough for you since it requires a process per column, and many systems are limited in user processes (check ulimit -u). I submit it though because it may have some limited learning usefulness, to you or to a reader down the line.
Decoding:
split -- Divide a file up into subfiles. Normally this is by lines or by size but we're tweaking it to use columns.
-nr/$(...) -- Use round-robin output: Sort records (in our case, matrix cells) into the appropriate number of bins in a round-robin fashion. This is the key to making this work. The part in parens means, count (wc) the number of words (-w) in the first line (<(head -1 input)) of the input and discard the filename (cut -d' ' -f1), and insert the output into the command line.
-t' ' -- Use a single space as a record delimiter. This breaks the matrix cells into records for split to split on.
--additional-suffix=".txt" -- Append .txt to output files.
-a4 -- Use four-digit numbers; you probably won't get 1,000 files out of it but just in case ...
--numeric-suffix=1 -- Add a numeric suffix (normally it's a letter combination) and start at 1. This is pretty pedantic but it matches the example. If you have more than 100 columns, you will need to add a -a4 option or whatever length you need.
--filter ... -- Pipe each file through a shell command.
Shell command:
cat -- Concatenate the next two arguments.
<(echo -e 's=5 r=9') -- This means execute the echo command and use its output as the input to cat. We use a space instead of a newline to separate because we're converting spaces to newlines eventually and it is shorter and clearer to read.
- -- Read standard input as an argument to cat -- this is the binned data.
| tr ' ' '\n' -- Convert spaces between records to newlines, per the desired output example.
>\$FILE -- Write to the output file, which is stored in $FILE (but we have to quote it so the shell doesn't interpret it in the initial command).
Shell command over -- rest of split arguments:
<(tr -s '\n' ' ' < input) -- Use, as input to split, the example input file but convert newlines to spaces because we don't need them and we need a consistent record separator. The -s means only output one space between each record (just in case we got multiple ones on input).
file -- This is the prefix to the output filenames. The output in my example would be file0001.txt, file0002.txt, ..., file0005.txt.

Using bash to copy contents of row in one file to specific character location in another file

I'm new to bash and need help to copy Row 2 onwards from one file into a specific position (150 characters in) in another file. Through looking through the forum, I've found a way to include specific text listed at this position:
sed -i 's/^(.{150})/\1specifictextlisted/' destinationfile.txt
However, I can't seem to find a way to copy content from one file into this.
Basically, I'm working with these 2 starting files and need the following output:
File 1 contents:
Sequence
AAAAAAAAAGGGGGGGGGGGCCCCCCCCCTTTTTTTTT
File 2 contents:
chr2
tccccagcccagccccggccccatccccagcccagcctatccccagcccagcctatccccagcccagccccggccccagccccagccccggccccagccccagccccggccccagccccggccccatccccggccccggccccatccccggccccggccccggccccggccccggccccatccccagcccagccccagccccatccccagcccagccccggcccagccccagcccagccccagccacagcccagccccggccccagccccggcccaggcccagcccca
Desired output contents:
chr2
tccccagcccagccccggccccatccccagcccagcctatccccagcccagcctatccccagcccagccccggccccagccccagccccggccccagccccagccccggccccagccccggccccatccccggccccggccccatccccgAAAAAAAAAGGGGGGGGGGGCCCCCCCCCTTTTTTTTTgccccggccccggccccggccccggccccatccccagcccagccccagccccatccccagcccagccccggcccagccccagcccagccccagccacagcccagccccggccccagccccggcccaggcccagcccca
Can anybody put me on the right track to achieving this?
If the file is really huge instead of just 327 characters you might want to use dd:
dd if=chr2 bs=1 count=150 status=none of=destinationfile.txt
tr -d '\n' < Sequence >> destinationfile.txt
dd if=chr2 bs=1 skip=150 seek=189 status=none of=destinationfile.txt
189 is 150+length of Sequence.
You can use awk for that:
awk 'NR==FNR{a=$2;next}{print $1, substr($2, 0, 149) "" a "" substr($2, 150)}' file1 file2
Explanation:
# Total row number == row number in file
# This is only true when processing file1
NR==FNR {
a=$2 # store column 2 in a variable 'a'
next # do not process the block below
}
# Because of the 'next' statement above, this
# block gets only executed for file2
{
# put 'a' in the middle of the second column and print it
print $1, substr($2, 0, 149) "" a "" substr($2, 150)
}
I assume that both files contain only a single line, like in your example.
Edit: In comments you said that the files actually spread two lines, in that case you can use the following awk script:
# usage: awk -f this_file.awk file1 file2
# True for the second line in each file
FNR==2 {
# Total line number equals line number in file
# This is only true while we are processing file1
if(NR==FNR) {
insert=$0 # Store the string to be inserted in a variable
} else {
# Insert the string in file1
# Assigning to $0 will modify the current line
$0 = substr($0, 0, 149) "" insert "" substr($0, 150)
}
}
# Print lines of file2 (line 2 has been modified above)
NR!=FNR
You can use bash and read one char at the time from the file:
i=1
while read -n 1 -r; do
echo -n "$REPLY"
let i++
if [ $i -eq 150 ]; then
echo -n "AAAAAAAAAGGGGGGGGGGGCCCCCCCCCTTTTTTTTT"
fi
done < chr2 > destinationfile.txt
This simply reads a char, echos it and increments the counter. If the counter is 150 it echos your sequence. You can replace the echo with a cat file | tr -d '\n'. Just make sure to remove any newlines, like here with tr. That is also why I use echo -n so it doesn't add any.

Select lines having count greater than x and less than y using linux command

I have the following file:
1,A
2,B
3,C
10000,D
20,E
4000,F
I want to select the lines having a count greater than 10 and less than 5000. the output should be E and F. In C++ or any other language is a piece of cake. I really wanted to know how can I do it with a linux command.
I tried the following command
awk -F ',' '{$1 >= 10 && $1 < 5000} { count++ } END { print $1,$2}' test.txt
But it is only givine 4000,F.
just do:
awk -F',' '$1 >= 10 && $1 < 5000' test.txt
you put boolean check in {....}, and don't use the result at all. it doesn't make any sense. You should do either {if(...) ...} or booleanExpression{do...}
useless count++
you have only print statement in END so only last line was printed out.
Your script does actually:
print the last line of the test.txt, no matter what it is.

Insert new line if different number appears in a column

I have a column
1
1
1
2
2
2
I would like to insert a blank line when the value in the column changes:
1
1
1
<- blank line
2
2
2
I would recommend using awk:
awk -v i=1 'NR>1 && $i!=p { print "" }{ p=$i } 1' file
On any line after the first, if value of the "i"th column is different to the previous value, print a blank line. Always set the value of p. The 1 at the end evaluates to true, which means that awk prints the line. i can be set to the column number of your choice.
while read L; do [[ "$L" != "$PL" && "$PL" != "" ]] && echo; echo "$L"; PL="$L"; done < file
awk(1) seems like the obvious answer to this problem:
#!/usr/bin/awk -f
BEGIN { prev = "" }
/./ {
if (prev != "" && prev != $1) print ""
print
prev = $1
}
You can also do this with SED:
sed '{N;s/^\(.*\)\n\1$/\1\n\1/;tx;P;s/^.*\n/\n/;P;D;:x;P;D}'
The long version with explanations is:
sed '{
N # read second line; (terminate if there are no more lines)
s/^\(.*\)\n\1$/\1\n\1/ # try to replace two identical lines with themselves
tx # if replacement succeeded then goto label x
P # print the first line
s/^.*\n/\n/ # replace first line by empty line
P # print this empty line
D # delete empty line and proceed with input
:x # label x
P # print first line
D # delete first line and proceed with input
}'
One thing I like about using (GNU) SED (what which is not clear if it is useful to you from your question) is that you can easily apply changes in-place with the -i switch, e.g.
sed -i '{N;s/^\(.*\)\n\1$/\1\n\1/;tx;P;s/^.*\n/\n/;P;D;:x;P;D}' FILE
You could use getline function in Awk to match the current line against the following line:
awk '{f=$1; print; getline}f != $1{print ""}1' file

Resources