Copy all lines hit by search in notepad++ - search

I have a huge file with my db dump which is like the small snippet shown below.
903 09-JAN-14 4 2 "false" "false" "false" 7505 7459 2139 66.51 0.18 69.72 1
903 09-JAN-14 5 3 "false" "false" "false" 7468 7415 2173 66.24 0.37 70.19 4
860 17-FEB-13 1 1 "false" "false" "false" 7014 6973 2371 67.21 0.97 68.31 16
860 17-FEB-13 2 2 "false" "false" "false" 6992 6954 2401 66.95 0.62 68.78 8
891 10-DEC-13 1 1 "false" "false" "false" 1010 1001 10965 17.75 11.3 71.49 505
903 17-DEC-13 5 3 "false" "false" "false" 7468 7415 2173 66.24 0.37 70.19 4
903 10-JAN-14 7 4 "false" "false" "false" 7421 7380 2225 65.83 0.01 71.14 0
860 11-JAN-14 1 1 "false" "false" "false" 7014 6973 2371 67.21 0.97 68.31 16
I want to copy all the lines with "903" in it. Is there a way to do it in notepad++?

You could do:
Search > Mark...
Find what : ^903\b
check Mark the lines and Regular Expression
Click on Find All
All the lines that begin with 903 are now marked.
After that:
Search > Bookmark > Copy marked lines
goto destination file then Ctrl+V

Related

Create new file from two files with a common (unsorted) column

This is probably a very basic problem but I am stumped.
I am attempting create a new file from two large tab-delimited files with a common column. The heads of the two files are:
file1
k141_1 319 4 0
k141_2 400 9 0
k141_3 995 43 0
k141_4 670 21 0
k141_5 372 8 0
k141_6 359 9 0
k141_7 483 18 0
k141_8 1826 76 0
k141_9 566 15 0
k141_10 462 14 0
file2
U k141_1 0
U k141_11 0
U k141_24 0
U k141_30 0
C k141_32 2 18 77133,212695,487010, 5444279,5444689,68971626, TIEYSSLHACRSTLEDPT, cellular organisms; Bacteria;
C k141_38 1566886 16 1566886, 50380646, ELVMDREAWCAAIHGV, cellular organisms; Bacteria; Terrabacteria group; Actinobacteria; Actinobacteria; Corynebacteriales; Mycobacteriaceae; Mycobacterium; Mycobacterium sp. WCM 7299;
U k141_46 0
C k141_57 186802 23 1496,1776046,1776047, 64601048,64601468,64601628,64603689,64604310,64605360,71436886,71436980,71437249,71437272,71437295, CLLYTSDAADDLLCVDLGGRRII, cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Clostridia; Clostridiales;
U k141_64 0
C k141_73 131567 14 287,305,1496,2209,1483596, 47871795,47873311,47873322,47880313,47880625,53485494,53485498,62558724,71434583,71434608, LSRGLGDVYKRQIL,SCLVGSEMCIRDRY,YLSLIHISEPTRQE, cellular organisms;
I want the new file to contain all 4 columns from file 1 and the 8th column of file 2 (taxonomic information separated by semi colons).
I have attempted to sort the files based on the common column but the outputs are not the same despite the columns having the exact same values.
For example,
[user#compute02 Contigs]$ sort -k 1 file1 | head
k141_1000 312 253 0
k141_1001 553 13 0
k141_1002 518 19 0
k141_1003 812 30 0
k141_1004 327 13 0
k141_1005 454 18 0
k141_100 595 20 0
k141_1006 1585 78 0
k141_1007 537 23 0
[user#compute02 Contigs]$ sort -k 2 file2 | head
U k141_1 0
C k141_1000 305 26 305, 62554095,62558735, PVSYTHLRAHETRGNLVCRLLLEKKK, cellular organisms; Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales; Burkholderiaceae; Ralstonia; Ralstonia solanacearum;
C k141_1001 946362 11 946362, 5059526, SGRNGLPLKVR, cellular organisms; Eukaryota; Opisthokonta; Choanoflagellida; Craspedida; Salpingoecidae; Salpingoeca; Salpingoeca rosetta;
C k141_1002 131567 15 287,305,2209,1483596, 47870166,47873029,47873592,53485045,55518854,62558495, RTCLLYTSPSPRDKR,NLSLIHISEPTRQEA,EPVSYTHLRAHETRG, cellular organisms;
C k141_100 2 14 287,1496,1776047, 53544868,64603691,71437007, SRSSAASDVYKRQV, cellular organisms; Bacteria;
U k141_1003 0
C k141_1004 2 14 518,1776046,1776047, 28571314,64603094,64605737, LFFFNDTATTEIYT, cellular organisms; Bacteria;
U k141_1005 0
C k141_1006 948 13 948, 73024016, QAPLSMGFSRQEY, cellular organisms; Bacteria; Proteobacteria; Alphaproteobacteria; Rickettsiales; Anaplasmataceae; Anaplasma; phagocytophilum group; Anaplasma phagocytophilum;
C k141_1007 287 14 287, 50594737, RRQRQMCIRDRVGS, cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Pseudomonadales; Pseudomonadaceae; Pseudomonas; Pseudomonas aeruginosa group; Pseudomonas aeruginosa;
Any assistance would be greatly appreciated :)
This solution should work.
for i in `cat file1.txt|awk -F" " '{print $1}'`
do
F1=`grep -w $i file1.txt`
F2=`grep -w $i file2.txt|awk -F" " '{$1=$2=$3=$4=$5=$6=$7=""; print $0}'`
echo $F1 $F2
done

Problems combining awk scripts

I am trying to use awk to parse a tab delimited table -- there are several duplicate entries in the first column, and I need to remove the duplicate rows that have a smaller total sum of the other 4 columns in the table. I can remove the first or second row easily, and sum the columns, but I'm having trouble combining the two. For my purposes there will never be more than 2 duplicates.
Example file: http://pastebin.com/u2GBnm2D
Desired output in this case would be to remove the rows:
lmo0330 1 1 0 1
lmo0506 7 21 2 10
And keep the other two rows with the same gene id in the column. The final parsed file would look like this: http://pastebin.com/WgDkm5ui
Here's what I have tried (this doesn't do anything. But the first part removes the second duplicate, and the second part sums the counts):
awk 'BEGIN {!a[$1]++} {for(i=1;i<=NF;i++) t+=$i; print t; t=0}'
I tried modifying the 2nd part of the script in the best answer of this question: Removing lines containing a unique first field with awk?
awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile
But unfortunately I don't really understand what's going on well enough to get it working. Can anyone help me out? I think I need to replace the a[$1] > 1 part with [remove (first duplicate count or 2nd duplicate count depending on which is larger].
EDIT: I'm also using GNU Awk 3.1.7 if that matters.
You can use this awk command:
awk 'NR == 1 {
print;
next
} {
s = $2+$3+$4+$5
} s >= sum[$1] {
sum[$1] = s;
if (!($1 in rows))
a[++n] = $1;
rows[$1] = $0
} END {
for(i=1; i<=n; i++)
print rows[a[i]]
}' file | column -t
Output:
gene SRR034450.out.rpkm_0 SRR034451.out.rpkm_0 SRR034452.out.rpkm_0 SRR034453.out.rpkm_0
lmo0001 160 323 533 293
lmo0002 135 317 504 306
lmo0003 1 4 5 3
lmo0004 35 59 58 48
lmo0005 113 218 257 187
lmo0006 279 519 653 539
lmo0007 563 1053 1165 1069
lmo0008 34 84 203 107
lmo0009 13 45 90 49
lmo0010 57 210 237 169
lmo0011 65 224 247 179
lmo0012 65 226 250 215
lmo0013 342 500 738 682
lmo0014 662 1032 1283 1311
lmo0015 321 413 631 637
lmo0016 175 253 273 325
lmo0017 3 6 6 6
lmo0018 33 38 46 45
lmo0019 13 1 39 1
lmo0020 3 12 28 15
lmo0021 3 4 14 12
lmo0022 2 3 5 1
lmo0023 2 0 3 2
lmo0024 1 0 2 6
lmo0330 1 1 1 3
lmo0506 151 232 60 204

How can I swap numbers inside data block of repeating format using linux commands?

I have a huge data file, and I hope to swap some numbers of 2nd column only, in the following format file. The file have 25,000,000 dataset, and 8768 lines each.
%% Edited: shorter 10 line example. Sorry for the inconvenience. This is typical one data block.
# Dataset 1
#
# Number of lines 10
#
# header lines
5 11 3 10 120 90 0 0.952 0.881 0.898 2.744 0.034 0.030
10 12 3 5 125 112 0 0.952 0.897 0.905 2.775 0.026 0.030
50 10 3 48 129 120 0 1.061 0.977 0.965 3.063 0.001 0.026
120 2 4 5 50 186 193 0 0.881 0.965 0.899 0.917 3.669 0.000 -0.005
125 3 4 10 43 186 183 0 0.897 0.945 0.910 0.883 3.641 0.000 0.003
186 5 4 120 125 249 280 0 0.899 0.910 0.931 0.961 3.727 0.000 -0.001
193 6 4 120 275 118 268 0 0.917 0.895 0.897 0.937 3.799 0.000 0.023
201 8 4 278 129 131 280 0 0.921 0.837 0.870 0.934 3.572 0.000 0.008
249 9 4 186 355 179 317 0 0.931 0.844 0.907 0.928 3.615 0.000 0.008
280 10 4 186 201 340 359 0 0.961 0.934 0.904 0.898 3.700 0.000 0.033
#
# Dataset 1
#
# Number of lines 10
...
As you can see, there are 7 repeating header lines in the head, and 1 trailing line at the end of the dataset. Those header and trailing lines are all beginning from #. As a result, the data will have 7 header lines, 8768 data lines, and 1 trailing line, total 8776 lines per a data block. That one trailing line only contains sinlge '#'.
I want to swap some numbers in 2nd columns only. First, I want to replace
1, 9, 10, 11 => 666
2, 6, 7, 8 => 333
3, 4, 5 => 222
of the 2nd column, and then,
666 => 6
333 => 3
222 => 2
of the 2nd column. I hope to conduct this replacing for all repeating dataset.
I tried this with python, but the data is too big, so it makes memory error. How can I perform this swapping with linux commands like sed or awk or cat commands?
Thanks
Best,
This might work for you, but you'd have to use GNU awk, as it's using the gensub command and $0 reassignment.
Put the following into an executable awk file ( like script.awk ):
#!/usr/bin/awk -f
BEGIN {
a[1] = a[9] = a[10] = a[11] = 6
a[2] = a[6] = a[7] = a[8] = 3
a[3] = a[4] = a[5] = 2
}
function swap( c2, val ) {
val = a[c2]
return( val=="" ? c2 : val )
}
/^( [0-9]+ )/ { $0 = gensub( /^( [0-9]+)( [0-9]+)/, "\\1 " swap($2), 1 ) }
47 # print the line
Here's the breakdown:
BEGIN - set up an array a with mappings of the new values.
create a user defined function swap to provide values for the 2nd column from the a array or the value itself. The c2 element is passed in, while the val element is a local variable ( becuase no 2nd argument is passed in ).
when a line starts with a space followed by a number and a space (the pattern), then use gensub to replace the first occurrance of the first number pattern with itself concatenated with a space and the return from swap(the action). In this case, I'm using gensub's replacement text to preserve the first column data. The second column is passed to swap using the field data identifier of $2. Using gensub should preserve the formatting of the data lines.
47 - an expression that evaluates to true provides the default action of printing $0, which for data lines might have been modified. Any line that wasn't "data" will be printed out here w/o modifications.
The provided data doesn't show all the cases, so I made up my own test file:
# 2 skip me
9 2 not going to process me
1 1 don't change the for matting
2 2 4 23242.223 data
3 3 data that's formatted
4 4 7 that's formatted
5 5 data that's formatted
6 6 data that's formatted
7 7 data that's formatted
8 8 data that's formatted
9 9 data that's formatted
10 10 data that's formatted
11 11 data that's formatted
12 12 data that's formatted
13 13 data that's formatted
14 s data that's formatted
# some other data
Running the executable awk (like ./script.awk data) gives the following output:
# 2 skip me
9 2 not going to process me
1 6 don't change the for matting
2 3 4 23242.223 data
3 2 data that's formatted
4 2 7 that's formatted
5 2 data that's formatted
6 3 data that's formatted
7 3 data that's formatted
8 3 data that's formatted
9 6 data that's formatted
10 6 data that's formatted
11 6 data that's formatted
12 12 data that's formatted
13 13 data that's formatted
14 s data that's formatted
# some other data
which looks alright to me, but I'm not the one with 25 million datasets.
You'd also most definitely want to try this on a smaller sample of your data first (the first few datasets?) and redirect stdout a temp file perhaps like:
head -n 26328 data | ./script.awk - > tempfile
You can learn more about the elements used in this script here:
awk basics (the man page)
Arrays
User defined functions
String functions - gensub()
And of course, you should spend some quality time reviewing awk related questions and answers on Stack Overflow ;)

Turning rows into columns based on a value in excel?

I have a bunch of data, ordered below one another in excel (actually openoffice). I need them to be consolidated by the first column, but so that all data are still shown:
From (actually goes until 100):
1 283.038 244
1 279.899 494
1 255.139 992
1 254.606 7329
1 254.5 17145
1 251.008 23278
1 250.723 28758
1 247.753 92703
1 243.43 315278
1 242.928 485029
1 237.475 1226549
1 233.851 2076295
1 232.833 9826327
1 229.656 15965410
1 229.656 30000235
2 286.535 231
2 275.968 496
2 267.927 741
2 262.647 2153
2 258.925 3130
2 253.954 4857
2 249.551 9764
2 244.725 36878
2 243.825 318455
2 242.86 921618
2 238.401 1405028
2 234.984 3170031
2 233.168 4403799
2 229.317 8719139
2 224.395 26986035
2 224.395 30000056
3 269.715 247
3 268.652 469
3 251.214 957
3 249.04 30344
3 245.883 56115
3 241.753 289668
3 241.707 954750
3 240.684 1421766
3 240.178 1865750
3 235.09 2626524
3 233.579 5129755
3 232.517 7018880
3 232.256 18518741
3 228.75 19117443
3 228.75 30000051
to:
1 2 3
283.038 244 286.535 231 269.715 247
279.899 494 275.968 496 268.652 469
255.139 992 267.927 741 251.214 957
254.606 7329 262.647 2153 249.04 30344
254.5 17145 258.925 3130 245.883 56115
251.008 23278 253.954 4857 241.753 289668
250.723 28758 249.551 9764 241.707 954750
247.753 92703 244.725 36878 240.684 1421766
243.43 315278 243.825 318455 240.178 1865750
242.928 485029 242.86 921618 235.09 2626524
237.475 1226549 238.401 1405028 233.579 5129755
233.851 2076295 234.984 3170031 232.517 7018880
232.833 9826327 233.168 4403799 232.256 18518741
229.656 15965410 229.317 8719139 228.75 19117443
229.656 30000235 224.395 26986035 228.75 30000051
224.395 30000056
This must be really simple. But I couldn't find it. I tried a pivot table, but that only allows me to summarise or count etc the fields, while I want them all to be displayed. Any ideas?
To elaborate on the pivot table. I put column 1 as row, column 2 as colum and column 3 in the middle, but that comes out with a lot of empty cells and summarised.
I am not sure on which search terms to look, unconsolidised pivot tables haven't provided an answer.
After some discussions with my collegues, this seemed undoable in excel. So I just created a short script to save each "run" (which is the first column) in a seperate file, from csv:
awk -F, '{print > $1}' rw.txt

Supplement patterns

I have these kind of records in a file:
1867 121 2 56
1868 121 1 6
1868 121 2 65
1868 122 0 53
1869 121 0 41
1869 121 1 41
1871 121 1 13
1871 121 2 194
I would like to get this output:
1867 121 2 56
1868 121 1 6
1868 121 2 65
1868 122 0 53
1869 121 0 41
1869 121 1 41
1870 121 0 0
1871 121 1 13
1871 121 2 194
The difference is the 1870 121 0 0 row.
So, if the difference between the numbers in the first column is greater than 1, then we have to include a line with the missing number (the above case it is 1870) and the other columns. One should get the other columns in a way, that let the second column be the minimum of the possible values of the numbers of the column (in the example these values might be 121 or 122), and for the same as in the third column case. The value of the last column let be always zero.
Can anybody suggest me something? Thanks in advance!
I am trying to solve it with awk, but maybe there is (are) other nicer or more practical solution(s) for this...
Something like this could work -
awk 'BEGIN{getline;a=$1;b=$2;c=$3}
NR==FNR{if (b>$2) b=$2; if (c>$3) c=$3;next}
{if ($1-a>1) {x=($1-a); for (i=1;i<x;i++) {print (a+1)"\t"b,c,"0";a++};a=$1} else a=$1;print}' file file
Explanation:
BEGIN{getline;a=$1;b=$2;c=$3} -
In this BEGIN block we read the first line and assign values in column 1 to variable a, column 2 to variable b and column 3 to variable c.
NR==FNR{if (b>$2) b=$2; if (c>$3) c=$3;next} -
In this we scan through the entire file (NR==FNR) and keep track of the lowest possible values in column 2 and column 3 and store them in variables b and c respectively. We use next to avoid running the second pattern{action} statement.
{if ($1-a>1) {x=($1-a); for (i=1;i<x;i++) {print (a+1)"\t"b,c,"0";a++};a=$1} else a=$1;print} -
This action statement checks the for the value in column 1 and compares it with a. If the the difference is more than 1, we do a for loop to add all the missing lines and set the value of a to $1. If the value in column 1 on successive lines is not greater than 1, we assign the value of column 1 to a and print it.
Test:
[jaypal:~/Temp] cat file
1867 121 2 56
1868 121 1 6
1868 121 2 65
1868 122 0 53
1869 121 0 41
1869 121 1 41
1871 121 1 13 # <--- 1870 skipped
1871 121 2 194
1875 120 1 12 # <--- 1872, 1873, 1874 skipped
[jaypal:~/Temp] awk 'BEGIN{getline;a=$1;b=$2;c=$3}
NR==FNR{if (b>$2) b=$2; if (c>$3) c=$3;next}
{if ($1-a>1) {x=($1-a); for (i=1;i<x;i++) {print (a+1)"\t"b,c,"0";a++};a=$1} else a=$1;print}' file file
1867 121 2 56
1868 121 1 6
1868 121 2 65
1868 122 0 53
1869 121 0 41
1869 121 1 41
1870 120 0 0 # Assigned minimum value in col 2 (120) and col 3 (0).
1871 121 1 13
1871 121 2 194
1872 120 0 0 # Assigned minimum value in col 2 (120) and col 3 (0).
1873 120 0 0 # Assigned minimum value in col 2 (120) and col 3 (0).
1874 120 0 0 # Assigned minimum value in col 2 (120) and col 3 (0).
1875 120 1 12
Perl solution. Should work for large files, too, as it does not load the whole file into memory, but goes over the file two times.
#!/usr/bin/perl
use warnings;
use strict;
my $file = shift;
open my $IN, '<', $file or die $!;
my #mins;
while (<$IN>) {
my #cols = split;
for (0, 1) {
$mins[$_] = $cols[$_ + 1] if $cols[$_ + 1] < $mins[$_ ]
or ! defined $mins[$_];
}
}
seek $IN, 0, 0;
my $last;
while (<$IN>) {
my #cols = split;
$last //= $cols[0];
for my $i ($last .. $cols[0]-2) {
print $i + 1, "\t#mins 0\n";
}
print;
$last = $cols[0];
}
A Bash solution:
# initialize minimum of 2. and 3. column
read no min2 min3 c4 < "$infile"
# get minimum of 2. and 3. column
while read c1 c2 c3 c4 ; do
[ $c2 -lt $min2 ] && min=$c2
[ $c3 -lt $min3 ] && min=$c3
done < "$infile"
while read c1 c2 c3 c4 ; do
# insert missing line(s) ?
while (( c1- no > 1 )) ; do
((no++))
echo -e "$no $min2 $min3 0"
done
# now insert existing line
echo -e "$c1 $c2 $c3 $c4"
no=$c1
done < "$infile"
One way using awk:
BEGIN {
if ( ARGC > 2 ) {
print "Usage: awk -f script.awk <file-name>"
exit 0
}
## Need to process file twice, duplicate the input filename.
ARGV[2] = ARGV[1]
++ARGC
col2 = -1
col3 = -1
}
## First processing of file. Get min values of second and third columns.
FNR == NR {
col2 = col2 < 0 || col2 > $2 ? $2 : col2
col3 = col3 < 0 || col3 > $3 ? $3 : col3
next
}
## Second processing of file.
FNR < NR {
## Get value of column 1 in first row.
if ( FNR == 1 ) {
col1 = $1
print
next
}
## Compare current value of column 1 with value of previous row.
## Add a new row while difference is bigger than '1'.
while ( $1 - col1 > 1 ) {
++col1
printf "%d\t%d %d %d\n", col1, col2, col3, 0
}
## Assing new value of column 1.
col1 = $1
print
}
Running the script:
awk -f script.awk infile
Result:
1867 121 2 56
1868 121 1 6
1868 121 2 65
1868 122 0 53
1869 121 0 41
1869 121 1 41
1870 121 0 0
1871 121 1 13
1871 121 2 194

Resources