Problem getting desired output using Grep with just numbers as pattern - linux

I am trying to grep rows from a file 2 that matches the values in file 1, but output is giving more lines.
File 1 looks like this:
$ head b.txt
5
11
26
27
File 2, a.txt, looks like
1509 5
1506 11
1507 12
339 26
1000 27
1000 100
Command I use:
grep -wFf b.txt a.txt
Results I want:
1509 5
1506 11
339 26
1000 27
It is giving me all I have in b.txt, but some extra lines too, e.g.,
1509 5
1506 11
1507 12
339 26
1000 27
1000 100
How can I fix this?

I simulated your problem and believe I know what's going on. With an empty line at the end of b.txt, I get the same output as you do. If I remove the empty line at the end of b.txt, I get your desired output.
➜ ~ grep -wFf b.txt a.txt
1509 5
1506 11
339 26
1000 27
From grep's manpage:
-f file, --file=file
Read one or more newline separated patterns from file. Empty pattern lines match every input line. Newlines are not considered part of a pattern. If file is empty, nothing
is matched.
I believe the Empty pattern lines match every input line. is the cause of your erroneous output.

Maybe you want to join files.
join -12 -21 -o1.1,1.2 <(<a.txt sort -k2) <(<b.txt sort)
will output:
1506 11
339 26
1000 27
1509 5
The command joins the second field from a.txt with the second field from b.txt. "joins" means finds specified fields in both files, where they have equal value. I "join" those two files on the second column from the first file and on the first column from the second file. join needs the inputs to be sorted by the joined fields, so we need to pipe it through sort. This method sadly will not preserve the order of the lines in files, as they need to be reordered for join to work.

Related

Find strings from one file that are not in lines of another file

In a bash shell script, I need to create a file with strings from file 1 that are not found in lines from file 2. File 1 is opened through a for loop of files in a directory.
files=./Output/*
for f in $files
do
done
I have very large files, so using grep isn't ideal. I previously tried:
awk 'NR==FNR{A[$2]=$0;next}!($2 in A){print }' file2 file1 > file3
file 1:
NB551674:136:HHVMJAFX2:1:11101:18246:1165
NB551674:136:HHVMJAFX2:1:11101:10296:1192
NB551674:136:HHVMJAFX2:1:11101:13281:1192
NB551674:136:HHVMJAFX2:2:21204:11743:6409
file 2:
aggggcgttccgcagtcgacaagggctgaaaaa|AbaeA1 NB551674:136:HHVMJAFX2:2:21204:11743:6409 100.000 32 0 0 1 32 83 114 7.30e-10 60.2
taccaacaattcagcgttacgccaacggtaac|AbaeB1 NB551674:136:HHVMJAFX2:4:21611:6341:1845 100.000 32 0 0 1 32 27 58 6.70e-10 60.2
taccaacaattcagcgttacgccaacggtaac|AbaeB1 NB551674:136:HHVMJAFX2:4:11504:1547:13124 100.000 32 0 0 1 32 88 119 6.70e-10 60.2
taccaacaattcagcgttacgccaacggtaac|AbaeB1 NB551674:136:HHVMJAFX2:3:11410:11337:15451 100.000 32 0 0 1 32 27 58 6.70e-10 60.2
expected output:
NB551674:136:HHVMJAFX2:2:21204:11743:6409
You were close - file1 only has 1 field ($1) but you were trying to use $2 in the hash lookup ($2 in A). Do this instead:
$ awk 'NR==FNR{a[$2]; next} !($1 in a)' file2 file1
NB551674:136:HHVMJAFX2:1:11101:18246:1165
NB551674:136:HHVMJAFX2:1:11101:10296:1192
NB551674:136:HHVMJAFX2:1:11101:13281:1192
Don't use all upper case for user-defined variable names in awk or shell btw to avoid clashes with builtin variables and other reasons.
Use comm, which requires sorted files. Print the second field of file2 using a Perl one-liner (or cut):
comm -23 <(sort file1) <(perl -lane 'print $F[1]' file2 | sort)
don't do that one line left compare one line right mess.
use gawk in bytes mode, or preferably, mawk, preload every single line into an array from file 1. use the strings directly as the array's hash indices instead of just numerical 1,2,3....
and set FS same as ORS (to prevent it from unnecessarily attempt to process the string looking to split fields).
close file 1. open file 2, then use each of the strings in file 2 and delete the corresponding entry in the array.
close file 2.
in END section, print out whatever is left inside that array. that's your set.

Rearrange column with empty values using awk or sed

i want to rearrange the columns of a txt file, but there are empty values, which cause a problem. For example:
testfile:
Name ID Count Date Other
A 1 10 513 x
6 15 312 x
3 18 314 x
B 19 31 942 x
8 29 722 x
when i tried $ more testfile |awk '{print $2"\t"$1"\t"$3"\t"$4"\t"$5}'
it becomes:
ID Name Count Date Other
1 A 10 513 x
15 6 312 x
18 3 314 x
19 B 31 942 x
29 8 722 x
which is not i want, please help,i want it to be
ID Name Count Date Other
1 A 10 513 x
15 6 312 x
18 3 314 x
19 B 31 942 x
29 8 722 x
moreover i am not sure which columns might contain empty values, and the column length is not fixed, thank you
Assuming your input file is not tab-separated and you have (or can get) GNU awk then I recommend:
$ awk -v FIELDWIDTHS="8 8 8 8 8" -v OFS='\t' '{
for (i=1;i<=NF;i++) {
gsub(/^\s+|\s+$/,"",$i)
}
t=$1; $1=$2; $2=t'
}1' file
ID Name Count Date Other
1 A 10 513 x
6 15 312 x
3 18 314 x
19 B 31 942 x
8 29 722 x
If your file is tab-separated then all you need is:
awk 'BEGIN{FS=OFS="\t"} {t=$1; $1=$2; $2=t}1' file
Another awk alternative is using the number of fields. If you know your data and it's only deficit in the first column you can try this.
awk -v OFS="\t" 'NF==4{$5=$4;$4=$3;$3=$2;$2=$1;$1=""} {print $2,$1,$3,$4,$5}'
However, the output will be tab separated instead of fixed length format. You can achieve the same using printf and changing OFS, but perhaps tab separated is what you really need for tabular representation.
The most natural model for awk to use is columns as defined by the transitions from white-space to non-white-space and back. Since you have columns that may themselves be white-space, the natural model won't work.
However, you can revert to using a model based on column positions rather than transitions, meaning that a file containing only spaces (the presence of tabs will complicate things):
Name ID Count Date Other
A 1 10 513 x
6 15 312 x
3 18 314 x
B 19 31 942 x
8 29 722 x
can still be rearranged, though not as succinctly as transition-based columns.
The following awk script will do the trick, swapping name and id:
{
name = substr($0, 1,7);
id = substr($0, 9,7);
count = substr($0,17,7);
date = substr($0,25,7);
other = substr($0,33 );
print id" "name" "count" "date" "other;
}
If the original file is called pax.in and the awk script is stored in pax.awk, the command awk -f pax.awk pax.in will give you, as desired:
ID Name Count Date Other
1 A 10 513 x
6 15 312 x
3 18 314 x
19 B 31 942 x
8 29 722 x
Keep in mind I've written that script to be relatively flexible, allowing you to change the order of the columns quite easily. If all you want is to swap the first two columns, you could use:
awk '{print substr($0,9,8)substr($0,1,8)substr($0,17)}' qq.in
or the slightly shorter (if you're allowed to use other tools):
sed -E 's/^(.{8})(.{8})/\2\1/' qq.in

how does linux store negative number in text files?

I made a file using ed and named it numeric. Its content is as follow:
-100
-10
0
99
11
-56
12
Then I executed this command on terminal:
sort numeric
And the result was:
0
-10
-100
11
12
-56
99
And of course this output was not at all expected!
Sort want to be asked to sort numerically (otherwise it will default to lexigraphic sorting)
$ sort -n numbers.dat
-100
-56
-10
0
11
12
99
Watch out for the "-n" parameter (see manual)
Text files are text files, they contain text. Your numbers are sorted alphabetically. If you want sort to sort based on numerical value, use sort -n.
Also, your sort result is strange, when I run the same test I get:
$ sort numeric
-10
-100
-56
0
11
12
99
Sorted alphabetically, as expected.
See https://glot.io/snippets/e555jjumx6
Use sort -n to make sort do numerical sorting instead of alphabetical
That's because by default, sort is alphabetical, not numeric. sort -n does numbers.
Otherwise you'll get
1
10
2
3
etc.

Sed command to find multiple patterns and print the line of the pattern and next nth lines after it

I have a tab delimited file with 43,075 lines and 7 columns. I sorted the file by column 4 from the highest to the smaller value. Now I need to find 342 genes which ids are in column 2. See example below:
miR Target Transcript Score Energy Length miR Length target
aae-bantam-3p AAEL007110 AAEL007110-RA 28404 -565.77 22 1776
aae-let-7 AAEL007110 AAEL007110-RA 28404 -568.77 21 1776
aae-miR-1 AAEL007110 AAEL007110-RA 28404 -567.77 22 1776
aae-miR-100 AAEL007110 AAEL007110-RA 28404 -567.08 22 1776
aae-miR-11-3p AAEL007110 AAEL007110-RA 28404 -564.03 22 1776
.
.
.
aae-bantam-3p AAEL018149 AAEL018149-RA 28292 -569.7 22 1769
aae-bantam-5p AAEL018149 AAEL018149-RA 28292 -570.93 23 1769
aae-let-7 AAEL018149 AAEL018149-RA 28292 -574.26 21 1769
aae-miR-1 AAEL018149 AAEL018149-RA 28292 -568.34 22 1769
aae-miR-10 AAEL018149 AAEL018149-RA 28292 -570.08 22 1769
The are 124 lines for each gene. However, I want to extract the top hits for each, for example top 5 genes since the file is sorted. I can do it for one gene with the following script:
sed -n '/AAEL018149/ {p;q}' myfile.csv > top-hits.csv
However, it prints only the line of the match. I was wondering if I could use a script to get all the 342 genes at once. It would be great if I could get the line of the match and the next 4. Then I would have the top 5 hits for each gene.
Any suggestion will be welcome. Thanks
You can also use awk for this:
awk '++a[$2]<=5' myfile.csv
Here, $2 => 2nd column. Since file is already sorted based on 4th column, This will print the top 5 lines corresponding to each gene (2nd column). All 342 genes will be covered. Also the header line will be maintained.
Use grep:
grep -A 4 AAEL018149 myfile.csv

Slice 3TB log file with sed, awk & xargs?

I need to slice several TB of log data, and would prefer the speed of the command line.
I'll split the file up into chunks before processing, but need to remove some sections.
Here's an example of the format:
uuJ oPz eeOO 109 66 8
uuJ oPz eeOO 48 0 221
uuJ oPz eeOO 9 674 3
kf iiiTti oP 88 909 19
mxmx lo uUui 2 9 771
mxmx lo uUui 577 765 27878456
The gaps between the first 3 alphanumeric strings are spaces. Everything after that is tabs. Lines are separated with \n.
I want to keep only the last line in each group.
If there's only 1 line in a group, it should be kept.
Here's the expected output:
uuJ oPz eeOO 9 674 3
kf iiiTti oP 88 909 19
mxmx lo uUui 577 765 27878456
How can I do this with sed, awk, xargs and friends, or should I just use something higher level like Python?
awk -F '\t' '
NR==1 {key=$1}
$1!=key {print line; key=$1}
{line=$0}
END {print line}
' file_in > file_out
Try this:
awk 'BEGIN{FS="\t"}
{if($1!=prevKey) {if (NR > 1) {print lastLine}; prevKey=$1} lastLine=$0}
END{print lastLine}'
It saves the last line and prints it only when it notcies that the key has changed.
This might work for you:
sed ':a;$!N;/^\(\S*\s\S*\s\S*\)[^\n]*\n\1/s//\1/;ta;P;D' file

Resources