"Minus" operation on two files using Linux commands - linux

I have 4 files sorted alphabetically, A, B, C, and D.
These files contain a single string on each line.
Essentially, what needs to happen is that anything in B gets deleted from A.
The result of that will then be stripped of anything in C.
And similarly, the result of that will be stripped of D.
Is there a way to this using Linux commands?

comm is good for this, either:
cat B C D | sort | comm -2 -3 A -
or:
comm -2 -3 A B | comm -2 -3 - C | comm -2 -3 - D
depending on what's easier/clearer for your script.

grep -x -v -f B A | grep -x -v -f C | grep -x -v -f D
The -v switch is an inverse match (i.e. match all except). The -f switch takes a file with a list of patterns to match. The -x switch forces it to match whole lines (so that lines that are substrings of other lines don't cause the longer lines to be removed).

Look at the join command. Read its man page and you should find what you seek.

join A B | join - C | join - D

Related

Split sorted file without cutting blocks

I have a large file (20 GB) consisting of 30 million data records. The first field of each record is a non-unique key. The file is sorted by this key. I'd like to split this file into chunks, using anything available in a bash shell, such that the chunks are approximately the same size in bytes and all records with the same key go into the same chunk in the same order as in the original file.
I obviously do not want awk -F";" '{print > $1}' theFile because I'd prefer on the order of 10 large chunks, not one file per key. Also, split alone won't cut it, because I need a way to keep identical keys together.
You can pre-process the file such that split knows where it is allowed to split.
Here, we insert the null byte \0 to mark that splitting is allowed. Afterwards, we remove all \0 from the generated files. This assumes that your original data never contains \0.
awk -F\; '$1!=last {last=$1; if(NR>1) printf "\0"} 1' file > file.tmp
split -t\\0 --filter 'tr -d \\0 > "$FILE"' -n l/10 file.tmp file.
rm file.tmp
You can adapt split's options to your liking. Here we split into 10 chunks. Due to our pre-processing and the changed delimiter -t\\0, the chunk option l/… keeps identical keys together.
To verify that everything worked, you can run
for i in file.*; do
echo "--- $i ---"
head -n1 "$i"
echo "[...]"
tail -n1 "$i"
done
I generated a testfile using
n=1""000""000; paste -d\; <(shuf -i1-500 -rn$n) <(shuf -rn$n /usr/share/dict/words) | sort -t\; -k1,1n > file
and got
--- file.aa ---
1;abbesses
[...]
55;Zoroaster
--- file.ab ---
56;abase
[...]
107;zoologists
--- file.ac ---
108;abattoir
[...]
and so on.
You mention split as a tag, but do you know that the UNIX/Linux command split exists, especially for this purpose?
The man-page mentions (amongst others):
-b, --bytes=SIZE
put SIZE bytes per output file
There are plenty of examples all over the internet.
Edit: apparently split is not a good option
I've created a file, try.txt, with following content:
x1 a b
x1 b c
x1 c d
x2 a b
x2 a d
x3 a b
x3 a c
x3 b c
x3 c c
First, I need to know how many lines there are per key:
Linux prompt>cat try.txt | awk '{print $1}' | sort | uniq -c
3 x1
2 x2
4 x3
(Remark: uniq -c shows the count of unique entries)
So, there is 3 times "x1", 2 times "x2", and 4 times "x3". Now let's take those parts:
Linux prompt>head -n 3 try.txt
x1 a b
x1 b c
x1 c d
Linux prompt>head -n $((3+2)) try.txt | tail -n 2
x2 a b
x2 a d
Linux prompt>head -n $((3+2+4)) try.txt | tail -n 4
x3 a b
x3 a c
x3 b c
x3 c c
It's not an entirely scripted solution, but I guess it might be helpful for you.

4 lines invert grep search in a directory that contains many files

I have many log files in a directory. In those files, there are many lines. Some of these lines contain ERROR word.
I am using grep ERROR abc.* to get error lines from all the abc1,abc2,abc3,etc files.
Now, there are 4-5 ERROR lines that I want to avoid.
So, I am using
grep ERROR abc* | grep -v 'str1\| str2'
This works fine. But when I insert 1 more string,
grep ERROR abc* | grep -v 'str1\| str2\| str3
it doesn’t get affected.
I need to avoid 4-5 strings.. can anybody suggest a solution?
You are using multiple search pattern, i.e. in a way a regex expression. -E in grep supports an extended regular expression as you can see from the man page below
-e PATTERN, --regexp=PATTERN
Use PATTERN as the pattern. This can be used to specify multiple search patterns, or to protect a pattern beginning with a hyphen (-). (-e is specified by POSIX.)
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
So you need to use the -E flag along with the -v invert search
grep ERROR abc* | grep -Ev 'str1|str2|str3|str4|str5'
An example of the usage for your reference:-
$ cat sample.txt
ID F1 F2 F3 F4 ID F1 F2 F3 F4
aa aa
bb 1 2 3 4 bb 1 2 3 4
cc 1 2 3 4 cc 1 2 3 4
dd 1 2 3 4 dd 1 2 3 4
xx xx
$ grep -vE "aa|xx|yy|F2|cc|dd" sample.txt
bb 1 2 3 4 bb 1 2 3 4
Your example should work, but you can also use
grep ERROR abc* | grep -e 'str1' -e 'str2' -e 'str3' -v

Finding the Number of strings in a File

I'm trying to write a very small program that will check the number of sub strings in a large text file. All it will do is count the first 2000 lines of the text file, find any "TTT" sub-strings, count them, and set a variable to that total. I'm a bit new to shell, so any help would be amazingly appreciated!
#!/bin/bash
$counter=(head -2000 [file name] | grep TTT | grep -o TTT | wc -l)
echo $counter
For what it's worth you might awk better suited for this task:
awk -F"ttt" '{j=(NF-1)+j}END{print j}' filename
This will split each record in your file by delimiter "ttt". Then it counts the number of fields, subtracts one, and adds that to the total.
A file like:
ttt tttttt something
1 5 ttt
tt
one more ttt record
Would be split (visualizing with pipe delim) like:
| || something
1 5 |
tt
one more | record
Counting the number of fields per record:
4
2
1
2
Subtracting one from that:
3
1
0
1
Which totals to 5, which is how many "ttt" substrings are present.
To incorporate this into your script (and fixing your other issue):
#!/bin/bash
counter=$(awk -F"ttt" '{j=(NF-1)+j}END{print j}' filename)
echo $counter
The change here is that when we set a variable in Bash we don't include the $ sign at the front. Only in referencing the variable do we include the $.
You have some minor syntax errors there, probably you meant this:
counter=$(head -2000 [file name] | grep TTT | grep -o TTT | wc -l)
echo $counter
Notice the tiny changes I made there to make it work.
Btw the grep TTT in the middle is redundant, you can simply drop it, that is:
counter=$(head -2000 [file name] | grep -o TTT | wc -l)
grep can already do what you want: counter=$(grep -c TTT $infile). You can limit the number of hits (not lines) with -m NUM, --max-count=NUM, which makes grep stop at the end of the file OR when NUM occurrences are found.

Joining a pair of lines with specific starting points

I know that with sed I can print
cat current.txt | sed 'N;s/\n/,/' > new.txt
A
B
C
D
E
F
to
A,B
C,D
E,F
What I would like to do is following:
A
B
C
D
E
F
to
A,D
B,E
C,F
I'd like to join 1 with 4, 2 with 5, 3 with 6 and so on.
Is this possible with sed? Any idea how it could be achieved?
Thank you.
Try printing in columns:
pr -s, -t -2 current.txt
This is longer than I was hoping, but:
$ lc=$(( $(wc -l current.txt | sed 's/ .*//') / 2 ))
$ paste <(head -"$lc" current.txt) <(tail -"$lc" current.txt) | column -t -o,
The variable lc stores the number of lines in current.txt divided by two. Then head and tail are used to print lc first and lc last lines, respectively (i.e. the first and second half of the file); then paste is used to put the two together and column changes tabs to commas.
An awk version
awk '{a[NR]=$0} NR>3 {print a[NR-3]","$0}' current.txt
A,D
B,E
C,F
This solution is easy to adjust if you like other interval.
Just change NR>3 and NR-3 to desired number.

Making horizontal String vertical shell or awk

I have a string
ABCDEFGHIJ
I would like it to print.
A
B
C
D
E
F
G
H
I
J
ie horizontal, no editing between characters to vertical. Bonus points for how to put a number next to each one with a single line. It'd be nice if this were an awk or shell script, but I am open to learning new things. :) Thanks!
If you just want to convert a string to one-char-per-line, you just need to tell awk that each input character is a separate field and that each output field should be separated by a newline and then recompile each record by assigning a field to itself:
awk -v FS= -v OFS='\n' '{$1=$1}1'
e.g.:
$ echo "ABCDEFGHIJ" | awk -v FS= -v OFS='\n' '{$1=$1}1'
A
B
C
D
E
F
G
H
I
J
and if you want field numbers next to each character, see #Kent's solution or pipe to cat -n.
The sed solution you posted is non-portable and will fail with some seds on some OSs, and it will add an undesirable blank line to the end of your sed output which will then become a trailing line number after your pipe to cat -n so it's not a good alternative. You should accept #Kent's answer.
awk one-liner:
awk 'BEGIN{FS=""}{for(i=1;i<=NF;i++)print i,$i}'
test :
kent$ echo "ABCDEF"|awk 'BEGIN{FS=""}{for(i=1;i<=NF;i++)print i,$i}'
1 A
2 B
3 C
4 D
5 E
6 F
So I figured this one out on my own with sed.
sed 's/./&\n/g' horiz.txt > vert.txt
One more awk
echo "ABCDEFGHIJ" | awk '{gsub(/./,"&\n")}1'
A
B
C
D
E
F
G
H
I
J
This might work for you (GNU sed):
sed 's/\B/\n/g' <<<ABCDEFGHIJ
for line numbers:
sed 's/\B/\n/g' <<<ABCDEFGHIJ | sed = | sed 'N;y/\n/ /'
or:
sed 's/\B/\n/g' <<<ABCDEFGHIJ | cat -n

Resources