concatenating linux program outputs, and returning only those which repeat - linux

I have multiple programs which each produce lines of output. How do I concatenate those outputs, and then only return one copy of each line which repeated at least once? In other words, I want to return the set intersection of all response lines.
for example:
$ progA
9
13
14
15
$ progA --someFlag
13
14
15
100
$ progB
14
15
-42
$ magicFunction 'progA' 'progA --someFlag' 'progB'
14
15
This doesn't have to be a function per se. I just wanted a unix command-line way.

How about:
( progA; progA --someFlag; progB ) | sort | uniq -d
The -d option for uniq forces it to output only lines with more than one copy.
Here's a variant of the one-liner above that does not use a subshell:
{ progA; progA --someFlag; progB; } | sort | uniq -d
This works at least in bash. Note the required terminating semicolon (;) after the last command in the curly braces.

The solutions above don't really compute the set intersection of all 3 outputs. uniq -d will also output lines which are output only by 2 of the 3 programs.
Here's my take on it:
progA | sort > f1; progA --someFlag | sort > f2; progB | sort > f3; comm -1 -2 f1 f2 | comm -1 -2 f3 -; rm f[123]

Related

Calculating a sum of numbers in C shell

I'm trying to calculate a sum numbers positioned on separate lines using C shell.
I must do it with specific commands using pipes.
There is a number of commands: comand.. | comand.. | (comands...)
printing lines in the following form:
1
2
8
4
7
The result should be 22, since 1 + 2 + 8 + 4 + 7 = 22.
I tried ... | bc | tr "\n" "+" | bc, but it didn't work.
I can't use AWK, or variables. That is why I am asking for help.
You actually can use the C shell variables, as they are part of the syntax. Without using variables, you need to pipe, and pipe again:
your-command | sed '2~1 s/^/+/' | xargs | bc
The sed command prepends plus character to all lines starting from the second; xargs joins the lines as a sequence of arguments.
The SED expression can be improved to filter out non-numeric lines:
'/^[^0-9]\+$/ d; 2~1 s/\([0-9]\+\)/+\1/'

4 lines invert grep search in a directory that contains many files

I have many log files in a directory. In those files, there are many lines. Some of these lines contain ERROR word.
I am using grep ERROR abc.* to get error lines from all the abc1,abc2,abc3,etc files.
Now, there are 4-5 ERROR lines that I want to avoid.
So, I am using
grep ERROR abc* | grep -v 'str1\| str2'
This works fine. But when I insert 1 more string,
grep ERROR abc* | grep -v 'str1\| str2\| str3
it doesn’t get affected.
I need to avoid 4-5 strings.. can anybody suggest a solution?
You are using multiple search pattern, i.e. in a way a regex expression. -E in grep supports an extended regular expression as you can see from the man page below
-e PATTERN, --regexp=PATTERN
Use PATTERN as the pattern. This can be used to specify multiple search patterns, or to protect a pattern beginning with a hyphen (-). (-e is specified by POSIX.)
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
So you need to use the -E flag along with the -v invert search
grep ERROR abc* | grep -Ev 'str1|str2|str3|str4|str5'
An example of the usage for your reference:-
$ cat sample.txt
ID F1 F2 F3 F4 ID F1 F2 F3 F4
aa aa
bb 1 2 3 4 bb 1 2 3 4
cc 1 2 3 4 cc 1 2 3 4
dd 1 2 3 4 dd 1 2 3 4
xx xx
$ grep -vE "aa|xx|yy|F2|cc|dd" sample.txt
bb 1 2 3 4 bb 1 2 3 4
Your example should work, but you can also use
grep ERROR abc* | grep -e 'str1' -e 'str2' -e 'str3' -v

How can I sort a 10GB file?

I'm trying to sort a big table stored in a file. The format of the file is
(ID, intValue)
The data is sorted by ID, but what I need is to sort the data using the intValue, in descending order.
For example
ID | IntValue
1 | 3
2 | 24
3 | 44
4 | 2
to this table
ID | IntValue
3 | 44
2 | 24
1 | 3
4 | 2
How can I use the Linux sort command to do the operation? Or do you recommend another way?
How can I use the Linux sort command to do the operation? Or do you recommend another way?
As others have already pointed out, see man sort for -k & -t command line options on how to sort by some specific element in the string.
Now, the sort also has facility to help sort huge files which potentially don't fit into the RAM. Namely the -m command line option, which allows to merge already sorted files into one. (See merge sort for the concept.) The overall process is fairly straight forward:
Split the big file into small chunks. Use for example the split tool with the -l option. E.g.:
split -l 1000000 huge-file small-chunk
Sort the smaller files. E.g.
for X in small-chunk*; do sort -t'|' -k2 -nr < $X > sorted-$X; done
Merge the sorted smaller files. E.g.
sort -t'|' -k2 -nr -m sorted-small-chunk* > sorted-huge-file
Clean-up: rm small-chunk* sorted-small-chunk*
The only thing you have to take special care about is the column header.
How about:
sort -t' ' -k2 -nr < test.txt
where test.txt
$ cat test.txt
1 3
2 24
3 44
4 2
gives sorting in descending order (option -r)
$ sort -t' ' -k2 -nr < test.txt
3 44
2 24
1 3
4 2
while this sorts in ascending order (without option -r)
$ sort -t' ' -k2 -n < test.txt
4 2
1 3
2 24
3 44
in case you have duplicates
$ cat test.txt
1 3
2 24
3 44
4 2
4 2
use the uniq command like this
$ sort -t' ' -k2 -n < test.txt | uniq
4 2
1 3
2 24
3 44

How to extract one column from multiple files, and paste those columns into one file?

I want to extract the 5th column from multiple files, named in a numerical order, and paste those columns in sequence, side by side, into one output file.
The file names look like:
sample_problem1_part1.txt
sample_problem1_part2.txt
sample_problem2_part1.txt
sample_problem2_part2.txt
sample_problem3_part1.txt
sample_problem3_part2.txt
......
Each problem file (1,2,3...) has two parts (part1, part2). Each file has the same number of lines.
The content looks like:
sample_problem1_part1.txt
1 1 20 20 1
1 7 21 21 2
3 1 22 22 3
1 5 23 23 4
6 1 24 24 5
2 9 25 25 6
1 0 26 26 7
sample_problem1_part2.txt
1 1 88 88 8
1 1 89 89 9
2 1 90 90 10
1 3 91 91 11
1 1 92 92 12
7 1 93 93 13
1 5 94 94 14
sample_problem2_part1.txt
1 4 330 30 a
3 4 331 31 b
1 4 332 32 c
2 4 333 33 d
1 4 334 34 e
1 4 335 35 f
9 4 336 36 g
The output should look like: (in a sequence of problem1_part1, problem1_part2, problem2_part1, problem2_part2, problem3_part1, problem3_part2,etc.,)
1 8 a ...
2 9 b ...
3 10 c ...
4 11 d ...
5 12 e ...
6 13 f ...
7 14 g ...
I was using:
paste sample_problem1_part1.txt sample_problem1_part2.txt > \
sample_problem1_partall.txt
paste sample_problem2_part1.txt sample_problem2_part2.txt > \
sample_problem2_partall.txt
paste sample_problem3_part1.txt sample_problem3_part2.txt > \
sample_problem3_partall.txt
And then:
for i in `find . -name "sample_problem*_partall.txt"`
do
l=`echo $i | sed 's/sample/extracted_col_/'`
`awk '{print $5, $10}' $i > $l`
done
And:
paste extracted_col_problem1_partall.txt \
extracted_col_problem2_partall.txt \
extracted_col_problem3_partall.txt > \
extracted_col_problemall_partall.txt
It works fine with a few files, but it's a crazy method when the number of files is large (over 4000).
Could anyone help me with simpler solutions that are capable of dealing with multiple files, please?
Thanks!
Here's one way using awk and a sorted glob of files:
awk '{ a[FNR] = (a[FNR] ? a[FNR] FS : "") $5 } END { for(i=1;i<=FNR;i++) print a[i] }' $(ls -1v *)
Results:
1 8 a
2 9 b
3 10 c
4 11 d
5 12 e
6 13 f
7 14 g
Explanation:
For each line of input of each input file:
Add the files line number to an array with a value of column 5.
(a[FNR] ? a[FNR] FS : "") is a ternary operation, which is set up to build up the arrays value as a record. It simply asks if the files line number is already in the array. If so, add the arrays value followed by the default file separator before adding the fifth column. Else, if the line number is not in the array, don't prepend anything, just let it equal the fifth column.
At the end of the script:
Use a C-style loop to iterate through the array, printing each of the arrays values.
For only ~4000 files, you should be able to do:
find . -name sample_problem*_part*.txt | xargs paste
If find is giving names in the wrong order, pipe it to sort:
find . -name sample_problem*_part*.txt | sort ... | xargs paste
# print filenames in sorted order
find -name sample\*.txt | sort |
# extract 5-th column from each file and print it on a single line
xargs -n1 -I{} sh -c '{ cut -s -d " " -f 5 $0 | tr "\n" " "; echo; }' {} |
# transpose
python transpose.py ?
where transpose.py:
#!/usr/bin/env python
"""Write lines from stdin as columns to stdout."""
import sys
from itertools import izip_longest
missing_value = sys.argv[1] if len(sys.argv) > 1 else '-'
for row in izip_longest(*[column.split() for column in sys.stdin],
fillvalue=missing_value):
print " ".join(row)
Output
1 8 a
2 9 b
3 10 c
4 11 d
5 ? e
6 ? f
? ? g
Assuming the first and second files have less lines than the third one (missing values are replaced by '?').
Try this one. My script assumes that every file has the same number of lines.
# get number of lines
lines=$(wc -l sample_problem1_part1.txt | cut -d' ' -f1)
for ((i=1; i<=$lines; i++)); do
for file in sample_problem*; do
# get line number $i and delete everything except the last column
# and then print it
# echo -n means that no newline is appended
echo -n $(sed -n ${i}'s%.*\ %%p' $file)" "
done
echo
done
This works. For 4800 files, each 7 lines long it took 2 minutes 57.865 seconds on a AMD Athlon(tm) X2 Dual Core Processor BE-2400.
PS: The time for my script increases linearly with the number of lines. It would take very long time to merge files with 1000 lines. You should consider learning awk and use the script from steve. I tested it: For 4800 files, each with 1000 lines it took only 65 seconds!
You can pass awk output to paste and redirect it to a new file as follows:
paste <(awk '{print $3}' file1) <(awk '{print $3}' file2) <(awk '{print $3}' file3) > file.txt

Sorting space delimited numbers with Linux/Bash

Is there a Linux utility or a Bash command I can use to sort a space delimited string of numbers?
Here's a simple example to get you going:
echo "81 4 6 12 3 0" | tr " " "\n" | sort -g
tr translates the spaces delimiting the numbers, into carriage returns, because sort uses carriage returns as delimiters (ie it is for sorting lines of text). The -g option tells sort to sort by "general numerical value".
man sort for further details about sort.
This is a variation from #JamesMorris answer:
echo "81 4 6 12 3 0" | xargs -n1 | sort -g | xargs
Instead of tr, I use xargs -n1 to convert to new lines. The final xargs is to convert back, to a space separated sequence of numbers.
This is a variation on ghostdog74's answer that's too big to fit in a comment. It shows digits instead of names of numbers and both the original string and the result are in space-delimited strings (instead of an array which becomes a newline-delimited string).
$ s="3 2 11 15 8"
$ sorted=$(echo $(printf "%s\n" $s | sort -n))
$ echo $sorted
2 3 8 11 15
$ echo "$sorted"
2 3 8 11 15
If you didn't use the echo when setting the value of sorted, then the string has newlines in it. In that case echoing it without quotes puts it all on one line, but, as echoing it with quotes would show, each number would appear on its own line. This is the case whether the original is an array or a string.
# demo
$ s="3 2 11 15 8"
$ sorted=$(printf "%s\n" $s | sort -n)
$ echo $sorted
2 3 8 11 15
$ echo "$sorted"
2
3
8
11
15
$ s=(one two three four)
$ sorted=$(printf "%s\n" ${s[#]}|sort)
$ echo $sorted
four one three two
Using Bash parameter expansion (to replace spaces with newlines) we can do:
str="3 2 11 15 8"
sort -n <<< "${str// /$'\n'}"
# alternative
NL=$'\n'
str="3 2 11 15 8"
sort -n <<< "${str// /${NL}}"
If you actually have a space-delimited string of numbers, then one of the other answers provided would work fine. If your list is a bash array, then:
oldIFS="$IFS"
IFS=$'\n'
array=($(sort -g <<< "${array[*]}"))
IFS="$oldIFS"
might be a better solution. The newline delimiter would help if you want to generalize to sorting an array of strings instead of numbers.
Improving on Evan Krall's nice Bash "array sort" by limiting the scope of IFS to a single command:
printf "%q\n" "${IFS}"
array=(3 2 11 15 8)
array=($(IFS=$'\n' sort -n <<< "${array[*]}"))
echo "${array[#]}"
printf "%q\n" "${IFS}"
$ awk 'BEGIN{split(ARGV[1], numbers);for(i in numbers) {print numbers[i]} }' \
"6 7 4 1 2 3" | sort -n
I added this to my .zshrc (or .bashrc) file:
#sort a space-separated list of words (e.g. a list of HTML classes)
sortwords() {
echo $1 | xargs -n1 | sort -g | xargs
}
Call it from the terminal like this:
sortwords "banana date apple cherry"
# apple banana cherry date
Thanks to #FranMowinckel and others for inspiration.

Resources