Find lines common to several files

Find lines common to several files - linux

I'm trying to determine which header declares a specific function. I've used grep to find instances of the function's use; now, I want to find which header is included by all the files. I'm aware of the comm utility; however, it can only compare two sorted files. Is there a Unix utility that can find the common lines between an arbitrary number of unsorted files, or must I write my own?

cat *.c | sort | uniq -c | grep -e '^ *COUNT #include'
where COUNT is the number of files passed to cat. In playing around, I used this variant to see what files I #include at least 10 times:
cat *.c | sort | uniq -c | grep -e '^ *[0-9][0-9]\+ #include'

Related

Optimizing search in linux

I have a huge log file close to 3GB in size.
My task is to generate some reporting based on # of times something is being logged.
I need to find the number of time StringA , StringB , StringC is being called separately.
What I am doing right now is:
grep "StringA" server.log | wc -l
grep "StringB" server.log | wc -l
grep "StringC" server.log | wc -l
This is a long process and my script takes close to 10 minutes to complete. What I want to know is that whether this can be optimized or not ? Is is possible to run one grep command and find out the number of time StringA, StringB and StringC has been called individually ?

You can use grep -c instead of wc -l:
grep -c "StringA" server.log
grep can't report count of individual strings. You can use awk:
out=$(awk '/StringA/{a++;} /StringB/{b++;} /StringC/{c++;} END{print a, b, c}' server.log)
Then you can extract each count with a simple bash array:
arr=($out)
echo "StringA="${arr[0]}
echo "StringA="${arr[1]}
echo "StringA="${arr[2]}
This (grep without wc) is certainly going to be faster and possibly awk solution is also faster. But I haven't measured any.

Certainly this approach could be optimized since grep doesn't perform any text indexing. I would use a text indexing engine like one of those from this review or this stackexchange QA . Also you may consider using journald from systemd which stores logs in a structured and indexed format so lookups are more effective.

So many greps so little time... :-)
According to David Lyness, a straight grep search is about 7 times as fast as an awk in large file searches.
If that is the case, the current approach could be optimized by changing grep to fgrep, but only if the patterns being searched for are not regular expressions. fgrep is optimized for fixed patterns.
If the number of instances is relatively small compared to the original log file entries, it may be an improvement to use the egrep version of grep to create a temporary file filled with all three instances:
egrep "StringA|StringB|StringC" server.log > tmp.log
grep "StringA" tmp.log | wc -c
grep "StringB" tmp.log | wc -c
grep "StringC" tmp.log | wc -c
The egrep variant of grep allows for a | (vertical bar/pipe) character to be used between two or more separate search strings so that you can find multiple strings in statement. You can use grep -E to do the same thing.
Full documentation is in the man grep page and information about the Extended Regular Expressions that egrep uses from the man 7 re_format command.

How to find files with same name part in directory using the diff command?

I have two directories with files in them. Directory A contains a list of photos with numbered endings (e.g. janet1.jpg laura2.jpg) and directory B has the same files except with different numbered endings (e.g. janet41.jpg laura33.jpg). How do I find the files that do not have a corresponding file from directory A and B while ignoring the numbered endings? For example there is a rachael3 in directory A but no rachael\d in directory B. I think there's a way to do with the diff command in bash but I do not see an obvious way to do it.

I can't see a way to use diff for this directly. It will probably be easier to use a sums tool (md5, sha1, etc.) on both directories and then sort both files based on the first (sum) column and diff/compare those output files.
Alternatively, something like findimagedupes (which isn't as simple a comparison as diff or a sums check) might be a simpler (and possibly more useful) solution.

It seems you know that your files are the same, if they exist and you are sure, there is only one of a kind per directory.
So to diff the contents of the directory according to this, you need to get only the relevant parts of the file name ("laura", "janet").
This could be done by simple grepping the appropriate parts from the output of ls like this:
ls dir1/ | egrep -o '^[a-A]+'
Then to compare, let's say dir1 and dir2, you can use:
diff <(ls dir1/ | egrep -o '^[a-A]+') <(ls dir2/ | egrep -o '^[a-A]+')

Assuming the files are simply renamed and otherwise identical, a simple solution to find the missing ones is to use md5sum (or sha or somesuch) and uniq:
#!/bin/bash
md5sum A/*.jpg B/*.jpg >index
awk '{print $1}' <index | sort >sums # delete dir/file
# list unique files (missing from one directory)
uniq -u sums | while read s; do
grep "$s" index | sed 's/^[a-z0-9]\{32\} //'
done
This fails in the case where a folder contains several copies of the same file renamed (such that the hash matches multiple files in one folder), but that is easily fixed:
#!/bin/bash
md5sum A/*.jpg B/*.jpg > index
sed 's/\/.*//' <index | sort >sums # just delete /file
# list unique files (missing from one directory)
uniq sums | awk '{print $1}' |\
uniq -u | while read s junk; do
grep "$s" index | sed 's/^[a-z0-9]\{32\} //'
done

diff command to get number of different lines only

Can I use the diff command to find out how many lines do two files differ in?
I don't want the contextual difference, just the total number of lines that are different between two files. Best if the result is just a single integer.

diff can do all the first part of the job but no counting; wc -l does the rest:
diff -y --suppress-common-lines file1 file2 | wc -l

Yes you can, and in true Linux fashion you can use a number of commands piped together to perform the task.
First you need to use the diff command, to get the differences in the files.
diff file1 file2
This will give you an output of a list of changes. The ones your interested in are the lines prefixed with a '>' symbol
You use the grep tool to filter these out as follows
diff file1 file2 | grep "^>"
finally, once you have a list of the changes your interested in, you simply use the wc command in line mode to count the number of changes.
diff file1 file2 | grep "^>" | wc -l
and you have a perfect example of the philosophy that Linux is all about.

Bash Local vs remote directory comparison

I'm trying to compare a local vs a remote directory and identify files which are either not present on the remote directory or different by checksum.
The goal is for the script to return a list of files to iterate through. So far I have the following, but it's not the best.
rsync -avnc /path/to/files remoteuser#remoteserver:/path/to/files/ | grep -v "sending incremental file list" | grep -v "bytes received" | grep -v "total size is" | grep -v "./"
I've just used piped grep -v calls to remove the bits I don't care about. Is there a better way to compare a local and remote directory using SSH? It seems like their should be. The important constraint is that I have to compare directories across two separate machines.

comm -3 <(ls -l /path/to/files | awk '{print $5"\t"$9}' | sort) <(ssh remoteuser#remoteserver ls -l /path/to/files | awk '{print $5"\t"$9}' | sort)
$5 is size
$9 is filename
then, print files which exists only in remote server

I would do so using a matching pair of find calls and a call to comm.
# comm -3 produces two-column output, skipping lines in common.
comm -3 <(find $LOCALDIR | sort) <(ssh remote#host find $REMOTEDIR | sort)
If you write your local and remote output to temporary files, you can easily print a list of missing files on either system; with a little cleverness in your find commands, you could likely compare file checksums between the two systems.
Note that this solution uses line-based text comparison and thus is not immune to bizarre filenames. You may need to investigate a more-clever solution (probably involving find ... -print0) if you need to handle filenames with newlines or other special characters.

Extract and count value from standard .gz log files on an hourly basis

I'm trying to count the number of occurrences of a particular string from a bunch of .gz logfiles on an hourly basis. Each logfile statement starts with the following time format:
2013-11-21;09:07:23.433.
For example, to be more clear, find the count of occurrences of string "abc" between 8am to 9am, then 9am to 10am and so on. Any ideas on how to do it?

Since you just want to count occurrences, you may simply zcat the contents of the file, grep the portion that describes what you're looking for -- words/time intervals --, and finally sort/count (sort | uniq -c) the entries. The following would probably suffice:
zcat *.gz | grep <word> | grep -oP "^\d{4}-\d{2}-\d{2};\d{2}" | sort | uniq -c
The above command shall find the lines in your logfile that contains the <word> you're looking for, extract both date and hour from such entries, and later count the occurrences.
In case you don't want to take into account days/months/years, you may use:
zcat *.gz | grep <word> | grep -oP "^\d{4}-\d{2}-\d{2};\K\d{2}" | sort | uniq -c
The \K added in the grep expression is a flag for look-behind in PCRE -- Perl Compatible Regular Expression.

Try this :
zgrep -c '2013-11-21;0[89]:.*abc' file.gz

Or awk (gawk in linux) will work:
zcat *.gz | awk -F'[\.;:]' '{arr[$2]++} END{for(i in arr){print i, arr[i]} }' 2>/dev/null
the redirection is there because some awks, notably gawk, will complain about . not being a metacharacter

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Find lines common to several files - linux

cat .c | sort | uniq -c | grep -e '^ COUNT #include' where COUNT is the number of files passed to cat. In playing around, I used this variant to see what files I #include at least 10 times: cat .c | sort | uniq -c | grep -e '^ [0-9][0-9]\+ #include'

Related

Optimizing search in linux

How to find files with same name part in directory using the diff command?

diff command to get number of different lines only

Bash Local vs remote directory comparison

Extract and count value from standard .gz log files on an hourly basis

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Find lines common to several files - linux

cat *.c | sort | uniq -c | grep -e '^ *COUNT #include' where COUNT is the number of files passed to cat. In playing around, I used this variant to see what files I #include at least 10 times: cat *.c | sort | uniq -c | grep -e '^ *[0-9][0-9]\+ #include'

Related

Optimizing search in linux

How to find files with same name part in directory using the diff command?

diff command to get number of different lines only

Bash Local vs remote directory comparison

Extract and count value from standard .gz log files on an hourly basis

Categories

Resources

cat .c | sort | uniq -c | grep -e '^ COUNT #include' where COUNT is the number of files passed to cat. In playing around, I used this variant to see what files I #include at least 10 times: cat .c | sort | uniq -c | grep -e '^ [0-9][0-9]\+ #include'