How to get grep -m1 to work in OSX - linux

I have a script which I used perfectly fine in Linux, but now that I've switched over to Mac, the script still runs but has slightly different behavior.
This is a script for tallying student attendance at departmental functions. We use a portable barcode scanner to scan their ID's, and then save all scans in one csv file per date.
I used grep -m1 $ID csvfolder/* | wc -l in the past to get a count of how many files their ID shows up in. The -m1 is necessary to make sure they don't get "extra credit" for repeatedly scanning in at the same event.
However, when I use this same command in Mac, it exits grep when it has found the first match in the first file. So if the student shows up in 4 files, wc -l still returns 1
How can I (without installing the GNU versions) emulate this feature?

I don't have Mac OS X handy to test it with, but the following is Posix-standard afaik:
grep -l "$ID" csvfolder/* | wc -l
The grep will print the name of each file which contains a match. That should work with Gnu grep equally.

You could alternatively use awk for this task:
awk -v id="$ID" '$0 ~ id{print 1; exit}' csvfolder/* | wc -l

Related

Loop to filter out lines from apache log files

I have several apache access files that I would like to clean up a bit before I analyze them. I am trying to use grep in the following way:
grep -v term_to_grep apache_access_log
I have several terms that I want to grep, so I am piping every grep action as follow:
grep -v term_to_grep_1 apache_access_log | grep -v term_to_grep_2 | grep -v term_to_grep_3 | grep -v term_to_grep_n > apache_access_log_cleaned
Until here my rudimentary script works as expected! But I have many apache access logs, and I don't want to do that for every file. I have started to write a bash script but so far I couldn't make it work. This is my try:
for logs in ./access_logs/*;
do
cat $logs | grep -v term_to_grep | grep -v term_to_grep_2 | grep -v term_to_grep_3 | grep -v term_to_grep_n > $logs_clean
done;
Could anyone point me out what I am doing wrong?
If you have a variable and you append _clean to its name, that's a new variable, and not the value of the old one with _clean appended. To fix that, use curly braces:
$ var=file.log
$ echo "<$var>"
<file.log>
$ echo "<$var_clean>"
<>
$ echo "<${var}_clean>"
<file.log_clean>
Without it, your pipeline tries to redirect to the empty string, which results in an error. Note that "$file"_clean would also work.
As for your pipeline, you could combine that into a single grep command:
grep -Ev 'term_to_grep|term_to_grep_2|term_to_grep_3|term_to_grep_n' "$logs" > "${logs}_clean"
No cat needed, only a single invocation of grep.
Or you could stick all your terms into a file:
$ cat excludes
term_to_grep_1
term_to_grep_2
term_to_grep_3
term_to_grep_n
and then use the -f option:
grep -vf excludes "$logs" > "${logs}_clean"
If your terms are strings and not regular expressions, you might be able to speed this up by using -F ("fixed strings"):
grep -vFf excludes "$logs" > "${logs}_clean"
I think GNU grep checks that for you on its own, though.
You are looping over several files, but in your loop you constantly overwrite your result file, so it will only contain the last result from the last file.
You don't need a loop, use this instead:
egrep -v 'term_to_grep|term_to_grep_2|term_to_grep_3' ./access_logs/* > "$logs_clean"
Note, it is always helpful to start a Bash script with set -eEuCo pipefail. This catches most common errors -- it would have stopped with an error when you tried to clobber the $logs_clean file.

Show not logged users processes linux bash script

I am doing a bash script and i am essaying to show not logged users processes,which are typically daemon processes, for this,in the exercise, they recommend me:
To process the command line, we will use the cut command, which allows
selecting the different columns of the list through a filter.
I used:
ps -A | grep -v w
ps -A | grep -v who
ps -A | grep -v $USER
but trying all these options all the processes of all users are printed in the output file, and I only want the processes of users who are not logged.
I appreciate your help
Thank you.
grep -v w will remove lines matching the regular expression w (which is simply anything which contains the string w). To run the command w you have to say so; but as hinted in the instructions, you will also need to use cut to post-process the output.
So as not to give the answer away completely, here's rough pseudocode.
w | cut something >tempfile
ps -A | grep -Fvf tempfile
It would be nice if you could pass the post-processed results of w in a pipe, but standard input is already tied to ps -A. If you have a shell which supports process substitution, you can use that.
ps -A | grep -Fvf <(w | cut something)
Unfortunately, the output from w is not properly machine-readable -- you will properly want to cut out the header line(s), too. (On my machine, there are two header lines. Yours might differ.) You'll probably learn a bit of Awk later on in the course, but until then, maybe something like
ps -A | grep -Fvf <(w | tail -n +3 | cut something)
This still doesn't completely handle all possible situations. What if someone's account name is grep?

Optimizing search in linux

I have a huge log file close to 3GB in size.
My task is to generate some reporting based on # of times something is being logged.
I need to find the number of time StringA , StringB , StringC is being called separately.
What I am doing right now is:
grep "StringA" server.log | wc -l
grep "StringB" server.log | wc -l
grep "StringC" server.log | wc -l
This is a long process and my script takes close to 10 minutes to complete. What I want to know is that whether this can be optimized or not ? Is is possible to run one grep command and find out the number of time StringA, StringB and StringC has been called individually ?
You can use grep -c instead of wc -l:
grep -c "StringA" server.log
grep can't report count of individual strings. You can use awk:
out=$(awk '/StringA/{a++;} /StringB/{b++;} /StringC/{c++;} END{print a, b, c}' server.log)
Then you can extract each count with a simple bash array:
arr=($out)
echo "StringA="${arr[0]}
echo "StringA="${arr[1]}
echo "StringA="${arr[2]}
This (grep without wc) is certainly going to be faster and possibly awk solution is also faster. But I haven't measured any.
Certainly this approach could be optimized since grep doesn't perform any text indexing. I would use a text indexing engine like one of those from this review or this stackexchange QA . Also you may consider using journald from systemd which stores logs in a structured and indexed format so lookups are more effective.
So many greps so little time... :-)
According to David Lyness, a straight grep search is about 7 times as fast as an awk in large file searches.
If that is the case, the current approach could be optimized by changing grep to fgrep, but only if the patterns being searched for are not regular expressions. fgrep is optimized for fixed patterns.
If the number of instances is relatively small compared to the original log file entries, it may be an improvement to use the egrep version of grep to create a temporary file filled with all three instances:
egrep "StringA|StringB|StringC" server.log > tmp.log
grep "StringA" tmp.log | wc -c
grep "StringB" tmp.log | wc -c
grep "StringC" tmp.log | wc -c
The egrep variant of grep allows for a | (vertical bar/pipe) character to be used between two or more separate search strings so that you can find multiple strings in statement. You can use grep -E to do the same thing.
Full documentation is in the man grep page and information about the Extended Regular Expressions that egrep uses from the man 7 re_format command.

Bash Local vs remote directory comparison

I'm trying to compare a local vs a remote directory and identify files which are either not present on the remote directory or different by checksum.
The goal is for the script to return a list of files to iterate through. So far I have the following, but it's not the best.
rsync -avnc /path/to/files remoteuser#remoteserver:/path/to/files/ | grep -v "sending incremental file list" | grep -v "bytes received" | grep -v "total size is" | grep -v "./"
I've just used piped grep -v calls to remove the bits I don't care about. Is there a better way to compare a local and remote directory using SSH? It seems like their should be. The important constraint is that I have to compare directories across two separate machines.
comm -3 <(ls -l /path/to/files | awk '{print $5"\t"$9}' | sort) <(ssh remoteuser#remoteserver ls -l /path/to/files | awk '{print $5"\t"$9}' | sort)
$5 is size
$9 is filename
then, print files which exists only in remote server
I would do so using a matching pair of find calls and a call to comm.
# comm -3 produces two-column output, skipping lines in common.
comm -3 <(find $LOCALDIR | sort) <(ssh remote#host find $REMOTEDIR | sort)
If you write your local and remote output to temporary files, you can easily print a list of missing files on either system; with a little cleverness in your find commands, you could likely compare file checksums between the two systems.
Note that this solution uses line-based text comparison and thus is not immune to bizarre filenames. You may need to investigate a more-clever solution (probably involving find ... -print0) if you need to handle filenames with newlines or other special characters.

How do I grep multiple lines (output from another command) at the same time?

I have a Linux driver running in the background that is able to return the current system data/stats. I view the data by running a console utility (let's call it dump-data) in a console. All data is dumped every time I run dump-data. The output of the utility is like below
Output:
- A=reading1
- B=reading2
- C=reading3
- D=reading4
- E=reading5
...
- variableX=readingX
...
The list of readings returned by the utility can be really long. Depending on the scenario, certain readings would be useful while everything else would be useless.
I need a way to grep only the useful readings whose names might have have nothing in common (via a bash script). I.e. Sometimes I'll need to collect A,D,E; and other times I'll need C,D,E.
I'm attempting to graph the readings over time to look for trends, so I can't run something like this:
# forgive my pseudocode
Loop
dump-data | grep A
dump-data | grep D
dump-data | grep E
End Loop
to collect A,D,E as that would actually give me readings from 3 separate calls of dump-data as that would not be accurate.
If you want to save all result of grep in the same file, you can just join all expressions in one:
grep -E 'expr1|expr2|expr3'
But if you want to have results (for expr1, expr2 and expr3) in separate files, things are getting more interesting.
You can do this using tee >(command).
For example, here I process the same pipe with thre different commands:
$ echo abc | tee >(sed s/a/_a_/ > file1) | tee >(sed s/b/_b_/ > file2) | sed s/c/_c_/ > file3
$ grep "" file[123]
file1:_a_bc
file2:a_b_c
file3:ab_c_
But the command seems to be too complex.
I would better save dump-data results to a file and then grep it.
TEMP=$(mktemp /tmp/dump-data-XXXXXXXX)
dump-data > ${TEMP}
grep A ${TEMP}
grep B ${TEMP}
grep C ${TEMP}
You can use dump-data | grep -E "A|D|E". Note the -E option of grep. Alternatively you could use egrep without the -E option.
you can simply use:
dump-data | grep -E 'A|D|E'
awk '/MY PATTERN/{print > "matches-"FILENAME;}' myfile{1,3}
thx Guru at Stack Exchange

Resources