Read one file to search another file and print out missing lines - linux

I am following the example in this post finding contents of one file into another file in unix shell script but want to print out differently.
Basically file "a.txt", with the following lines:
alpha
0891234
beta
Now, the file "b.txt", with the lines:
Alpha
0808080
0891234
gamma
I would like the output of the command is:
alpha
beta
The first one is "incorrect case" and second one is "missing from b.txt". The 0808080 doesn't matter and it can be there.
This is different from using grep -f "a.txt" "b.txt" and print out 0891234 only.
Is there an elegant way to do this?
Thanks.

Use grep with following options:
grep -Fvf b.txt a.txt
The key is to use -v:
-v, --invert-match
Invert the sense of matching, to select non-matching lines.
When reading patterns from a file I recommend to use the -F option as long as you not explicitly want that patterns are treated as regular expressions.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings (instead of regular expressions), separated by newlines, any of which
is to be matched.

Related

Bash script - Get part of a line of text from another file

I'm quite new to bash scripting. I have a script where I want to extract part of the value of a particular line in a separate config file and then use that value as a variable in the script.
For example:
Line 75 in a file named config.cfg
"ssl_cert_location=/etc/ssl/certs/thecert.cer"
I want just the value at the end of "thecert.cer" to then use in the script. I've tried awk and various uses of grep but I can't quite get just the name of the certificate.
Any help would be appreciated. Thanks
These are some examples of the commands I ran:
awk -F "/" '{print $4}' config.cfg
grep -o *.cer config.cfg
Is this possible to extract the value on that line and then edit the output so it just contains the name of the certificate file?
This is a pure Bash version of the basic functionality of basename:
cert=${line##*/}
which removes everything up to and including the final slash. It presupposes that you've already read the line.
Or, using sed:
cert=$(sed -n '75s/^.*\///p' filename)
or
cert=$(sed -n '/^ssl_cert_location=/s/^.*\///p' filename)
This gets the specified line based on the line number or the setting name and replaces everything up to and including the final slash with nothing. It ignores all other lines in the file (unless the setting is repeated in the case of the text match version). The text match version is better because it works no matter what line number the setting is on.
grep uses regular expressions (as does sed). The grep command in your command appears to have a glob expression which won't work. One way to use grep (GNU grep) is to use the PCRE feature (Perl Compatible Regular Expressions):
cert=$(grep -Po '^ssl_cert_location=.*/\K.*' filename)
This works similarly to the sed command.
I have anchored the regular expressions to the beginning of the line. If there may be leading white spaces (the line may be indented), change the regex so it looks something like this:
^[[:space:]]*ssl_cert_location=
which works for both indented and unindented lines.
There are many variants, but a simple one that comes to mind with grep is first getting the line, then matching only non-slashes at the end of the line:
<config.cfg grep '^ssl_cert_location=' | grep -o '[^/]*$'
Why didn't your grep command (grep -o *.cer config.cfg) work? Becasue *.cer is a shell glob pattern and will be expanded by the shell to matching file names, even before the grep process is even started. If there are no matching files, it will be passed verbatim, but * in regular expressions is a quantifier which needs a preceeding expression. . in regex is "match any single character". So what you wanted is probably grep -o '.*\.cer', but .* matches anything, including slashes.
An awk solution would look like the following:
awk -F/ '/^ssl_cert_location=/{print $NF}' config.cfg
It uses "/" as separator, finds only lines starting with "ssl_cert_location" and then prints the last (NF) field in from this line.
Or an equivalent sed solution, which matches the same line and then deletes everything including the last slash:
sed -n '/^ssl_cert_location=/s#^.*/##p' config.cfg
To store the output of any command in a variable, use command substitution:
var="$(command with arguments)"

Find the words do not appear in the dictionary in linux platform

Here's the text maybe have some words ,each line has one word.and I accept it as a command line arguments. For example the textile a.txt is like:
about
catb
west
eastren
And what I want to do is to find the words do not in the dictionary, if the words are dictionary words, delete it in the textfile.
I use the following commands:
word=$1
grep "$1$" /usr/share/dict/linux.words -q
for word in $(<a.txt)
do
if [ $word -eq 0 ]
then
sed '/$word/d'
fi
done
Nothing Happened.
grep alone is enough from what I understand
$ grep -xvFf /usr/share/dict/linux.words a.txt
catb
eastren
catb and eastren are words not found in /usr/share/dict/linux.words. The options used are
-x, --line-regexp
Select only those matches that exactly match the whole line. For a regular expression pattern, this is
like parenthesizing the pattern and then surrounding it with ^ and $.
-v, --invert-match
Invert the sense of matching, to select non-matching lines.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings (instead of regular expressions), separated by newlines,
any of which is to be matched.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. If this option is used multiple times or is combined with the
-e (--regexp) option, search for all patterns given. The empty file contains zero patterns, and
therefore matches nothing.
An alternative would be to use a spell checker like hunspell, so you can apply it to any text, not only a pre-formatted file with only one word per line. You can also specify several dictionaries, and it shows only words which are in none of them.
For example, after copy/pasting the content of this page into test.txt
lang1=en_US
lang2=fr_FR
hunspell -d $lang1,$lang2 -l test.txt | sort -u
produces a list of 46 words:
'
Apps
Arqade
catb
'cba'
Cersei
ceving
Drupal
eastren
...
WordPress
Worldbuilding
xvFf
xz
Yodeya

Filter out only matched values from a text file in each line

I have a file "test.txt" with the lines below and also lot bunch of extra stuff after the "version"
soainfra_metrics{metric_group="sca_composite",partition="test",is_active="true",state="on",is_default="true",composite="test123"} map:stats version:1.0
soainfra_metrics{metric_group="sca_composite",partition="gello",is_active="true",state="on",is_default="true",composite="test234"} map:stats version:1.8
soainfra_metrics{metric_group="sca_composite",partition="bolo",is_active="true",state="on",is_default="true",composite="3415"} map:stats version:3.1
soainfra_metrics{metric_group="sca_composite",partition="solo",is_active="true",state="on",is_default="true",composite="hji"} map:stats version:1.1
I tried:
egrep -r 'partition|is_active|state|is_default|composite' test.txt
It's displaying every line, but I need only specific mentioned fields like this below,ignoring rest of the data/stuff or lines
in a nut shell, i want to display only these fields from a line not the rest
partition="test",is_active="true",state="on",is_default="true",composite="test123"
partition="gello",is_active="true",state="on",is_default="true",composite="test234"
partition="bolo",is_active="true",state="on",is_default="true",composite="3415"
partition="solo",is_active="true",state="on",is_default="true",composite="hji"
If your version of grep supports Perl-style regular expressions, then I'd use this:
grep -oP '.*?,\K[^}]+' file
It removes everything up to the first comma (\K kills any previous output) and prints everything up to the }.
Alternatively, using awk:
awk -F'}' '{ sub(/[^,]+,/, ""); print $1 }' file
This sets the field separator to } so the part you're interested in is the first field. It then uses sub to remove the part up to the first comma.
For completeness, you could also use sed:
sed 's/[^,]*,\([^}]*\).*/\1/' file
This captures the part after the first , up to the } and replaces the content of the line with it.
After the grep to pick out the lines you want, use sed to edit the lines:
sed 's/.*\(partition[^}]*\)} map.*/\1/'
This means: "whenever you see anything .*, followed by partition and
any number of non-}, then } map and anything else, grab the part
from partition up to but not including the brace \(...\) as group 1.
The replacement text is just group 1 \1.
Use a pipe | to connect the output of egrep to the input of sed:
egrep ... | sed ...
As far as i understood your file might have more lines you don't want to see, so i would use:
sed -n 's/.*\(partition.*\)}.*/\1/p' file
we use -n p to show only lines where we made substitution. The substitution part just gets the part of the line you need substituting the whole line with the pattern.
This might work for you (GNU sed):
sed -r 's/(partition|is_active|state|is_default|composite)="[^"]*"/\n&\n/g;s/[^\n]*\n([^\n]*)\n[^\n]*/\1,/g;s/,$//' file
Treat the problem as if it were a "decomposed club sandwich". Identify the fillings, remove the bread and tidy up.

grep based on blacklist -- without procedural code?

It's a well-known task, simple to describe:
Given a text file foo.txt, and a blacklist file of exclusion strings, one per line, produce foo_filtered.txt that has only the lines of foo.txt that do not contain any exclusion string.
A common application is filtering compiler warnings from a build log, but to ignore warnings on files that are not yours. The file foo.txt is the warnings file (itself filtered from the build log), and a blacklist file excluded_filenames.txt with file names, one per line.
I know how it's done in procedural languages like Perl or AWK, and I've even done it with combinations of Linux commands such as cut, comm, and sort.
But I feel that I should be really close with xargs, and just can't see the last step.
I know that if excluded_filenames.txt has only 1 file name in it, then
grep -v foo.txt `cat excluded_filenames.txt`
will do it.
And I know that I can get the filenames one per line with
xargs -L1 -a excluded_filenames.txt
So how do I combine those two into a single solution, without explicit loops in a procedural language?
Looking for the simple and elegant solution.
You should use the -f option (or you can use fgrep which is the same):
grep -vf excluded_filenames.txt foo.txt
You could also use -F which is more directly the answer to what you asked:
grep -vF "`cat excluded_filenames.txt`" foo.txt
from man grep
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.

Is there a way to put the following logic into a grep command?

For example suppose I have the following piece of data
ABC,3,4
,,ExtraInfo
,,MoreInfo
XYZ,6,7
,,XyzInfo
,,MoreXyz
ABC,1,2
,,ABCInfo
,,MoreABC
It's trivial to get grep to extract the ABC lines. However if I want to also grab the following lines to produce this output
ABC,3,4
,,ExtraInfo
,,MoreInfo
ABC,1,2
,,ABCInfo
,,MoreABC
Can this be done using grep and standard shell scripting?
Edit: Just to clarify there could be a variable number of lines in between. The logic would be to keep printing while the first column of the CSV is empty.
grep -A 2 {Your regex} will output the two lines following the found strings.
Update:
Since you specified that it could be any number of lines, this would not be possible as grep focuses on matching on a single line see the following questions:
How can I search for a multiline pattern in a file?
Regex (grep) for multi-line search needed
Why can't i match the pattern in this case?
Selecting text spanning multiple lines using grep and regular expressions
You can use this, although a bit hackity due to the grep at the end of the pipeline to mute out anything that does not start with 'A' or ',':
$ sed -n '/^ABC/,/^[^,]/p' yourfile.txt| grep -v '^[^A,]'
Edit: A less hackity way is to use awk:
$ awk '/^ABC/ { want = 1 } !/^ABC/ && !/^,/ { want = 0 } { if (want) print }' f.txt
You can understand what it does if you read out loud the pattern and the thing in the braces.
The manpage has explanations for the options, of which you want to look at -A under Context Line Control.

Resources