grep with variable numbers of whitespaces - linux

I want to grep for a string which I know exists in my files. However, the source it comes from managed to change the number of whitespaces so that the content per se is identical in the string, but the length differs. => ordinary grep does not find it. Is there a way to adjust for it?
I dont's see a system behind the additional whitespace effect
Here's the original string
4FD0-A tr|A5ZLA0|A5ZLA0_9BACE Bacterial
and here's the modified string
4FD0-A tr|A5ZLA0|A5ZLA0_9BACE Bacterial

I believe that egrep would be your friend here. Try the following command:
egrep '4FD0-A\s+tr[|]A5ZLA0[|]A5ZLA0_9BACE\s+Bacterial' filename
I used a rather simple pattern for my example. Feel free to change it to suit.

You can use the "one or more" + operator. For example
grep '^4FD0-A\s\+tr|' myfile
The \s searches for any whitespace. If you want to limit it to only spaces just use a single space in place of the \s

You can use all kind of utilities for deleting spaces.
Try this:
cat file |tr -s '\ ' | grep '4FD0-A tr|A5ZLA0|A5ZLA0_9BACE Bacterial'

Related

I am trying to replace a text for example

Example:
"word" -nothing
To
word" - nothing
in gvim.
I tried
:%s/^.*\"/
But what I get is: -nothing
Well I am new to scripting so I would like to know if it can be done in any other way like using gvim or awk or sed.
In vim... Check for \(word + quote + space + hyphen\) as first reference, followed directly by another \(word\) as second reference... replace by first reference + space + second reference... Make sure the find/replace can happen multiple times on a line with g suffix.
:%s/\(\w" -\)\(\w\)/\1 \2/g
Note that I left out the leading quote... I suppose it is possible you might have spaces in the quoted text - and I think this form might be better for you. Now in sed, that is the really cool thing about the relationship between *nix tools - they all use similar (or the same) regular expressions pattern language. So, the same exact pattern as above can be done in sed (using : as delimiter for clarity).
sed 's:\(\w" -\)\(\w\):\1 \2:g'
Awk doesn't do back references; so, not to say it can't be done, but it is not so convenient.
Could you please try following and let me know if this helps you.
awk '{sub(/^"/,"");sub(/-/,"- ")} 1' Input_file
Solution 2nd: With sed.
sed 's/^"//;s/-/- /' Input_file
Since you also tagged grep: GNU grep has the -P switch for PCRE (Perl compatible reg ex) which has \K: Keep the stuff left of the \K, don't include it in $&, so:
$ echo \"word\" | grep -oP "\"\Kword\""
word"
If I understand your question correctly you want to replace first " in each line with empty string. So in sed it is just:
sed 's/"//'
Without g flag it will replace only first occurrence in each line.
EDIT:
The same way it will work in Vim (unless you have 'gdefault' option set), so in Vim you can:
:%s/"//
try this :
:%s/\"(.*)\"/\1\"/gc

Embedding quotation marks in command string generated by AWK?

I need to match all instances of strings in one file, with a master list in another. However, if my string is abc I want only that, not abcdef, abc1234 and so on.
So, a word boundary for the regex? Right now, I'm using a simple awk one liner:
cat results_file| sort -k 1| awk -F" " '{ print $1" /home/owner/file_2_search"}'|
xargs -L 1 /bin/grep -i
However, to force a word boundary, I'd need to grep string\b and the quotes (single or double) seem to be required.
In awk, \b is a special character, you need \\b ... And the quoted quotes ... (arg) ... Or am I missing something and overdoing this?
This is a Linux box, so presumably gawk. I have gone over quoting rules for awk, and realize this has got to be simple (and not complex ... but), but am not seeing it.
Had meant to post as an answer, not a comment. Will try to pose a more readable question, but confess to having second thoughts about doing this as a one-liner in the first place -- may be best to follow an alternate method. Appreciate the willingness to help.
--Joe

How can I get substring from a string in linux?

I am trying to extract a specific string from a string in linux.
For example, I want to extract 'android.content.pm.PackageParser.parseBaseApplication' from the below string.
The String has a regular format and only the string within parenthesis is changeable.
Join point 'method-execution(boolean android.content.pm.PackageParser.parseBaseApplication(android.content.pm.PackageParser$Package, android.content.res.Resources, org.xmlpull.v1.XmlPullParser, android.util.AttributeSet, int, java.lang.String[]))' in Type
However, I have a trouble in finding a proper approach to do this.
At first, I tried sed command but it's too complicate so I couldn't complete the work.
Could you recommend any other approach to do this?
Thanks alot.
If the interested string is always the second string after the first ( then:
echo "..." | awk -F '[()]' '{split($2,a," "); printf a[2]}'
extract it.
It splits the line using delimiters ( and ). So $2 will the data between ( and ). split splits $2 and you get the second string which is
android.content.pm.PackageParser.parseBaseApplication
for your example.
This looks like AOP syntax. So with certain assumption, this can be done as :
echo "Join point...." | cut -d'(' -f2 | cut -d' ' -f2
Explanation : cut based on ( and get second field, which is the method signature except parameters. Since we are not interested in return type as well, split the signature based on blank space and get the second field, which is the method name.
This is based your stated invariant, that the substring you're capturing is the only part that varies from file to file, here is a perl solution:
Extract=$(perl -ne 'print $1 if /\s*Join point \x27method-execution\(boolean\s+([^(]*)/' file_to_search)
echo $Extract
android.content.pm.PackageParser.parseBaseApplication
I used the full lead-in because it reduced the chance of false-positive, but if you find other things change and want to use yet a substring of that (e.g., "method-execution(boolean "), that's your choice to make.
This matches out to the where the variant substring starts, which goes to the next invariant--the open parenthesis--so we can just capture while not open parenthesis. Since it's probably some human interaction changing the variant, I allowed for extra spaces with the \s+ (one or more white space).
You could use almost the same regex with sed, but would need to consume the entire string to avoid it becoming part of the output. e.g., in shorthand:
sed -r 's/.*LEAD_IN(CAPTURE_TEXT).*/\1/
Where LEAD_IN is the constant leader, "Join point..." and CAPTURE_TEXT the same capture group as in the perl solution. Main difference is leading and triling ".*" to consume the entire subject.

Grep filtering of the dictionary

I'm having a hard time getting a grasp of using grep for a class i am in was hoping someone could help guide me in this assignment. The Assignment is as follows.
Using grep print all 5 letter lower case words from the linux dictionary that have a single letter duplicated one time (aabbe or ababe not valid because both a and b are in the word twice). Next to that print the duplicated letter followed buy the non-duplicated letters in alphabetically ascending order.
The Teacher noted that we will need to use several (6) grep statements (piping the results to the next grep) and a sed statement (String Editor) to reformat the final set of words, then pipe them into a read loop where you tear apart the three non-dup letters and sort them.
Sample Output:
aback a bck
abaft a bft
abase a bes
abash a bhs
abask a bks
abate a bet
I haven't figured out how to do more then printing 5 character words,
grep "^.....$" /usr/share/dict/words |
Didn't check it thoroughly, but this might work
tr '[:upper:]' '[:lower:]' | egrep -x '[a-z]{5}' | sed -r 's/^(.*)(.)(.*)\2(.*)$/\2 \1\3\4/' | grep " " | egrep -v "(.).*\1"
But do your way because someone might see it here.
All in one sed
sed -n '
# filter 5 letter word
/[a-zA-Z]\{5\}/ {
# lower letters
y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxya/
# filter non single double letter
/\(.\).*\1/ !b
/\(.\).*\(.\).*\1.*\1/ b
/\(.\).*\(.\).*\1.*\2/ b
/\(.\).*\(.\).*\2.*\1/ b
# extract peer and single
s/\(.\)*\(.\)\(.*\)\2\(.*\)/a & \2:\1\3\4/
# sort singles
:sort
s/:\([^a]*\)a\(.*\)$/:\1\2a/
y/abcdefghijklmnopqrstuvwxyz/zabcdefghijklmnopqrstuvwxy/
/^a/ !b sort
# clean and print
s/..//
s/:/ /p
}' YourFile
posix sed so --posix on GNU sed
The first bit, obviously, is to use grep to get it down to just the words that have a single duplication in. I will give you some clues on how to do that.
The key is to use backreferences, which allow you to specify that something that matched a previous expression should appear again. So if you write
grep -E "^(.)...\1...\1$"
then you'll get all the words that have the starting letter reappearing in fifth and ninth positions. The point of the brackets is to allow you to refer later to whatever matched the thing in brackets; you do that with a \1 (to match the thing in the first lot of brackets).
You want to say that there should be a duplicate anywhere in the word, which is slightly more complicated, but not much. You want a character in brackets, then any number of characters, then the repeated character (with no ^ or $ specified).
That will also include ones where there are two or more duplicates, so the next stage is to filter them out. You can do that by a grep -v invocation. Once you've got your list of 5-character words that have at least one duplicate, pipe them through a grep -v call that strips out anything with two (or more) duplicates in. That'll have a (.), and another (.), and a \1, and a \2, and these might appear in several different orders.
You'll also need to strip out anything that has a (.) and a \1 and another \1, since that will have a letter with three occurrences.
That should be enough to get you started, at any rate.
Your next step should be to find the 5-letter words containing a duplicate letter. To do that, you will need to use back-references. Example:
grep "[a-z]*\([a-z]\)[a-z]*\$1[a-z]*"
The $1 picks up the contents of the first parenthesized group and expects to match that group again. In this case, it matches a single letter. See: http://www.thegeekstuff.com/2011/01/advanced-regular-expressions-in-grep-command-with-10-examples--part-ii/ for more description of this capability.
You will next need to filter out those cases that have either a letter repeated 3 times or a word with 2 letters repeated. You will need to use the same sort of back-reference trick, but you can use grep -v to filter the results.
sed can be used for the final display. Grep will merely allow you to construct the correct lines to consider.
Note that the dictionary contains capital letters and also non-letter characters, plus that strange characters used in Southern Europe. say "รจ".
If you want to distinguish "A" and "a", it's automatic, on the other hand if "A" and "a" are the same letter, in ALL grep invocations you must use the -i option, to instruct grep to ignore case.
Next, you always want to pass the -E option, to avoid the so called backslashitis gravis in the regexp that you want to pass to grep.
Further, if you want to exclude the lines matching a regexp from the output, the correct option is -v.
Eventually, if you want to specify many different regexes to a single grep invocation, this is the way (just an example btw)
grep -E -i -v -e 'regexp_1' -e 'regexp_2' ... -e 'regexp_n'
The preliminaries are after us, let's look forward, use the answer from chiastic-security as a reference to understand the procedings
There are only these possibilities to find a duplicate in a 5 character string
(.)\1
(.).\1
(.)..\1
(.)...\1
grep -E -i -e 'regexp_1' ...
Now you have all the doubles, but this doesn't exclude triples etc that are identified by the following patterns (Edit added a cople of additional matching triples patterns)
(.)\1\1
(.).\1\1
(.)\1.\1
(.)..\1\1
(.).\1.\1
(.)\1\1\1
(.).\1\1\1
(.)\1\1\1\1\
you want to exclude these patterns, so grep -E -i -v -e 'regexp_1' ...
at his point, you have a list of words with at least a couple of the same character, and no triples, etc and you want to drop double doubles, these are the regexes that match double doubles
(.)(.)\1\2
(.)(.)\2\1
(.).(.)\1\2
(.).(.)\2\1
(.)(.).\1\2
(.)(.).\2\1
(.)(.)\1.\2
(.)(.)\2.\1
and you want to exclude the lines with these patterns, so its grep -E -i -v ...
A final hint, to play with my answer copy a few hundred lines of the dictionary in your working directory, head -n 3000 /usr/share/dict/words | tail -n 300 > ./300words so that you can really understand what you're doing, avoiding to be overwhelmed by the volume of the output.
And yes, this is not a complete answer, but it is maybe too much, isn't it?

truncate output in BASH

How do I truncate output in BASH?
For example, if I "du file.name" how do I just get the numeric value and nothing more?
later addition:
all solutions work perfectly. I chose to accept the most enlightning "cut" answer because I prefer the simplest approach in bash files others are supposed to be able to read.
If you know what the delimiters are then cut is your friend
du | cut -f1
Cut defaults to tab delimiters so in this case you are selecting the first field.
You can change delimiters: cut -d ' ' would use a space as a delimiter. (from Tomalak)
You can also select individual character positions or ranges:
ls | cut -c1-2
I'd recommend cut, as others have said. But another alternative that is sometimes useful because it allows any whitespace as separators, is to use awk:
du file.name | awk '{print $1}'
du | cut -f 1
If you just want the number of bytes of a single file, use the -s operator.
SIZE=-s file.name
That gives you a different number than du, but I'm not sure how exactly you're using this.
This has the advantage of not having to run du, and having bash get the size of the file directly.
It's hard to answer questions like this in a vacuum, because we don't know how you're going to use the data. Knowing that might suggest an entirely different answer.

Resources