Grep expression filter out lines of the form [alnum][punct][alnum] - text

Hi all my first post is for what I thought would be simple ...
I haven't been able to find an example of a similar problem/solution.
I have thousands of text files with thousands of lines of content in the form
<word><space><word><space><number>
Example:
example for 1
useful when 1
for. However 1
,boy wonder 1
,hary-horse wondered 2
In the above example I want to exclude line 3 as it contains internal punctuation
I'm trying to use the GNU grep 2.25 however not having luck
my initial attempt was (however this does not allow the "-" internal to the pattern):
grep -v [:alnum:]*[:punct:]*[:alnum:]* filename
so tried this however
grep -v [:alnum:]*[:space:]*[!]*["]*[#]*[$]*[%]*[&]*[']*[(]*[)]*[*]*[+]*[,]*[.]*[/]*[:]*[;]*[<]*[=]*[>]*[?]*[#]*[[]*[\]*[]]*[^]*[_]*[`]*[{]*[|]*[}]*[~]*[.]*[:space:]*[:alnum:]* filename
however I need to factor in spaces and - as these are acceptable internal to the string.
I had been trying with the :punct" set however now see it contains - so clearly that will not work
I do currently have a stored procedure in TSQL to process these however would prefer to preprocess prior to loading if possible as the routine takes some seconds per file.
Has someone been able to achieve something similar?

On the face of it, you're looking for the 'word space word space number' schema, assuming 'word' is 'one alphanumeric optionally followed by zero or one occurrences of zero or more alphanumeric or punctuation characters and ending with an alphanumeric', and 'space' is 'one or more spaces' and 'number' is 'one or more digits'.
In terms of grep -E (aka egrep):
grep -E '[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:digit:]]+'
That contains:
[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?
That detects a word with any punctuation surrounded by alphanumerics, and:
[[:space:]]+
[[:digit:]]+
which look for one or more spaces or digits.
Using a mildly extended data file, this produces:
$ cat data
example for 1
useful when 1
for. However 1
,boy wonder 1
,hary-horse wondered 2
O'Reilly Books 23
Coelecanths, Dodos Etc 19
$ grep -E '[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:digit:]]+' data
example for 1
useful when 1
,boy wonder 1
,hary-horse wondered 2
O'Reilly Books 23
Coelecanths, Dodos Etc 19
$
It eliminates the for. However 1 line as required.

Your regex contains a long string of ordered optional elements, but that means it will fail if something happens out of order. For example,
[!]*[?]*
will capture !? but not ?! (and of course, a character class containing a single character is just equivalent to that single character, so you might as well say !*?*).
You can instead use a single character class which contains all of the symbols you want to catch. As soon as you see one next to an alphanumeric character, you are done, so you don't need for the regex to match the entire input line.
grep -v '[[:alnum:]][][!"#$%&'"'"'()*+,./:;<=>?#\^_`{|}~]' filename
Also notice how the expression needs to be in single quotes in order for the shell not to interfere with the many metacharacters here. In order for a single-quoted string to include a literal single quote, I temporarily break out into a double-quoted string; see here for an explanation (I call this "seesaw quoting").
In a character class, if the class needs to include ], it needs to be at the beginning of the enumerated list; for symmetry and idiom, I also moved [ next to it.
Moreover, as pointed out by Jonathan Leffler, a POSIX character class name needs to be inside a character class; so to match one character belonging to the [:alnum:] named set, you say [[:alnum:]]. (This means you can combine sets, so [-[:alnum:].] covers alphanumerics plus dash and period.)
If you need to constrain this to match only on the first field, change the [[:alnum:]] to ^[[:alnum:]]\+.
Not realizing that a*b*c* matches anything is a common newbie error. You want to avoid writing an expression where all elements are optional, because it will match every possible string. Focus on what you want to match (the long list of punctuation characters, in your case) and then maybe add optional bits of context around it if you really need to; but the fewer of these you need, the faster it will run, and the easier it will be to see what it does. As a quick rule of thumb, a*bc* is effectively precisely equivalent to just b -- leading or trailing optional expressions might as well not be specified, because they do not affect what is going to be matched.

Related

How to replace a range of numbers from a range of strings with sed

I'm trying to modify a given text file, wherein I want to change/alter the following strings, eg:
lcl|NC_018257.1_cds_XP_003862892.1_5067
lcl|NC_018241.1_cds_XP_003859498.1_1683
lcl|NC_018256.1_cds_XP_003862456.1_4633
lcl|NC_018237.1_cds_XP_003858978.1_1163
lcl|NC_018254.1_cds_XP_003861926.1_4104
so that it only contains the XP_n.1 part of the string.
I have successfully removed the lcl|NC\_*.1_cds\_ part out of the strings for which
I used the following sed command:
sed 's/lcl|NC\_.\*_cds_//g' cds.fa > cds4.fa
The resultant text file contains strings like XP_003862892.1_5067.
There are about 8014 strings like this ranging from XP_*.1_1 to XP_*.1_8014. I want to delete the _1 to _8014 part of the string and replace it with 1.
I tried using
sed 's/1\_./1/g'
and it seemed to have worked, however when I scrolled further down the list of strings, the double digit numbers didn't get replaced - only one of the digits was replaced, which immediately followed the '_', resulting in the first digit turning into 1 and the rest retaining their original identity. Same with triple and quadruple digit numbers.
eg:
XP_003857837.1_23 ---> XP_003857837.13
XP_003857942.1_228 ---> XP_003857942.128
I have absolutely no idea how to remove this, all my attempts have led to failure. Some people have asked me for what my desired output should look like, the ideal output would be: XP_003857837.1, each string should be followed by a .1 instead of .1_SomeNumberRangingFrom1to8014
You can do everything in one go with a slightly more complex regex.
sed 's/lcl|NC_.*_cds_\(XP_[0-9.]*\)_.*/\1/' cds.fa > cds4.fa
The backslashed parentheses create a capturing group, and \1 in the replacement recalls the first captured group (\2 for the second, etc, if you have more than one). The regex inside the group looks for XP_ followed by digits and dots, and the expression after matches the rest of the line from the next uderscore on.
In other words, this basically says "replace the whole line with just the part we care about".
By the by, there is no reason to backslash underscores anywhere, and the /g option to the s command only makes sense when you want to replace multiple occurrences on the same input line.
Using sed
$ sed 's/.*_\?\(XP_[^.]*\.\)[^_]*_[0-9]\(.*\)/\11\2/'
XP_003862892.1067
XP_003859498.1683
XP_003862456.1633
XP_003858978.1163
XP_003861926.1104
XP_003857837.13
XP_003857942.128

inserting a number from stdout into a string from stdout

I'm working on a Linux terminal.
I have a string followed by a number as stdout and I need a command that replaces the middle of the string by the number and writes the result to stdout.
This is the string and number: librarian 16
and this is what the output should be: l16n
I have tried using echo librarian 16|sed s/[a-z]*/16/g and this gives me 9 999 the problems are that it replaces every letter separitaly and that it also replaces the first and last letter and that I can't make it use the number from stdout.
I have also tried using cut -c 1-1 , sed s/[^0-9]*//g and cut-c 9-9 to generate l, 16 and n respectively but I can't find how to combine their outputs into a single line.
Lastly I have tried using text editors to copy the number and paste it into the string but I haven't made much progress since I don't know how to use editors directly from the command line.
So what you want is to capture the first letter, the last letter and the number while ignoring the middle.
In regex we use ( and ) to tell the engine what we want to capture, anything else simply gets matched, or "eaten", but not captured. So the pattern should look like this:
([a-z])[a-z]*([a-z]) ([0-9]+)
([a-z]) to capture the first letter
[a-z]* to match zero or more characters but not capture. We choose "*" here because there might not be anything to match in the middle, like when there are two or less letters.
([a-z]) to capture the last letter.
to "eat" the whitespace.
([0-9]+) to capture the number. We use + instead of * because we require a number at this position.
sed uses a different syntax for some fo these constructs so we'll use the -E flag. You could do without it but you'd have to escape the ()+ characters which IMO makes pattern a little bit confusing.
Now, to retrieve the captured content, we have to use an engine-specific sequence of characters. sed uses \n where n is the number of the capturing group, so our final pattern should look like this:
\1\3\2
\1: First letter
\3: Number
\2: Last letter
Now we put everything together:
$ echo librarian 16|sed -r 's/([a-z])[a-z]*([a-z]) ([0-9]+)/\1\3\2/g'
l16n

How to perform following search and replace in vim?

I have the following string in the code at multiple places,
m_cells->a[ Id ]
and I want to replace it with
c(Id)
where the string Id could be anything including numbers also.
A regular expression replace like below should do:
%s/m_cells->a\[\s\(\w\+\)\s\]/c(\1)/g
If you wish to apply the replacement operation on a number of files you could use the :bufdo command.
Full explanation of #BasBossink's answer (as a separate answer because this won't fit in a comment), because regexes are awesome but non-trivial and definitely worth learning:
In Command mode (ie. type : from Normal mode), s/search_term/replacement/ will replace the first occurrence of 'search_term' with 'replacement' on the current line.
The % before the s tells vim to perform the operation on all lines in the document. Any range specification is valid here, eg. 5,10 for lines 5-10.
The g after the last / performs the operation "globally" - all occurrences of 'search_term' on the line or lines, not just the first occurrence.
The "m_cells->a" part of the search term is a literal match. Then it gets interesting.
Many characters have special meaning in a regex, and if you want to use the character literally, without the special meaning, then you have to "escape" it, by putting a \ in front.
Thus \[ and \] match the literal '[' and ']' characters.
Then we have the opposite case: literal characters that we want to treat as special regex entities.
\s matches white*s*pace (space, tab, etc.).
\w matches "*w*ord" characters (letters, digits, and underscore _).
(. matches any character (except a newline). \d matches digits. There are more...)
If a character is not followed by a quantifier, then exactly one such character matches. Thus, \s will match one space or tab, but not fewer or more.
\+ is a quantifier, and means "one or more". (\? matches 0 or 1; * (with no backslash) matches any number: zero or more. Warning: matching on zero occurrences takes a little getting used to; when you're first learning regexes, you don't always get the results you expected. It's also possible to match on an arbitrary exact number or range of occurrences, but I won't get into that here.)
\( and \) work together to form a "capturing group". This means that we don't just want to match on these characters, we also want to remember them specially so that we can do something with them later. You can have any number of capturing groups, and they can be nested too. You can refer to them later by number, starting at 1 (not 0). Just start counting (escaped) left-parantheses from the left to determine the number.
So here, we are matching a space followed by a group (which we will capture) of at least one "word" character followed by a space, within the square brackets.
Then section between the second and third / is the replacement text.
The "c" is literal.
\1 means the first captured group, which in this case will be the "Id".
In summary, we are finding text that matches the given description, capturing part of it, and replacing the entire match with the replacement text that we have constructed.
Perhaps a final suggestion: c after the final / (doesn't matter whether it comes before or after the 'g') enables *c*onfirmation: vim will highlight the characters to be replaced and will show the replacement text and ask whether you want to go ahead. Great for learning.
Yes, regexes are complicated, but super powerful and well worth learning. Once you have them internalized, they're actually fairly easy. I suggest that, as with learning vim itself, you start with the basics, get fluent in them, and then incrementally add new features to your repertoire.
Good luck and have fun.

replacing part of regex matches

I have several functions that start with get_ in my code:
get_num(...) , get_str(...)
I want to change them to get_*_struct(...).
Can I somehow match the get_* regex and then replace according to the pattern so that:
get_num(...) becomes get_num_struct(...),
get_str(...) becomes get_str_struct(...)
Can you also explain some logic behind it, because the theoretical regex aren't like the ones used in UNIX (or vi, are they different?) and I'm always struggling to figure them out.
This has to be done in the vi editor as this is main work tool.
Thanks!
To transform get_num(...) to get_num_struct(...), you need to capture the correct text in the input. And, you can't put the parentheses in the regular expression because you may need to match pointers to functions too, as in &get_distance, and uses in comments. However, and this depends partially on the fact that you are using vim and partially on how you need to keep the entire input together, I have checked that this works:
%s/get_\w\+/&_struct/g
On every line, find every expression starting with get_ and continuing with at least one letter, number, or underscore, and replace it with the entire matched string followed by _struct.
Darn it; I shouldn't answer these things on spec. Note that other regex engines might use \& instead of &. This depends on having magic set, which is default in vim.
For an alternate way to do it:
%s/get_\(\w*\)(/get_\1_struct(/g
What this does:
\w matches to any "word character"; \w* matches 0 or more word characters.
\(...\) tells vim to remember whatever matches .... So, \(w*\) means "match any number of word characters, and remember what you matched. You can then access it in the replacement with \1 (or \2 for the second, etc.)
So, the overall pattern get_\(\w*\)( looks for get_, followed by any number of word chars, followed by (.
The replacement then just does exactly what you want.
(Sorry if that was too verbose - not sure how comfortable you are with vim regex.)

Decrypt obfuscated perl script

Had some spam issues on my server and, after finding out and removing some Perl and PHP scripts I'm down to checking what they really do, although I'm a senior PHP programmer I have little experience with Perl, can anyone give me a hand with the script here:
http://pastebin.com/MKiN8ifp
(It was one long line of code, script was called list.pl)
The start of the script is:
$??s:;s:s;;$?::s;(.*); ]="&\%[=.*.,-))'-,-#-*.).<.'.+-<-~-#,~-.-,.+,~-{-,.<'`.{'`'<-<--):)++,+#,-.{).+,,~+{+,,<)..})<.{.)-,.+.,.)-#):)++,+#,-.{).+,,~+{+,,<)..})<*{.}'`'<-<--):)++,+#,-.{).+:,+,+,',~+*+~+~+{+<+,)..})<'`'<.{'`'<'<-}.<)'+'.:*}.*.'-|-<.+):)~*{)~)|)++,+#,-.{).+:,+,+,',~+*+~+~+{+<+,)..})
It continues with precious few non-punctuation characters until the very end:
0-9\;\\_rs}&a-h;;s;(.*);$_;see;
Replace the s;(.*);$_;see; with print to get this. Replace s;(.*);$_;see; again with print in the first half of the payload to get this, which is the decryption code. The second half of the payload is the code to decrypt, but I can't go any further with it, because as you see, the decryption code is looking for a key in an envvar or a cookie (so that only the script's creator can control it or decode it, presumably), and I don't have that key. This is actually reasonably cleverly done.
For those interested in the nitty gritty... The first part, when de-tangled looks like this:
$? ? s/;s/s;;$?/ :
s/(.*)/...lots of punctuation.../;
The $? at the beginning of the line is the pre-defined variable containing the child error, which no doubt serves only as obfuscation. It will be undefined, as there can be no child error at this point.
The questionmark following it is the start of a ternary operator
CONDITION ? IF_TRUE : IF_FALSE
Which is also added simply to obfuscate. The expression returned for true is a substitution regex, where the / slash delimiter has been replaced with colon s:pattern:replacement:. Above, I have put back slashes. The other expression, which is the one that will be executed is also a substitution regex, albeit an incredibly long one. The delimiter is semi-colon.
This substitution replaces .* in $_ - the default input and pattern-searching space - with a rather large amount of punctuation characters, which represents the bulk of the code. Since .* matches any string, even the empty string, it will simply get inserted into $_, and is for all intents and purposes identical to simply assigning the string to $_, which is what I did:
$_ = q;]="&\%[=.*.,-))'-,-# .......;;
The following lines are a transliteration and another substitution. (I inserted comments to point out the delimiters)
y; -"[%-.:<-#]-`{-}#~\$\\;{\$()*.0-9\;\\_rs}&a-h;;
#^ ^ ^ ^
#1 2 3
(1,2,3 are delimiters, the semi-colon between 2 and 3 is escaped)
The basic gist of it is that various characters and ranges -" (space to double quote), and something that looks like character classes (with ranges) [%-.:<-#], but isn't, get transliterated into more legible characters e.g. curly braces, dollar sign, parentheses,0-9, etc.
s;(.*);$_;see;
The next substitution is where the magic happens. It is also a substitution with obfuscated delimiters, but with three modifers: see. s does nothing in this case, as it only allows the wildcard character . to match newline. ee means to evaluate the expression twice, however.
In order to see what I was evaluating, I performed the transliteration and printed the result. I suspect that I somewhere along the line got some characters corrupted, because there were subtle errors, but here's the short (cleaned up) version:
s;(.*);73756220656e6372797074696f6e5f6 .....;; # very long line of alphanumerics
s;(..);chr(hex($1));eg;
s;(.*);$_;see;
s;(.*);704b652318371910023c761a3618265 .....;; # another long line
s;(..);chr(hex($1));eg;
&e_echr(\$_);
s;(.*);$_;see;
The long regexes are once again the data containers, and insert data into $_ to be evaluated as code.
The s/(..)/chr(hex($1))/eg; is starting to look rather legible. It is basically reading two characters at the time from $_ and converting it from hex to corresponding character.
The next to last line &e_echr(\$_); stumped me for a while, but it is a subroutine that is defined somewhere in this evaluated code, as hobbs so aptly was able to decode. The dollar sign is prefixed by backslash, meaning it is a reference to $_: I.e. that the subroutine can change the global variable.
After quite a few evaluations, $_ is run through this subroutine, after which whatever is contained in $_ is evaluated a last time. Presumably this time executing the code. As hobbs said, a key is required, which is taken from the environment %ENV of the machine where the script runs. Which we do not have.
Ask the B::Deparse module to make it (a little more) readable.

Resources