How to replace a range of numbers from a range of strings with sed - linux

I'm trying to modify a given text file, wherein I want to change/alter the following strings, eg:
lcl|NC_018257.1_cds_XP_003862892.1_5067
lcl|NC_018241.1_cds_XP_003859498.1_1683
lcl|NC_018256.1_cds_XP_003862456.1_4633
lcl|NC_018237.1_cds_XP_003858978.1_1163
lcl|NC_018254.1_cds_XP_003861926.1_4104
so that it only contains the XP_n.1 part of the string.
I have successfully removed the lcl|NC\_*.1_cds\_ part out of the strings for which
I used the following sed command:
sed 's/lcl|NC\_.\*_cds_//g' cds.fa > cds4.fa
The resultant text file contains strings like XP_003862892.1_5067.
There are about 8014 strings like this ranging from XP_*.1_1 to XP_*.1_8014. I want to delete the _1 to _8014 part of the string and replace it with 1.
I tried using
sed 's/1\_./1/g'
and it seemed to have worked, however when I scrolled further down the list of strings, the double digit numbers didn't get replaced - only one of the digits was replaced, which immediately followed the '_', resulting in the first digit turning into 1 and the rest retaining their original identity. Same with triple and quadruple digit numbers.
eg:
XP_003857837.1_23 ---> XP_003857837.13
XP_003857942.1_228 ---> XP_003857942.128
I have absolutely no idea how to remove this, all my attempts have led to failure. Some people have asked me for what my desired output should look like, the ideal output would be: XP_003857837.1, each string should be followed by a .1 instead of .1_SomeNumberRangingFrom1to8014

You can do everything in one go with a slightly more complex regex.
sed 's/lcl|NC_.*_cds_\(XP_[0-9.]*\)_.*/\1/' cds.fa > cds4.fa
The backslashed parentheses create a capturing group, and \1 in the replacement recalls the first captured group (\2 for the second, etc, if you have more than one). The regex inside the group looks for XP_ followed by digits and dots, and the expression after matches the rest of the line from the next uderscore on.
In other words, this basically says "replace the whole line with just the part we care about".
By the by, there is no reason to backslash underscores anywhere, and the /g option to the s command only makes sense when you want to replace multiple occurrences on the same input line.

Using sed
$ sed 's/.*_\?\(XP_[^.]*\.\)[^_]*_[0-9]\(.*\)/\11\2/'
XP_003862892.1067
XP_003859498.1683
XP_003862456.1633
XP_003858978.1163
XP_003861926.1104
XP_003857837.13
XP_003857942.128

Related

inserting a number from stdout into a string from stdout

I'm working on a Linux terminal.
I have a string followed by a number as stdout and I need a command that replaces the middle of the string by the number and writes the result to stdout.
This is the string and number: librarian 16
and this is what the output should be: l16n
I have tried using echo librarian 16|sed s/[a-z]*/16/g and this gives me 9 999 the problems are that it replaces every letter separitaly and that it also replaces the first and last letter and that I can't make it use the number from stdout.
I have also tried using cut -c 1-1 , sed s/[^0-9]*//g and cut-c 9-9 to generate l, 16 and n respectively but I can't find how to combine their outputs into a single line.
Lastly I have tried using text editors to copy the number and paste it into the string but I haven't made much progress since I don't know how to use editors directly from the command line.
So what you want is to capture the first letter, the last letter and the number while ignoring the middle.
In regex we use ( and ) to tell the engine what we want to capture, anything else simply gets matched, or "eaten", but not captured. So the pattern should look like this:
([a-z])[a-z]*([a-z]) ([0-9]+)
([a-z]) to capture the first letter
[a-z]* to match zero or more characters but not capture. We choose "*" here because there might not be anything to match in the middle, like when there are two or less letters.
([a-z]) to capture the last letter.
to "eat" the whitespace.
([0-9]+) to capture the number. We use + instead of * because we require a number at this position.
sed uses a different syntax for some fo these constructs so we'll use the -E flag. You could do without it but you'd have to escape the ()+ characters which IMO makes pattern a little bit confusing.
Now, to retrieve the captured content, we have to use an engine-specific sequence of characters. sed uses \n where n is the number of the capturing group, so our final pattern should look like this:
\1\3\2
\1: First letter
\3: Number
\2: Last letter
Now we put everything together:
$ echo librarian 16|sed -r 's/([a-z])[a-z]*([a-z]) ([0-9]+)/\1\3\2/g'
l16n

Grep expression filter out lines of the form [alnum][punct][alnum]

Hi all my first post is for what I thought would be simple ...
I haven't been able to find an example of a similar problem/solution.
I have thousands of text files with thousands of lines of content in the form
<word><space><word><space><number>
Example:
example for 1
useful when 1
for. However 1
,boy wonder 1
,hary-horse wondered 2
In the above example I want to exclude line 3 as it contains internal punctuation
I'm trying to use the GNU grep 2.25 however not having luck
my initial attempt was (however this does not allow the "-" internal to the pattern):
grep -v [:alnum:]*[:punct:]*[:alnum:]* filename
so tried this however
grep -v [:alnum:]*[:space:]*[!]*["]*[#]*[$]*[%]*[&]*[']*[(]*[)]*[*]*[+]*[,]*[.]*[/]*[:]*[;]*[<]*[=]*[>]*[?]*[#]*[[]*[\]*[]]*[^]*[_]*[`]*[{]*[|]*[}]*[~]*[.]*[:space:]*[:alnum:]* filename
however I need to factor in spaces and - as these are acceptable internal to the string.
I had been trying with the :punct" set however now see it contains - so clearly that will not work
I do currently have a stored procedure in TSQL to process these however would prefer to preprocess prior to loading if possible as the routine takes some seconds per file.
Has someone been able to achieve something similar?
On the face of it, you're looking for the 'word space word space number' schema, assuming 'word' is 'one alphanumeric optionally followed by zero or one occurrences of zero or more alphanumeric or punctuation characters and ending with an alphanumeric', and 'space' is 'one or more spaces' and 'number' is 'one or more digits'.
In terms of grep -E (aka egrep):
grep -E '[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:digit:]]+'
That contains:
[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?
That detects a word with any punctuation surrounded by alphanumerics, and:
[[:space:]]+
[[:digit:]]+
which look for one or more spaces or digits.
Using a mildly extended data file, this produces:
$ cat data
example for 1
useful when 1
for. However 1
,boy wonder 1
,hary-horse wondered 2
O'Reilly Books 23
Coelecanths, Dodos Etc 19
$ grep -E '[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:digit:]]+' data
example for 1
useful when 1
,boy wonder 1
,hary-horse wondered 2
O'Reilly Books 23
Coelecanths, Dodos Etc 19
$
It eliminates the for. However 1 line as required.
Your regex contains a long string of ordered optional elements, but that means it will fail if something happens out of order. For example,
[!]*[?]*
will capture !? but not ?! (and of course, a character class containing a single character is just equivalent to that single character, so you might as well say !*?*).
You can instead use a single character class which contains all of the symbols you want to catch. As soon as you see one next to an alphanumeric character, you are done, so you don't need for the regex to match the entire input line.
grep -v '[[:alnum:]][][!"#$%&'"'"'()*+,./:;<=>?#\^_`{|}~]' filename
Also notice how the expression needs to be in single quotes in order for the shell not to interfere with the many metacharacters here. In order for a single-quoted string to include a literal single quote, I temporarily break out into a double-quoted string; see here for an explanation (I call this "seesaw quoting").
In a character class, if the class needs to include ], it needs to be at the beginning of the enumerated list; for symmetry and idiom, I also moved [ next to it.
Moreover, as pointed out by Jonathan Leffler, a POSIX character class name needs to be inside a character class; so to match one character belonging to the [:alnum:] named set, you say [[:alnum:]]. (This means you can combine sets, so [-[:alnum:].] covers alphanumerics plus dash and period.)
If you need to constrain this to match only on the first field, change the [[:alnum:]] to ^[[:alnum:]]\+.
Not realizing that a*b*c* matches anything is a common newbie error. You want to avoid writing an expression where all elements are optional, because it will match every possible string. Focus on what you want to match (the long list of punctuation characters, in your case) and then maybe add optional bits of context around it if you really need to; but the fewer of these you need, the faster it will run, and the easier it will be to see what it does. As a quick rule of thumb, a*bc* is effectively precisely equivalent to just b -- leading or trailing optional expressions might as well not be specified, because they do not affect what is going to be matched.

Vim - Substitute command and regexp

I have a little JSON files with some entries, here is a section:
"i":{
"normale":"3c",
"bold":"4b",
"doppio":"6c"},
"is":{
"normale":"2c",
"bold":"33",
"doppio":"66"},
I realized I have to add "\u25" in front of all the values, so I tried this command:
:%s:\("\)\(\d\d"\)\|\("\)\(\d\w"\):"\\u25\2
The idea is to search for either "dd" or "dw", and substitute the first double quote with "\u25 while keeping the rest.This is the result:
"i":{
"normale":"\u25,
"bold":"\u25,
"doppio":"\u25},
"is":{
"normale":"\u25,
"bold":"\u2533",
"doppio":"\u2566"},
If the matching string has only the two digits, the command works fine: the first double quote (the first group) is substituted and the second group is left as it was.
However, if the matching string has a digit and a character, it seems to ignore the second group, substituting the whole string. The two patterns are identical, except for \w, so it should work exactly the same. What's happening?
Vim matches \d to digits; you'd need \x to match hex digits.
But it seems you want to replace all occurrences of :" with :"\u25.
Can you use:
:%s/:"/:"\\u25"/
Or, if you want to prepend \u25 to all occurrences of 2 hex digits,
:%s/\x\x/\\u25&/

Substitute `number` with `(number)` in multiple lines

I am a beginner at Vim and I've been reading about substitution but I haven't found an answer to this question.
Let's say I have some numbers in a file like so:
1
2
3
And I want to get:
(1)
(2)
(3)
I think the command should resemble something like :s:\d\+:........ Also, what's the difference between :s/foo/bar and :s:foo:bar ?
Thanks
Here is an alternative, slightly less verbose, solution:
:%s/^\d\+/(&)
Explanation:
^ anchors the pattern to the beginning of the line
\d is the atom that covers 0123456789
\+ matches one or more of the preceding item
& is a shorthand for \0, the whole match
Let me address those in reverse.
First: there's no difference between :s/foo/bar and :s:foo:bar; whatever delimiter you use after the s, vim will expect you to use from then on. This can be nice if you have a substitution involving lots of slashes, for instance.
For the first: to do this to the first number on the current line (assuming no commas, decimal places, etc), you could do
:s:\(\d\+\):(\1)
The \(...\) doesn't change what is matched - rather, it tells vim to remember whatever matched what is inside, and store it. The first \(...\) is stored in \1, the second in \2, etc. So, when you do the replacement, you can reference \1 to get the number back.
If you want to change ALL numbers on the current line, change it to
:s:\(\d\+\):(\1):g
If you want to change ALL numbers on ALL lines, change it to
:%s:\(\d\+\):(\1):g
You can do what you want with:
:%s/\([0-9]\)/(\1)/
%s means global search and replace, that is do the search/replace for every line in the file. the \( \) defines a group, which in turn is referenced by \1. So the above search and replace, finds all lines with a single digit ([0-9]), and replaces it with the matched digit surrounded by parentheses.

How do you delete everything but a specific pattern in Vim?

I have an XML file where I only care about the size attribute of a certain element.
First I used
global!/<proto name="geninfo"/d
to delete all lines that I don't care about. That leaves me a whole bunch of lines that look like this:
<proto name="geninfo" pos="0" showname="General information" size="174">
I want to delete everything but the value for "size."
My plan was to use substitute to get rid of everything not matching 'size="[digit]"', the remove the string 'size' and the quotes but I can't figure out how to substitute the negation of a string.
Any idea how to do it, or ideas on a better way to achieve this? Basically I want to end up with a file with one number (the size) per line.
You can use matching groups:
:%s/^.*size="\([0-9]*\)".*$/\1/
This will replace lines that contain size="N" by just N and not touch other lines.
Explanation: this will look for a line that contains some random characters, then somewhere the chain size=", then digits, then ", then some more random characters, then the end of the line. Now what I did is that I wrapped the digits in (escaped) parenthesis. That creates a group. In the second part of the search-and-replace command, I essentially say "I want to replace the whole line with just the contents of that first group" (referred to as \1).
:v:size="\d\+":d|%s:.*size="\([^"]\+\)".*:\1:
The first command (until the | deletes every line which does not match the size="<SOMEDIGIT(S)>" pattern, the second (%s... removes everything before and after size attr's " (and " will also be removed).
HTH

Resources