inserting a number from stdout into a string from stdout - linux

I'm working on a Linux terminal.
I have a string followed by a number as stdout and I need a command that replaces the middle of the string by the number and writes the result to stdout.
This is the string and number: librarian 16
and this is what the output should be: l16n
I have tried using echo librarian 16|sed s/[a-z]*/16/g and this gives me 9 999 the problems are that it replaces every letter separitaly and that it also replaces the first and last letter and that I can't make it use the number from stdout.
I have also tried using cut -c 1-1 , sed s/[^0-9]*//g and cut-c 9-9 to generate l, 16 and n respectively but I can't find how to combine their outputs into a single line.
Lastly I have tried using text editors to copy the number and paste it into the string but I haven't made much progress since I don't know how to use editors directly from the command line.

So what you want is to capture the first letter, the last letter and the number while ignoring the middle.
In regex we use ( and ) to tell the engine what we want to capture, anything else simply gets matched, or "eaten", but not captured. So the pattern should look like this:
([a-z])[a-z]*([a-z]) ([0-9]+)
([a-z]) to capture the first letter
[a-z]* to match zero or more characters but not capture. We choose "*" here because there might not be anything to match in the middle, like when there are two or less letters.
([a-z]) to capture the last letter.
to "eat" the whitespace.
([0-9]+) to capture the number. We use + instead of * because we require a number at this position.
sed uses a different syntax for some fo these constructs so we'll use the -E flag. You could do without it but you'd have to escape the ()+ characters which IMO makes pattern a little bit confusing.
Now, to retrieve the captured content, we have to use an engine-specific sequence of characters. sed uses \n where n is the number of the capturing group, so our final pattern should look like this:
\1\3\2
\1: First letter
\3: Number
\2: Last letter
Now we put everything together:
$ echo librarian 16|sed -r 's/([a-z])[a-z]*([a-z]) ([0-9]+)/\1\3\2/g'
l16n

Related

How to replace a range of numbers from a range of strings with sed

I'm trying to modify a given text file, wherein I want to change/alter the following strings, eg:
lcl|NC_018257.1_cds_XP_003862892.1_5067
lcl|NC_018241.1_cds_XP_003859498.1_1683
lcl|NC_018256.1_cds_XP_003862456.1_4633
lcl|NC_018237.1_cds_XP_003858978.1_1163
lcl|NC_018254.1_cds_XP_003861926.1_4104
so that it only contains the XP_n.1 part of the string.
I have successfully removed the lcl|NC\_*.1_cds\_ part out of the strings for which
I used the following sed command:
sed 's/lcl|NC\_.\*_cds_//g' cds.fa > cds4.fa
The resultant text file contains strings like XP_003862892.1_5067.
There are about 8014 strings like this ranging from XP_*.1_1 to XP_*.1_8014. I want to delete the _1 to _8014 part of the string and replace it with 1.
I tried using
sed 's/1\_./1/g'
and it seemed to have worked, however when I scrolled further down the list of strings, the double digit numbers didn't get replaced - only one of the digits was replaced, which immediately followed the '_', resulting in the first digit turning into 1 and the rest retaining their original identity. Same with triple and quadruple digit numbers.
eg:
XP_003857837.1_23 ---> XP_003857837.13
XP_003857942.1_228 ---> XP_003857942.128
I have absolutely no idea how to remove this, all my attempts have led to failure. Some people have asked me for what my desired output should look like, the ideal output would be: XP_003857837.1, each string should be followed by a .1 instead of .1_SomeNumberRangingFrom1to8014
You can do everything in one go with a slightly more complex regex.
sed 's/lcl|NC_.*_cds_\(XP_[0-9.]*\)_.*/\1/' cds.fa > cds4.fa
The backslashed parentheses create a capturing group, and \1 in the replacement recalls the first captured group (\2 for the second, etc, if you have more than one). The regex inside the group looks for XP_ followed by digits and dots, and the expression after matches the rest of the line from the next uderscore on.
In other words, this basically says "replace the whole line with just the part we care about".
By the by, there is no reason to backslash underscores anywhere, and the /g option to the s command only makes sense when you want to replace multiple occurrences on the same input line.
Using sed
$ sed 's/.*_\?\(XP_[^.]*\.\)[^_]*_[0-9]\(.*\)/\11\2/'
XP_003862892.1067
XP_003859498.1683
XP_003862456.1633
XP_003858978.1163
XP_003861926.1104
XP_003857837.13
XP_003857942.128

Replace each column with different spacing using sed

I am trying to replace a different pattern for each column of my input file.
Input file
this- START
this- START
Result I want
/this/ -START-
/this/ -START-
My code
sed 's|^\([a-zA-Z]*\)-\s\([a-zA-Z]*\)$|/\1/ -\2-|' inputfile
Output
/this/ -START-
this- START
The first input works but the 2nd input with a huge amount of spaces does not. How can I deal with both of them using the same line of code?
sed uses POSIX Basic Regular Expressions, which are, like the name suggests, very basic, without a lot of the syntactical sugar or features of other RE packages you might be more used to. But they can still handle this:
$ cat input.txt
this- START
this- START
$ sed 's!^\([a-zA-Z]*\)-[[:space:]]\{1,\}\([a-zA-Z]*\)$!/\1/ -\2-!' input.txt
/this/ -START-
/this/ -START-
The key here is in the [[:space:]]\{1,\} portion: [:space:] inside a []character class matches any whitespace character, like \s in other RE implementations, and \{1,\} matches 1 or more of the preceeding atom, like + in pretty much every other flavor (Which also support this notation, though without needing the backslashes). So combined it matches 1 or more whitespace characters. And since regular expressions are greedy, it matches the longest sequence of whitespace characters instead of stopping after seeing just one.
If you only have spaces, not spaces and/or tabs between columns, it can be simplified to \{1,\} (Note the leading literal space; it's not obvious in rendered markdown). And you can use [[:alpha:]] instead of [a-zA-Z] to match all alphabetic characters. Makes a difference if matching non-English text. And you might want to use \{1,\} instead of * to avoid matching 0-length/missing columns if they can show up in your input.

Vim or sed : Replace character(s) within a pattern

I wanted to replace underscores with hyphens in all places where the character('_') is preceded and following by uppercase letters e.g. QWQW_IOIO, OP_FD_GF_JK, TRT_JKJ, etc. The replacement is needed throughout one document.
I tried to replace this in vim using:
:%s/[A-Z]_[A-Z]/[A-Z]-[A-Z]/g
But that resulted in QWQW_IOIO with QWQ[A-Z]-[A-Z]OIO :(
I tried using a sed command:
sed -i '/[A-Z]_[A-Z]/ s/_/-/g' ./file_name
This resulted in replacement over the whole line. e.g.
QWQW_IOIO variable may contain '_' or '-' line was replaced by
QWQW-IOIO variable may contain '-' or '-'
You had the right idea with your first vim approach. But you need to use a capturing group to remember what character was found in the [A-Z] section. Those are nicely explained here and under :h /\1. As a side note, I would recommend using \u instead of [A-Z], since it is both shorter and faster. That means the solution you want is:
:%s/\(\u\)_\(\u\)/\1-\2/g
Or, if you would like to use the magic setting to make it more readable:
:%s/\v(\u)_(\u)/\1-\2/g
Another option would be to limit the part of the search that gets replaced with the \zs and \ze atoms:
:%s/\u\zs_\ze\u/-/g
This is the shortest solution I'm aware of.
This should do what you want, assuming GNU sed.
sed -i -r -e 's/([A-Z]+)_([A-Z]+)/\1-\2/g' ./file_name
Explanation:
-r flag enables extended regex
[A-Z]+ is "one or more uppercase letters"
() groups a pattern together and creates a numbered memorized match
\1, \2 put those memorized matches in the replacement.
So basically this finds a chunk of uppercase letters followed by an underscore, followed by another chunk of uppercase letters, memorizes only the letter chunks as 2 groups,
([A-Z]+)_([A-Z]+)
Then it replays those groups, but with a hyphen in between instead of an underscore.
\1-\2
The g flag at the end says to do this even if the pattern shows up multiple times on one line.
Note that this falls apart a little in this case:
QWQW_IOIO_ABAB
Because it matches the first time, but not the second; the second part won't match because IOIO was consumed by the first match. So that would result in
QWQW-IOIO_ABAB
This version drops the + so it only matches one uppercase letter, and won't break in the same way:
sed -i -r -e 's/([A-Z])_([A-Z])/\1-\2/g'
It still has a small flaw, if you have a string like this:
A_B_C
Same issue as before, just one letter now instead of multiple.

Grep expression filter out lines of the form [alnum][punct][alnum]

Hi all my first post is for what I thought would be simple ...
I haven't been able to find an example of a similar problem/solution.
I have thousands of text files with thousands of lines of content in the form
<word><space><word><space><number>
Example:
example for 1
useful when 1
for. However 1
,boy wonder 1
,hary-horse wondered 2
In the above example I want to exclude line 3 as it contains internal punctuation
I'm trying to use the GNU grep 2.25 however not having luck
my initial attempt was (however this does not allow the "-" internal to the pattern):
grep -v [:alnum:]*[:punct:]*[:alnum:]* filename
so tried this however
grep -v [:alnum:]*[:space:]*[!]*["]*[#]*[$]*[%]*[&]*[']*[(]*[)]*[*]*[+]*[,]*[.]*[/]*[:]*[;]*[<]*[=]*[>]*[?]*[#]*[[]*[\]*[]]*[^]*[_]*[`]*[{]*[|]*[}]*[~]*[.]*[:space:]*[:alnum:]* filename
however I need to factor in spaces and - as these are acceptable internal to the string.
I had been trying with the :punct" set however now see it contains - so clearly that will not work
I do currently have a stored procedure in TSQL to process these however would prefer to preprocess prior to loading if possible as the routine takes some seconds per file.
Has someone been able to achieve something similar?
On the face of it, you're looking for the 'word space word space number' schema, assuming 'word' is 'one alphanumeric optionally followed by zero or one occurrences of zero or more alphanumeric or punctuation characters and ending with an alphanumeric', and 'space' is 'one or more spaces' and 'number' is 'one or more digits'.
In terms of grep -E (aka egrep):
grep -E '[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:digit:]]+'
That contains:
[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?
That detects a word with any punctuation surrounded by alphanumerics, and:
[[:space:]]+
[[:digit:]]+
which look for one or more spaces or digits.
Using a mildly extended data file, this produces:
$ cat data
example for 1
useful when 1
for. However 1
,boy wonder 1
,hary-horse wondered 2
O'Reilly Books 23
Coelecanths, Dodos Etc 19
$ grep -E '[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:digit:]]+' data
example for 1
useful when 1
,boy wonder 1
,hary-horse wondered 2
O'Reilly Books 23
Coelecanths, Dodos Etc 19
$
It eliminates the for. However 1 line as required.
Your regex contains a long string of ordered optional elements, but that means it will fail if something happens out of order. For example,
[!]*[?]*
will capture !? but not ?! (and of course, a character class containing a single character is just equivalent to that single character, so you might as well say !*?*).
You can instead use a single character class which contains all of the symbols you want to catch. As soon as you see one next to an alphanumeric character, you are done, so you don't need for the regex to match the entire input line.
grep -v '[[:alnum:]][][!"#$%&'"'"'()*+,./:;<=>?#\^_`{|}~]' filename
Also notice how the expression needs to be in single quotes in order for the shell not to interfere with the many metacharacters here. In order for a single-quoted string to include a literal single quote, I temporarily break out into a double-quoted string; see here for an explanation (I call this "seesaw quoting").
In a character class, if the class needs to include ], it needs to be at the beginning of the enumerated list; for symmetry and idiom, I also moved [ next to it.
Moreover, as pointed out by Jonathan Leffler, a POSIX character class name needs to be inside a character class; so to match one character belonging to the [:alnum:] named set, you say [[:alnum:]]. (This means you can combine sets, so [-[:alnum:].] covers alphanumerics plus dash and period.)
If you need to constrain this to match only on the first field, change the [[:alnum:]] to ^[[:alnum:]]\+.
Not realizing that a*b*c* matches anything is a common newbie error. You want to avoid writing an expression where all elements are optional, because it will match every possible string. Focus on what you want to match (the long list of punctuation characters, in your case) and then maybe add optional bits of context around it if you really need to; but the fewer of these you need, the faster it will run, and the easier it will be to see what it does. As a quick rule of thumb, a*bc* is effectively precisely equivalent to just b -- leading or trailing optional expressions might as well not be specified, because they do not affect what is going to be matched.

Substitute `number` with `(number)` in multiple lines

I am a beginner at Vim and I've been reading about substitution but I haven't found an answer to this question.
Let's say I have some numbers in a file like so:
1
2
3
And I want to get:
(1)
(2)
(3)
I think the command should resemble something like :s:\d\+:........ Also, what's the difference between :s/foo/bar and :s:foo:bar ?
Thanks
Here is an alternative, slightly less verbose, solution:
:%s/^\d\+/(&)
Explanation:
^ anchors the pattern to the beginning of the line
\d is the atom that covers 0123456789
\+ matches one or more of the preceding item
& is a shorthand for \0, the whole match
Let me address those in reverse.
First: there's no difference between :s/foo/bar and :s:foo:bar; whatever delimiter you use after the s, vim will expect you to use from then on. This can be nice if you have a substitution involving lots of slashes, for instance.
For the first: to do this to the first number on the current line (assuming no commas, decimal places, etc), you could do
:s:\(\d\+\):(\1)
The \(...\) doesn't change what is matched - rather, it tells vim to remember whatever matched what is inside, and store it. The first \(...\) is stored in \1, the second in \2, etc. So, when you do the replacement, you can reference \1 to get the number back.
If you want to change ALL numbers on the current line, change it to
:s:\(\d\+\):(\1):g
If you want to change ALL numbers on ALL lines, change it to
:%s:\(\d\+\):(\1):g
You can do what you want with:
:%s/\([0-9]\)/(\1)/
%s means global search and replace, that is do the search/replace for every line in the file. the \( \) defines a group, which in turn is referenced by \1. So the above search and replace, finds all lines with a single digit ([0-9]), and replaces it with the matched digit surrounded by parentheses.

Resources