'StringCut' to the left or right of a defined position using Mathematica - string

On reading this question, I thought the following problem would be simple using StringSplit
Given the following string, I want to 'cut' it to the left of every "D" such that:
I get a List of fragments (with sequence unchanged)
StringJoin#fragments gives back the original string (but is does not matter if I have to reorder the fragments to obtain this). That is, sequence within each fragment is important, and I do not want to lose any characters.
(The example I am interested in is a protein sequence (string) where each character represents an amino acid in one-letter code. I want to obtain the theoretical list of ALL fragments obtained by treating with an enzyme known to split before "D")
str = "MTPDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN"
The best I can come up with is to insert a space before each "D" using StringReplace and then use StringSplit. This seems quite awkward, to say the least.
frags1 = StringSplit#StringReplace[str, "D" -> " D"]
giving as output:
{"MTP", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"}
or, alternatively, using StringReplacePart:
frags1alt =
StringSplit#StringReplacePart[str, " D", StringPosition[str, "D"]]
Finally (and more realistically), if I want to split before "D" provided that the residue immediately preceding it is not "P" [ie P-D,(Pro-Asp) bonds are not cleaved], I do it as follows:
StringSplit#StringReplace[str, (x_ /; x != "P") ~~ "D" -> x ~~ " D"]
Is there a more elegant way?
Speed is not necessarily an issue. I am unlikely to be dealing with strings of greater than, say, 500 characters. I am using Mma 7.
Update
I have added the bioinformatics tag, and I thought it might be of interest to add an example from that field.
The following imports a protein sequence (Bovine serum albumin, accession number 3336842) from the NCBI database using eutils and then generates a (theoretical) trypsin digest. I have assumed that the enzyme tripsin cleaves between residues A1-A2 when A1 is either "R" or "K", provided that A2 is not "R", "K" or "P". If anyone has any suggestions for improvements, please feel free to suggest modifications.
Using a modification of sakra's method ( a carriage return after '?db=' possibly needs to be removed):
StringJoin /#
Split[Characters[#],
And ## Function[x, #1 != x] /# {"R", "K"} ||
Or ## Function[xx, #2 == xx] /# {"R", "K", "P"} &] & #
StringJoin#
Rest#Import[
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=\
protein&id=3336842&rettype=fasta&retmode=text", "Data"]
My possibly ham-fisted attempt at using the regex method (Sasha/WReach) to do the same thing:
StringSplit[#, RegularExpression["(?![PKR])(?<=[KR])"]] &#
StringJoin#Rest#Import[...]
Output
{MK,WVTFISLLLLFSSAYSR,GVFRR,<<69>>,CCAADDK,EACFAVEGPK,LVVSTQTALA}

I can not build anything much simpler that your code. Here is a regex code, which you might happen to like:
In[281]:= StringSplit#
StringReplace[str, RegularExpression["(?<!P)D"] -> " D"]
Out[281]= {"MTPDKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", \
"DYFRYLSEVASG", "DN"}
It uses negative lookbehind pattern, borrowed from this site.
EDIT Adding WReach's cool solution:
In[2]:= StringSplit[str, RegularExpression["(?<!P)(?=D)"]]
Out[2]= {"MTPDKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", \
"DYFRYLSEVASG", "DN"}

Here are some alternate solutions:
Splitting by any occurrence of "D":
In[18]:= StringJoin /# Split[Characters["MTPDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN"], #2!="D" &]
Out[18]:= {"MTP", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"}
Splitting by any occurrence of "D" provided it is not preceded by "P":
In[19]:= StringJoin /# Split[Characters["MTPDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN"], #2!="D" || #1=="P" &]
Out[19]:= {"MTPDKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"}

Your first solution isn't that bad, is it? Everything that I can think of is longer or uglier than that. Is the problem there might be spaces in the original string?
StringCases[str, "D" | StartOfString ~~ Longest[Except["D"] ..]]
or
Prepend["D" <> # & /# Rest[StringSplit[str, "D"]], First[StringSplit[str, "D"]]]

Related

Python 3.5: Is it possible to align punctuation (e.g. £, $) to the left side of a word using regex?

As part of my code, I need to align things like the pound sign to the left of a string. For example my code starts with:
"A price of £ 8 is roughly the same as $ 10.23!"
and needs to end with:
"A price of £8 is roughly the same as $10.23!"
I've created the following function to solve this however I feel that it is very inefficient and was wondering if there was a way to do this with regular expressions in Python?
for i in sentence:
if i == "(" or i == "{" or i == "[" or i == "£" or i == "$":
if i != len(sentence):
corrected_sentence.append(" ")
corrected_sentence.append(i)
else:
corrected_sentence.append(i)
What this is doing right now is going through the 'sentence' list where I have split up all of the words and punctuation and t then reforming this followed by a space EXPECT where the listed characters are used and adding to another list to be made into a single string again.
I only want to do this with the characters I have listed above (so I need to ignore things like full stops or exclamation marks etc).
Thanks!
I'm not sure what you want to do with the brackets, but from the description you can use a regex to find and replace whitespace preceded by the characters (lookbehind) and followed by a digit (lookahead).
>>> print(re.sub(r"(?<=[\{\[£\$])\s+(?=\d)", "", "A price of £ 8 is roughly the same as $ 10.23!"))
A price of £8 is roughly the same as $10.23!

String searching in Rebol or Red

I'm interested in searching on a lot of long strings, to try and hack out a sed-like utility in rebol as a learning exercise. As a baby step I decided to search for a character:
>> STR: "abcdefghijklmopqrz"
>> pos: index? find STR "z"
== 18
>> pos
== 18
Great! Let's search for something else...
>> pos: index? find STR "n"
** Script Error: index? expected series argument of type: series port
** Where: halt-view
** Near: pos: index? find STR "n"
>> pos
== 18
What? :-(
Yeah, there was no "n" in the string I was searching. But what is the benefit of an interpreter blowing up instead of doing something sensible, such as returning a testable "null" char in pos?
I was told I should have done this:
>> if found? find STR "z" [pos: index? find STR "z"]
== 18
>> if found? find STR "n" [pos: index? find STR "n"]
== none
>> pos
== 18
Really? I have to search the string TWICE; the first time just to be sure it is "safe" to search AGAIN?
So I have a three-part question:
How would a wizard implement my search function? I presume there is a wizardly better way better than this....
Is Red going to change this? Ideally I'd think find should return a valid string position or a NULL if it hits end of string (NULL delimited, may I presume?). The NULL is FALSE so that would set up for a really easy if test.
What is the most CPU effective way to do a replace once I have a valid index? There appear to so many choices in Rebol (a good thing) that it is possible to get stuck in choosing or stuck in a suboptimal choice.
I was told I should have done this:
>> if found? find STR "z" [pos: index? find STR "z"]
== 18
>> if found? find STR "n" [pos: index? find STR "n"]
== none
>> pos
== 18
Really? I have to search the string TWICE; the first time just to be sure it is "safe" to search AGAIN?
You certainly don't have to search the string twice. But index? (likely future name since it doesn't return a yes/no: index-of) doesn't return a NONE! value if given a NONE! input. It assumes the caller wants an integer position back and raises an error if it can't give you one.
How would a wizard implement my search function?
To eliminate the double search, you can use a short circuit evaluation...
>> all [pos: find STR "z" pos: index? pos]
== 18
>> pos
== 18
>> all [pos: find STR "n" pos: index? pos]
== none
>> pos
== none
But note that without introducing a second variable you will overwrite your previous pos. Let's say you call your variable index instead and pos is a temporary:
>> all [pos: find STR "z" index: index? pos]
== 18
>> index
== 18
>> all [pos: find STR "n" index: index? pos]
== none
>> index
== 18
The ability to throw set-words at arbitrary points in mid-expression is quite powerful, and it's why things like multiple initialization (a: b: c: 0) are not special features of the language, but something that falls out of the evaluator model.
Is Red going to change this?
It's not likely that the benefits of index? (cough index-of) returning a NONE! value if given a NONE! input outweigh the problems it would cause by being so tolerant. It's always a balance.
Note that FIND does indeed behave as you expect. FOUND? is just a syntactic convenience that transforms a position found into a true value, and a NONE! returned into a false one. It is equivalent to calling TRUE? (but just a little more literate when reading). There is no need to use it in the condition of an IF or UNLESS or EITHER...as they will treat a NONE result as if it were false and any position as if it were true.
What is the most CPU effective way to do a replace once I have a valid index?
What would have been fastest would probably have been to have hung onto the position, and said change pos #"x". (Though internally "positions" are implemented by index plus series, and not an independent pointer. So the advantage is not that significant in micro-optimization world, where we're counting things like additions of offsets...)
As for which operation with an index: I'd say choose how you like it best and micro-optimize later.
I don't personally think STR/:index: #"x" looks all that great, but it's the briefest in characters.
STR/(index): #"x" does the same thing and looks better IMO. But at the cost of the source code structure blowing up a bit. That's a SET-PATH! series containing a PAREN! series followed by a CHAR!...all embedded in the original series "vector" that's holding the code. Under the hood there's going to be locality problems. And we know how important that is these days...
It's likely that the seemingly naive POKE is the fastest. poke STR index #"x". It may look like "4 elements instead of 2", but the "2 elements" of the path cases are an illusion.
In Rebol it's always a bit of a hard thing to guess, so you have to gather data. You can run some repeated iterative tests to find out. To time a block of code, see the builtin delta-time.
In Red the compiled forms should be equivalent, but if somehow this winds up being interpreted you'd probably have similar timings to Rebol.
No surprises that HostileFork answer covers everything beautifully! +1
Just wanted to add an alternative solution to point 1 that i use regularly:
>> attempt [index? find STR "z"]
== 18
>> attempt [index? find STR "n"]
== none
Online documentation for Rebol 2 attempt & Rebol 3 attempt
Searching strings in Red/Rebol is very simple and convenient. About the issues you have encountered, let me unpack the details for you:
First of all, the interpreter is giving you a good hint about what you are doing wrong, in form of an error message: index? expected series argument of type: series port. This means that you used index? on the wrong datatype. How did that happen? Simply because the find function returns a none value in case the search fails:
>> str: "abcdefghijklmopqrz"
>> find str "o"
== "pqrz"
>> type? find str "o"
== string!
>> find str "n"
== none
>> type? find str "n"
== none!
So, using index? directly on the result of find is unsafe, unless you know that the search won't fail. If you need to extract the index information anyway, the safe approach is to test the result of find first:
>> all [pos: find str "o" index? pos]
== 14
>> all [pos: find str "n" index? pos]
== none
>> if pos: find str "o" [print index? pos]
== 14
>> print either pos: find str "n" [index? pos][-1]
== -1
Those were just examples of safe ways to achieve it, depending on your needs. Note that none acts as false for conditional tests in if or either, so that using found? in such case, is superfluous.
Now let's shed some lights on the core issue which brought confusion to you.
Rebol languages have a fundamental concept called a series from which string! datatype is derived. Understanding and using properly series is a key part of being able to use Rebol languages in an idiomatic way. Series look like usual lists and string-like datatypes in other languages, but they are not the same. A series is made of:
a list of values (for strings, it is a list of characters)
a implicit index (we can call it a cursor for sake of simplicity)
The following description will only focus on strings, but the same rules apply to all series datatypes. I will use index? function in the examples below just to display the implicit index as an integer number.
By default, when you create a new string, the cursor is at head position:
>> s: "hello"
>> head? s
== true
>> index? s
== 1
But the cursor can be moved to point to other places in the string:
>> next s
== "ello"
>> skip s 3
== "lo"
>> length? skip s 3
== 2
As you can see, the string with a moved cursor is not only displayed from the cursor position, but also all the other string (or series) functions will take that position into account.
Additionally, you can also set the cursor for each reference pointing to the string:
>> a: next s
== "ello"
>> b: skip s 3
== "lo"
>> s: at s 5
== "o"
>> reduce [a b s]
== ["ello" "lo" "o"]
>> reduce [index? a index? b index? s]
== [2 4 5]
As you can see, you can have as many different references to a given string (or series) as you wish, each having its own cursor value, but all pointing to the same underlying list of values.
One important consequence of series properties: you do not need to rely on integer indexes to manipulate strings (and other series) like you would do in other languages, you can simply leverage the cursor which comes with any series reference to do whatever computation you need, and your code will be short, clean and very readable. Still, integer indexes can be useful sometimes on series, but you rarely need them.
Now let's go back to your use-case for searching in strings.
>> STR: "abcdefghijklmopqrz"
>> find STR "z"
== "z"
>> find STR "n"
== none
That is all you need, you do not have to extract the index position in order to use the resulting values for pretty much any computation you need to do.
>> pos: find STR "o"
>> if pos [print "found"]
found
>> print ["sub-string from `o`:" pos]
sub-string from `o`: opqrz
>> length? pos
== 5
>> index? pos
== 14
>> back pos
== "mopqrz"
>> skip pos 4
== "z"
>> pos: find STR "n"
>> print either pos ["found"]["not found"]
not found
>> print either pos [index? pos][-1]
-1
Here is a simple example to show how to do sub-string extraction without any explicit usage of integer indexes:
>> s: "The score is 1:2 after 5 minutes"
>> if pos: find/tail s "score is " [print copy/part pos find pos " "]
1:2
With a little practice (the console is great for such experimentations), you will see how simpler and more efficient it is to rely fully on series in Rebol languages than just plain integer indexes.
Now, here is my take on your questions:
No wizardry required, just use series and find function adequately, as shown above.
Red is not going to change that. Series are a cornerstone of what makes Rebol languages simple and powerful.
change should be the fastest way, though, if you have many replacements to operate on a long string, reconstructing a new string instead of changing the original one, leads often to better performances, as it would avoid moving memory chunks around when replacement strings are not of same size as the part they replace.

Split string into individual characters

I am having two problems while working in Lisp and I can't find any tutorials or sites that explain this. How do you split up a string into its individual characters? And how would I be able to change those characters into their corresponding ASCII values? If anyone knows any sites or tutorial videos explaining these, they would be greatly appreciated.
CL-USER 87 > (coerce "abc" 'list)
(#\a #\b #\c)
CL-USER 88 > (map 'list #'char-code "abc")
(97 98 99)
Get the Common Lisp Quick Reference.
A Lisp string is already split into its characters, in a way. It is a vector of characters, and depending upon what you need to do, you can use either whole string operations on it, or any operations applicable to vectors (like all the operations of the sequence protocol) to handle the individual characters.
split-string splits string into substrings based on the regular expression separators
Each match for separators defines a splitting point; the substrings between splitting points are made into a list, which is returned. If omit-nulls is nil (or omitted), the result contains null strings whenever there are two consecutive matches for separators, or a match is adjacent to the beginning or end of string. If omit-nulls is t, these null strings are omitted from the result. If separators is nil (or omitted), the default is the value of split-string-default-separators.
As a special case, when separators is nil (or omitted), null strings are always omitted
from the result. Thus:
(split-string " two words ") -> ("two" "words")
The result is not ("" "two" "words" ""), which would rarely be useful. If you need
such a result, use an explicit value for separators:
(split-string " two words " split-string-default-separators) -> ("" "two" "words" "")
More examples:
(split-string "Soup is good food" "o") -> ("S" "up is g" "" "d f" "" "d")
(split-string "Soup is good food" "o" t) -> ("S" "up is g" "d f" "d")
(split-string "Soup is good food" "o+") -> ("S" "up is g" "d f" "d")
You can also use elt or aref to get specific characters out of a string.
One of the best sites for an in-depth introduction to Common Lisp is the site for the Practical Common Lisp book (link to the section on numbers, chars and strings). The whole book is available online for free. Check it out.

Any other ways to emulate `tr` in J?

I picked up J a few weeks ago, about the same time the CodeGolf.SE beta opened to the public.
A recurrent issue (of mine) when using J over there is reformatting input and output to fit the problem specifications. So I tend to use code like this:
( ] ` ('_'"0) ) #. (= & '-')
This one untested for various reasons (edit me if wrong); intended meaning is "convert - to _". Also come up frequently: convert newlines to spaces (and converse), merge numbers with j, change brackets.
This takes up quite a few characters, and is not that convenient to integrate to the rest of the program.
Is there any other way to proceed with this? Preferably shorter, but I'm happy to learn anything else if it's got other advantages. Also, a solution with an implied functional obverse would relieve a lot.
It sometimes goes against the nature of code golf to use library methods, but in the string library, the charsub method is pretty useful:
'_-' charsub '_123'
-123
('_-', LF, ' ') charsub '_123', LF, '_stuff'
-123 -stuff
rplc is generally short for simple replacements:
'Test123' rplc 'e';'3'
T3st123
Amend m} is very short for special cases:
'*' 0} 'aaaa'
*aaa
'*' 0 2} 'aaaa'
*a*a
'*&' 0 2} 'aaaa'
*a&a
but becomes messy when the list has to be a verb:
b =: 'abcbdebf'
'L' (]g) } b
aLcLdeLf
where g has to be something like g =: ('b' E. ]) # ('b' E. ]) * [: i. #.
There are a lot of other "tricks" that work on a case by case basis. Example from the manual:
To replace lowercase 'a' through 'f' with uppercase 'A'
through 'F' in a string that contains only 'a' through 'f':
('abcdef' i. y) { 'ABCDEF'
Extending the previous example: to replace lowercase 'a' through
'f' with uppercase 'A' through 'F' leaving other characters unchanged:
(('abcdef' , a.) i. y) { 'ABCDEF' , a.
I've only dealt with the newlines and CSV, rather than the general case of replacement, but here's how I've handled those. I assume Unix line endings (or line endings fixed with toJ) with a final line feed.
Single lines of input: ".{:('1 2 3',LF) (Haven't gotten to use this yet)
Rectangular input: (".;._2) ('1 2 3',LF,'4 5 6',LF)
Ragged input: probably (,;._2) or (<;._2) (Haven't used this yet either.)
One line, comma separated: ".;._1}:',',('1,2,3',LF)
This doesn't replace tr at all, but does help with line endings and other garbage.
You might want to consider using the 8!:2 foreign:
8!:2]_1
-1

How to parse a string (by a "new" markup) with R?

I want to use R to do string parsing that (I think) is like a simplistic HTML parsing.
For example, let's say we have the following two variables:
Seq <- "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"
Str <- ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."
Say that I want to parse "Seq" According to "Str", by using the legend here
Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA
Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
| | | | | | | || |
+-----+ +--------------+ +---------------+ +---------------++-----+
| Stem 1 Stem 2 Stem 3 |
| |
+----------------------------------------------------------------+
Stem 0
Assume that we always have 4 stems (0 to 3), but that the length of letters before and after each of them can very.
The output should be something like the following list structure:
list(
"Stem 0 opening" = "GCCTCGA",
"before Stem 1" = "TA",
"Stem 1" = list(opening = "GCTC",
inside = "AGTTGGGA",
closing = "GAGC"
),
"between Stem 1 and 2" = "G",
"Stem 2" = list(opening = "TACGA",
inside = "CTGAAGA",
closing = "TCGTA"
),
"between Stem 2 and 3" = "AGGtC",
"Stem 3" = list(opening = "ACCAG",
inside = "TTCGATC",
closing = "CTGGT"
),
"After Stem 3" = "",
"Stem 0 closing" = "TCGGGGC"
)
I don't have any experience with programming a parser, and would like advices as to what strategy to use when programming something like this (and any recommended R commands to use).
What I was thinking of is to first get rid of the "Stem 0", then go through the inner string with a recursive function (let's call it "seperate.stem") that each time will split the string into:
1. before stem
2. opening stem
3. inside stem
4. closing stem
5. after stem
Where the "after stem" will then be recursively entered into the same function ("seperate.stem")
The thing is that I am not sure how to try and do this coding without using a loop.
Any advices will be most welcomed.
Update: someone sent me a bunch of question, here they are.
Q: Does each sequence have the same number of ">>>>" for the opening sequence as it does for "<<<<" on the ending sequence?
A: Yes
Q: Does the parsing always start with a partial stem 0 as your example shows?
A: No. Sometimes it will start with a few "."
Q: Is there a way of making sure you have the right sequences when you start?
A: I am not sure I understand what you mean.
Q: Is there a chance of error in the middle of the string that you have to restart from?
A: Sadly, yes. In which case, I'll need to ignore one of the inner stems...
Q: How long are these strings that you want to parse?
A: Each string has between 60 to 150 characters (and I have tens of thousands of them...)
Q: Is each one a self contained sequence like you show in your example, or do they go on for thousands of characters?
A: each sequence is self contained.
Q: Is there always at least one '.' between stems?
A: No.
Q: A full set of rules as to how the parsing should be done would be useful.
A: I agree. But since I don't have even a basic idea on how to start coding this, I thought first to have some help on the beginning and try to tweak with the other cases that will come up before turning back for help.
Q: Do you have the BNF syntax for parsing?
A: No. Your e-mail is the first time I came across it (http://en.wikipedia.org/wiki/Backus–Naur_Form).
You can simplify the task by using run length encoding.
First, convert Str to be a vector of individual characters, then call rle.
split_Str <- strsplit(Str, "")[[1]]
rle_Str <- rle(split_Str)
Run Length Encoding
lengths: int [1:14] 7 2 4 8 4 1 5 7 5 5 ...
values : chr [1:14] ">" "." ">" "." "<" "." ">" "." "<" "." ">" "." "<" "."
Now you just need to parse rle_Str$values, which is perhaps simpler. For instance, an inner stem will always look like ">" "." "<".
I think the main thing that you need to think about is the structure of the data. Does a "." always have to come between ">" and "<", or is it optional? Can you have a "." at the start? Do you need to be able to generalise to stems within stems within stems, or even more complex structures?
Once you have this solved, contructing your list output should be straightforward.
Also, don't worry about using loops, they are in the language because they are useful. Get the thing working first, then worry about speed optimisations (if you really have to) afterwards.

Resources