How to parse a string (by a "new" markup) with R? - string

I want to use R to do string parsing that (I think) is like a simplistic HTML parsing.
For example, let's say we have the following two variables:
Seq <- "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"
Str <- ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."
Say that I want to parse "Seq" According to "Str", by using the legend here
Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA
Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
| | | | | | | || |
+-----+ +--------------+ +---------------+ +---------------++-----+
| Stem 1 Stem 2 Stem 3 |
| |
+----------------------------------------------------------------+
Stem 0
Assume that we always have 4 stems (0 to 3), but that the length of letters before and after each of them can very.
The output should be something like the following list structure:
list(
"Stem 0 opening" = "GCCTCGA",
"before Stem 1" = "TA",
"Stem 1" = list(opening = "GCTC",
inside = "AGTTGGGA",
closing = "GAGC"
),
"between Stem 1 and 2" = "G",
"Stem 2" = list(opening = "TACGA",
inside = "CTGAAGA",
closing = "TCGTA"
),
"between Stem 2 and 3" = "AGGtC",
"Stem 3" = list(opening = "ACCAG",
inside = "TTCGATC",
closing = "CTGGT"
),
"After Stem 3" = "",
"Stem 0 closing" = "TCGGGGC"
)
I don't have any experience with programming a parser, and would like advices as to what strategy to use when programming something like this (and any recommended R commands to use).
What I was thinking of is to first get rid of the "Stem 0", then go through the inner string with a recursive function (let's call it "seperate.stem") that each time will split the string into:
1. before stem
2. opening stem
3. inside stem
4. closing stem
5. after stem
Where the "after stem" will then be recursively entered into the same function ("seperate.stem")
The thing is that I am not sure how to try and do this coding without using a loop.
Any advices will be most welcomed.
Update: someone sent me a bunch of question, here they are.
Q: Does each sequence have the same number of ">>>>" for the opening sequence as it does for "<<<<" on the ending sequence?
A: Yes
Q: Does the parsing always start with a partial stem 0 as your example shows?
A: No. Sometimes it will start with a few "."
Q: Is there a way of making sure you have the right sequences when you start?
A: I am not sure I understand what you mean.
Q: Is there a chance of error in the middle of the string that you have to restart from?
A: Sadly, yes. In which case, I'll need to ignore one of the inner stems...
Q: How long are these strings that you want to parse?
A: Each string has between 60 to 150 characters (and I have tens of thousands of them...)
Q: Is each one a self contained sequence like you show in your example, or do they go on for thousands of characters?
A: each sequence is self contained.
Q: Is there always at least one '.' between stems?
A: No.
Q: A full set of rules as to how the parsing should be done would be useful.
A: I agree. But since I don't have even a basic idea on how to start coding this, I thought first to have some help on the beginning and try to tweak with the other cases that will come up before turning back for help.
Q: Do you have the BNF syntax for parsing?
A: No. Your e-mail is the first time I came across it (http://en.wikipedia.org/wiki/Backus–Naur_Form).

You can simplify the task by using run length encoding.
First, convert Str to be a vector of individual characters, then call rle.
split_Str <- strsplit(Str, "")[[1]]
rle_Str <- rle(split_Str)
Run Length Encoding
lengths: int [1:14] 7 2 4 8 4 1 5 7 5 5 ...
values : chr [1:14] ">" "." ">" "." "<" "." ">" "." "<" "." ">" "." "<" "."
Now you just need to parse rle_Str$values, which is perhaps simpler. For instance, an inner stem will always look like ">" "." "<".
I think the main thing that you need to think about is the structure of the data. Does a "." always have to come between ">" and "<", or is it optional? Can you have a "." at the start? Do you need to be able to generalise to stems within stems within stems, or even more complex structures?
Once you have this solved, contructing your list output should be straightforward.
Also, don't worry about using loops, they are in the language because they are useful. Get the thing working first, then worry about speed optimisations (if you really have to) afterwards.

Related

Python ord() and chr()

I have:
txt = input('What is your sentence? ')
list = [0]*128
for x in txt:
list[ord(x)] += 1
for x in list:
if x >= 1:
print(chr(list.index(x)) * x)
As per my understanding this should just output every letter in a sentence like:
))
111
3333
etc.
For the string "aB)a2a2a2)" the output is correct:
))
222
B
aaaa
For the string "aB)a2a2a2" the output is wrong:
)
222
)
aaaa
I feel like all my bases are covered but I'm not sure what's wrong with this code.
When you do list.index(x), you're searching the list for the first index that value appears. That's not actually what you want though, you want the specific index of the value you just read, even if the same value occurs somewhere else earlier in the list too.
The best way to get indexes along side values from a sequence is with enuemerate:
for i, x in enumerate(list):
if x >= 1:
print(chr(i) * x)
That should get you the output you want, but there are several other things that would make your code easier to read and understand. First of all, using list as a variable name is a very bad idea, as that will shadow the builtin list type's name in your namespace. That makes it very confusing for anyone reading your code, and you even confuse yourself if you want to use the normal list for some purpose and don't remember you've already used it for a variable of your own.
The other issue is also about variable names, but it's a bit more subtle. Your two loops both use a loop variable named x, but the meaning of the value is different each time. The first loop is over the characters in the input string, while the latter loop is over the counts of each character. Using meaningful variables would make things a lot clearer.
Here's a combination of all my suggested fixes together:
text = input('What is your sentence? ')
counts = [0]*128
for character in text:
counts[ord(character)] += 1
for index, count in enumerate(counts):
if count >= 1:
print(chr(index) * count)

How to compare a string in nodejs?

I have a string which will be in DB like below.
let text = "Hi {#var#}, ghkk{#Var#}"
But from user we will get the value like below.
let text2 = "Hi xya, ghkkhoww"
It can be upto 10 variables. I want to compare both text and text2 and in place of {#var}, if some value is there, then the string is equal, else it is not. Can anybody please give me an idea how to compare this ?
There are a couple of ways here are a few:
String(var1).includes(var2) | Determine if var1 includes var2
String(var1).indexOf(var2) | returns the index if var2 in var1
String(var1).search(var2) | searches for var2 in var1 (best for paragraphs or sentences)
String(var1).equals(var2) | do both strings equal one-another
String(var1).compareTo(var3) | are these strings comparable
var == var2 | do they equal note do not use === which is know as the equivalence operator.
As you can see its completely dependent on the use case. This is a standard problem in CS but, the common rule of thumb would be using the ==, indexOf, and equals. SideBar I guess you could use Regular expressions, commonly known as Regex. It's pretty powerful, with a small learning curve but, you can easily get expressions wrong and cause an edge case where your code is letting scenarios sneak by which can be difficult to find if you don't know what your looking for or how its affecting your code.
Resources:
https://www.w3schools.com/js/js_string_methods.asp

String Operations Confusion? ELI5

I'm extremely new to python and I have no idea why this code gives me this output. I tried searching around for an answer but couldn't find anything because I'm not sure what to search for.
An explain-like-I'm-5 explanation would be greatly appreciated
astring = "hello world"
print(astring[3:7:2])
This gives me : "l"
Also
astring = "hello world"
print(astring[3:7:3])
gives me : "lw"
I can't wrap my head around why.
This is string slicing in python.
Slicing is similar to regular string indexing, but it can return a just a section of a string.
Using two parameters in a slice, such as [a:b] will return a string of characters, starting at index a up to, but not including, index b.
For example:
"abcdefg"[2:6] would return "cdef"
Using three parameters performs a similar function, but the slice will only return the character after a chosen gap. For example [2:6:2] will return every second character beginning at index 2, up to index 5.
ie "abcdefg"[2:6:2] will return ce, as it only counts every second character.
In your case, astring[3:7:3], the slice begins at index 3 (the second l) and moves forward the specified 3 characters (the third parameter) to w. It then stops at index 7, returning lw.
In fact when using only two parameters, the third defaults to 1, so astring[2:5] is the same as astring[2:5:1].
Python Central has some more detailed explanations of cutting and slicing strings in python.
I have a feeling you are over complicating this slightly.
Since the string astring is set statically you could more easily do the following:
# Sets the characters for the letters in the consistency of the word
letter-one = "h"
letter-two = "e"
letter-three = "l"
letter-four = "l"
letter-six = "o"
letter-7 = " "
letter-8 = "w"
letter-9 = "o"
letter-10 = "r"
letter11 = "l"
lettertwelve = "d"
# Tells the python which of the character letters that you want to have on the print screen
print(letter-three + letter-7 + letter-three)
This way its much more easily readable to human users and it should mitigate your error.

String searching in Rebol or Red

I'm interested in searching on a lot of long strings, to try and hack out a sed-like utility in rebol as a learning exercise. As a baby step I decided to search for a character:
>> STR: "abcdefghijklmopqrz"
>> pos: index? find STR "z"
== 18
>> pos
== 18
Great! Let's search for something else...
>> pos: index? find STR "n"
** Script Error: index? expected series argument of type: series port
** Where: halt-view
** Near: pos: index? find STR "n"
>> pos
== 18
What? :-(
Yeah, there was no "n" in the string I was searching. But what is the benefit of an interpreter blowing up instead of doing something sensible, such as returning a testable "null" char in pos?
I was told I should have done this:
>> if found? find STR "z" [pos: index? find STR "z"]
== 18
>> if found? find STR "n" [pos: index? find STR "n"]
== none
>> pos
== 18
Really? I have to search the string TWICE; the first time just to be sure it is "safe" to search AGAIN?
So I have a three-part question:
How would a wizard implement my search function? I presume there is a wizardly better way better than this....
Is Red going to change this? Ideally I'd think find should return a valid string position or a NULL if it hits end of string (NULL delimited, may I presume?). The NULL is FALSE so that would set up for a really easy if test.
What is the most CPU effective way to do a replace once I have a valid index? There appear to so many choices in Rebol (a good thing) that it is possible to get stuck in choosing or stuck in a suboptimal choice.
I was told I should have done this:
>> if found? find STR "z" [pos: index? find STR "z"]
== 18
>> if found? find STR "n" [pos: index? find STR "n"]
== none
>> pos
== 18
Really? I have to search the string TWICE; the first time just to be sure it is "safe" to search AGAIN?
You certainly don't have to search the string twice. But index? (likely future name since it doesn't return a yes/no: index-of) doesn't return a NONE! value if given a NONE! input. It assumes the caller wants an integer position back and raises an error if it can't give you one.
How would a wizard implement my search function?
To eliminate the double search, you can use a short circuit evaluation...
>> all [pos: find STR "z" pos: index? pos]
== 18
>> pos
== 18
>> all [pos: find STR "n" pos: index? pos]
== none
>> pos
== none
But note that without introducing a second variable you will overwrite your previous pos. Let's say you call your variable index instead and pos is a temporary:
>> all [pos: find STR "z" index: index? pos]
== 18
>> index
== 18
>> all [pos: find STR "n" index: index? pos]
== none
>> index
== 18
The ability to throw set-words at arbitrary points in mid-expression is quite powerful, and it's why things like multiple initialization (a: b: c: 0) are not special features of the language, but something that falls out of the evaluator model.
Is Red going to change this?
It's not likely that the benefits of index? (cough index-of) returning a NONE! value if given a NONE! input outweigh the problems it would cause by being so tolerant. It's always a balance.
Note that FIND does indeed behave as you expect. FOUND? is just a syntactic convenience that transforms a position found into a true value, and a NONE! returned into a false one. It is equivalent to calling TRUE? (but just a little more literate when reading). There is no need to use it in the condition of an IF or UNLESS or EITHER...as they will treat a NONE result as if it were false and any position as if it were true.
What is the most CPU effective way to do a replace once I have a valid index?
What would have been fastest would probably have been to have hung onto the position, and said change pos #"x". (Though internally "positions" are implemented by index plus series, and not an independent pointer. So the advantage is not that significant in micro-optimization world, where we're counting things like additions of offsets...)
As for which operation with an index: I'd say choose how you like it best and micro-optimize later.
I don't personally think STR/:index: #"x" looks all that great, but it's the briefest in characters.
STR/(index): #"x" does the same thing and looks better IMO. But at the cost of the source code structure blowing up a bit. That's a SET-PATH! series containing a PAREN! series followed by a CHAR!...all embedded in the original series "vector" that's holding the code. Under the hood there's going to be locality problems. And we know how important that is these days...
It's likely that the seemingly naive POKE is the fastest. poke STR index #"x". It may look like "4 elements instead of 2", but the "2 elements" of the path cases are an illusion.
In Rebol it's always a bit of a hard thing to guess, so you have to gather data. You can run some repeated iterative tests to find out. To time a block of code, see the builtin delta-time.
In Red the compiled forms should be equivalent, but if somehow this winds up being interpreted you'd probably have similar timings to Rebol.
No surprises that HostileFork answer covers everything beautifully! +1
Just wanted to add an alternative solution to point 1 that i use regularly:
>> attempt [index? find STR "z"]
== 18
>> attempt [index? find STR "n"]
== none
Online documentation for Rebol 2 attempt & Rebol 3 attempt
Searching strings in Red/Rebol is very simple and convenient. About the issues you have encountered, let me unpack the details for you:
First of all, the interpreter is giving you a good hint about what you are doing wrong, in form of an error message: index? expected series argument of type: series port. This means that you used index? on the wrong datatype. How did that happen? Simply because the find function returns a none value in case the search fails:
>> str: "abcdefghijklmopqrz"
>> find str "o"
== "pqrz"
>> type? find str "o"
== string!
>> find str "n"
== none
>> type? find str "n"
== none!
So, using index? directly on the result of find is unsafe, unless you know that the search won't fail. If you need to extract the index information anyway, the safe approach is to test the result of find first:
>> all [pos: find str "o" index? pos]
== 14
>> all [pos: find str "n" index? pos]
== none
>> if pos: find str "o" [print index? pos]
== 14
>> print either pos: find str "n" [index? pos][-1]
== -1
Those were just examples of safe ways to achieve it, depending on your needs. Note that none acts as false for conditional tests in if or either, so that using found? in such case, is superfluous.
Now let's shed some lights on the core issue which brought confusion to you.
Rebol languages have a fundamental concept called a series from which string! datatype is derived. Understanding and using properly series is a key part of being able to use Rebol languages in an idiomatic way. Series look like usual lists and string-like datatypes in other languages, but they are not the same. A series is made of:
a list of values (for strings, it is a list of characters)
a implicit index (we can call it a cursor for sake of simplicity)
The following description will only focus on strings, but the same rules apply to all series datatypes. I will use index? function in the examples below just to display the implicit index as an integer number.
By default, when you create a new string, the cursor is at head position:
>> s: "hello"
>> head? s
== true
>> index? s
== 1
But the cursor can be moved to point to other places in the string:
>> next s
== "ello"
>> skip s 3
== "lo"
>> length? skip s 3
== 2
As you can see, the string with a moved cursor is not only displayed from the cursor position, but also all the other string (or series) functions will take that position into account.
Additionally, you can also set the cursor for each reference pointing to the string:
>> a: next s
== "ello"
>> b: skip s 3
== "lo"
>> s: at s 5
== "o"
>> reduce [a b s]
== ["ello" "lo" "o"]
>> reduce [index? a index? b index? s]
== [2 4 5]
As you can see, you can have as many different references to a given string (or series) as you wish, each having its own cursor value, but all pointing to the same underlying list of values.
One important consequence of series properties: you do not need to rely on integer indexes to manipulate strings (and other series) like you would do in other languages, you can simply leverage the cursor which comes with any series reference to do whatever computation you need, and your code will be short, clean and very readable. Still, integer indexes can be useful sometimes on series, but you rarely need them.
Now let's go back to your use-case for searching in strings.
>> STR: "abcdefghijklmopqrz"
>> find STR "z"
== "z"
>> find STR "n"
== none
That is all you need, you do not have to extract the index position in order to use the resulting values for pretty much any computation you need to do.
>> pos: find STR "o"
>> if pos [print "found"]
found
>> print ["sub-string from `o`:" pos]
sub-string from `o`: opqrz
>> length? pos
== 5
>> index? pos
== 14
>> back pos
== "mopqrz"
>> skip pos 4
== "z"
>> pos: find STR "n"
>> print either pos ["found"]["not found"]
not found
>> print either pos [index? pos][-1]
-1
Here is a simple example to show how to do sub-string extraction without any explicit usage of integer indexes:
>> s: "The score is 1:2 after 5 minutes"
>> if pos: find/tail s "score is " [print copy/part pos find pos " "]
1:2
With a little practice (the console is great for such experimentations), you will see how simpler and more efficient it is to rely fully on series in Rebol languages than just plain integer indexes.
Now, here is my take on your questions:
No wizardry required, just use series and find function adequately, as shown above.
Red is not going to change that. Series are a cornerstone of what makes Rebol languages simple and powerful.
change should be the fastest way, though, if you have many replacements to operate on a long string, reconstructing a new string instead of changing the original one, leads often to better performances, as it would avoid moving memory chunks around when replacement strings are not of same size as the part they replace.

Standard ML string to a list

Is there a way in ML to take in a string and output a list of those string where a separation is a space, newline or eof, but also keeping strings inside strings intact?
EX) hello world "my id" is 5555
-> [hello, world, my id, is, 5555]
I am working on a tokenizing these then into:
->[word, word, string, word, int]
Sure you can! Here's the idea:
If we take a string like "Hello World, \"my id\" is 5555", we can split it at the quote marks, ignoring the spaces for now. This gives us ["Hello World, ", "my id", " is 5555"]. The important thing to notice here is that the list contains three elements - an odd number. As long as the string only contains pairs of quotes (as it will if it's properly formatted), we'll always get an odd number of elements when we split at the quote marks.
A second important thing is that all the even-numbered elements of the list will be strings that were unquoted (if we start counting from 0), and the odd-numbered ones were quoted. That means that all we need to do is tokenize the ones that were unquoted, and then we're done!
I put some code together - you can continue from there:
fun foo s =
let
val quoteSep = String.tokens (fn c => c = #"\"") s
val spaceSep = String.tokens (fn c => c = #" ") (* change this to include newlines and stuff *)
fun sepEven [] = []
| sepEven [x] = (* there were no quotes in the string *)
| sepEven (x::y::xs) = (* x was unquoted, y was quoted *)
in
if length quoteSep mod 2 = 0
then (* there was an uneven number of quote marks - something is wrong! *)
else (* call sepEven *)
end
String.tokens brings you halfway there. But if you really want to handle quotes like you are sketching then there is no way around writing an actual lexer. MLlex, which comes with SML/NJ and MLton (but is usable with any SML) could help. Or you just write it by hand, which should be easy enough in this case as well.

Resources