String searching in Rebol or Red - string

I'm interested in searching on a lot of long strings, to try and hack out a sed-like utility in rebol as a learning exercise. As a baby step I decided to search for a character:
>> STR: "abcdefghijklmopqrz"
>> pos: index? find STR "z"
== 18
>> pos
== 18
Great! Let's search for something else...
>> pos: index? find STR "n"
** Script Error: index? expected series argument of type: series port
** Where: halt-view
** Near: pos: index? find STR "n"
>> pos
== 18
What? :-(
Yeah, there was no "n" in the string I was searching. But what is the benefit of an interpreter blowing up instead of doing something sensible, such as returning a testable "null" char in pos?
I was told I should have done this:
>> if found? find STR "z" [pos: index? find STR "z"]
== 18
>> if found? find STR "n" [pos: index? find STR "n"]
== none
>> pos
== 18
Really? I have to search the string TWICE; the first time just to be sure it is "safe" to search AGAIN?
So I have a three-part question:
How would a wizard implement my search function? I presume there is a wizardly better way better than this....
Is Red going to change this? Ideally I'd think find should return a valid string position or a NULL if it hits end of string (NULL delimited, may I presume?). The NULL is FALSE so that would set up for a really easy if test.
What is the most CPU effective way to do a replace once I have a valid index? There appear to so many choices in Rebol (a good thing) that it is possible to get stuck in choosing or stuck in a suboptimal choice.

I was told I should have done this:
>> if found? find STR "z" [pos: index? find STR "z"]
== 18
>> if found? find STR "n" [pos: index? find STR "n"]
== none
>> pos
== 18
Really? I have to search the string TWICE; the first time just to be sure it is "safe" to search AGAIN?
You certainly don't have to search the string twice. But index? (likely future name since it doesn't return a yes/no: index-of) doesn't return a NONE! value if given a NONE! input. It assumes the caller wants an integer position back and raises an error if it can't give you one.
How would a wizard implement my search function?
To eliminate the double search, you can use a short circuit evaluation...
>> all [pos: find STR "z" pos: index? pos]
== 18
>> pos
== 18
>> all [pos: find STR "n" pos: index? pos]
== none
>> pos
== none
But note that without introducing a second variable you will overwrite your previous pos. Let's say you call your variable index instead and pos is a temporary:
>> all [pos: find STR "z" index: index? pos]
== 18
>> index
== 18
>> all [pos: find STR "n" index: index? pos]
== none
>> index
== 18
The ability to throw set-words at arbitrary points in mid-expression is quite powerful, and it's why things like multiple initialization (a: b: c: 0) are not special features of the language, but something that falls out of the evaluator model.
Is Red going to change this?
It's not likely that the benefits of index? (cough index-of) returning a NONE! value if given a NONE! input outweigh the problems it would cause by being so tolerant. It's always a balance.
Note that FIND does indeed behave as you expect. FOUND? is just a syntactic convenience that transforms a position found into a true value, and a NONE! returned into a false one. It is equivalent to calling TRUE? (but just a little more literate when reading). There is no need to use it in the condition of an IF or UNLESS or EITHER...as they will treat a NONE result as if it were false and any position as if it were true.
What is the most CPU effective way to do a replace once I have a valid index?
What would have been fastest would probably have been to have hung onto the position, and said change pos #"x". (Though internally "positions" are implemented by index plus series, and not an independent pointer. So the advantage is not that significant in micro-optimization world, where we're counting things like additions of offsets...)
As for which operation with an index: I'd say choose how you like it best and micro-optimize later.
I don't personally think STR/:index: #"x" looks all that great, but it's the briefest in characters.
STR/(index): #"x" does the same thing and looks better IMO. But at the cost of the source code structure blowing up a bit. That's a SET-PATH! series containing a PAREN! series followed by a CHAR!...all embedded in the original series "vector" that's holding the code. Under the hood there's going to be locality problems. And we know how important that is these days...
It's likely that the seemingly naive POKE is the fastest. poke STR index #"x". It may look like "4 elements instead of 2", but the "2 elements" of the path cases are an illusion.
In Rebol it's always a bit of a hard thing to guess, so you have to gather data. You can run some repeated iterative tests to find out. To time a block of code, see the builtin delta-time.
In Red the compiled forms should be equivalent, but if somehow this winds up being interpreted you'd probably have similar timings to Rebol.

No surprises that HostileFork answer covers everything beautifully! +1
Just wanted to add an alternative solution to point 1 that i use regularly:
>> attempt [index? find STR "z"]
== 18
>> attempt [index? find STR "n"]
== none
Online documentation for Rebol 2 attempt & Rebol 3 attempt

Searching strings in Red/Rebol is very simple and convenient. About the issues you have encountered, let me unpack the details for you:
First of all, the interpreter is giving you a good hint about what you are doing wrong, in form of an error message: index? expected series argument of type: series port. This means that you used index? on the wrong datatype. How did that happen? Simply because the find function returns a none value in case the search fails:
>> str: "abcdefghijklmopqrz"
>> find str "o"
== "pqrz"
>> type? find str "o"
== string!
>> find str "n"
== none
>> type? find str "n"
== none!
So, using index? directly on the result of find is unsafe, unless you know that the search won't fail. If you need to extract the index information anyway, the safe approach is to test the result of find first:
>> all [pos: find str "o" index? pos]
== 14
>> all [pos: find str "n" index? pos]
== none
>> if pos: find str "o" [print index? pos]
== 14
>> print either pos: find str "n" [index? pos][-1]
== -1
Those were just examples of safe ways to achieve it, depending on your needs. Note that none acts as false for conditional tests in if or either, so that using found? in such case, is superfluous.
Now let's shed some lights on the core issue which brought confusion to you.
Rebol languages have a fundamental concept called a series from which string! datatype is derived. Understanding and using properly series is a key part of being able to use Rebol languages in an idiomatic way. Series look like usual lists and string-like datatypes in other languages, but they are not the same. A series is made of:
a list of values (for strings, it is a list of characters)
a implicit index (we can call it a cursor for sake of simplicity)
The following description will only focus on strings, but the same rules apply to all series datatypes. I will use index? function in the examples below just to display the implicit index as an integer number.
By default, when you create a new string, the cursor is at head position:
>> s: "hello"
>> head? s
== true
>> index? s
== 1
But the cursor can be moved to point to other places in the string:
>> next s
== "ello"
>> skip s 3
== "lo"
>> length? skip s 3
== 2
As you can see, the string with a moved cursor is not only displayed from the cursor position, but also all the other string (or series) functions will take that position into account.
Additionally, you can also set the cursor for each reference pointing to the string:
>> a: next s
== "ello"
>> b: skip s 3
== "lo"
>> s: at s 5
== "o"
>> reduce [a b s]
== ["ello" "lo" "o"]
>> reduce [index? a index? b index? s]
== [2 4 5]
As you can see, you can have as many different references to a given string (or series) as you wish, each having its own cursor value, but all pointing to the same underlying list of values.
One important consequence of series properties: you do not need to rely on integer indexes to manipulate strings (and other series) like you would do in other languages, you can simply leverage the cursor which comes with any series reference to do whatever computation you need, and your code will be short, clean and very readable. Still, integer indexes can be useful sometimes on series, but you rarely need them.
Now let's go back to your use-case for searching in strings.
>> STR: "abcdefghijklmopqrz"
>> find STR "z"
== "z"
>> find STR "n"
== none
That is all you need, you do not have to extract the index position in order to use the resulting values for pretty much any computation you need to do.
>> pos: find STR "o"
>> if pos [print "found"]
found
>> print ["sub-string from `o`:" pos]
sub-string from `o`: opqrz
>> length? pos
== 5
>> index? pos
== 14
>> back pos
== "mopqrz"
>> skip pos 4
== "z"
>> pos: find STR "n"
>> print either pos ["found"]["not found"]
not found
>> print either pos [index? pos][-1]
-1
Here is a simple example to show how to do sub-string extraction without any explicit usage of integer indexes:
>> s: "The score is 1:2 after 5 minutes"
>> if pos: find/tail s "score is " [print copy/part pos find pos " "]
1:2
With a little practice (the console is great for such experimentations), you will see how simpler and more efficient it is to rely fully on series in Rebol languages than just plain integer indexes.
Now, here is my take on your questions:
No wizardry required, just use series and find function adequately, as shown above.
Red is not going to change that. Series are a cornerstone of what makes Rebol languages simple and powerful.
change should be the fastest way, though, if you have many replacements to operate on a long string, reconstructing a new string instead of changing the original one, leads often to better performances, as it would avoid moving memory chunks around when replacement strings are not of same size as the part they replace.

Related

trouble with tripling letters [duplicate]

How can I iterate over a string in Python (get each character from the string, one at a time, each time through a loop)?
As Johannes pointed out,
for c in "string":
#do something with c
You can iterate pretty much anything in python using the for loop construct,
for example, open("file.txt") returns a file object (and opens the file), iterating over it iterates over lines in that file
with open(filename) as f:
for line in f:
# do something with line
If that seems like magic, well it kinda is, but the idea behind it is really simple.
There's a simple iterator protocol that can be applied to any kind of object to make the for loop work on it.
Simply implement an iterator that defines a next() method, and implement an __iter__ method on a class to make it iterable. (the __iter__ of course, should return an iterator object, that is, an object that defines next())
See official documentation
If you need access to the index as you iterate through the string, use enumerate():
>>> for i, c in enumerate('test'):
... print i, c
...
0 t
1 e
2 s
3 t
Even easier:
for c in "test":
print c
Just to make a more comprehensive answer, the C way of iterating over a string can apply in Python, if you really wanna force a square peg into a round hole.
i = 0
while i < len(str):
print str[i]
i += 1
But then again, why do that when strings are inherently iterable?
for i in str:
print i
Well you can also do something interesting like this and do your job by using for loop
#suppose you have variable name
name = "Mr.Suryaa"
for index in range ( len ( name ) ):
print ( name[index] ) #just like c and c++
Answer is
M r . S u r y a a
However since range() create a list of the values which is sequence thus you can directly use the name
for e in name:
print(e)
This also produces the same result and also looks better and works with any sequence like list, tuple, and dictionary.
We have used tow Built in Functions ( BIFs in Python Community )
1) range() - range() BIF is used to create indexes
Example
for i in range ( 5 ) :
can produce 0 , 1 , 2 , 3 , 4
2) len() - len() BIF is used to find out the length of given string
If you would like to use a more functional approach to iterating over a string (perhaps to transform it somehow), you can split the string into characters, apply a function to each one, then join the resulting list of characters back into a string.
A string is inherently a list of characters, hence 'map' will iterate over the string - as second argument - applying the function - the first argument - to each one.
For example, here I use a simple lambda approach since all I want to do is a trivial modification to the character: here, to increment each character value:
>>> ''.join(map(lambda x: chr(ord(x)+1), "HAL"))
'IBM'
or more generally:
>>> ''.join(map(my_function, my_string))
where my_function takes a char value and returns a char value.
Several answers here use range. xrange is generally better as it returns a generator, rather than a fully-instantiated list. Where memory and or iterables of widely-varying lengths can be an issue, xrange is superior.
You can also do the following:
txt = "Hello World!"
print (*txt, sep='\n')
This does not use loops but internally print statement takes care of it.
* unpacks the string into a list and sends it to the print statement
sep='\n' will ensure that the next char is printed on a new line
The output will be:
H
e
l
l
o
W
o
r
l
d
!
If you do need a loop statement, then as others have mentioned, you can use a for loop like this:
for x in txt: print (x)
If you ever run in a situation where you need to get the next char of the word using __next__(), remember to create a string_iterator and iterate over it and not the original string (it does not have the __next__() method)
In this example, when I find a char = [ I keep looking into the next word while I don't find ], so I need to use __next__
here a for loop over the string wouldn't help
myString = "'string' 4 '['RP0', 'LC0']' '[3, 4]' '[3, '4']'"
processedInput = ""
word_iterator = myString.__iter__()
for idx, char in enumerate(word_iterator):
if char == "'":
continue
processedInput+=char
if char == '[':
next_char=word_iterator.__next__()
while(next_char != "]"):
processedInput+=next_char
next_char=word_iterator.__next__()
else:
processedInput+=next_char

How to assign the value of an index of a list to a variable in Python if you only know the index number and don't know or care about the element

I just spent 4 hours trying to google the answer asked pretty much as above. There were 10,000 results on how to find the index of an element. But I am not interested in any of the elements and never know what they will be, how long they will be, nor what they will contain since the string is user's input.
I have studied the Python manual for "lists" and tried several methods. I used loops of several types. List comprehensions were simply too complicated for my tiny brain although I tried. In fact, I tried for 4 hours googling and tweaking things maybe a hundred times but I only received errors after error of every type which I chased down one by one. I am not providing any code because I am not interested in having anyone fix it (its for an online course and that would be cheating). I just want to know the syntax for one little thing that is preventing me from making it work.
I need to assign the position of any element (i.e., the index integer value) anywhere in the list to a variable so I can control some boundaries. The most important thing for me is to set some conditions for the first character of the string converted to a list.
I was not very clear explaining what I am trying to do so I edited to add this:
pseudo code:
1. Ask the user for input and assign it a variable
2 Convert the string into a list where each letter of the string is an element of the list.
#The operation is dependent only on the position and not on the content of each character in the list.
3 For some elements of the list (but not all of them, which is why a loop won't work) perform an operation depending on their position (or index value)in the list.
in other words:"if the element is in position 0 of the list (or 3 or 27 etc) of the list then do something to the element." And I won't know or care what the content of the original element was.
If I know how to do that then I can extrapolate it for other character positions in the list.
I am an total beginner, and am not familiar with technical jargon, so please provide the simplest, least complex method! :-) Thank you in advance.
I'm an amateur myself but I will take a shot. If you can please do share some more context and some code for clarity.
I just checked the comment you made 50 mins ago. If I understand correctly, you want to assign the indexes to a variable. If that's correct can use the enumerate function. It's a built-in function for sequences. If we were to apply enumerate on our list named text it will return the position e.g. the index and the value of that position.
text = ["B", "O", "O", "M"]
for index, value in enumerate(text):
print(index, value)
This code will give you the following result:
0 B
1 O
2 O
3 M
Inside the for loop, you have the index variable that will now refer to the position of each value. You can now apply further conditions, like if index == 0:... and do your thing.
Does this help?
I also tried this one:
I think this is what you want. It assigns the value of an index (0,1,2,3) of a list to a variable.
list = ["item 1", "item 2", "item 3"]
item = input("Enter some text: ")
if item == list[0]:
index = 0
elif item == list[1]:
index = 1
elif item == list[2]:
index = 2
print(index)
I tried this because you said: "if this is index 0 of the list then do this"
It checks if the value of the input is the item with index 0 of the list.
list = ["item 1", "item 2", "item 3"]
item = input("Enter some text: ")
if item == list[0]:
print("This is index 0 of the list")
else:
print("This is not index 0 of the list")
If this is not the thing you're looking for, could you please try to explain it in a different way? Or maybe try writting some code too please.

Python ord() and chr()

I have:
txt = input('What is your sentence? ')
list = [0]*128
for x in txt:
list[ord(x)] += 1
for x in list:
if x >= 1:
print(chr(list.index(x)) * x)
As per my understanding this should just output every letter in a sentence like:
))
111
3333
etc.
For the string "aB)a2a2a2)" the output is correct:
))
222
B
aaaa
For the string "aB)a2a2a2" the output is wrong:
)
222
)
aaaa
I feel like all my bases are covered but I'm not sure what's wrong with this code.
When you do list.index(x), you're searching the list for the first index that value appears. That's not actually what you want though, you want the specific index of the value you just read, even if the same value occurs somewhere else earlier in the list too.
The best way to get indexes along side values from a sequence is with enuemerate:
for i, x in enumerate(list):
if x >= 1:
print(chr(i) * x)
That should get you the output you want, but there are several other things that would make your code easier to read and understand. First of all, using list as a variable name is a very bad idea, as that will shadow the builtin list type's name in your namespace. That makes it very confusing for anyone reading your code, and you even confuse yourself if you want to use the normal list for some purpose and don't remember you've already used it for a variable of your own.
The other issue is also about variable names, but it's a bit more subtle. Your two loops both use a loop variable named x, but the meaning of the value is different each time. The first loop is over the characters in the input string, while the latter loop is over the counts of each character. Using meaningful variables would make things a lot clearer.
Here's a combination of all my suggested fixes together:
text = input('What is your sentence? ')
counts = [0]*128
for character in text:
counts[ord(character)] += 1
for index, count in enumerate(counts):
if count >= 1:
print(chr(index) * count)

String Operations Confusion? ELI5

I'm extremely new to python and I have no idea why this code gives me this output. I tried searching around for an answer but couldn't find anything because I'm not sure what to search for.
An explain-like-I'm-5 explanation would be greatly appreciated
astring = "hello world"
print(astring[3:7:2])
This gives me : "l"
Also
astring = "hello world"
print(astring[3:7:3])
gives me : "lw"
I can't wrap my head around why.
This is string slicing in python.
Slicing is similar to regular string indexing, but it can return a just a section of a string.
Using two parameters in a slice, such as [a:b] will return a string of characters, starting at index a up to, but not including, index b.
For example:
"abcdefg"[2:6] would return "cdef"
Using three parameters performs a similar function, but the slice will only return the character after a chosen gap. For example [2:6:2] will return every second character beginning at index 2, up to index 5.
ie "abcdefg"[2:6:2] will return ce, as it only counts every second character.
In your case, astring[3:7:3], the slice begins at index 3 (the second l) and moves forward the specified 3 characters (the third parameter) to w. It then stops at index 7, returning lw.
In fact when using only two parameters, the third defaults to 1, so astring[2:5] is the same as astring[2:5:1].
Python Central has some more detailed explanations of cutting and slicing strings in python.
I have a feeling you are over complicating this slightly.
Since the string astring is set statically you could more easily do the following:
# Sets the characters for the letters in the consistency of the word
letter-one = "h"
letter-two = "e"
letter-three = "l"
letter-four = "l"
letter-six = "o"
letter-7 = " "
letter-8 = "w"
letter-9 = "o"
letter-10 = "r"
letter11 = "l"
lettertwelve = "d"
# Tells the python which of the character letters that you want to have on the print screen
print(letter-three + letter-7 + letter-three)
This way its much more easily readable to human users and it should mitigate your error.

How to parse a string (by a "new" markup) with R?

I want to use R to do string parsing that (I think) is like a simplistic HTML parsing.
For example, let's say we have the following two variables:
Seq <- "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"
Str <- ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."
Say that I want to parse "Seq" According to "Str", by using the legend here
Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA
Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
| | | | | | | || |
+-----+ +--------------+ +---------------+ +---------------++-----+
| Stem 1 Stem 2 Stem 3 |
| |
+----------------------------------------------------------------+
Stem 0
Assume that we always have 4 stems (0 to 3), but that the length of letters before and after each of them can very.
The output should be something like the following list structure:
list(
"Stem 0 opening" = "GCCTCGA",
"before Stem 1" = "TA",
"Stem 1" = list(opening = "GCTC",
inside = "AGTTGGGA",
closing = "GAGC"
),
"between Stem 1 and 2" = "G",
"Stem 2" = list(opening = "TACGA",
inside = "CTGAAGA",
closing = "TCGTA"
),
"between Stem 2 and 3" = "AGGtC",
"Stem 3" = list(opening = "ACCAG",
inside = "TTCGATC",
closing = "CTGGT"
),
"After Stem 3" = "",
"Stem 0 closing" = "TCGGGGC"
)
I don't have any experience with programming a parser, and would like advices as to what strategy to use when programming something like this (and any recommended R commands to use).
What I was thinking of is to first get rid of the "Stem 0", then go through the inner string with a recursive function (let's call it "seperate.stem") that each time will split the string into:
1. before stem
2. opening stem
3. inside stem
4. closing stem
5. after stem
Where the "after stem" will then be recursively entered into the same function ("seperate.stem")
The thing is that I am not sure how to try and do this coding without using a loop.
Any advices will be most welcomed.
Update: someone sent me a bunch of question, here they are.
Q: Does each sequence have the same number of ">>>>" for the opening sequence as it does for "<<<<" on the ending sequence?
A: Yes
Q: Does the parsing always start with a partial stem 0 as your example shows?
A: No. Sometimes it will start with a few "."
Q: Is there a way of making sure you have the right sequences when you start?
A: I am not sure I understand what you mean.
Q: Is there a chance of error in the middle of the string that you have to restart from?
A: Sadly, yes. In which case, I'll need to ignore one of the inner stems...
Q: How long are these strings that you want to parse?
A: Each string has between 60 to 150 characters (and I have tens of thousands of them...)
Q: Is each one a self contained sequence like you show in your example, or do they go on for thousands of characters?
A: each sequence is self contained.
Q: Is there always at least one '.' between stems?
A: No.
Q: A full set of rules as to how the parsing should be done would be useful.
A: I agree. But since I don't have even a basic idea on how to start coding this, I thought first to have some help on the beginning and try to tweak with the other cases that will come up before turning back for help.
Q: Do you have the BNF syntax for parsing?
A: No. Your e-mail is the first time I came across it (http://en.wikipedia.org/wiki/Backus–Naur_Form).
You can simplify the task by using run length encoding.
First, convert Str to be a vector of individual characters, then call rle.
split_Str <- strsplit(Str, "")[[1]]
rle_Str <- rle(split_Str)
Run Length Encoding
lengths: int [1:14] 7 2 4 8 4 1 5 7 5 5 ...
values : chr [1:14] ">" "." ">" "." "<" "." ">" "." "<" "." ">" "." "<" "."
Now you just need to parse rle_Str$values, which is perhaps simpler. For instance, an inner stem will always look like ">" "." "<".
I think the main thing that you need to think about is the structure of the data. Does a "." always have to come between ">" and "<", or is it optional? Can you have a "." at the start? Do you need to be able to generalise to stems within stems within stems, or even more complex structures?
Once you have this solved, contructing your list output should be straightforward.
Also, don't worry about using loops, they are in the language because they are useful. Get the thing working first, then worry about speed optimisations (if you really have to) afterwards.

Resources