What's the fastest/most efficient way to count lines in Rebol? - string

Given a string string, what is the fastest/most-efficient way to count lines therein? Will accept best answers for any flavour of Rebol. I've been working under the assumption that the parse [some [thru]] combination was the fastest way to traverse a string, but then I don't know that for certain, hence turning to SO:
count-lines: func [string [string!] /local count][
parse/all string [
(count: 1) some [thru newline (count: count + 1)]
]
count
]
Or:
count-lines: func [string [string!] /local count][
count: 0
until [
count: count + 1
not string: find/tail string newline
]
count
]
And how about counters? How efficient is repeat?
count-lines: func [string [string!]][
repeat count length? string [
unless string: find/tail string newline [
break/return count
]
]
]
Update: line count goes by the Text Editor principle:
An empty document still has a line count of one. So:
>> count-lines ""
== 1
>> count-lines "^/"
== 2

count-lines: func [
str
/local sort-str ][
sort-str: sort join str "^/"
1 + subtract index? find/last sort-str "^/" index? find sort-str "^/"
]

Enhanced PARSE version, as suggested by BrianH:
i: 1 ; add one as TextMate
parse text [any [thru newline (++ i)]]
print i

Here's the best simple non-parse version I can think of:
count-lines: function [text [string!]] [
i: 1
find-all text newline [++ i]
i
]
It uses function and ++ from more recent versions of Rebol, and find-all from either R3 or R2/Forward. You could look at the source of find-all and inline what you find and optimize, but situations like this are exactly what we wrote find-all for, so why not use it?

Here is the best for me:
temp: read/lines %mytext.txt
length? temp

remove-each can be fast as it is native
s: "1^/2^/3"
a: length? s
print a - length? remove-each v s [v = #"^/"]
; >> 2
or as a function
>> f: func [s] [print [(length? s) - (length? remove-each v s [v = #"^/"])]]
>> f "1^/2^/3"
== 2

Why no one came with the simplest solution I wonder :)
t: "abc^/de^/f^/ghi"
i: 0 until [i: i + 1 not t: find/tail t newline] i
== 4
Not sure about the performance but I think it's quite fast, as UNTIL and FIND are natives.
WHILE could be used as well.
i: 1 while [t: find/tail t newline] [i: i + 1] i
== 4
Just need to check for empty string. And if it would be a function, argument series needs to be HEADed.

Not the most efficient, but probably one of the fastest solution (anyway if a benchmark is run, I would like to see how this solution performs):
>> s: "1^/2^/ ^/^/3"
>> (length? s) - length? trim/with copy s newline
== 4

Do not know about performance, and the last line rule (r3).
>> length? parse "1^/2^/3" "^/"
== 3

hehehe the read/lines length? temp is a great thing I though about read/lines -> foreach lines temps [ count: count + 1]
another way to do it would be to do
temp: "line 1 ^M line2 ^M line3 ^M "
length? parse temp newline ; that cuts the strings into a block
;of multiple strings that represent each a line [ "line 1" "line2" "line3" ]
:then you count how much strings you have in the block with length?
I like to code in rebol it is so funny
Edit I didnt read the whole post so my solution already waas proposed in a different way...
ok to amend for my sin of posting a already posted solution I will bring insight comment of a unexpected behavior of that solution. Multiple chained carriage returns are not counted (using rebol3 linux ...)
>> a: "line1 ^M line2 ^M line3 ^M^M"
== "line1 ^M line2 ^M line3 ^M^M"
>> length? parse a newline
== 3

Related

Extracting a string from a substring in bash (yes, that way around)

I have a string of several words in bash called comp_line, which can have any number of spaces inside. For example:
"foo bar apple banana q xy"
And I have a zero-based index comp_point pointing to one character in that string, e.g. if comp_point is 4, it points to the first 'b' in 'bar'.
Based on the comp_point and comp_line alone, I want to extract the word being pointed to by the index, where the "word" is a sequence of letters, numbers, punctuation or any other non-whitespace character, surrounded by whitespace on either side (if the word is at the start or end of the string, or is the only word in the string, it should work the same way.)
The word I'm trying to extract will become cur (the current word)
Based on this, I've come up with a set of rules:
Read the current character curchar, the previous character prevchar, and the next character nextchar. Then:
If curchar is a graph character (non-whitespace), set cur to the letters before and after curchar (stopping until you reach a whitespace or string start/end on either side.)
Else, if prevchar is a graph character, set cur to the letters from the previous letter, backwards until the previous whitespace character/string start.
Else, if nextchar is a graph character, set cur to the letters from the next letter, forwards until the next whitespace character/string end.
If none of the above conditions are hit (meaning curchar, nextchar and prevchar are all whitespace characters,) set cur to "" (empty string)
I've written some code which I think achieves this. Rules 2, 3 and 4 are relatively straightforward, but rule 1 is the most difficult to implement - I've had to do some complicated string slicing. I'm not convinced that my solution is in any way ideal, and want to know if anyone knows of a better way to do this within bash only (not outsourcing to Python or another easier language.)
Tested on https://rextester.com/l/bash_online_compiler
#!/bin/bash
# GNU bash, version 4.4.20
comp_line="foo bar apple banana q xy"
comp_point=19
cur=""
curchar=${comp_line:$comp_point:1}
prevchar=${comp_line:$((comp_point - 1)):1}
nextchar=${comp_line:$((comp_point + 1)):1}
echo "<$prevchar> <$curchar> <$nextchar>"
if [[ $curchar =~ [[:graph:]] ]]; then
# Rule 1 - Extract current word
slice="${comp_line:$comp_point}"
endslice="${slice%% *}"
slice="${slice#"$endslice"}"
slice="${comp_line%"$slice"}"
cur="${slice##* }"
else
if [[ $prevchar =~ [[:graph:]] ]]; then
# Rule 2 - Extract previous word
slice="${comp_line::$comp_point}"
cur="${slice##* }"
else
if [[ $nextchar =~ [[:graph:]] ]]; then
# Rule 3 - Extract next word
slice="${comp_line:$comp_point+1}"
cur="${slice%% *}"
else
# Rule 4 - Set cur to empty string ""
cur=""
fi
fi
fi
echo "Cur: <$cur>"
The current example will return 'banana' as comp_point is set to 19.
I'm sure that there must be a neater way to do it that I hadn't thought of, or some trick that I've missed. Also it works so far, but I think there may be some edge cases I hadn't thought of. Can anyone advise if there's a better way to do it?
(The XY problem, if anyone asks)
I'm writing a tab completion script, and trying to emulate the functionality of COMP_WORDS and COMP_CWORD, using COMP_LINE and COMP_POINT. When a user is typing a command to tab complete, I want to work out which word they are trying to tab complete just based on the latter two variables. I don't want to outsource this code to Python because performance takes a big hit when Python is involved in tab complete.
Another way in bash without array.
#!/bin/bash
string="foo bar apple banana q xy"
wordAtIndex() {
local index=$1 string=$2 ret='' last first
if [ "${string:index:1}" != " " ] ; then
last="${string:index}"
first="${string:0:index}"
ret="${first##* }${last%% *}"
fi
echo "$ret"
}
for ((i=0; i < "${#string}"; ++i)); do
printf '%s <-- "%s"\n' "${string:i:1}" "$(wordAtIndex "$i" "$string")"
done
if anyone knows of a better way to do this within bash only
Use regexes. With ^.{4} you can skip the first four letters to navigate to index 4. With [[:graph:]]* you can match the rest of the word at that index. * is greedy and will match as many graphical characters as possible.
wordAtIndex() {
local index=$1 string=$2 left right indexFromRight
[[ "$string" =~ ^.{$index}([[:graph:]]*) ]]
right=${BASH_REMATCH[1]}
((indexFromRight=${#string}-index-1))
[[ "$string" =~ ([[:graph:]]*).{$indexFromRight}$ ]]
left=${BASH_REMATCH[1]}
echo "$left${right:1}"
}
And here is full test for your example:
string="foo bar apple banana q xy"
for ((i=0; i < "${#string}"; ++i)); do
printf '%s <-- "%s"\n' "${string:i:1}" "$(wordAtIndex "$i" "$string")"
done
This outputs the input string vertically on the left, and on each index extracts the word that index points to on the right.
f <-- "foo"
o <-- "foo"
o <-- "foo"
<-- ""
b <-- "bar"
a <-- "bar"
r <-- "bar"
<-- ""
<-- ""
<-- ""
a <-- "apple"
p <-- "apple"
p <-- "apple"
l <-- "apple"
e <-- "apple"
<-- ""
<-- ""
b <-- "banana"
a <-- "banana"
n <-- "banana"
a <-- "banana"
n <-- "banana"
a <-- "banana"
<-- ""
q <-- "q"
<-- ""
x <-- "xy"
y <-- "xy"

Convert carriage return (\r) to actual overwrite

Questions
Is there a way to convert the carriage returns to actual overwrite in a string so that 000000000000\r1010 is transformed to 101000000000?
Context
1. Initial objective:
Having a number x (between 0 and 255) in base 10, I want to convert this number in base 2, add trailing zeros to get a 12-digits long binary representation, generate 12 different numbers (each of them made of the last n digits in base 2, with n between 1 and 12) and print the base 10 representation of these 12 numbers.
2. Example:
With x = 10
Base 2 is 1010
With trailing zeros 101000000000
Extract the 12 "leading" numbers: 1, 10, 101, 1010, 10100, 101000, ...
Convert to base 10: 1, 2, 5, 10, 20, 40, ...
3. What I have done (it does not work):
x=10
x_base2="$(echo "obase=2;ibase=10;${x}" | bc)"
x_base2_padded="$(printf '%012d\r%s' 0 "${x_base2}")"
for i in {1..12}
do
t=$(echo ${x_base2_padded:0:${i}})
echo "obase=10;ibase=2;${t}" | bc
done
4. Why it does not work
Because the variable x_base2_padded contains the whole sequence 000000000000\r1010. This can be confirmed using hexdump for instance. In the for loop, when I extract the first 12 characters, I only get zeros.
5. Alternatives
I know I can find alternative by literally adding zeros to the variable as follow:
x_base2=1010
x_base2_padded="$(printf '%s%0.*d' "${x_base2}" $((12-${#x_base2})) 0)"
Or by padding with zeros using printf and rev
x_base2=1010
x_base2_padded="$(printf '%012s' "$(printf "${x_base2}" | rev)" | rev)"
Although these alternatives solve my problem now and let me continue my work, it does not really answer my question.
Related issue
The same problem may be observed in different contexts. For instance if one tries to concatenate multiple strings containing carriage returns. The result may be hard to predict.
str=$'bar\rfoo'
echo "${str}"
echo "${str}${str}"
echo "${str}${str}${str}"
echo "${str}${str}${str}${str}"
echo "${str}${str}${str}${str}${str}"
The first echo will output foo. Although you might expect the other echo to output foofoofoo..., they all output foobar.
The following function overwrite transforms its argument such that after each carriage return \r the beginning of the string is actually overwritten:
overwrite() {
local segment result=
while IFS= read -rd $'\r' segment; do
result="$segment${result:${#segment}}"
done < <(printf '%s\r' "$#")
printf %s "$result"
}
Example
$ overwrite $'abcdef\r0123\rxy'
xy23ef
Note that the printed string is actually xy23ef, unlike echo $'abcdef\r0123\rxy' which only seems to print the same string, but still prints \r which is then interpreted by your terminal such that the result looks the same. You can confirm this with hexdump:
$ echo $'abcdef\r0123\rxy' | hexdump -c
0000000 a b c d e f \r 0 1 2 3 \r x y \n
000000f
$ overwrite $'abcdef\r0123\rxy' | hexdump -c
0000000 x y 2 3 e f
0000006
The function overwrite also supports overwriting by arguments instead of \r-delimited segments:
$ overwrite abcdef 0123 xy
xy23ef
To convert variables in-place, use a subshell: myvar=$(overwrite "$myvar")
With awk, you'd set the field delimiter to \r and iterate through fields printing only the visible portions of them.
awk -F'\r' '{
offset = 1
for (i=NF; i>0; i--) {
if (offset <= length($i)) {
printf "%s", substr($i, offset)
offset = length($i) + 1
}
}
print ""
}'
This is indeed too long to put into a command substitution. So you better wrap this in a function, and pipe the lines to be resolved to that.
To answer the specific question, how to convert 000000000000\r1010 to 101000000000, refer to Socowi's answer.
However, I wouldn't introduce the carriage return in the first place and solve the problem like this:
#!/usr/bin/env bash
x=$1
# Start with 12 zeroes
var='000000000000'
# Convert input to binary
binary=$(bc <<< "obase = 2; $x")
# Rightpad with zeroes: ${#binary} is the number of characters in $binary,
# and ${var:x} removes the first x characters from $var
var=$binary${var:${#binary}}
# Print 12 substrings, convert to decimal: ${var:0:i} extracts the first
# i characters from $var, and $((x#$var)) interprets $var in base x
for ((i = 1; i <= ${#var}; ++i)); do
echo "$((2#${var:0:i}))"
done

Finding character location of all instances of a string in bash

I'm trying to find the location of all instances of a string in a particular file; however, the code I'm currently running only returns the location of the first instance and then stops there. Here is what I'm currently running:
str=$(cat temp1.txt)
tmp="${str%%<C>*}"
if [ "$tmp" != "$str" ]; then
echo ${#tmp}
fi
The file is only one line of string and I would display it but the format questions need to be in won't allow me to add the proper amount of spaces between each character.
I am not sure of many details of your requirements, however this is an awk one-liner:
awk -vRS='<C>' '{printf("%u:",a+=length($0));a+=length(RS)}END{print ""}' temp1.txt
Let’s test it with an actual line of input:
$ awk -vRS='<C>' \
'{printf("%u:",a+=length($0));a+=length(RS)}END{print ""}' \
<<<" <C> <C> "
4:14:20:
This means: the first <C> is at byte 4, the second <C> is at byte 14 (including the three bytes of the first <C>), and the whole line is 20 bytes long (including final newline).
Is this what you want?
Explanation
We set (-v) record separator (RS) as <C>. Then we keep a variable a with the count of all bytes processed so far. For each “line” (i.e., <C>-separated substrings) we add the length of the current line to a, printf it with a suitable format "%u:", and increase a by the length of the separator which ended the current line. Since no printing so far included newlines, at the END we print an empty string, which is an idiom to output a final newline.
Look at the basically the same question asked here.
In particular your question may be answered for multiple instances thanks to user
JRFerguson response using perl.
EDIT: I found another solution that might just do the trick here. (The main question and response post is found here.)
I changed the shell from ksh to bash, changed the searched string to include multiple <C>'s to better demonstrate an answer the question, and named it "tester":
#!/bin/bash
printf '%s\n' '<C>abc<C>xyz<C>123456<C>zzz<C>' | awk -v s="$1" '
{ d = ""
for(i = 1; x = index(substr($0, i), s); i = i + x + length(s) - 1) {
printf("%s%d", d, i + x - 1)
d = ":"
}
print ""
}'
This is how I ran it:
$ tester '<C>'
1:7:13:22:28
I haven't figured the code out (I like to know why it works) but it seems to work! It would nice to get an explanation and an elegant way to feed your string into this script. Cheers.

Bash script string processing

I wrote a script that reads a Plain text and a key, and then loops trough each character of plain text and shifts it with the value of the corresponding character in key text, with a=0 b=1 c=2 ... z = 25
the code works fine but with a string of size 1K characters it takes almost 3s to execute.
this is the code:
small="abcdefghijklmnopqrstuvwxyz" ## used to search and return the position of some small letter in a string
capital="ABCDEFGHIJKLMNOPQRSTUVWXYZ" ## used to search and return the position of some capital letter in a string
read Plain_text
read Key_text
## saving the length of each string
length_1=${#Plain_text}
length_2=${#Key_text}
printf " Your Plain text is: %s\n The Key is: %s\n The resulting Cipher text is: " "$Plain_text" "$Key_text"
for(( i=0,j=0;i<$length_1;++i,j=`expr $(($j + 1)) % $length_2` )) ## variable 'i' is the index for the first string, 'j' is the index of the second string
do
## return a substring statring from position 'i' and with length 1
c=${Plain_text:$i:1}
d=${Key_text:$j:1}
## function index takes two parameters, the string to seach in and a substring,
## and return the index of the first occerunce of the substring with base-insex 1
x=`expr index "$small" $c`
y=`expr index "$small" $d`
##shifting the current letter to the right with the vaule of the corresponding letter in the key mod 26
z=`expr $(($x + $y - 2)) % 26`
##print the resulting letter from capital letter string
printf "%s" "${capital:$z:1}"
done
echo ""
How is it possible to improve the performance of this code.
Thank you.
You are creating 4 new processes in each iteration of your for loop by using command substitution (3 substitutions in the body, 1 in the head). You should use arithmetic expansion instead of calling expr (search for $(( in the bash(1) manpage). Note that you don't need the $ to substitute variables inside $(( and )).
you can change character like this
a=( soheil )
echo ${a/${a:0:1}/${a:1:1}}
for change all char use loop like for
and for change char to upper
echo soheil | tr "[:lower:]" "[:upper:]"
i hope i understand your question.
be at peace
You will have a lot of repeating chars in a 1K string.
Imagine the input was 1M.
You should calculate all request/respond pairs in front, so your routine only has to lookup the replacement.
I would think of a solution with arrays is the best approach here.

Remove escapes from a string, or, "how can I get \ out of the way?"

Escape characters cause a lot of trouble in R, as evidenced by previous questions:
Change the values in a column
Can R paste() output "\"?
Replacing escaped double quotes by double quotes in R
How to gsub('%', '\%', ... in R?
Many of these previous questions could be simplified to special cases of "How can I get \ out of my way?"
Is there a simple way to do this?
For example, I can find no arguments to gsub that will remove all escapes from the following:
test <- c('\01', '\\001')
The difficulty here is that "\1", although it's printed with two glyphs, is actually, in R's view a single character. And in fact, it's the very same character as "\001" and "\01":
nchar("\1")
# [1] 1
nchar("\001")
# [1] 1
identical("\1", "\001")
# [1] TRUE
So, you can in general remove all backslashes with something like this:
(test <- c("\\hi\\", "\n", "\t", "\\1", "\1", "\01", "\001"))
# [1] "\\hi\\" "\n" "\t" "\\1" "\001" "\001" "\001"
eval(parse(text=gsub("\\", "", deparse(test), fixed=TRUE)))
# [1] "hi" "n" "t" "1" "001" "001" "001"
But, as you can see, "\1", "\01", and \001" will all be rendered as 001, (since to R they are all just different names for "\001").
EDIT: For more on the use of "\" in escape sequences, and on the great variety of characters that can be represented using them (including the disallowed nul string mentioned by Joshua Ulrich in a comment above), see this section of the R language definition.
I just faced the same issue - if you want any \x where x is a character then I am not sure how, I wish I knew, but to fix it for a specific escape sequence,. say \n then you can do
new = gsub("\n","",old,fixed=T)
in my case, I only had \n

Resources