I am learning the lua programming from a online book.
It talks about string indices for an array
If the indices are strings, you can create a single index
concatenating both indices with a character in between to separate
them. For instance, you can index a matrix m with string indices s and
t with the code m[s..':'..t], provided that both s and t do not
contain colons (otherwise, pairs like ("a:", "b") and ("a", ":b")
would collapse into a single index "a::b"). When in doubt, you can use
a control character like `\0´ to separate the indices.
https://www.lua.org/pil/11.2.html
I don't understand what is wrong with index "a::b". What is the difference between "a:b" and "a::b".
What's the trick behind it?
What the documentation you linked is describing, is a way to represent a multidimensional matrix in a single dimensional table. After giving an example of how you can do this with number indices:
mt = {} -- create the matrix
for i=1,N do
for j=1,M do
mt[i*M + j] = 0
end
end
they describe a way to do the same thing with strings: "If the indices are strings, you can create a single index concatenating both indices with a character in between to separate them." A code snippet that would fit the description would look like:
str_idxs = {"foo", "bar", "baz"} -- table of the string indices
mt = {} -- matrix
for 1,N do
for 1,M do
mt[str_idxs[N] .. ":" .. str_idxs[M]] = 0
end
end
print(mt["foo:bar"]) -- 0
print(mt["foo" .. ":" .. "bar"]) -- 0
print(mt["foo::bar"]) -- nil
As you can see in this example there is nothing special about the ":" character, you can choose any string to be a separator (including "::" if you really wanted). The reason why "foo::bar" is wrong in this case is because you never gave a value to "foo::bar."
Related
I have a function to shuffle strings from another article adapted to reorder the characters with a table of predefined numbers. It works perfectly, so I also needed a function to unscramble this string using the number table, but I have no idea how to do this, especially after having tried and failed several times.
Shuffle function:
randomValues = {}
for i = 1, 60 do
table.insert(randomValues, 1, math.random())
end
function shuffle(str)
math.randomseed(4)
local letters = {}
local idx = 0
for letter in str:gmatch'.[\128-\191]*' do
idx = idx + 1
table.insert(letters, {letter = letter, rnd = randomValues[idx]})
end
table.sort(letters, function(a, b) return a.rnd < b.rnd end)
for i, v in ipairs(letters) do
letters[i] = v.letter
end
return table.concat(letters)
end
Any tips?
I will assume, that all you're trying to do is:
Split a string into unicode characters.
Shuffle these characters.
Restore the characters to their original position.
I have separated the splitting up of unicode characters and doing the actual shuffle, to make it a bit easier to follow.
1. Splitting the characters
Starting off with the splitting of characters:
-- Splits a string into a table of unicode characters.
local function splitLetters(str)
local letters = {}
for letter in str:gmatch'.[\128-\191]*' do
table.insert(letters, letter)
end
return letters
end
This is mostly copied from the first part of your function.
2. Shuffling the table of characters
Now that we have a nice table of characters, that we can work with, it's time to shuffle them. Shuffling a list can be done by going through each character in order and swapping it with a randomly chosen (but still unshuffled) item. While we do that, we also keep a table of all indices that got swapped, which I call swapTable here.
-- Shuffles in place and returns a table, which can be used to unshuffle.
local function shuffle(items)
local swapTable = {}
for i = 1, #items - 1 do
-- Swap the first item with a random item (including itself).
local j = math.random(i, #items)
items[i], items[j] = items[j], items[i]
-- Keep track of each swap so we can undo it.
table.insert(swapTable, j)
-- Everything up to i is now random.
-- The last iteration can be skipped, as it would always swap with itself.
-- See #items - 1 at the top of the loop.
end
return swapTable
end
3. Restoring the letters to their original positions
Using this swapTable, it is now pretty straightforward to just do the whole shuffle again, but in reverse.
-- Restores a previous shuffle in place.
local function unshuffle(items, swapTable)
-- Go through the swap table backwards, as we need to do everything in reverse.
for i = #swapTable, 1, -1 do
-- Do the same as before, but using the swap table.
local j = swapTable[i]
items[i], items[j] = items[j], items[i]
end
end
A full example using all those functions
Using those few functions (and table.concat to build up the list of letters into a string again) we can do everything you want:
-- Make our output reproducible
math.randomseed(42)
-- Split our test string into a table of unicode characters
local letters = splitLetters("Hellö Wörld! Höw are yoü?")
-- Shuffle them in-place, while also getting the swapTable
local swapTable = shuffle(letters)
-- Print out the shuffled string
print(table.concat(letters)) --> " rH?doröWüle Hl lwa eyöö!"
-- Unshuffle them in-place using the swapTable
unshuffle(letters, swapTable)
-- And we're back to the original string
print(table.concat(letters)) --> "Hellö Wörld! Höw are yoü?"
Creating the swapTable upfront
In your example, you generate the swapTable upfront (and it also works slightly different for you). You can of course split that part out and have your shuffle function work similar to how unshuffle is currently implemented. Tell me, if you want me to elaborate on that.
Imagine the problem of finding if one string "STR1" is a rotated version of another string "STR2". This problem is simple and just requires searching for either string in the other string concatenated with itself. However how would you solve this problem efficiently if you are allowed to replace up to K characters of a string and are allowed unlimited rotations.
For example consider two strings X = "abcdefgh" and Y = "pefgwarc" where K = 3. These strings are equal as you can first rotate X three moves to the left to get "defghabc". Then we replace "d" with "p", "h" with "w" and "b" with "r" to finally get "pefgwarc". Since we only replaced three characters, these strings are considered equal by this metric.
How would you go about solving this efficiently? It seems to leverage edit distance but with rotating costing nothing and only replacements being considered but I can't seem to find any well known algorithm online that solves this.
I'm extremely new to python and I have no idea why this code gives me this output. I tried searching around for an answer but couldn't find anything because I'm not sure what to search for.
An explain-like-I'm-5 explanation would be greatly appreciated
astring = "hello world"
print(astring[3:7:2])
This gives me : "l"
Also
astring = "hello world"
print(astring[3:7:3])
gives me : "lw"
I can't wrap my head around why.
This is string slicing in python.
Slicing is similar to regular string indexing, but it can return a just a section of a string.
Using two parameters in a slice, such as [a:b] will return a string of characters, starting at index a up to, but not including, index b.
For example:
"abcdefg"[2:6] would return "cdef"
Using three parameters performs a similar function, but the slice will only return the character after a chosen gap. For example [2:6:2] will return every second character beginning at index 2, up to index 5.
ie "abcdefg"[2:6:2] will return ce, as it only counts every second character.
In your case, astring[3:7:3], the slice begins at index 3 (the second l) and moves forward the specified 3 characters (the third parameter) to w. It then stops at index 7, returning lw.
In fact when using only two parameters, the third defaults to 1, so astring[2:5] is the same as astring[2:5:1].
Python Central has some more detailed explanations of cutting and slicing strings in python.
I have a feeling you are over complicating this slightly.
Since the string astring is set statically you could more easily do the following:
# Sets the characters for the letters in the consistency of the word
letter-one = "h"
letter-two = "e"
letter-three = "l"
letter-four = "l"
letter-six = "o"
letter-7 = " "
letter-8 = "w"
letter-9 = "o"
letter-10 = "r"
letter11 = "l"
lettertwelve = "d"
# Tells the python which of the character letters that you want to have on the print screen
print(letter-three + letter-7 + letter-three)
This way its much more easily readable to human users and it should mitigate your error.
I have a collection S, typically containing 10-50 long strings. For illustrative purposes, suppose the length of each string ranges between 1000 and 10000 characters.
I would like to find strings of specified length k (typically in the range of 5 to 20) that are substrings of every string in S. This can obviously be done using a naive approach - enumerating every k-length substring in S[0] and checking if they exist in every other element of S.
Are there more efficient ways of approaching the problem? As far as I can tell, there are some similarities between this and the longest common subsequence problem, but my understanding of LCS is limited and I'm not sure how it could be adapted to the situation where we bound the desired common substring length to k, or if subsequence techniques can be applied to finding substrings.
Here's one fairly simple algorithm, which should be reasonably fast.
Using a rolling hash as in the Rabin-Karp string search algorithm, construct a hash table H0 of all the |S0|-k+1 length k substrings of S0. That's roughly O(|S0|) since each hash is computed in O(1) from the previous hash, but it will take longer if there are collisions or duplicate substrings. Using a better hash will help you with collisions but if there are a lot of k-length duplicate substrings in S0 then you could end up using O(k|S0|).
Now use the same rolling hash on S1. This time, look each substring up in H0 and if you find it, remove it from H0 and insert it into a new table H1. Again, this should be around O(|S1|) unless you have some pathological case, like both S0 and S1 are just long repetitions of the same character. (It's also going to be suboptimal if S0 and S0 are the same string, or have lots of overlapping pieces.)
Repeat step 2 for each Si, each time creating a new hash table. (At the end of each iteration of step 2, you can delete the hash table from the previous step.)
At the end, the last hash table will contain all the common k-length substrings.
The total run time should be about O(Σ|Si|) but in the worst case it could be O(kΣ|Si|). Even so, with the problem size as described, it should run in acceptable time.
Some thoughts (N is number of strings, M is average length, K is needed substring size):
Approach 1:
Walk through all strings, computing rolling hash for k-length strings and storing these hashes in the map (store tuple {key: hash; string_num; position})
time O(NxM), space O(NxM)
Extract groups with equal hash, check step-by-step:
1) that size of group >= number of strings
2) all strings are represented in this group 3
3) thorough checking of real substrings for equality (sometimes hashes of distinct substrings might coincide)
Approach 2:
Build suffix array for every string
time O(N x MlogM) space O(N x M)
Find intersection of suffix arrays for the first string pair, using merge-like approach (suffixes are sorted), considering only part of suffixes of length k, then continue with the next string and so on
I would treat each long string as a collection of overlapped short strings, so ABCDEFGHI becomes ABCDE, BCDEF, CDEFG, DEFGH, EFGHI. You can represent each short string as a pair of indexes, one specifying the long string and one the starting offset in that string (if this strikes you as naive, skip to the end).
I would then sort each collection into ascending order.
Now you can find the short strings common to the first two collection by merging the sorted lists of indexes, keeping only those from the first collection which are also present in the second collection. Check the survivors of this against the third collection, and so on and the survivors at the end correspond to those short strings which are present in all long strings.
(Alternatively you could maintain a set of pointers into each sorted list and repeatedly look to see if every pointer points at short strings with the same text, then advancing the pointer which points at the smallest short string).
Time is O(n log n) for the initial sort, which dominates. In the worst case - e.g. when every string is AAAAAAAA..AA - there is a factor of k on top of this, because all string compares check all characters and take time k. Hopefully, there is a clever way round this with https://en.wikipedia.org/wiki/Suffix_array which allows you to sort in time O(n) rather than O(nk log n) and the https://en.wikipedia.org/wiki/LCP_array, which should allow you to skip some characters when comparing substrings from different suffix arrays.
Thinking about this again, I think the usual suffix array trick of concatenating all of the strings in question, separated by a character not found in any of them, works here. If you look at the LCP of the resulting suffix array you can split it into sections, splitting at points where where the difference between suffixes occurs less than k characters in. Now each offset in any particular section starts with the same k characters. Now look at the offsets in each section and check to see if there is at least one offset from every possible starting string. If so, this k-character sequence occurs in all starting strings, but not otherwise. (There are suffix array constructions which work with arbitrarily large alphabets so you can always expand your alphabet to produce a character not in any string, if necessary).
I would try a simple method using HashSets:
Build a HashSet for each long string in S with all its k-strings.
Sort the sets by number of elements.
Scan the first set.
Lookup the term in the other sets.
The first step takes care of repetitions in each long string.
The second ensures the minimum number of comparisons.
let getHashSet k (lstr:string) =
let strs = System.Collections.Generic.HashSet<string>()
for i in 0..lstr.Length - k do
strs.Add lstr.[i..i + k - 1] |> ignore
strs
let getCommons k lstrs =
let strss = lstrs |> Seq.map (getHashSet k) |> Seq.sortBy (fun strs -> strs.Count)
match strss |> Seq.tryHead with
| None -> [||]
| Some h ->
let rest = Seq.tail strss |> Seq.toArray
[| for s in h do
if rest |> Array.forall (fun strs -> strs.Contains s) then yield s
|]
Test:
let random = System.Random System.DateTime.Now.Millisecond
let generateString n =
[| for i in 1..n do
yield random.Next 20 |> (+) 65 |> System.Convert.ToByte
|] |> System.Text.Encoding.ASCII.GetString
[ for i in 1..3 do yield generateString 10000 ]
|> getCommons 4
|> fun l -> printfn "found %d\n %A" l.Length l
result:
found 40
[|"PPTD"; "KLNN"; "FTSR"; "CNBM"; "SSHG"; "SHGO"; "LEHS"; "BBPD"; "LKQP"; "PFPH";
"AMMS"; "BEPC"; "HIPL"; "PGBJ"; "DDMJ"; "MQNO"; "SOBJ"; "GLAG"; "GBOC"; "NSDI";
"JDDL"; "OOJO"; "NETT"; "TAQN"; "DHME"; "AHDR"; "QHTS"; "TRQO"; "DHPM"; "HIMD";
"NHGH"; "EARK"; "ELNF"; "ADKE"; "DQCC"; "GKJA"; "ASME"; "KFGM"; "AMKE"; "JJLJ"|]
Here it is in fiddle: https://dotnetfiddle.net/ZK8DCT
What is the difference between string and character class in MATLAB?
a = 'AX'; % This is a character.
b = string(a) % This is a string.
The documentation suggests:
There are two ways to represent text in MATLAB®. You can store text in character arrays. A typical use is to store short pieces of text as character vectors. And starting in Release 2016b, you can also store multiple pieces of text in string arrays. String arrays provide a set of functions for working with text as data.
This is how the two representations differ:
Element access. To represent char vectors of different length, one had to use cell arrays, e.g. ch = {'a', 'ab', 'abc'}. With strings, they can be created in actual arrays: str = [string('a'), string('ab'), string('abc')].
However, to index characters in a string array directly, the curly bracket notation has to be used:
str{3}(2) % == 'b'
Memory use. Chars use exactly two bytes per character. strings have overhead:
a = 'abc'
b = string('abc')
whos a b
returns
Name Size Bytes Class Attributes
a 1x3 6 char
b 1x1 132 string
The best place to start for understanding the difference is the documentation. The key difference, as stated there:
A character array is a sequence of characters, just as a numeric array is a sequence of numbers. A typical use is to store short pieces of text as character vectors, such as c = 'Hello World';.
A string array is a container for pieces of text. String arrays provide a set of functions for working with text as data. To convert text to string arrays, use the string function.
Here are a few more key points about their differences:
They are different classes (i.e. types): char versus string. As such they will have different sets of methods defined for each. Think about what sort of operations you want to do on your text, then choose the one that best supports those.
Since a string is a container class, be mindful of how its size differs from an equivalent character array representation. Using your example:
>> a = 'AX'; % This is a character.
>> b = string(a) % This is a string.
>> whos
Name Size Bytes Class Attributes
a 1x2 4 char
b 1x1 134 string
Notice that the string container lists its size as 1x1 (and takes up more bytes in memory) while the character array is, as its name implies, a 1x2 array of characters.
They can't always be used interchangeably, and you may need to convert between the two for certain operations. For example, string objects can't be used as dynamic field names for structure indexing:
>> s = struct('a', 1);
>> name = string('a');
>> s.(name)
Argument to dynamic structure reference must evaluate to a valid field name.
>> s.(char(name))
ans =
1
Strings do have a bit of overhead, but still increase by 2 bytes per character. After every 8 characters it increases the size of the variable. The red line is y=2x+127.
figure is created using:
v=[];N=100;
for ct = 1:N
s=char(randi([0 255],[1,ct]));
s=string(s);
a=whos('s');v(ct)=a.bytes;
end
figure(1);clf
plot(v)
xlabel('# characters')
ylabel('# bytes')
p=polyfit(1:N,v,1);
hold on
plot([0,N],[127,2*N+127],'r')
hold off
One important practical thing to note is, that strings and chars behave differently when interacting with square brackets. This can be especially confusing when coming from python. consider following example:
>>['asdf' '123']
ans =
'asdf123'
>> ["asdf" "123"]
ans =
1×2 string array
"asdf" "123"