Lua split string to table - string

I'm looking for the most efficient way to split a Lua string into a table.
I found two possible ways using gmatch or gsub and tried to make them as fast as possible.
function string:split1(sep)
local sep = sep or ","
local result = {}
local i = 1
for c in (self..sep):gmatch("(.-)"..sep) do
result[i] = c
i = i + 1
end
return result
end
function string:split2(sep)
local sep = sep or ","
local result = {}
local pattern = string.format("([^%s]+)", sep)
local i = 1
self:gsub(pattern, function (c) result[i] = c i = i + 1 end)
return result
end
The second option takes ~50% longer than the first.
What is the right way and why?
Added: I added a third function with the same pattern.
It shows the best result.
function string:split3(sep)
local sep = sep or ","
local result = {}
local i = 1
for c in self:gmatch(string.format("([^%s]+)", sep)) do
result[i] = c
i = i + 1
end
return result
end
"(.-)"..sep - works with a sequence.
"([^" .. sep .. "]+)" works with a single character. In fact, for each character in the sequence.
string.format("([^%s]+)", sep) is faster than "([^" .. sep .. "]+)".
The string.format("(.-)%s", sep) shows almost the same time as "(.-)"..sep.
result[i]=c i=i+1 is faster than result[#result+1]=c and table.insert(result,c)
Code for test:
local init = os.clock()
local initialString = [[1,2,3,"afasdaca",4,"acaac"]]
local temTable = {}
for i = 1, 1000 do
table.insert(temTable, initialString)
end
local dataString = table.concat(temTable,",")
print("Creating data: ".. (os.clock() - init))
init = os.clock()
local data1 = {}
for i = 1, 1000 do
data1 = dataString:split1(",")
end
print("split1: ".. (os.clock() - init))
init = os.clock()
local data2 = {}
for i = 1, 1000 do
data2 = dataString:split2(",")
end
print("split2: ".. (os.clock() - init))
init = os.clock()
local data3 = {}
for i = 1, 1000 do
data3 = dataString:split3(",")
end
print("split3: ".. (os.clock() - init))
Times:
Creating data: 0.000229
split1: 1.189397
split2: 1.647402
split3: 1.011056

The gmatch version is preferred. gsub is intended for "global substitution" - string replacement - rather than iterating over matches; accordingly it presumably has to do more work.
The comparison isn't quite fair though as your patterns differ: For gmatch you use "(.-)"..sep and for gsub you use "([^" .. sep .. "]+)". Why don't you use the same pattern for both? In newer Lua versions you could even use the frontier pattern.
The different patterns also lead to different behavior: The gmatch-based func will return empty matches whereas the others won't. Note that the "([^" .. sep .. "]+)" pattern allows you to omit the parentheses.

Related

How'd i loop through every string.gmatch in a string then replacing the matches with a bytecoded version of that?

How'd I change everything that it matches in the string without changing the non matches?
local a = "\" Hello World! I want to replace this with a bytecoded version of this!\" but not this!"
for i in string.gmatch(a, "\".*\"") do
print(i)
end
For example I want [["Hello World!" Don't Replace this!]] to [["\72\101\108\108\111\32\87\111\114\108\100\33" Don't Replace this!]]
Your question is a little bit tricky because it could involve:
Lua patterns
string.gsub function
Access to string's bytes with string.byte
String concatenations with table.concat
First things first, if you need to implement Lua patterns, please know that there is a very handy Lua syntax which is very appropriate for dealing with quoted strings. With this syntax, instead of opening/closing a string with a double-quote, you do it with the characters [[ and ]]. The key difference is that between these markers, you don't have to escape the quoted strings anymore!
String = [["Hello World!" Don't Replace this!]]
Then, we need to build the proper Lua pattern, a possibility could be to match a double-quote (") and then match all the characters which are not a double-quote ("), this gives us the following pattern:
[["([^"]+)"]]
| **** |
| \-> the expression to match
| |
quote quote
Then if we study the function string.gsub, we can learn that the function can call a callback when a pattern is matched, the matched string will be replaced by the return value of the callback.
function ConvertToByteString (MatchedString)
local ByteStrings = {}
local Len = #MatchedString
ByteStrings[#ByteStrings+1] = [[\"]]
for Index = 1, Len do
local Byte = MatchedString:byte(Index)
ByteStrings[#ByteStrings+1] = string.format([[\%d]], Byte)
end
ByteStrings[#ByteStrings+1] = [[\"]]
return table.concat(ByteStrings)
end
In this function, we iterate through all the characters of the matched string. Then for each of the characters, we extract its byte value with the function string.byte and convert it to a string using the string.format function. We put this string in a temporary array that we will concatenate at the end of the function.
The function to concatenate the sub-strings into a larger string is table.concat. This is a very convenient function which could be used as follow:
> table.concat({ [[\10]], [[\11]], [[\12]] })
\10\11\12
The remaining thing we need to do is to test this outstanding function:
> String = [["Hello World!" Don't Replace this!]]
> NewString = String:gsub([["([^"]+)"]], ConvertToByteString)
> NewString
\"\72\101\108\108\111\32\87\111\114\108\100\33\" Don't Replace this!
Edit: I got some remarks regarding code performances, I personally don't focus much on performances, I focus on getting the code correct & simple. In order to address the performance question, I wrote a micro-benchmark to compare the versions:
function SOLUTION_DarkWiiPlayer (String)
local result = String:gsub('"[^"]*"', function(str)
return str:gsub('[^"]', function(char)
return "\\" .. char:byte()
end)
end)
return result
end
function SOLUTION_Robert (String)
local function ConvertToByteString (MatchedString)
local ByteStrings = {}
local Len = #MatchedString
ByteStrings[#ByteStrings+1] = [[\"]]
for Index = 1, Len do
local Byte = MatchedString:byte(Index)
ByteStrings[#ByteStrings+1] = string.format([[\%d]], Byte)
end
ByteStrings[#ByteStrings+1] = [[\"]]
return table.concat(ByteStrings)
end
local Result = String:gsub([["([^"]+)"]], ConvertToByteString)
return Result
end
function SOLUTION_Piglet (String)
return String:gsub('%b""' , function (match)
local ret = ""
for _,v in ipairs{match:byte(1, -1)} do
ret = ret .. string.format("\\%d", v)
end
return ret
end)
end
function SOLUTION_Renshaw (String)
local function convert(str)
local byte_str = ""
for i = 1, #str do
byte_str = byte_str .. "\\" .. tostring(string.byte(str, i))
end
return byte_str
end
local Result = string.gsub(String, "\"(.*)\"", function(matched_str)
return "\"" .. convert(matched_str) .. "\""
end)
return Result
end
String = "\"Hello World!\" Don't Replace this!"
print("INITIAL REQUIREMENT FROM OP ", [[\"\72\101\108\108\111\32\87\111\114\108\100\33\" Don't Replace this!]])
print("TEST SOLUTION_Robert: ", SOLUTION_Robert(String))
print("TEST SOLUTION_DarkWiiPlayer:", SOLUTION_DarkWiiPlayer(String))
print("TEST SOLUTION_Piglet: ", SOLUTION_Piglet(String))
print("TEST SOLUTION_Renshaw: ", SOLUTION_Renshaw(String))
The results show that only one answer fulfill 100% of OP's requirements. The other answers doesn't handle the first and ending double-quotes " properly.
INITIAL REQUIREMENT FROM OP \"\72\101\108\108\111\32\87\111\114\108\100\33\" Don't Replace this!
TEST SOLUTION_Robert: \"\72\101\108\108\111\32\87\111\114\108\100\33\" Don't Replace this!
TEST SOLUTION_DarkWiiPlayer: "\72\101\108\108\111\32\87\111\114\108\100\33" Don't Replace this!
TEST SOLUTION_Piglet: \34\72\101\108\108\111\32\87\111\114\108\100\33\34 Don't Replace this! 1
TEST SOLUTION_Renshaw: "\72\101\108\108\111\32\87\111\114\108\100\33" Don't Replace this!
To finalize this post, one could dive a little deeper and check the code performances with a micro-benchmark which could be copy/paste directly in a Lua interpreter.
function SOLUTION_DarkWiiPlayer (String)
local result = String:gsub('"[^"]*"', function(str)
return str:gsub('[^"]', function(char)
return "\\" .. char:byte()
end)
end)
return result
end
function SOLUTION_Robert (String)
local function ConvertToByteString (MatchedString)
local ByteStrings = {}
local Len = #MatchedString
ByteStrings[#ByteStrings+1] = [[\"]]
for Index = 1, Len do
local Byte = MatchedString:byte(Index)
ByteStrings[#ByteStrings+1] = string.format([[\%d]], Byte)
end
ByteStrings[#ByteStrings+1] = [[\"]]
return table.concat(ByteStrings)
end
local Result = String:gsub([["([^"]+)"]], ConvertToByteString)
return Result
end
function SOLUTION_Piglet (String)
return String:gsub('%b""' , function (match)
local ret = ""
for _,v in ipairs{match:byte(1, -1)} do
ret = ret .. string.format("\\%d", v)
end
return ret
end)
end
function SOLUTION_Renshaw (String)
local function convert(str)
local byte_str = ""
for i = 1, #str do
byte_str = byte_str .. "\\" .. tostring(string.byte(str, i))
end
return byte_str
end
local Result = string.gsub(String, "\"(.*)\"", function(matched_str)
return "\"" .. convert(matched_str) .. "\""
end)
return Result
end
---
--- Micro-benchmark environment
---
COUNT = 600000
function TEST_Function (Name, Function, String, Count)
local TimerStart = os.clock()
for Index = 1, Count do
Function(String)
end
local ElapsedSeconds = (os.clock() - TimerStart)
print(string.format("[%25.25s] %f sec", Name, ElapsedSeconds))
end
String = "\"Hello World!\" Don't Replace this!"
TEST_Function("SOLUTION_DarkWiiPlayer", SOLUTION_DarkWiiPlayer, String, COUNT)
TEST_Function("SOLUTION_Robert", SOLUTION_Robert, String, COUNT)
TEST_Function("SOLUTION_Piglet", SOLUTION_Piglet, String, COUNT)
TEST_Function("SOLUTION_Renshaw", SOLUTION_Renshaw, String, COUNT)
The results shows that #DarkWiiPlayer's answer is the fastest one.
[ SOLUTION_DarkWiiPlayer] 6.363000 sec
[ SOLUTION_Robert] 9.605000 sec
[ SOLUTION_Piglet] 7.943000 sec
[ SOLUTION_Renshaw] 8.875000 sec
local a = "\"Hello World!\" but not this!"
print(a:gsub('"[^"]*"', function(str)
return str:gsub('[^"]', function(char)
return "\\" .. char:byte()
end)
end))
you need string.gsub.
local a = "\"Hello World!\" Don't Replace this!"
local function convert(str)
local byte_str = ""
for i = 1, #str do
byte_str = byte_str .. "\\" .. tostring(string.byte(str, i))
end
return byte_str
end
a = string.gsub(a, "\"(.*)\"", function(matched_str)
return "\"" .. convert(matched_str) .. "\""
end)
print(a)
local a = "\" Hello World! I want to replace this with a bytecoded version of this!\" but not this!"
print((a:gsub('%b""' , function (match)
local ret = ""
for _,v in ipairs{match:byte(1, -1)} do
ret = ret .. string.format("\\%d", v)
end
return ret
end)))

Lua: split string into words unless quoted

So I have the following code to split a string between whitespaces:
text = "I am 'the text'"
for string in text:gmatch("%S+") do
print(string)
end
The result:
I
am
'the
text'
But I need to do this:
I
am
the text --[[yep, without the quotes]]
How can I do this?
Edit: just to complement the question, the idea is to pass parameters from a program to another program. Here is the pull request that I am working, currently in review: https://github.com/mpv-player/mpv/pull/1619
There may be ways to do this with clever parsing, but an alternative way may be to keep track of a simple state and merge fragments based on detection of quoted fragments. Something like this may work:
local text = [[I "am" 'the text' and "some more text with '" and "escaped \" text"]]
local spat, epat, buf, quoted = [=[^(['"])]=], [=[(['"])$]=]
for str in text:gmatch("%S+") do
local squoted = str:match(spat)
local equoted = str:match(epat)
local escaped = str:match([=[(\*)['"]$]=])
if squoted and not quoted and not equoted then
buf, quoted = str, squoted
elseif buf and equoted == quoted and #escaped % 2 == 0 then
str, buf, quoted = buf .. ' ' .. str, nil, nil
elseif buf then
buf = buf .. ' ' .. str
end
if not buf then print((str:gsub(spat,""):gsub(epat,""))) end
end
if buf then print("Missing matching quote for "..buf) end
This will print:
I
am
the text
and
some more text with '
and
escaped \" text
Updated to handle mixed and escaped quotes. Updated to remove quotes. Updated to handle quoted words.
Try this:
text = [[I am 'the text' and '' here is "another text in quotes" and this is the end]]
local e = 0
while true do
local b = e+1
b = text:find("%S",b)
if b==nil then break end
if text:sub(b,b)=="'" then
e = text:find("'",b+1)
b = b+1
elseif text:sub(b,b)=='"' then
e = text:find('"',b+1)
b = b+1
else
e = text:find("%s",b+1)
end
if e==nil then e=#text+1 end
print("["..text:sub(b,e-1).."]")
end
Lua Patterns aren't powerful to handle this task properly. Here is an LPeg solution adapted from the Lua Lexer. It handles both single and double quotes.
local lpeg = require 'lpeg'
local P, S, C, Cc, Ct = lpeg.P, lpeg.S, lpeg.C, lpeg.Cc, lpeg.Ct
local function token(id, patt) return Ct(Cc(id) * C(patt)) end
local singleq = P "'" * ((1 - S "'\r\n\f\\") + (P '\\' * 1)) ^ 0 * "'"
local doubleq = P '"' * ((1 - S '"\r\n\f\\') + (P '\\' * 1)) ^ 0 * '"'
local white = token('whitespace', S('\r\n\f\t ')^1)
local word = token('word', (1 - S("' \r\n\f\t\""))^1)
local string = token('string', singleq + doubleq)
local tokens = Ct((string + white + word) ^ 0)
input = [["This is a string" 'another string' these are words]]
for _, tok in ipairs(lpeg.match(tokens, input)) do
if tok[1] ~= "whitespace" then
if tok[1] == "string" then
print(tok[2]:sub(2,-2)) -- cut off quotes
else
print(tok[2])
end
end
end
Output:
This is a string
another string
these
are
words

Lua split strings into Keys and Values of a table

So I want to split two strings, and be able to return a table with one string equaling the Keys and another the Values.
So if:
String1 = "Key1,Key2,Key3,Key4,Key Ect..."
String2 = "Value1,Value2,Value3,Value4,Value Ect..."
The output would be a table as folows:
Key1 - Value1
Key2 - Value2
Key3 - Value3
Key4 - Value4
Key Ect... - Value Ect...
I have been looking at this split function I found on the Lua wiki
split(String2, ",")
function split(String, pat)
local t = {} -- NOTE: use {n = 0} in Lua-5.0
local fpat = "(.-)" .. pat
local last_end = 1
local s, e, cap = str:find(fpat, 1)
while s do
if s ~= 1 or cap ~= "" then
table.insert(t,cap)
end
last_end = e+1
s, e, cap = str:find(fpat, last_end)
end
if last_end <= #str then
cap = str:sub(last_end)
table.insert(t, cap)
end
return t
end
But of course this only returns:
1 - Value1
2 - Value2
and so on...
I'm going to start trying to modify this code, but I don't know how far I'll get.
You can use it directly like this:
local t1 = split(String1, ",")
local t2 = split(String2, ",")
local result = {}
for k, v in ipairs(t1) do
result[v] = t2[k]
end
Or, create your own iterator:
local function my_iter(t1, t2)
local i = 0
return function() i = i + 1; return t1[i], t2[i] end
end
local result = {}
for v1, v2 in my_iter(t1, t2) do
result[v1] = v2
end
The code below avoids creating two temporary tables:
function join(s1,s2)
local b1,e1,k=1
local b2,e2,v=1
local t={}
while true do
b1,e1,k=s1:find("([^,]+)",b1)
if b1==nil then break end
b1=e1+1
b2,e2,v=s2:find("([^,]+)",b2)
if b2==nil then break end
b2=e2+1
t[k]=v
end
return t
end
String1 = "Key1,Key2,Key3,Key4"
String2 = "Value1,Value2,Value3,Value4"
for k,v in pairs(join(String1,String2)) do
print(k,v)
end

Lua - convert string to table

I want to convert string text to table and this text must be divided on characters. Every character must be in separate value of table, for example:
a="text"
--converting string (a) to table (b)
--show table (b)
b={'t','e','x','t'}
You could use string.gsub function
t={}
str="text"
str:gsub(".",function(c) table.insert(t,c) end)
Just index each symbol and put it at same position in table.
local str = "text"
local t = {}
for i = 1, #str do
t[i] = str:sub(i, i)
end
The builtin string library treats Lua strings as byte arrays.
An alternative that works on multibyte (Unicode) characters is the
unicode library that
originated in the Selene project.
Its main selling point is that it can be used as a drop-in replacement
for the string library, making most string operations “magically”
Unicode-capable.
If you prefer not to add third party dependencies your task can easily
be implemented using LPeg.
Here is an example splitter:
local lpeg = require "lpeg"
local C, Ct, R = lpeg.C, lpeg.Ct, lpeg.R
local lpegmatch = lpeg.match
local split_utf8 do
local utf8_x = R"\128\191"
local utf8_1 = R"\000\127"
local utf8_2 = R"\194\223" * utf8_x
local utf8_3 = R"\224\239" * utf8_x * utf8_x
local utf8_4 = R"\240\244" * utf8_x * utf8_x * utf8_x
local utf8 = utf8_1 + utf8_2 + utf8_3 + utf8_4
local split = Ct (C (utf8)^0) * -1
split_utf8 = function (str)
str = str and tostring (str)
if not str then return end
return lpegmatch (split, str)
end
end
This snippet defines the function split_utf8() that creates a table
of UTF8 characters (as Lua strings), but returns nil if the string
is not a valid UTF sequence.
You can run this test code:
tests = {
en = [[Lua (/ˈluːə/ LOO-ə, from Portuguese: lua [ˈlu.(w)ɐ] meaning moon; ]]
.. [[explicitly not "LUA"[1]) is a lightweight multi-paradigm programming ]]
.. [[language designed as a scripting language with "extensible ]]
.. [[semantics" as a primary goal.]],
ru = [[Lua ([лу́а], порт. «луна») — интерпретируемый язык программирования, ]]
.. [[разработанный подразделением Tecgraf Католического университета ]]
.. [[Рио-де-Жанейро.]],
gr = [[Η Lua είναι μια ελαφρή προστακτική γλώσσα προγραμματισμού, που ]]
.. [[σχεδιάστηκε σαν γλώσσα σεναρίων με κύριο σκοπό τη δυνατότητα ]]
.. [[επέκτασης της σημασιολογίας της.]],
XX = ">\255< invalid"
}
-------------------------------------------------------------------------------
local limit = 14
for lang, str in next, tests do
io.write "\n"
io.write (string.format ("<%s %3d> ->", lang, #str))
local chars = split_utf8 (str)
if not chars then
io.write " INVALID!"
else
io.write (string.format (" <%3d>", #chars))
for i = 1, #chars > limit and limit or #chars do
io.write (string.format (" %q", chars [i]))
end
end
end
io.write "\n"
Btw., building a table with LPeg is significantly faster than calling
table.insert() repeatedly.
Here are stats for splitting the whole of Gogol’s Dead Souls (in
Russian, 1023814 bytes raw, 571395 characters UTF) on my machine:
library method time in ms
string table.insert() 380
string t [#t + 1] = c 310
string gmatch & for loop 280
slnunicode table.insert() 220
slnunicode t [#t + 1] = c 200
slnunicode gmatch & for loop 170
lpeg Ct (C (...)) 70
You can below code to achieve this easily.
t = {}
str = "text"
for i=1, string.len(str) do
t[i]= (string.sub(str,i,i))
end
for k , v in pairs(t) do
print(k,v)
end
-- 1 t
-- 2 e
-- 3 x
-- 4 t
Using string.sub
string.sub(s, i [, j])
Return a substring of the string passed. The substring starts at i. If the third argument j is not given, the substring will end at the end of the string. If the third argument is given, the substring ends at and includes j.

Find the last index of a character in a string

I want to have ability to use a lastIndexOf method for the strings in my Lua (Luvit) project. Unfortunately there's no such method built-in and I'm bit stuck now.
In Javascript it looks like:
'my.string.here.'.lastIndexOf('.') // returns 14
function findLast(haystack, needle)
local i=haystack:match(".*"..needle.."()")
if i==nil then return nil else return i-1 end
end
s='my.string.here.'
print(findLast(s,"%."))
print(findLast(s,"e"))
Note that to find . you need to escape it.
If you have performance concerns, then this might be a bit faster if you're using Luvit which uses LuaJIT.
local find = string.find
local function lastIndexOf(haystack, needle)
local i, j
local k = 0
repeat
i = j
j, k = find(haystack, needle, k + 1, true)
until j == nil
return i
end
local s = 'my.string.here.'
print(lastIndexOf(s, '.')) -- This will be 15.
Keep in mind that Lua strings begin at 1 instead of 0 as in JavaScript.
Here’s a solution using
LPeg’s position capture.
local lpeg = require "lpeg"
local Cp, P = lpeg.Cp, lpeg.P
local lpegmatch = lpeg.match
local cache = { }
local find_last = function (str, substr)
if not (str and substr)
or str == "" or substr == ""
then
return nil
end
local pat = cache [substr]
if not pat then
local p_substr = P (substr)
local last = Cp() * p_substr * Cp() * (1 - p_substr)^0 * -1
pat = (1 - last)^0 * last
cache [substr] = pat
end
return lpegmatch (pat, str)
end
find_last() finds the last occurence of substr in the string
str, where substr can be a string of any length.
The first return value is the position of the first character of
substr in str, the second return value is the position of the
first character following substr (i.e. it equals the length of the
match plus the first return value).
Usage:
local tests = {
A = [[fooA]], --> 4, 5
[""] = [[foo]], --> nil
FOO = [[]], --> nil
K = [[foo]], --> nil
X = [[X foo X bar X baz]], --> 13, 14
XX = [[foo XX X XY bar XX baz X]], --> 17, 19
Y = [[YYYYYYYYYYYYYYYYYY]], --> 18, 19
ZZZ = [[ZZZZZZZZZZZZZZZZZZ]], --> 14, 17
--- Accepts patterns as well!
[P"X" * lpeg.R"09"^1] = [[fooX42barXxbazX]], --> 4, 7
}
for substr, str in next, tests do
print (">>", substr, str, "->", find_last (str, substr))
end
To search for the last instance of string needle in haystack:
function findLast(haystack, needle)
--Set the third arg to false to allow pattern matching
local found = haystack:reverse():find(needle:reverse(), nil, true)
if found then
return haystack:len() - needle:len() - found + 2
else
return found
end
end
print(findLast("my.string.here.", ".")) -- 15, because Lua strings are 1-indexed
print(findLast("my.string.here.", "here")) -- 11
print(findLast("my.string.here.", "there")) -- nil
If you want to search for the last instance of a pattern instead, change the last argument to find to false (or remove it).
Can be optimized but simple and does the work.
function lastIndexOf(haystack, needle)
local last_index = 0
while haystack:sub(last_index+1, haystack:len()):find(needle) ~= nil do
last_index = last_index + haystack:sub(last_index+1, haystack:len()):find(needle)
end
return last_index
end
local s = 'my.string.here.'
print(lastIndexOf(s, '%.')) -- 15

Resources