ArangoDB sharding cluster performance issue - arangodb

I have a query that runs well in single-instance setup. However, when I tried to run it on a sharded cluster, the performance dropped (4x longer execution time).
The query plan shows that practically all processing is done on Coordinator node, not on DbServer.
How to push the query to be executed at DbServer?
To give a bit of a context: I have a collection of ~120k (will grow to several millions) of multi-level JSON documents with nested arrays. And the query needs to unnest these arrays before getting to the proper node.
AQL Query:
for doc IN doccollection
for arrayLevel1Elem in doc.report.container.children.container
for arrayLevel2Elem in arrayLevel1Elem.children.container.children.num
for arrayLevel3Elem in arrayLevel2Elem.children.code
filter doc.report.container.concept.simpleCodedValue == 'A'
filter arrayLevel1Elem.concept.codedValue == "B"
filter arrayLevel2Elem.concept.simpleCodedValue == "C"
filter arrayLevel3Elem.concept.simpleCodedValue == 'X'
filter arrayLevel3Elem.value.simpleCodedValue == 'Y'
collect studyUid = doc.report.study.uid, personId = doc.report.person.id, metricName = arrayLevel2Elem.concept.meaning, value = to_number(arrayLevel2Elem.value)
return {studyUid, personId, metricName, value}
Query Plan:
Id NodeType Site Est. Comment
1 SingletonNode DBS 1 * ROOT
2 EnumerateCollectionNode DBS 121027 - FOR doc IN doccollection /* full collection scan, projections: `report`, 2 shard(s) */ FILTER (doc.`report`.`container`.`concept`.`simpleCodedValue` == "A") /* early pruning */
3 CalculationNode DBS 121027 - LET #8 = doc.`report`.`container`.`children`.`container` /* attribute expression */ /* collections used: doc : doccollection */
19 CalculationNode DBS 121027 - LET #24 = doc.`report`.`study`.`uid` /* attribute expression */ /* collections used: doc : doccollection */
20 CalculationNode DBS 121027 - LET #26 = doc.`report`.`person`.`id` /* attribute expression */ /* collections used: doc : doccollection */
29 RemoteNode COOR 121027 - REMOTE
30 GatherNode COOR 121027 - GATHER /* parallel, unsorted */
4 EnumerateListNode COOR 12102700 - FOR arrayLevel1Elem IN #8 /* list iteration */
11 CalculationNode COOR 12102700 - LET #16 = (arrayLevel1Elem.`concept`.`codedValue` == "B") /* simple expression */
12 FilterNode COOR 12102700 - FILTER #16
5 CalculationNode COOR 12102700 - LET #10 = arrayLevel1Elem.`children`.`container`.`children`.`num` /* attribute expression */
6 EnumerateListNode COOR 1210270000 - FOR arrayLevel2Elem IN #10 /* list iteration */
13 CalculationNode COOR 1210270000 - LET #18 = (arrayLevel2Elem.`concept`.`simpleCodedValue` == "C") /* simple expression */
14 FilterNode COOR 1210270000 - FILTER #18
7 CalculationNode COOR 1210270000 - LET #12 = arrayLevel2Elem.`children`.`code` /* attribute expression */
21 CalculationNode COOR 1210270000 - LET #28 = arrayLevel2Elem.`concept`.`meaning` /* attribute expression */
22 CalculationNode COOR 1210270000 - LET #30 = TO_NUMBER(arrayLevel2Elem.`value`) /* simple expression */
8 EnumerateListNode COOR 121027000000 - FOR arrayLevel3Elem IN #12 /* list iteration */
15 CalculationNode COOR 121027000000 - LET #20 = ((arrayLevel3Elem.`concept`.`simpleCodedValue` == "X") && (arrayLevel3Elem.`value`.`simpleCodedValue` == "Y")) /* simple expression */
16 FilterNode COOR 121027000000 - FILTER #20
23 CollectNode COOR 96821600000 - COLLECT studyUid = #24, personId = #26, metricName = #28, value = #30 /* hash */
26 SortNode COOR 96821600000 - SORT studyUid ASC, personId ASC, metricName ASC, value ASC /* sorting strategy: standard */
24 CalculationNode COOR 96821600000 - LET #32 = { "studyUid" : studyUid, "personId" : personId, "metricName" : metricName, "value" : value } /* simple expression */
25 ReturnNode COOR 96821600000 - RETURN #32
Thanks a lot for any hint.

Queries are not actually executed at the DB server - the coordinators handle query compilation and execution, only really asking the DB server(s) for data.
This means memory load for query execution happens on the coordinators (good!) but that the coordinator has to transport (sometimes LARGE amounts of) data across the network. This is THE BIGGEST downside to moving to a cluster - and not one that is easily solved.
I walked this same road in the beginning and found ways to optimize some of my queries, but in the end, it was easier to go with a "one-shard" cluster or an "active-failover" setup.
It's tricky to make architecture suggestions because each use case can be so different, but there are some general AQL guidelines I follow:
Collecting FOR and FILTER statements is not recommended (see #2). Try this version to see if it runs any faster (and try indexing report.container.concept.simpleCodedValue) :
FOR doc IN doccollection
FILTER doc.report.container.concept.simpleCodedValue == 'A'
FOR arrayLevel1Elem in doc.report.container.children.container
FILTER arrayLevel1Elem.concept.codedValue == 'B'
FOR arrayLevel2Elem in arrayLevel1Elem.children.container.children.num
FILTER arrayLevel2Elem.concept.simpleCodedValue == 'C'
FOR arrayLevel3Elem in arrayLevel2Elem.children.code
FILTER arrayLevel3Elem.concept.simpleCodedValue == 'X'
FILTER arrayLevel3Elem.value.simpleCodedValue == 'Y'
COLLECT
studyUid = doc.report.study.uid,
personId = doc.report.person.id,
metricName = arrayLevel2Elem.concept.meaning,
value = to_number(arrayLevel2Elem.value)
RETURN { studyUid, personId, metricName, value }
The FOR doc IN doccollection pattern will recall the ENTIRE document from the DB server for each item in doccollection. Best practice is to either limit the number of documents you are retrieving (best done with an index-backed search) and/or return only a few attributes. Don't be afraid of using LET - in-memory on the coordinator can be faster than in-memory on the DB. This example does both - filters and returns a smaller set of data:
LET filteredDocs = (
FOR doc IN doccollection
FILTER doc.report.container.concept.simpleCodedValue == 'A'
RETURN {
study_id: doc.report.study.uid,
person_id: doc.report.person.id,
arrayLevel1: doc.report.container.children.container
}
)
FOR doc IN filteredDocs
FOR arrayLevel1Elem in doc.arrayLevel1
FILTER arrayLevel1Elem.concept.codedValue == 'B'
...

Related

Difference in Performance when using vertices and edges VS Joins in ArangoDB

Below are few details.
Query 1: Using Graph Traversal attached execution plan as well.
Here i am using an edge between CollectionA and CollectionB.
Query string:
for u in CollectionA filter u.FilterA == #opId and u.FilterB >= #startTimeInLong and u.FilterB <= #endTimeInLong
for v in 1..1 OUTBOUND u CollectionALinksCollectionB
filter
v.FilterC == null return v
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
9 IndexNode 45088 - FOR u IN CollectionA /* skiplist index scan */
5 TraversalNode 1 - FOR v /* vertex */ IN 1..1 /* min..maxPathDepth */ OUTBOUND u /* startnode */ CollectionALinksCollectionB
6 CalculationNode 1 - LET #6 = (v.`ReceivedRating` == null) /* simple expression */
7 FilterNode 1 - FILTER #6
8 ReturnNode 1 - RETURN v
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
9 skiplist CollectionA false false 100.00 % [ `FilterA`, `FilterB` ] ((u.`FilterA` == "8277") && (u.`FilterB` >= 1526947200000) && (u.`FilterB` <= 1541030400000))
5 edge CollectionALinksCollectionB false false 100.00 % [ `_from` ] base OUTBOUND
Traversals on graphs:
Id Depth Vertex collections Edge collections Options Filter conditions
5 1..1 CollectionALinksCollectionB uniqueVertices: none, uniqueEdges: path
Optimization rules applied:
Id RuleName
1 use-indexes
2 remove-filter-covered-by-index
3 remove-unnecessary-calculations-2
Query 2:
Query string:
for u in CollectionA filter u.FilterA == #opId and u.FilterB >= #startTimeInLong and
u.FilterB <= #endTimeInLong
for v in CollectionB
filter
v._key==u._key and
v.FilterC == null return v
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
8 CalculationNode 1 - LET #6 = CollectionB /* all collection documents */ /* v8 expression */
11 IndexNode 45088 - FOR u IN CollectionA /* skiplist index scan */
10 IndexNode 45088 - FOR v IN CollectionB /* primary index scan, scan only */
12 CalculationNode 45088 - LET #4 = (CollectionB /* all collection documents */.`FilterC` == null) /* v8 expression */
7 FilterNode 45088 - FILTER #4
9 ReturnNode 45088 - RETURN #6
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
11 skiplist CollectionA false false 100.00 % [ `FilterA`, `FilterB` ] ((u.`FilterA` == "8277") && (u.`FilterB` >= 1526947200000) && (u.`FilterB` <= 1541030400000))
10 primary CollectionB true false 100.00 % [ `_key` ] (CollectionB.`_key` == u.`_key`)
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 use-indexes
3 remove-filter-covered-by-index
4 remove-unnecessary-calculations-2
How Does Query 1 perform better than Query 2. Also, the query result is almost similar for smaller dataset but Query 1 performs better with larger data.
Can some one explain me in detail how does Graph traversing help here

ReadStream skipping bytes?

I'm writing a keyboard-events parser for Linux, using node.js. It's working somewhat okay, but sometimes it seems like node is skipping a few bytes. I'm using a ReadStream to get the data, handle it, process it, and eventually output it when a separator character is encountered (in my case, \n).
Here is the part of my class that handles the read data:
// This method is called through this callback:
// this.readStream = fs.createReadStream(this.path);
// this.readStream.on("data", function(a) { self.parse_data(self, a); });
EventParser.prototype.parse_data = function(self, data)
{
/*
* Data format :
* {
* 0x00 : struct timeval time { long sec (8), long usec (8) } (8 bytes)
* 0x08 : __u16 type (2 bytes)
* 0x10 : __u16 code (2 bytes)
* 0x12 : __s32 value (4 bytes)
* } = (16 bytes)
*/
var dataBuffer = new Buffer(data);
var slicedBuffer = dataBuffer.slice(0, 16);
dataBuffer = dataBuffer.slice(16, dataBuffer.length);
while (dataBuffer.length > 0 && slicedBuffer.length == 16)
{
var type = GetDataType(slicedBuffer),
code = GetDataCode(slicedBuffer),
value = GetDataValue(slicedBuffer);
if (type == CST.EV.KEY)
{ // Key was pressed: KEY event type
if (code == 42 && value == 1) { self.shift_pressed = true; }
if (code == 42 && value == 0) { self.shift_pressed = false; }
console.log(type + "\t" + code + "\t" + value + "\t(" + GetKey(self.shift_pressed, code) + ")")
// GetKey uses a static array to get the actual character
// based on whether the shift key is held or not
if (value == 1)
self.handle_processed_data(GetKey(self.shift_pressed, code));
// handle_processed_data adds characters together, and outputs the string when encountering a
// separator character (in this case, '\n')
}
// Take a new slice, and loop.
slicedBuffer = dataBuffer.slice(0, 16);
dataBuffer = dataBuffer.slice(16, dataBuffer.length);
}
}
// My system is in little endian!
function GetDataType(dataBuffer) { return dataBuffer.readUInt16LE(8); }
function GetDataCode(dataBuffer) { return dataBuffer.readUInt16LE(10); }
function GetDataValue(dataBuffer) { return dataBuffer.readInt32LE(12); }
I'm basically filling up the data structure explained at the top using a Buffer. The interesting part is the console.log near the end, which will print everything interesting (related to the KEY event) that passes in our callback! Here is the result of such log, complete with the expected result, and the actual result:
EventParserConstructor: Listening to /dev/input/event19
/* Expected result: CODE-128 */
/* Note that value 42 is the SHIFT key */
1 42 1 ()
1 46 1 (C)
1 42 0 ()
1 46 0 (c)
1 42 1 ()
1 24 1 (O)
1 42 0 ()
1 24 0 (o)
1 42 1 ()
1 32 1 (D)
1 42 0 ()
1 32 0 (d)
1 42 1 ()
1 18 1 (E)
1 42 0 ()
1 18 0 (e)
1 12 0 (-)
1 2 0 (1)
1 3 1 (2)
1 3 0 (2)
1 9 1 (8)
1 9 0 (8)
1 28 1 (
)
[EventParser_Handler]/event_parser.handle_processed_data: CODE28
/* Actual result: CODE28 */
/* The '-' and '1' events can be seen in the logs, but only */
/* as key RELEASED (value: 0), not key PRESSED */
We can clearly see the - and 1 character events passing by, but only as key releases (value: 0), not key presses. The weirdest thing is that most of the time, the events are correctly translated. But 10% of the time, this happens.
Is ReadStream eating up some bytes, occasionally? If yes, what alternative should I be using?
Thanks in advance!
Well, turns out that my loop was rotten.
I was assuming that the data would only come in chunks of 16 bytes... Which obviously isn't always the case. So sometimes, I had packets of <16 bytes being left over and lost between two 'data' event callbacks.
I added this by adding an excessBuffer field to my class and using it to fill my initial slicedBuffer when receiving data.

ArangoDB edge lookup cost

Hi I have below collections and relation
broadcast (24,518 doc)
videoGroup (5,699 doc)
episode (124,893 doc)
videoClip (485,878 doc)
character (55,541 doc)
And Their collections has relation each other, below
broadcast has many videoGroup(m:n), so I create broadcastToVideoGroup Edge Collection
Collection
videoGroup has many episode(1:n), so I create videoGroupToEpisode EdgeCollection
episode has many videoClip(1:n), so I create episodeToVideoClip EdgeCollection
I fetched below query for all joined result
FOR b IN broadcast
FILTER b.reg_title > NULL
return merge(b, {series: (
FOR s IN OUTBOUND b._id tvToSeries
return merge(s, {episode: (
FOR e IN OUTBOUND s._id seriesToEpisode
return merge(e, {clip: (
FOR c IN OUTBOUND e._id episodeToClip
return c
)})
)})
)})
The explain is below
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 EnumerateCollectionNode 24518 - FOR b IN broadcast /* full collection scan */
7 SubqueryNode 24518 - LET #2 = ... /* subquery */
3 SingletonNode 1 * ROOT
4 CalculationNode 1 - LET #13 = b.`_id` /* attribute expression */ /* collections used: b : broadcast */
5 TraversalNode 5 - FOR c /* vertex */ IN 1..1 /* min..maxPathDepth */ OUTBOUND #13 /* startnode */ broadcastToCharacter
6 ReturnNode 5 - RETURN c
24 SubqueryNode 24518 - LET #11 = ... /* subquery */
8 SingletonNode 1 * ROOT
9 CalculationNode 1 - LET #17 = b.`_id` /* attribute expression */ /* collections used: b : broadcast */
10 TraversalNode 1 - FOR s /* vertex */ IN 1..1 /* min..maxPathDepth */ OUTBOUND #17 /* startnode */ broadcastToVideoGroup
21 SubqueryNode 1 - LET #9 = ... /* subquery */
11 SingletonNode 1 * ROOT
12 CalculationNode 1 - LET #21 = s.`_id` /* attribute expression */
13 TraversalNode 25 - FOR e /* vertex */ IN 1..1 /* min..maxPathDepth */ OUTBOUND #21 /* startnode */ videoGroupToEpisode
18 SubqueryNode 25 - LET #7 = ... /* subquery */
14 SingletonNode 1 * ROOT
15 CalculationNode 1 - LET #25 = e.`_id` /* attribute expression */
16 TraversalNode 8 - FOR c /* vertex */ IN 1..1 /* min..maxPathDepth */ OUTBOUND #25 /* startnode */ episodeToClip
17 ReturnNode 8 - RETURN c
19 CalculationNode 25 - LET #29 = MERGE(e, { "clips" : #7 }) /* simple expression */
20 ReturnNode 25 - RETURN #29
22 CalculationNode 1 - LET #31 = MERGE(s, { "episodes" : #9 }) /* simple expression */
23 ReturnNode 1 - RETURN #31
25 CalculationNode 24518 - LET #33 = MERGE(b, { "character" : #2, "videoGroup" : #11 }) /* simple expression */ /* collections used: b : broadcast */
26 ReturnNode 24518 - RETURN #33
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
5 edge broadcastToCharacter false false 19.42 % [ `_from` ] base OUTBOUND
10 edge broadcastToVideoGroup false false 90.89 % [ `_from` ] base OUTBOUND
13 edge videoGroupToEpisode false false 3.99 % [ `_from` ] base OUTBOUND
16 edge episodeToClip false false 11.55 % [ `_from` ] base OUTBOUND
In execution plan, I wonder that why plan is not 1 but 25 at id 13(TravasalNode) Estimation.
Is not 1 for ArangoDB edge collection lookup?

How to improve Update query in arangodb

I have a collection which holds more than 15 million documents. Out of those 15 million documents I update 20k records every hour. But update query takes a long time to finish (30 min around).
Document:
{ "inst" : "instance1", "dt": "2015-12-12T00:00:000Z", "count": 10}
I have an array which holds 20k instances to be updated.
My Query looks like this:
For h in hourly filter h.dt == DATE_ISO8601(14501160000000)
For i in instArr
filter i.inst == h.inst
update h with {"inst":i.inst, "dt":i.dt, "count":i.count} in hourly
Is there any optimized way of doing this. I have hash indexing on inst and skiplist indexing on dt.
Update
I could not use 20k inst in the query manually so following is the execution plan for just 2 inst:
FOR r in hourly FILTER r.dt == DATE_ISO8601(1450116000000) FOR i IN
[{"inst":"0e649fa22bcc5200d7c40f3505da153b", "dt":"2015-12-14T18:00:00.000Z"}, {}] FILTER i.inst ==
r.inst UPDATE r with {"inst":i.inst, "dt": i.dt, "max":i.max, "min":i.min, "sum":i.sum, "avg":i.avg,
"samples":i.samples} in hourly OPTIONS { ignoreErrors: true } RETURN NEW.inst
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
5 CalculationNode 1 - LET #6 = [ { "inst" : "0e649fa22bcc5200d7c40f3505da153b", "dt" : "2015-12-14T18:00:00.000Z" }, { } ] /* json expression */ /* const assignment */
13 IndexRangeNode 103067 - FOR r IN hourly /* skiplist index scan */
6 EnumerateListNode 206134 - FOR i IN #6 /* list iteration */
7 CalculationNode 206134 - LET #8 = i.`inst` == r.`inst` /* simple expression */ /* collections used: r : hourly */
8 FilterNode 206134 - FILTER #8
9 CalculationNode 206134 - LET #10 = { "inst" : i.`inst`, "dt" : i.`dt`, "max" : i.`max`, "min" : i.`min`, "sum" : i.`sum`, "avg" : i.`avg`, "samples" : i.`samples` } /* simple expression */
10 UpdateNode 206134 - UPDATE r WITH #10 IN hourly
11 CalculationNode 206134 - LET #12 = $NEW.`inst` /* attribute expression */
12 ReturnNode 206134 - RETURN #12
Indexes used:
Id Type Collection Unique Sparse Selectivity Est. Fields Ranges
13 skiplist hourly false false n/a `dt` [ `dt` == "2015-12-14T18:00:00.000Z" ]
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 move-filters-up
3 move-calculations-up-2
4 move-filters-up-2
5 remove-data-modification-out-variables
6 use-index-range
7 remove-filter-covered-by-index
Write query options:
Option Value
ignoreErrors true
waitForSync false
nullMeansRemove false
mergeObjects true
ignoreDocumentNotFound false
readCompleteInput true
I assume the selection part (not the update part) will be the bottleneck in this query.
The query seems problematic because for each document matching the first filter (h.dt == DATE_ISO8601(...)), there will be an iteration over the 20,000 values in the instArr array. If instArr values are unique, then only one value from it will match. Additionally, no index will be used for the inner loop, as the index selection has happened in the outer loop already.
Instead of looping over all values in instArr, it will be better to turn the accompanying == comparison into an IN comparison. That would already work if instArr would be an array of instance names, but it seems to be an array of instance objects (consisting of at least attributes inst and count). In order to use the instance names in an IN comparison, it would be better to have a dedicated array of instance names, and a translation table for the count and dt values.
Following is an example for generating these with JavaScript:
var instArr = [ ], trans = { };
for (i = 0; i < 20000; ++i) {
var instance = "instance" + i;
var count = Math.floor(Math.random() * 10);
var dt = (new Date(Date.now() - Math.floor(Math.random() * 10000))).toISOString();
instArr.push(instance);
trans[instance] = [ count, dt ];
}
instArr would then look like this:
[ "instance0", "instance1", "instance2", ... ]
and trans:
{
"instance0" : [ 4, "2015-12-16T21:24:45.106Z" ],
"instance1" : [ 0, "2015-12-16T21:24:39.881Z" ],
"instance2" : [ 2, "2015-12-16T21:25:47.915Z" ],
...
}
These data can then be injected into the query using bind variables (named like the variables above):
FOR h IN hourly
FILTER h.dt == DATE_ISO8601(1450116000000)
FILTER h.inst IN #instArr
RETURN #trans[h.inst]
Note that ArangoDB 2.5 does not yet support the #trans[h.inst] syntax. In that version, you will need to write:
LET trans = #trans
FOR h IN hourly
FILTER h.dt == DATE_ISO8601(1450116000000)
FILTER h.inst IN #instArr
RETURN trans[h.inst]
Additionally, 2.5 has a problem with longer IN lists. IN-list performance decreases quadratically with the length of the IN list. So in this version, it will make sense to limit the length of instArr to at most 2,000 values. That may require issuing multiple queries with smaller IN lists instead of just one with a big IN list.
The better alternative would be to use ArangoDB 2.6, 2.7 or 2.8, which do not have that problem, and thus do not require the workaround. Apart from that, you can get away with the slightly shorter version of the query in the newer ArangoDB versions.
Also note that in all of the above examples I used a RETURN ... instead of the UPDATE statement from the original query. This is because all my tests revealed that the selection part of the query is the major problem, at least with the data I had generated.
A final note on the original version of the UPDATE: updating each document's inst value with i.inst seems redudant, because i.inst == h.inst so the value won't change.

lpeg grammar to parse comma separated groups that may have internal groups

I need to parse comma separated groups(enclosed in brackets) that may have internal groups inside the groups. It should only separate the outside groups.
I have a function that does this:
function lpeg.commaSplit(arg)
local P,C,V,sep = lpeg.P, lpeg.C, lpeg.V, lpeg.P(",")
local p = P{
"S";
S = lpeg.T_WSpace * C(V"Element") * (lpeg.T_WSpace * sep * lpeg.T_WSpace * C(V"Element"))^0 * lpeg.T_WSpace,
Element = (V"Group")^0 * (1 - lpeg.T_Group - sep)^0 * (V"Group" * (1 - lpeg.T_Group - sep)^0)^0 * (1 - sep)^0,
Group = lpeg.T_LGroup * ((1 - lpeg.T_Group) + V"Group")^0 * lpeg.T_RGroup
}^-1
return lpeg.match(lpeg.Ct(p), arg)
end
But the problem is to remove the extra brackets that may enclose the group.
Here is a test string:
[[a,b,[c,d]],[e,[f,g]]]
should parse to
[a,b,[c,d] & [e,[f,g]]
Notice the internal groups are left alone. A simple removal of the extra brackets on the end does not work since you'll end up with a string like a,b,[c,d]],[e,[f,g].
Any ideas how to modify the lpeg grammar to allow for the outside groups?
As I am not expert in making grammars in LPeg, I found this exercise interesting to do...
I couldn't manage to use your grammar, so I went ahead and made my own, with smaller chunks easier to understand and where I could put the captures I needed.
I think I got a decent empirical result. It works on your test case, I don't know if groups can be more deeply nested, etc. The post-processing of the capture is a bit ad hoc...
require"lpeg"
-- Guesswork...
lpeg.T_WSpace = lpeg.P" "^0
lpeg.T_LGroup = lpeg.P"["
lpeg.T_RGroup = lpeg.P"]"
lpeg.T_Group = lpeg.S"[]"
function lpeg.commaSplit(arg)
local P, C, Ct, V, sep = lpeg.P, lpeg.C, lpeg.Ct, lpeg.V, lpeg.P","
local grammar =
{
"S";
S = lpeg.T_WSpace * V"Group" * lpeg.T_WSpace,
Group = Ct(lpeg.T_LGroup * C(V"Units") * lpeg.T_RGroup),
Units = V"Unit" *
(lpeg.T_WSpace * sep * lpeg.T_WSpace * V"Unit")^0,
Unit = V"Element" + V"Group",
Element = (1 - sep - lpeg.T_Group)^1,
}
return lpeg.match(Ct(P(grammar)^-1), arg)
end
local test = "[[a,b,[c,d]],[e,[f,g]]]"
local res = lpeg.commaSplit(test)
print(dumpObject(res))
print(res[1], res[1][1], res[1][2])
local groups = res[1]
local finalResult = {}
for n, v in ipairs(groups) do
if type(v) == 'table' then
finalResult[#finalResult+1] = "[" .. v[1] .. "]"
end
end
print(dumpObject(finalResult))
dumpObject is just a table dump of my own. The output of this code is as follows:
local T =
{
{
"[a,b,[c,d]],[e,[f,g]]",
{
"a,b,[c,d]",
{
"c,d"
}
},
{
"e,[f,g]",
{
"f,g"
}
}
}
}
table: 0037ED48 [a,b,[c,d]],[e,[f,g]] table: 0037ED70
local T =
{
"[a,b,[c,d]]",
"[e,[f,g]]"
}
Personally, I wouldn't pollute the lpeg table with my stuff, but I kept your style here.
I hope this will be useful (or will be a starting point to make you to advance).

Resources