ArangoDB: what does execution cost represent on execution plan - arangodb

This is part of the execution plan that arango gave me.
Execution plan:
Id NodeType Est. Comment
40 CalculationNode 900 - LET now = DATE_NOW() /* v8 expression */
As you can see, DATE_NOW() is costing 900.
However, when I write a simplie query that only returns the value of DATE_NOW(), the execution cost is 1, just like below.
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 CalculationNode 1 - LET #0 = DATE_NOW() /* v8 expression */
3 ReturnNode 1 - RETURN #0
I wish to know,
1. How does ArangoDB calculates the execution cost?
2. What does execution cost represents?

You need to provide a complete plan for the first query like:
Query String:
LET now = DATE_NOW() RETURN now
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 CalculationNode 1 - LET now = 1548315869319 /* json expression */ /* const assignment */
4 ReturnNode 1 - RETURN now
Indexes used:
none
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 remove-redundant-calculations
3 remove-unnecessary-calculations
The estimate is the number of documents produced or seen by that node.

Related

How does Linux use values for PCIDs?

I'm trying to understand how Linux uses PCIDs (aka ASIDs) on Intel architecture. While I was investigating the Linux kernel's source code and patches I found such a define with the comment:
/*
* 6 because 6 should be plenty and struct tlb_state will fit in two cache
* lines.
*/
#define TLB_NR_DYN_ASIDS 6
Here is, I suppose, said that Linux uses only 6 PCID values, but what about this comment:
/*
* The x86 feature is called PCID (Process Context IDentifier). It is similar
* to what is traditionally called ASID on the RISC processors.
*
* We don't use the traditional ASID implementation, where each process/mm gets
* its own ASID and flush/restart when we run out of ASID space.
*
* Instead we have a small per-cpu array of ASIDs and cache the last few mm's
* that came by on this CPU, allowing cheaper switch_mm between processes on
* this CPU.
*
* We end up with different spaces for different things. To avoid confusion we
* use different names for each of them:
*
* ASID - [0, TLB_NR_DYN_ASIDS-1]
* the canonical identifier for an mm
*
* kPCID - [1, TLB_NR_DYN_ASIDS]
* the value we write into the PCID part of CR3; corresponds to the
* ASID+1, because PCID 0 is special.
*
* uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
* for KPTI each mm has two address spaces and thus needs two
* PCID values, but we can still do with a single ASID denomination
* for each mm. Corresponds to kPCID + 2048.
*
*/
As it is said in the previous comment, I suppose that Linux uses only 6 values for PCIDs, so in brackets we see just single values (not arrays). So ASID here can be only 0 and 5, kPCID can be only 1 and 6 and uPCID can only be 2049 and 2048 + 6 = 2054, right?
At this moment I have a few questions:
Why are there only 6 values for PCIDs? (Why is it plenty?)
Why will tlb_state structure fit in two cache lines if we choose 6 PCIDs?
Why does Linux use exactly these values for ASID, kPCID, and uPCID (I'm referring to the second comment)?
As it is said in the previous comment I suppose that Linux uses only 6 values for PCIDs so in brackets we see just single values (not arrays)
No, this is wrong, those are ranges. [0, TLB_NR_DYN_ASIDS-1] means from 0 to TLB_NR_DYN_ASIDS-1 inclusive. Keep reading for more details.
There are a few things to consider:
The difference between ASID (Address Space IDentifier) and PCID (Process-Context IDentifier) is just nomenclature: Linux calls this feature ASID across all architectures. Intel calls its implementation PCID. Linux ASIDs start at 0, Intel's PCIDs start at 1 because 0 is special and means "no PCID".
On x86 processors that support the feature, PCIDs are 12-bit values, so technically 4095 different PCIDs are possible (1 through 4095, as 0 is special).
Due to Kernel Page-Table Isolation Linux will nonetheless need two different PCIDs per task. The distinction between kPCID and uPCID is made for this reason, as each task effectively has two different virtual address spaces whose address translations need to be cached separately thus using different PCID. So we are down to 2047 usable pairs of PCIDs (plus the last single one that would just be unused).
Any normal system can easily exceed 2047 tasks on a single CPU, so no matter how many bits you use, you will never be able to have enough PCIDs for all existing tasks. On systems with a lot of CPUs you will also not have enough PCIDs for all active tasks.
Due to 4, you cannot implement PCID support as a simple assignment of a unique value for each existing/active task (e.g. like it is done for PIDs). Multiple tasks will need to "share" the same PCID sooner or later (not at the same time, but at different points in time). The logic to manage PCIDs will therefore need to be different.
The choice made by Linux developers was to use PCIDs as a way to optimize accesses to the most recently used mms (struct mm). This was implemented using a global per-CPU array (cpu_tlbstate.ctxs) that is linearly scanned on each mm-switch. Even small values of TLB_NR_DYN_ASIDS can easily trash performance instead of improving it. Apparently, 6 was a good number to choose as it provided a decent performance improvement. This means that only the 6 most-recently-used mms will use non-zero PCIDs (OK, technically the 6 most-recently-used user/kernel mm pairs).
You can see this reasoning explained more concisely in the commit message of the patch that implemented PCID support.
Why will tlb_state structure fit in two cache lines if we choose 6 PCIDs?
Well that's just simple math:
struct tlb_state {
struct mm_struct * loaded_mm; /* 0 8 */
union {
struct mm_struct * last_user_mm; /* 8 8 */
long unsigned int last_user_mm_spec; /* 8 8 */
}; /* 8 8 */
u16 loaded_mm_asid; /* 16 2 */
u16 next_asid; /* 18 2 */
bool invalidate_other; /* 20 1 */
/* XXX 1 byte hole, try to pack */
short unsigned int user_pcid_flush_mask; /* 22 2 */
long unsigned int cr4; /* 24 8 */
struct tlb_context ctxs[6]; /* 32 96 */
/* size: 128, cachelines: 2, members: 8 */
/* sum members: 127, holes: 1, sum holes: 1 */
};
(information extracted through pahole from a kernel image with debug symbols)
The array of struct tlb_context is used to keep track of ASIDs and it holds TLB_NR_DYN_ASIDS (6) entries.

What is the random factor in node v10 event loop?

My question is about nodejs event loop
Consider this code
(async () => {
let val = 1
const promise = new Promise(async resolve => {
resolve()
await new Promise(async r => {
setTimeout(r)
})
await promise
val = 2
})
await promise
await new Promise(resolve => setTimeout(resolve))
console.log(val)
})()
With node 10.20.1 (latest version of node 10)
for ((i = 0; i < 30; i++)); do /opt/node-v10.20.1-linux-x64/bin/node race-timeout.js; done
With node 12.0.0 (the first version of node 12)
for ((i = 0; i < 30; i++)); do /opt/node-v12.0.0-linux-x64/bin/node race-timeout.js; done
The result of node 10
1
2
2
1
1
2
2
1
2
1
1
1
1
1
2
1
1
2
1
2
1
1
2
2
1
2
1
1
2
1
The result of node 12
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
So far, I have known that node is a single-thread language.
Everything is well-determined and executed in an exact order except when there is an intervention of the poll phase.
The above code does not include any undetermined factors (like IO, network, ...).
I expected that the result should be the same. However, with node v10, it is not.
What is the random factor in node v10?
It is all explained here.
In a nutshell, before v.11 calling Promise.resolve().then( fn1 ); setTimeout( fn2, 0 ); could lead to fn2 being pushed to the queue of timers belonging to the current event loop iteration, and thus, fn2 would fire before the event loop even entered the nextTicks queue (a.k.a "microtask-queue").
According to one comment there, the discrepancy came from the fact "the timers could get set on a different ms and thus expire 1ms apart", thus pushing fn2 to a next event loop iteration and leaving the nextTicks queue being processed in between.
In other words, v10 should most of the time output 1, except when the main job was calling setTimeout in the next ms, for instance if the main job was called #ms 12.9, and that it took more than 0.1ms to get to the setTimeout call, then this setTimeout's 0 value would actually match the next timer queue (#ms 13), and not the current one (#ms 12).
In next versions, nextTicks are ran after each immediates and timers, hence we are sure the Promise's callback will get executed right after the job that did execute it, just like in browsers.

ArangoDB Parent to Child edge creation on existing 1 milltion docucments for nested leves not working/or SLOW

Created events document in ArangoDB. Loaded 1 million records as shown below which completes in 40 seconds.
FOR I IN 1..1000000
INSERT {
"source": "ABC",
"target": "ABC",
"type": "REST",
"attributes" : { "MyAtrib" : TO_STRING(I)},
"mynum" : I
} INTO events
So record 1 is super parent, and 2 is child of 1 etc.
1 --> 2 --> 3 --> 4 --> ...1000000
Created empty Edge collection ChildEvents, and tried to establish the parent to child edge relations through the below query, but it never completes (created a hash index on mynum, but no luck)
FOR p IN events
FOR c IN events
FILTER p.mynum == ( c.mynum + 1 )
INSERT { _from: p._id, _to: c._id} INTO ChildEvents
Any help would be greatly appreciated.
Creating the event documents took around 50 seconds on my system. I added an index for mynum to the events collection and ran your second query (with an added RETURN NEW at the end), and it took roughly 70 seconds to process the edges (plus some time to render a subset of them):
I used ArangoDB 3.6.0 with RocksDB engine under Windows 10, Intel i7-6700K 4x4.0 GHz, 32 GB RAM, Samsung Evo 850 SSD.
Are you sure that the index is set up correctly? Explain the query and check the execution plan, maybe something is different for you?
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
3 EnumerateCollectionNode 1000000 - FOR c IN events /* full collection scan, projections: `mynum`, `_id` */
9 IndexNode 1000000 - FOR p IN events /* persistent index scan, projections: `_id` */
6 CalculationNode 1000000 - LET #5 = { "_from" : p.`_id`, "_to" : c.`_id` } /* simple expression */ /* collections used: p : events, c : events */
7 InsertNode 1000000 - INSERT #5 IN ChildEvents
8 ReturnNode 1000000 - RETURN $NEW
Indexes used:
By Name Type Collection Unique Sparse Selectivity Fields Ranges
9 idx_1655926293788622848 persistent events true false 100.00 % [ `mynum` ] (p.`mynum` == (c.`mynum` + 1))
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 move-filters-up
3 interchange-adjacent-enumerations
4 move-calculations-up-2
5 move-filters-up-2
6 remove-data-modification-out-variables
7 use-indexes
8 remove-filter-covered-by-index
9 remove-unnecessary-calculations-2
10 reduce-extraction-to-projection

What dtrace script output means?

I am tracing DTrace probes in my restify.js application (restify it is http server in node.js that provides dtrace support). I am using sample dtrace script from restify documentation:
#!/usr/sbin/dtrace -s
#pragma D option quiet
restify*:::route-start
{
track[arg2] = timestamp;
}
restify*:::handler-start
/track[arg3]/
{
h[arg3, copyinstr(arg2)] = timestamp;
}
restify*:::handler-done
/track[arg3] && h[arg3, copyinstr(arg2)]/
{
#[copyinstr(arg2)] = quantize((timestamp - h[arg3, copyinstr(arg2)]) / 1000000);
h[arg3, copyinstr(arg2)] = 0;
}
restify*:::route-done
/track[arg2]/
{
#[copyinstr(arg1)] = quantize((timestamp - track[arg2]) / 1000000);
track[arg2] = 0;
}
And the output is:
use_restifyRequestLogger
value ------------- Distribution ------------- count
-1 | 0
0 |######################################## 2
1 | 0
use_validate
value ------------- Distribution ------------- count
-1 | 0
0 |######################################## 2
1 | 0
pre
value ------------- Distribution ------------- count
0 | 0
1 |#################### 1
2 |#################### 1
4 | 0
handler
value ------------- Distribution ------------- count
128 | 0
256 |######################################## 2
512 | 0
route_user_read
value ------------- Distribution ------------- count
128 | 0
256 |######################################## 2
512 | 0
I was wondering what is value value field - what does it mean?
Why there is 124/256/512 for example? I guess it means the time/duration but it is in strange format - is it possible to show miliseconds for example?
The output is a histogram. You are getting a histogram because you are using the quantize function in your D script. The DTrace documentation says the following on quantize:
A power-of-two frequency distribution of the values of the specified expressions. Increments the value in the highest power-of-two bucket that is less than the specified expression.
The 'value' columns is the result of (timestamp - track[arg2]) / 1000000 where timestamp is the current time in nanoseconds. So the value shown is duration in milliseconds.
Putting this all together, the route_user_read result graph is telling you that you had 2 requests that took between 128 and 256 milliseconds.
This output is useful when you have a lot of requests and want to get a general sense of how your server is performing (you can quickly identify a bi-modal distribution for example). If you just want to see how long each request is taking, try using the printf function instead of quantize.

LTL model checking using Spin and Promela syntax

I'm trying to reproduce ALGOL 60 code written by Dijkstra in the paper titled "Cooperating sequential processes", the code is the first attempt to solve the mutex problem, here is the syntax:
begin integer turn; turn:= 1;
parbegin
process 1: begin Ll: if turn = 2 then goto Ll;
critical section 1;
turn:= 2;
remainder of cycle 1; goto L1
end;
process 2: begin L2: if turn = 1 then goto L2;
critical section 2;
turn:= 1;
remainder of cycle 2; goto L2
end
parend
end
So I tried to reproduce the above code in Promela and here is my code:
#define true 1
#define Aturn true
#define Bturn false
bool turn, status;
active proctype A()
{
L1: (turn == 1);
status = Aturn;
goto L1;
/* critical section */
turn = 1;
}
active proctype B()
{
L2: (turn == 2);
status = Bturn;
goto L2;
/* critical section */
turn = 2;
}
never{ /* ![]p */
if
:: (!status) -> skip
fi;
}
init
{ turn = 1;
run A(); run B();
}
What I'm trying to do is, verify that the fairness property will never hold because the label L1 is running infinitely.
The issue here is that my never claim block is not producing any error, the output I get simply says that my statement was never reached..
here is the actual output from iSpin
spin -a dekker.pml
gcc -DMEMLIM=1024 -O2 -DXUSAFE -DSAFETY -DNOCLAIM -w -o pan pan.c
./pan -m10000
Pid: 46025
(Spin Version 6.2.3 -- 24 October 2012)
+ Partial Order Reduction
Full statespace search for:
never claim - (not selected)
assertion violations +
cycle checks - (disabled by -DSAFETY)
invalid end states +
State-vector 44 byte, depth reached 8, errors: 0
11 states, stored
9 states, matched
20 transitions (= stored+matched)
0 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.001 equivalent memory usage for states (stored*(State-vector + overhead))
0.291 actual memory usage for states
128.000 memory used for hash table (-w24)
0.534 memory used for DFS stack (-m10000)
128.730 total actual memory usage
unreached in proctype A
dekker.pml:13, state 4, "turn = 1"
dekker.pml:15, state 5, "-end-"
(2 of 5 states)
unreached in proctype B
dekker.pml:20, state 2, "status = 0"
dekker.pml:23, state 4, "turn = 2"
dekker.pml:24, state 5, "-end-"
(3 of 5 states)
unreached in claim never_0
dekker.pml:30, state 5, "-end-"
(1 of 5 states)
unreached in init
(0 of 4 states)
pan: elapsed time 0 seconds
No errors found -- did you verify all claims?
I've read all the documentation of spin on the never{..} block but couldn't find my answer (here is the link), also I've tried using ltl{..} blocks as well (link) but that just gave me syntax error, even though its explicitly mentioned in the documentation that it can be outside the init and proctypes, can someone help me correct this code please?
Thank you
You've redefined 'true' which can't possibly be good. I axed that redefinition and the never claim fails. But, the failure is immaterial to your goal - that initial state of 'status' is 'false' and thus the never claim exits, which is a failure.
Also, it is slightly bad form to assign 1 or 0 to a bool; assign true or false instead - or use bit. Why not follow the Dijkstra code more closely - use an 'int' or 'byte'. It is not as if performance will be an issue in this problem.
You don't need 'active' if you are going to call 'run' - just one or the other.
My translation of 'process 1' would be:
proctype A ()
{
L1: turn !=2 ->
/* critical section */
status = Aturn;
turn = 2
/* remainder of cycle 1 */
goto L1;
}
but I could be wrong on that.

Resources