I have a problem I do not understand.For example, a cluster has three nodes,
each node is agent, coordinate, and dbserver, I execute an aql query on a node,The implementation plan is as follows:
Query string:
FOR v,p IN outbound
SHORTEST_PATH #startnode
TO #endnode
GRAPH #graphname options {bfs:true}
filter v.c_id==#c_id
return v
Execution plan:
Id NodeType Site Est. Comment
1 SingletonNode COOR 1 * ROOT
2 ShortestPathNode COOR 10000000 - FOR v /* vertex /, p / edge / IN OUTBOUND SHORTEST_PATH 'contact_vertex/f69375e854a34250986399d1909c0776' / startnode / TO 'contact_vertex/e11a10f848cd4a4392b8f10ea307eac9' / targetnode / GRAPH 'an'
3 CalculationNode COOR 10000000 - LET #4 = (v.c_id == -2147483643) / simple expression */
4 FilterNode COOR 10000000 - FILTER #4
5 ReturnNode COOR 10000000 - RETURN v
Indexes used:
none
Shortest paths on graphs:
Id Vertex collections Edge collections
2 contact_vertex contact_edge
Optimization rules applied:
none
I want to know if the querying task is distributed on one node or all nodes?In other words, the query is whether the task is distributed computing?
As in the above query,I have a lot of data,I have two collection, one vertexes collection and one edge collection,I use the default index,two collection constituted a graph,there is no way to quickly improve the search speed?For example, set the index, improve the query memory, and so on.Or some other suggestions, I hope you can help me.
My cluster starts as follows:
arangodb --starter.join 10.200.11.32,10.200.11.34,10.200.11.35
How should I set up my cluster?
I have three node configurations are the sameļ¼
RAM:254G core:40
Related
I have a table containing the market data of 5,000 unique stocks. Each stock has 24 records a day and each record has 1,000 fields (factors). I want to pivot the table for cross-sectional analysis. You can find my script below.
I have two questions: (1) The current script is a bit complex. Is there a simpler implementation? (2) The execution takes 521 seconds. Any way to make it faster?
1.Create table
CREATE TABLE tb
(
tradeTime DateTime,
symbol String,
factor String,
value Float64
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(tradeTime)
ORDER BY (symbol, tradeTime)
SETTINGS index_granularity = 8192
2.Insert test data
INSERT INTO tb SELECT
tradetime,
symbol,
untuple(factor)
FROM
(
SELECT
tradetime,
symbol
FROM
(
WITH toDateTime('2022-01-01 00:00:00') AS start
SELECT arrayJoin(timeSlots(start, toUInt32((22 * 23) * 3600), 3600)) AS tradetime
)
ARRAY JOIN arrayMap(x -> concat('symbol', toString(x)), range(0, 5000)) AS symbol
)
ARRAY JOIN arrayMap(x -> (concat('f', toString(x)), toFloat64(x) + toFloat64(0.1)), range(0, 1000)) AS factor
3.Finally, send the query
SELECT
tradeTime,
sumIf(value, factor = 'factor1') AS factor1,
sumIf(value, factor = 'factor2') AS factor2,
sumIf(value, factor = 'factor3') AS factor3,
sumIf(value, factor = 'factor4') AS factor4,
...// so many factors to list out
sumIf(value, factor = 'factor1000') AS factor1000
FROM tb
GROUP BY tradeTime,symbol
ORDER BY tradeTime,symbol ASC
Have you considered building a materialized view to solve this with the inserts into a SummingMergeTree ?
The documentation says that one triggers a spark job and the other one does not. I am not sure I understand what that means. Could you help me understand the difference between the two?
The source of truth comes from latest code:
/**
* Zips this RDD with its element indices. The ordering is first based on the partition index
* and then the ordering of items within each partition. So the first item in the first
* partition gets index 0, and the last item in the last partition receives the largest index.
*
* This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type.
* This method needs to trigger a spark job when this RDD contains more than one partitions.
*
* #note Some RDDs, such as those returned by groupBy(), do not guarantee order of
* elements in a partition. The index assigned to each element is therefore not guaranteed,
* and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee
* the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
*/
def zipWithIndex(): RDD[(T, Long)] = withScope {
new ZippedWithIndexRDD(this)
}
/**
* Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
* 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
* won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
*
* #note Some RDDs, such as those returned by groupBy(), do not guarantee order of
* elements in a partition. The unique ID assigned to each element is therefore not guaranteed,
* and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee
* the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
*/
def zipWithUniqueId(): RDD[(T, Long)]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1396
I have an simple query runs inside foxx
For u in collection
Filter u.someIndexedSparseFiler !=null
Return {_id:u._id}
This will return millions+ results. In the logs, arango have a message of limited memory heap reached and terminate the process.
reached heap-size limit of #3 interrupting V8 execution (heap size limit 3232954528, used 3060226424) during V8 internal collection
Even though I add the flag --javascript.v8-max-heap 3000 to the start-up. It still runs in the same error. What should I do? Is there a better approach than this
I'm not sure why you're getting out-of-memory errors, but it looks like the data you're returning is overflowing the V8 heap size. Another possibility is that something is causing the engine to miss/ignore the index, causing the engine to load every document before evaluating the someIndexedSparseFiler attribute.
Evaluating millions of documents (or lots of large documents) would not only cost a lot of disk/memory I/O, but could also require a lot of RAM. Try using the explain feature to return a query analysis - it should tell you what is going wrong.
For comparison, my query...
FOR u IN myCollection
FILTER u.someIndexedSparseFiler != null
RETURN u._id
...returns this when I click "explain":
Query String (82 chars, cacheable: true):
FOR u IN myCollection
FILTER u.someIndexedSparseFiler != null
RETURN u._id
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
7 IndexNode 5 - FOR u IN myCollection /* persistent index scan, projections: `_id` */
5 CalculationNode 5 - LET #3 = u.`_id` /* attribute expression */ /* collections used: u : myCollection */
6 ReturnNode 5 - RETURN #3
Indexes used:
By Name Type Collection Unique Sparse Selectivity Fields Ranges
7 idx_1667363882689101824 persistent myCollection false true 100.00 % [ `someIndexedSparseFiler` ] *
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 move-filters-up
3 move-calculations-up-2
4 move-filters-up-2
5 use-indexes
6 remove-filter-covered-by-index
7 remove-unnecessary-calculations-2
8 reduce-extraction-to-projection
Note that it listss my sparse index under Indexes used:. Also, try changing the != to == and you will see that now it ignores the index! This is because the optimizer knows a sparse index will never have a null value, so it skips it.
If you aren't familiar with it, the "explain" functionality is extremely useful (indispensable, really) when tuning queries and creating indexes. Also, remember that indexes should match your query; in this case, the index should only have one attribute or the "selectivity" quotient may be too low and the engine will ignore it.
With data structure like
Departure -> Trip -> Driver
using an ArangoDB Spring Data derived query in the Trip repository like findByDriverIdNumberAndDepartureStartTimeBetween( String idNumber, String startTime, String endTime ) results in an AQL query like
WITH driver, departure
FOR e IN trip
FILTER
(FOR e1 IN 1..1 OUTBOUND e._id tripToDriver FILTER e1.idNumber == '999999-9999' RETURN 1)[0] == 1
AND
(FOR e1 IN 1..1 INBOUND e._id departureToTrip FILTER e1.startTime >= '2019-08-14T00:00:00' AND e1.startTime <= '2019-08-14T23:59:59' RETURN 1)[0] == 1
RETURN e
which performs fine (~1s) with a single instance setup, but after setting up a cluster with the Kubernetes ArangoDB Operator with default settings (3 nodes and coordinators) the query time increased tenfold, which is is probably due to sharding and multi-machine communication to fulfil the query.
This attempt to optimise the query gave better results, query time around 3 to 4 seconds:
WITH driver, departure
FOR doc IN trip
LET drivers = (FOR v IN 1..1 OUTBOUND doc tripToDriver RETURN v)
FILTER drivers[0].idNumber == '999999-9999'
LET departures = (FOR v in 1..1 INBOUND doc departureToTrip RETURN v)
FILTER departures[0].startTime >= '2019-08-14T00:00:00' AND departures[0].startTime <= '2019-08-14T23:59:59'
RETURN doc
But can I optimise the query further for the cluster setup, to come closer to the single instance query time of one second?
AQL support basic AQL for paging by LIMIT offset, count. But I need to get the total number of the query in order to know the total pages. How to get the total count of the query?
I know the LENGTH function to get the count of some collection, but maybe it doesn't suit for the following:
FOR v in 2 any 'Collection/id1' GRAPH 'graph-name' FILTER ... LIMIT 10 RETURN distinct v.
I want to get the total number, but I can't get it by RETURN distinct LENGTH(v)
I now can implement this in a ungraceful way:
LET nodeList=(FOR v IN 2 any 'Collection/id1' GRAPH 'graph-name' FILTER ... RETURN distinct v)
FOR v IN 2 any 'Collection/id1' GRAPH 'graph-name' FILTER ... limit 10 RETURN distinct {'nodes': v, 'total':LENGTH(nodeList)}
Is there any other good idea to get this?
I found this answer from the arangodb spring data project.
AqlQueryOptions has fullCount() function, to return the total count of the query.
and you can return the PageImpl which contains the query content and the pagination info.