I have an simple query runs inside foxx
For u in collection
Filter u.someIndexedSparseFiler !=null
Return {_id:u._id}
This will return millions+ results. In the logs, arango have a message of limited memory heap reached and terminate the process.
reached heap-size limit of #3 interrupting V8 execution (heap size limit 3232954528, used 3060226424) during V8 internal collection
Even though I add the flag --javascript.v8-max-heap 3000 to the start-up. It still runs in the same error. What should I do? Is there a better approach than this
I'm not sure why you're getting out-of-memory errors, but it looks like the data you're returning is overflowing the V8 heap size. Another possibility is that something is causing the engine to miss/ignore the index, causing the engine to load every document before evaluating the someIndexedSparseFiler attribute.
Evaluating millions of documents (or lots of large documents) would not only cost a lot of disk/memory I/O, but could also require a lot of RAM. Try using the explain feature to return a query analysis - it should tell you what is going wrong.
For comparison, my query...
FOR u IN myCollection
FILTER u.someIndexedSparseFiler != null
RETURN u._id
...returns this when I click "explain":
Query String (82 chars, cacheable: true):
FOR u IN myCollection
FILTER u.someIndexedSparseFiler != null
RETURN u._id
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
7 IndexNode 5 - FOR u IN myCollection /* persistent index scan, projections: `_id` */
5 CalculationNode 5 - LET #3 = u.`_id` /* attribute expression */ /* collections used: u : myCollection */
6 ReturnNode 5 - RETURN #3
Indexes used:
By Name Type Collection Unique Sparse Selectivity Fields Ranges
7 idx_1667363882689101824 persistent myCollection false true 100.00 % [ `someIndexedSparseFiler` ] *
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 move-filters-up
3 move-calculations-up-2
4 move-filters-up-2
5 use-indexes
6 remove-filter-covered-by-index
7 remove-unnecessary-calculations-2
8 reduce-extraction-to-projection
Note that it listss my sparse index under Indexes used:. Also, try changing the != to == and you will see that now it ignores the index! This is because the optimizer knows a sparse index will never have a null value, so it skips it.
If you aren't familiar with it, the "explain" functionality is extremely useful (indispensable, really) when tuning queries and creating indexes. Also, remember that indexes should match your query; in this case, the index should only have one attribute or the "selectivity" quotient may be too low and the engine will ignore it.
Related
I have a question similar to this one. Basically, I have been testing different ways to use partition key, and have noticed that at any time, the more a partition key is referenced in a query, the higher the RUs. It is quite consistent, and doesn't even matter how the partition key is used. So I narrowed it down to the basic queries for test.
To start, this database has about 850K documents, all more than 1KB in size. The partition key is basically a 100 modulus of the id in number form, is set to /partitionKey, and the container uses a default indexing policy:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
}
]
}
Here is my basic query test:
SELECT c.id, c.partitionKey
FROM c
WHERE c.partitionKey = 99 AND c.id = '99999'
-- Yields One Document; Actual Request Charge: 2.95 RUs
SELECT c.id, c.partitionKey
FROM c
WHERE c.id = '99999'
-- Yields One Document; Actual Request Charge: 2.85 RUs
Azure Cosmos documentation says without the partition key, the query will "fan out" to all logical partitions. Therefore I would fully expect the first query to target a single partition and the second to target all of them, meaning the first one should have a lower RUs. I suppose I am using RU results as evidence to whether or not the Cosmos is fanning out and scanning each partition, and comparing it to what the documentation says should happen.
I know these results are just 0.1 RUs in difference. But my point is the more complex the query, the bigger the difference. For example, here is another query ever so slightly more complex:
SELECT c.id, c.partitionKey
FROM c
WHERE (c.partitionKey = 98 OR c.partitionKey = 99) AND c.id = '99999'
-- Yields One Document; Actual Request Charge: 3.05 RUs
Notice the RUs continues to grow and separate from having not specified a partition key at all. Instead I would expect the above query to only target two partitions, compared to no partition key check which supposedly fans out to all partitions.
I am starting to suspect the partition key check is happening after the other filters (or inside each partition scan). For example, going back to the first query but changing the id to something which does not exist:
SELECT c.id, c.partitionKey
FROM c
WHERE c.partitionKey = 99 AND c.id = '99999x'
-- Yields Zero Documents; Actual Request Charge: 2.79 RUs
SELECT c.id, c.partitionKey
FROM c
WHERE c.id = '99999x'
-- Yields Zero Documents; Actual Request Charge: 2.79 RUs
Notice the RUs are exactly the same, and both (including the one with the partition filter) have less RUs than when a document exists. This seems like it would be a symptom of the partition filter being executed on the results, not restricting a fan-out. But this is not what the documentation says.
Why does Cosmos have higher RUs when a partition key is specified?
like the comment specifies if you are testing via the portal (or via the code, but with the query you provided) it will become more expensive, because you are not querying a specific partition, but rather querying everything and then introducing another filter, which is more expense.
what you should do instead - is use the proper way in the code to pass in the partition key. my result were quite impressive: 3 ru\s with the PK and 20.000 ru\s without the PK, so I'm quite confident intworks (I've had a really large dataset)
AQL support basic AQL for paging by LIMIT offset, count. But I need to get the total number of the query in order to know the total pages. How to get the total count of the query?
I know the LENGTH function to get the count of some collection, but maybe it doesn't suit for the following:
FOR v in 2 any 'Collection/id1' GRAPH 'graph-name' FILTER ... LIMIT 10 RETURN distinct v.
I want to get the total number, but I can't get it by RETURN distinct LENGTH(v)
I now can implement this in a ungraceful way:
LET nodeList=(FOR v IN 2 any 'Collection/id1' GRAPH 'graph-name' FILTER ... RETURN distinct v)
FOR v IN 2 any 'Collection/id1' GRAPH 'graph-name' FILTER ... limit 10 RETURN distinct {'nodes': v, 'total':LENGTH(nodeList)}
Is there any other good idea to get this?
I found this answer from the arangodb spring data project.
AqlQueryOptions has fullCount() function, to return the total count of the query.
and you can return the PageImpl which contains the query content and the pagination info.
I have a problem I do not understand.For example, a cluster has three nodes,
each node is agent, coordinate, and dbserver, I execute an aql query on a node,The implementation plan is as follows:
Query string:
FOR v,p IN outbound
SHORTEST_PATH #startnode
TO #endnode
GRAPH #graphname options {bfs:true}
filter v.c_id==#c_id
return v
Execution plan:
Id NodeType Site Est. Comment
1 SingletonNode COOR 1 * ROOT
2 ShortestPathNode COOR 10000000 - FOR v /* vertex /, p / edge / IN OUTBOUND SHORTEST_PATH 'contact_vertex/f69375e854a34250986399d1909c0776' / startnode / TO 'contact_vertex/e11a10f848cd4a4392b8f10ea307eac9' / targetnode / GRAPH 'an'
3 CalculationNode COOR 10000000 - LET #4 = (v.c_id == -2147483643) / simple expression */
4 FilterNode COOR 10000000 - FILTER #4
5 ReturnNode COOR 10000000 - RETURN v
Indexes used:
none
Shortest paths on graphs:
Id Vertex collections Edge collections
2 contact_vertex contact_edge
Optimization rules applied:
none
I want to know if the querying task is distributed on one node or all nodes?In other words, the query is whether the task is distributed computing?
As in the above query,I have a lot of data,I have two collection, one vertexes collection and one edge collection,I use the default index,two collection constituted a graph,there is no way to quickly improve the search speed?For example, set the index, improve the query memory, and so on.Or some other suggestions, I hope you can help me.
My cluster starts as follows:
arangodb --starter.join 10.200.11.32,10.200.11.34,10.200.11.35
How should I set up my cluster?
I have three node configurations are the sameļ¼
RAM:254G core:40
I'm sure there is an easy and fast way to do this but it's escaping me. I have a large dataset that has some duplicate records, and I want to get rid of the duplicates. (the duplicates are uniquely identified by one property, but the rest of the document should be identical as well).
I've attempted to create a new collection that only has unique values a few different ways, but they are all quite slow. For example:
FOR doc IN Documents
COLLECT docId = doc.myId, doc2 = doc
INSERT doc2 IN Documents2
or
FOR doc IN Documents
LET existing = (FOR doc2 IN Documents2
FILTER doc.myId == doc2.myId
RETURN doc2)
UPDATE existing WITH doc IN Documents2
or (this gives me a "violated unique constraint" error)
FOR doc IN Documents
UPSERT {myId: doc.myId}}]}
INSERT doc
UPDATE doc IN Documents2
TL;DR
It does not take that long to de-duplicate the records and write them to another collection (less than 60 seconds), at least on my desktop machine (Windows 10, Intel 6700K 4x4.0GHz, 32GB RAM, Evo 850 SSD).
Certain queries require proper indexing however, or they will last forever. Indexes require some memory, but compared to the needed memory during query execution for grouping the records, it is negligible. If you're short of memory, performance will suffer because the operating system needs to swap data between memory and mass storage. This is especially a problem with spinning disks, not so much with fast flash storage devices.
Preparation
I generated 2.2 million records with 5-20 random attributes and 160 chars of gibberish per attribute. In addition, every record has an attribute myid. 187k records have a unique id, 60k myids exist twice, and 70k three times. The collection size was reported as 4.83GB:
// 1..2000000: 300s
// 1..130000: 20s
// 1..70000: 10s
FOR i IN 1..2000000
LET randomAttributes = MERGE(
FOR j IN 1..FLOOR(RAND() * 15) + 5
RETURN { [CONCAT("attr", j)]: RANDOM_TOKEN(160) }
)
INSERT MERGE(randomAttributes, {myid: i}) INTO test1
Memory consumption before starting ArangoDB was at 3.4GB, after starting 4.0GB, and around 8.8GB after loading the test1 source collection.
Baseline
Reading from test1 and inserting all documents (2.2m) into test2 took 20s on my system, with a memory peak of ~17.6GB:
FOR doc IN test1
INSERT doc INTO test2
Grouping by myid without writing took approx. 9s for me, with 9GB RAM peak during query:
LET result = (
FOR doc IN test1
COLLECT myid = doc.myid
RETURN 1
)
RETURN LENGTH(result)
Failed grouping
I tried your COLLECT docId = doc.myId, doc2 = doc approach on a dataset with just 3 records and one duplicate myid. It showed that the query does not actually group/remove duplicates. I therefore tried to find alternative queries.
Grouping with INTO
To group duplicate myids together but retain the possibility to access the full documents, COLLECT ... INTO can be used. I simply picked the first document of every group to remove redundant myids. The query took about 40s for writing the 2m records with unique myid attribute to test2. I didn't measure memory consumption accurately, but I saw different memory peaks spanning 14GB to 21GB. Maybe truncating the test collections and re-running the queries increases the required memory because of some stale entries that get in the way somehow (compaction / key generation)?
FOR doc IN test1
COLLECT myid = doc.myid INTO groups
INSERT groups[0].doc INTO test2
Grouping with subquery
The following query showed a more stable memory consumption, peaking at 13.4GB:
FOR doc IN test1
COLLECT myid = doc.myid
LET doc2 = (
FOR doc3 IN test1
FILTER doc3.myid == myid
LIMIT 1
RETURN doc3
)
INSERT doc2[0] INTO test2
Note however that it required a hash index on myid in test1 to achieve a query execution time of ~38s. Otherwise the subquery will cause millions of collection scans and take ages.
Grouping with INTO and KEEP
Instead of storing the whole documents that fell into a group, we can assign just the _id to a variable and KEEP it so that we can look up the document bodies using DOCUMENT():
FOR doc IN test1
LET d = doc._id
COLLECT myid = doc.myid INTO groups KEEP d
INSERT DOCUMENT(groups[0].d) INTO test2
Memory usage: 8.1GB after loading the source collection, 13.5GB peak during the query. It only took 30 seconds for the 2m records!
Grouping with INTO and projection
Instead of KEEP I also tried a projection out of curiosity:
FOR doc IN test1
COLLECT myid = doc.myid INTO groups = doc._id
INSERT DOCUMENT(groups[0]) INTO test2
RAM was at 8.3GB after loading test1, and the peak at 17.8GB (there were actually two heavy spikes during the query execution, both going over 17GB). It took 35s to complete for the 2m records.
Upsert
I tried something with UPSERT, but saw some strange results. It turned out to be an oversight in ArangoDB's upsert implementation. v3.0.2 contains a fix and I get correct results now:
FOR doc IN test1
UPSERT {myid: doc.myid}
INSERT doc
UPDATE {} IN test2
It took 40s to process with a (unique) hash index on myid in test2, with a RAM peak around 13.2GB.
Delete duplicates in-place
I first copied all documents from test1 to test2 (2.2m records), then I tried to REMOVE just the duplicates in test2:
FOR doc IN test2
COLLECT myid = doc.myid INTO keys = doc._key
LET allButFirst = SLICE(keys, 1) // or SHIFT(keys)
FOR k IN allButFirst
REMOVE k IN test2
Memory was at 8.2GB (with only test2 loaded) and went up to 13.5GB during the query. It took roughly 16 seconds to delete the duplicates (200k).
Verification
The following query groups myid together and aggregates how often every id occurs. Run against the target collection test2, the result should be {"1": 2000000}, otherwise there are still duplicates. I double-checked the query results above and everything checked out.
FOR doc IN test2
COLLECT myid = doc.myid WITH COUNT INTO count
COLLECT c = count WITH COUNT INTO cc
RETURN {[c]: cc}
Conclusion
The performance appears to be reasonable with ArangoDB v3.0, although it may degrade if not enough RAM is available. The different queries completed roughly within the same time, but showed different RAM usage characteristics. For certain queries, indexes are necessary to avoid high computational complexity (here: full collection scans; 2,200,000,000,000 reads in the worst case?).
Can you try my presented solutions on your data and check what the performance is on your machine?
I built a Spark cluster.
workers:2
Cores:12
Memory: 32.0 GB Total, 20.0 GB Used
Each worker gets 1 cpu, 6 cores and 10.0 GB memory
My program gets data source from MongoDB cluster. Spark and MongoDB cluster are in the same LAN(1000Mbps).
MongoDB document format:
{name:string, value:double, time:ISODate}
There is about 13 million documents.
I want to get the average value of a special name from a special hour which contains 60 documents.
Here is my key function
/*
*rdd=sc.newAPIHadoopRDD(configOriginal, classOf[com.mongodb.hadoop.MongoInputFormat], classOf[Object], classOf[BSONObject])
Apache-Spark-1.3.1 scala doc: SparkContext.newAPIHadoopFile[K, V, F <: InputFormat[K, V]](path: String, fClass: Class[F], kClass: Class[K], vClass: Class[V], conf: Configuration = hadoopConfiguration): RDD[(K, V)]
*/
def findValueByNameAndRange(rdd:RDD[(Object,BSONObject)],name:String,time:Date): RDD[BasicBSONObject]={
val nameRdd = rdd.map(arg=>arg._2).filter(_.get("name").equals(name))
val timeRangeRdd1 = nameRdd.map(tuple=>(tuple, tuple.get("time").asInstanceOf[Date]))
val timeRangeRdd2 = timeRangeRdd1.map(tuple=>(tuple._1,duringTime(tuple._2,time,getHourAgo(time,1))))
val timeRangeRdd3 = timeRangeRdd2.filter(_._2).map(_._1)
val timeRangeRdd4 = timeRangeRdd3.map(x => (x.get("name").toString, x.get("value").toString.toDouble)).reduceByKey(_ + _)
if(timeRangeRdd4.isEmpty()){
return basicBSONRDD(name, time)
}
else{
return timeRangeRdd4.map(tuple => {
val bson = new BasicBSONObject()
bson.put("name", tuple._1)
bson.put("value", tuple._2/60)
bson.put("time", time)
bson })
}
}
Here is part of Job information
My program works so slowly. Does it because of isEmpty and reduceByKey? If yes, how can I improve it ? If not, why?
=======update ===
timeRangeRdd3.map(x => (x.get("name").toString, x.get("value").toString.toDouble)).reduceByKey(_ + _)
is on the line of 34
I know reduceByKey is a global operation, and may costs much time, however, what it costed is beyond my budget. How can I improvet it or it is the defect of Spark. With the same calculation and hardware, it just costs several seconds if I use multiple thread of java.
First, isEmpty is merely the point at which the RDD stage ends. The maps and filters do not create a need for a shuffle, and the method used in the UI is always the method that triggers a stage change/shuffle...in this case isEmpty. Why it's running slow is not as easy to discern from this perspective, especially without seeing the composition of the originating RDD. I can tell you that isEmpty first checks the partition size and then does a take(1) and verifies whether data was returned or not. So, the odds are that there is a bottle neck in the network or something else blocking along the way. It could even be a GC issue... Click into the isEmpty and see what more you can discern from there.