Datomic exceed GC overhead limit - garbage-collection

Trying to count entities in datomic with this query
(d/q '[:find (count ?a) . :where [?a :type]] (d/db (conn)))
OutOfMemoryError GC overhead limit exceeded [trace missing]
Working thought if I try to count smaller subset like
(d/q '[:find (count ?a) :where [?a :type "psp"]] (d/db (conn)))
[[400541]]
Using dev backend.
Is it me doing something wrong, or should I try different backend, or something else?
Here is stacktrace http://pastebin.com/C76mEhEJ which leads somewhere inside datomic.datalog.

Query in Datomic is eager. Even when using aggregates, the entire intermediate representation will be realized. In your case, this is the collection of all tuples for the entity-id, type, value portion for all entities in your database. You will see errors like this when the entire intermediate set can't be realized in memory, but the structure of your query isn't one on which Datomic can naively tell there will be a database scan (in those cases it will throw).
If you're scanning the entire database, datoms - documented here - is a better fit as it will lazily traverse all datoms that match a prefix. A lazy seq approach for a db scan with datoms for your use case could be something like:
(count (dedupe (map #(:e %) (seq (d/datoms (d/db conn) :aevt :type)))))
This gets all datoms from the :aevt index that have the attribute :type (the attribute is then a leading component that narrows the results). We deal with datoms output as a seq, and get the :e (entity id) from each datom, deduping so we only count unique entities. You could possibly avoid this dedupe step if this is a cardinality one attribute.

Related

Memory usage of node-influx when querying

I use node-influx and influx.query(...) use too much heap.
In my application I have something like
const data = await influx.query(`SELECT "f_power" as "power", "nature", "scope", "participantId"
FROM "power"
WHERE
"nature" =~ /^Conso$|^Prod$|^Autoconso$/ AND
"source" =~ /^$|^One$|^Two AND
"participantId" =~ /${participantIdFilter.join('|')}/ AND
"time" >= '${req.query.dateFrom}' AND
"time" <= '${req.query.dateTo}'
GROUP BY "timestamp"
`);
wb = dataToWb(data)
XLSX.writeFile(wb, filename);
Data is a result set of about 50M (I used this code)
And the heap used by this method is about 350M (I used process.memoryUsage().heapUsed)
I'm surprised by the diference between these two values...
So is possible to make this query less resource intensive?
Actually I use data to make a xlsx file. And the generation of this file lead to a node process out of memory. The method XLSX.writeFile(wb, filename) use about 100M. That's it's not enougth to fill my 512M RAM. So I figured me that is heap used by influx query which is never collected by the GC.
Actually I don't understand why the generation make this error. Why V8 can't free memory used by a method executed after and in another context ?
The node-influx (1.x client) reads the whole response, parses it into JSON, and transforms it into results (data array). There are a lot more intermediate objects on the heap during the processing of the response. You should also run the node garbage collector before and after the query to get a better estimate of how much heap does it take. You can now control the result memory usage only by reducing the result columns or rows (by limit, time, or aggregations). You can also join queries with smaller results to reduce the maximum heap usage caused by temporary objects. Of course, that is paid by the time and complexity of your code.
You can expect less memory usage with 2.x client (https://github.com/influxdata/influxdb-client-js). It does not create intermediate JSON objects, internally processes results in smaller chunks, it has an observable-like API that lets the user decide how to represent a result row. It uses FLUX as a query language and requires InfluxDB 2.x or InfluxDB 1.8 (with 2.x API enabled).

array_intersect giving performance issue in presto

I have a query running in presto which has array_intersect condition. This is taking around 5 hrs to run. If I remove the array_intersect then it is taking less than an hour.
CARDINALITY(ARRAY_INTERSECT(links, ARRAY['504949547', '504949616', '515604515', '515604526', '515604527', '515604528'])) > 0
Can anyone please let me know how to improve the performance. Have to get it less than 5 mins.
Have tried enabling the spill disk but it didnt help. Input data size is around 1TB.
Thanks
array_intersect materializes the result (the intersection), whereas the only thing you are checking is the membership of certain predefined elements.
In this case I'd recommend using any_match instead.
any_match(links, e -> e IN ('504949547', '504949616', ...))
If you're using Presto versions that doesn't have any_match, you can use reduce:
reduce(
links, -- array to reduce
false, -- initial state
(s, e) -> s OR e IN ('504949547', '504949616', ...), -- reduction function
s -> s) -- output function
Have tried enabling the spill disk but it didnt help.
Note: In Presto, spill is supported for certain operators (most of the Joins, Aggregations, Order By, Window functions). It is not applicable to scalar functions operating on ARRAYs. Also, you should not expect spill to increase performance. It can only reduce memory footprint, at the cost of performance.

Task server on ML

I have a query that may return up to 2000 documents.
Within these documents I need six pcdata items return as string values.
There is a possiblity, since the documents size range from small to very large,
exp tree cache error.
I am looking at spawn-function to break up my result set.
I will pass wildcard values, based on known "unique key structure", and will know the max number of results possible;each wildcard values will return 100 documents max.
Note: The pcdata for the unique key structure does have a range index on it.
Am I on the right track with below?
The task server will create three tasks.
The task server will allow multiple queries to run, but what stops them all running simultaneously and blowing out the exp tree cache?
i.e. What, if anything, forces one thread to wait for another? Or one task to wait for another so they all do not blow out the exp tree cache together?
xquery version "1.0-ml";
let $messages :=
(:each wildcard values will return 100 documents max:)
for $message in ("WILDCARDVAL1","WILDCARDVAL2", "WILDCARDVAL3")
let $_ := xdmp:log("Starting")
return
xdmp:spawn-function(function() {
let $_ := xdmp:sleep(5000)
let $_ := xdmp:log(concat("Searching on wildcard val=", $message))
return concat("100 pcdata items from the matched documents for ", $message) },
<options xmlns="xdmp:eval">
<result>true</result>
<transaction-mode>update-auto-commit</transaction-mode>
</options>)
return $messages
The Task Server configuration listed in the Admin UI defines the maximum number of simultaneous threads. If more tasks are spawned than there are threads, they are queued (FIFO I think, although ML9 has task priority options that modify that behavior), and the first queued task takes the next available thread.
The <result>true</result> option will force the spawning query to block until the tasks return. The tasks themselves are run independently and in parallel, and they don't wait on each other to finish. You may still run into problems with the expanded tree cache, but by splitting up the query into smaller ones, it could be less likely.
For a better understanding of why you are blowing out the cache, take a look at the functions xdmp:query-trace() and xdmp:query-meters(). Using the Task Server is more of a brute force solution, and you will probably get better results by optimizing your queries using information from those functions.
If you can't make your query more selective than 2000 documents, but you only need a few string values, consider creating range indexes on those values and using cts:values to select only those values directly from the index, filtered by the query. That method would avoid forcing the database to load documents into the cache.
It might be more efficient to use MarkLogic's capability to return co-occurrences, or even 3+ tuples of value combinations from within documents using functions like cts:values. You can blend in a (cts:uri-reference](http://docs.marklogic.com/cts:uri-reference) to get the document uri returned as part of the tuples.
It requires having range indexes on all those values though..
HTH!

Traversing the optimum path between nodes

in a graph where there are multiple path to go from point (:A) to (:B) through node (:C), I'd like to extract paths from (:A) to (:B) through nodes of type (c:C) where c.Value is maximum. For instance, connect all movies with only their oldest common actors.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, max(a.Age)
The above query returns the proper age for the oldest actor, but not always his correct name.
Conversely, I noticed that the following query returns both correct age and name.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
with m1, m2, a order by a.age desc
return m1.name, m2.name, a.name, max(a.age), head(collect(a.name))
Would this always be true? I guess so.
I there a better way to do the job without sorting which may cost much?
You need to use ORDER BY ... LIMIT 1 for this:
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, a.Age order by a.Age desc limit 1
Be aware that you basically want to do a weighted shortest path. Neo4j can do this more efficiently using java code and the GraphAlgoFactory, see the chapter on this in the reference manual.
For those who are willing to do similar things, consider read this post from #_nicolemargaret which describe how to extract the n oldest actors acting in pairs of movies rather than just the first, as with head(collect()).

Internal working of Spark - Communication/Synchronization

I am quite new to Spark but already have programming experience in BSP model. In BSP model (e.g. Apache Hama), we have to handle all the communication and synchronization of nodes on our own. Which is good on one side because we have a finer control on what we want to achieve but on the other hand it adds more complexity.
Spark on the other hand, takes all the control and handles everything on its own (which is great) but I don't understand how it works internally especially in cases where we have alot of data and message passing between nodes. Let me put an example
zb = sc.broadcast(z)
r_i = x_i.map(x => Math.pow(norm(x - zb.value), 2))
r_i.checkpoint()
u_i = u_i.zip(x_i).map(ux => ux._1 + ux._2 - zb.value)
u_i.checkpoint()
x_i = f.prox(u_i.map(ui => {zb.value - ui}), rho)
x_i.checkpoint()
x = x_i.reduce(_+_) / f.numSplits.toDouble
u = u_i.reduce(_+_) / f.numSplits.toDouble
z = g.prox(x+u, f.numSplits*rho)
r = Math.sqrt(r_i.reduce(_+_))
This is a method taken from here, which runs in a loop (let's say 200 times). x_i contains our data (let's say 100,000 entries).
In a BSP style program if we have to process this map operation, we will partition this data and distribute on multiple nodes. Each node will process sub part of data (map operation) and will return the result to master (after barrier synchronization). Since master node wants to process each individual result returned (centralized master- see figure below), we send the result of each entry to master (reduce operator in spark). So, (only) master receives 100,000 messages after each iterations. It processes this data and sends the new values to slaves again which again start processing for next iteration.
Now, since Spark takes control from user and does internally everything, I am unable to understand how Spark collects all the data after map operations (asynchronous message passing? i heard it has p2p message passing ? what about synchronization between map tasks? If it does synchronization, then is it right to say that Spark is actually a BSP model ?). Then in order to apply the reduce function, does it collects all the data on a central machine (If yes, does it receives 100,000 messages on a single machine?) or it reduces in a distributed fashion (If yes, then how can this be performed ?)
Following figure shows my reduce function on master. x_i^k-1 represents the i-th value calculated (in previous iteration) against x_i data entry of my input. x_i^k represents the value of x_i calculated in current iteration. Clearly, this equation, needs the results to be collected.
I actually want to compare both styles of distributed programming to understand when to use Spark and when to move to BSP. Further, I looked alot on the internet, all I find is how map/reduce works but nothing useful was available on actual communication/synchronization. Any helful material will be useful aswell.

Resources