Task server on ML - multithreading

I have a query that may return up to 2000 documents.
Within these documents I need six pcdata items return as string values.
There is a possiblity, since the documents size range from small to very large,
exp tree cache error.
I am looking at spawn-function to break up my result set.
I will pass wildcard values, based on known "unique key structure", and will know the max number of results possible;each wildcard values will return 100 documents max.
Note: The pcdata for the unique key structure does have a range index on it.
Am I on the right track with below?
The task server will create three tasks.
The task server will allow multiple queries to run, but what stops them all running simultaneously and blowing out the exp tree cache?
i.e. What, if anything, forces one thread to wait for another? Or one task to wait for another so they all do not blow out the exp tree cache together?
xquery version "1.0-ml";
let $messages :=
(:each wildcard values will return 100 documents max:)
for $message in ("WILDCARDVAL1","WILDCARDVAL2", "WILDCARDVAL3")
let $_ := xdmp:log("Starting")
return
xdmp:spawn-function(function() {
let $_ := xdmp:sleep(5000)
let $_ := xdmp:log(concat("Searching on wildcard val=", $message))
return concat("100 pcdata items from the matched documents for ", $message) },
<options xmlns="xdmp:eval">
<result>true</result>
<transaction-mode>update-auto-commit</transaction-mode>
</options>)
return $messages

The Task Server configuration listed in the Admin UI defines the maximum number of simultaneous threads. If more tasks are spawned than there are threads, they are queued (FIFO I think, although ML9 has task priority options that modify that behavior), and the first queued task takes the next available thread.
The <result>true</result> option will force the spawning query to block until the tasks return. The tasks themselves are run independently and in parallel, and they don't wait on each other to finish. You may still run into problems with the expanded tree cache, but by splitting up the query into smaller ones, it could be less likely.
For a better understanding of why you are blowing out the cache, take a look at the functions xdmp:query-trace() and xdmp:query-meters(). Using the Task Server is more of a brute force solution, and you will probably get better results by optimizing your queries using information from those functions.
If you can't make your query more selective than 2000 documents, but you only need a few string values, consider creating range indexes on those values and using cts:values to select only those values directly from the index, filtered by the query. That method would avoid forcing the database to load documents into the cache.

It might be more efficient to use MarkLogic's capability to return co-occurrences, or even 3+ tuples of value combinations from within documents using functions like cts:values. You can blend in a (cts:uri-reference](http://docs.marklogic.com/cts:uri-reference) to get the document uri returned as part of the tuples.
It requires having range indexes on all those values though..
HTH!

Related

Get an incrementing number in Logic App Select

I am using a Logic App to transform some data for an integration. I am trying to avoid using For Each loops as the amount of data I am working with is high, and these incur a cost for each action and iteration of the for each loop.
However the integration I am working with requires a unique incrementing number for each line. They don't have to be sequential, or even starting with 1 but the order should be kept the same.
So with the above, the first one would get LineNumber 1, the second LineNumber 2, etc.. (or like I said, it could be 67829, 67835, etc..)
I tried to set a variable with ticks(utcNow()) before the start of the mapping, and then use sub(ticks(utcNow()), variables('startTicks')) but this is evaluated once and the same number is applied to all.
My next thought is to use an azure function/inline javascript to go through afterward and assign them, but just wondering if there is a way to accomplish this in the select.
or like I said, it could be 67829, 67835, etc..
Answering to this requirement,
Inside the Select Option :
indexOf(string(variables('<DATA Variable>')),string(item()))
Explanation :
item() - current item (of all items) in the select - stringified the same & tried to find the same in stringified version of the entire data - the index number will be returned.
OUTPUT
Please note :
Did not get a chance to check on a very large dataset.
This may fail, if a specific row(all values in the row) repetitive in nature - I assume this may not
be your case (order number might unique )

How to use synchronous messages on rabbit queue?

I have a node.js function that needs to be executed for each order on my application. In this function my app gets an order number from a oracle database, process the order and then adds + 1 to that number on the database (needs to be the last thing on the function because order can fail and therefore the number will not be used).
If all recieved orders at time T are processed at the same time (asynchronously) then the same order number will be used for multiple orders and I don't want that.
So I used rabbit to try to remedy this situation since it was a queue. It seems that the processes finishes in the order they should, but a second process does NOT wait for the first one to finish (ack) to begin, so in the end I'm having the same problem of using the same order number multiple times.
Is there anyway I can configure my queue to process one message at a time? To only start process n+1 when process n has been acknowledged?
This would be a life saver to me!
If the problem is to avoid duplicate order numbers, then use an Oracle sequence, or use an identity column when you insert into a table to generate the order number:
CREATE TABLE mytab (
id NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY(START WITH 1),
data VARCHAR2(20));
INSERT INTO mytab (data) VALUES ('abc');
INSERT INTO mytab (data) VALUES ('def');
SELECT * FROM mytab;
This will give:
ID DATA
---------- --------------------
1 abc
2 def
If the problem is that you want orders to be processed sequentially, then don't pull an order from the queue until the previous one is finished. This will limit your throughput, so you need to understand your requirements and make some architectural decisions.
Overall, it sounds Oracle Advanced Queuing would be a good fit. See the node-oracledb documentation on AQ.

How to implement SUM with #QuerySqlFunction?

The examples seen so far that cover #QuerySqlFunction are trivial. I put one below. However, I'm looking for an example / solution / hint for providing a cross row calculation, e.g. average, sum, ... Is this possible?
In the example, the function returns value 0 from an array, basically an implementation of ARRAY_GET(x, 0). All other examples I've seen are similar: 1 row, get a value, do something with it. But I need to be able to calculate the sum of a grouped result, or possible a lot more business logic. If somebody could provide me with the QuerySqlFunction for SUM, I assume would allow me to do much more than just SUM.
Step 1: Write a function
public class MyIgniteFunctions {
#QuerySqlFunction
public static double value1(double[] values) {
return values[0];
}
}
Step 2: Register the function
CacheConfiguration<Long, MyFact> factResultCacheCfg = ...
factResultCacheCfg.setSqlFunctionClasses(new Class[] { MyIgniteFunctions.class });
Step 3: Use it in a query
SELECT
MyDimension.groupBy1,
MyDimension.groupBy2,
SUM(VALUE1(MyFact.values))
FROM
"dimensionCacheName".DimDimension,
"factCacheName".FactResult
WHERE
MyDimension.uid=MyFact.dimensionUid
GROUP BY
MyDimension.groupBy1,
MyDimension.groupBy2
I don't believe Ignite currently has clean API support for custom user-defined QuerySqlFunction that spans multiple rows.
If you need something like this, I would suggest that you make use of IgniteCompute APIs and distribute your computations, lambdas, or closures to the participating Ignite nodes. Then from inside of your closure, you can either execute local SQL queries, or perform any other cache operations, including predicate-based scans over locally cached data.
This approach will be executed across multiple Ignite nodes in parallel and should perform well.

Number of threads decreases as Parallel.Foreach loop goes on

I have a Parallel Foreach loop which loops through a list of items, and performs some actions against them. Some of these actions take longer than others, depending on the item.
Parallel.ForEach(list, new ParallelOptions { MaxDegreeOfParallelism = 5 }, item =>
{
var subItems = item.subItems;
foreach (var subItem in subItems)
{
//do some actions for subItem
}
Console.WriteLine("Action Complete for {0}", item);
});
After a while, when there are only about 5-10 items left in the list to run, it seems that there is only 1 thread left running. This is not ideal, because some items will then be stuck behind another one to finish.
If I stop the script, and then start it again, with only the leftover 5-10 items in the list, it spins up multiple threads to do each of the items in parallel again.
How can I ensure that the other threads will keep being used, without me needing to restart the script?
The problem here is that the default partitioner is blocking the work per task up into blocks of N items. It assumes that the number of items is large and each item takes the same amount of time then you would expect that the several threads will run the last ~ N * 5 items and all finish at the same time.
However in your case this is not true. You could write your own Partitioner to use a smaller number of items per block, See Partitioner Class. This may improve performance but it the work done per item is very small then you will increase the ratio of useful work to work done managing the tasks and possibly degrade performance.
You could also write a dynamic partitioner that decreases the partition size so that the last few items are in smaller partitions, thus ensuring that you are still using all the available threads. This MSDN article covers writing custom partitioners, Custom Partitioners for PLINQ and TPL.

How to get total rows for cypher with skip limit?

I am able to use skip, limit (and order by) to fetch the contents of particular page in the UI.
E.g. to render nth page of page size m. UI asks for skip n*m and limit m.
But, UI wants to generate links for all the possible pages. For that i have to return it total rows available in neo4j.
E.g. for total p rows, the UI will generate hyperlink 1,2,3... (p/m).
What is the best(in terms of performance) way to get the total number of rows while using skip, limit in the the cypher?
In general it is not advisable as fetching all results requires you to fetch large swaths of the graph into memory.
You have two options:
use a simpler version of your query as separate count query (which might also run asynchronously)
merge the count query and your real query into one, but it will be much more expensive than your skip-limit query, in the worst case totalcount/pageSize times more expensive
start n=node:User(name={username})
match n-[:KNOWS]->()
with n,count(*) as total
match n-[:KNOWS]->m
return m.name, total
skip {offset}
limit {pagesize}

Resources