How to approximate execution time of ArangoDB count function

How to approximate execution time of ArangoDB count function - arangodb

I am considering using ArangoDB for a new project of mine, but I have been unable to find very much information regarding its scalability.
Specifically, I am looking for some information regarding the count function. Is there a reliable way (perhaps a formula) to approximate how long it will take to count the number of documents in a collection which match a simple Boolean value?
All documents in the collection would have the same fields, however with different values. How can I determine how long would it take to count several hundred million documents?

Just create a collection users and insert as many random documents as you need.
FOR i IN 1..1100000
INSERT {
name: CONCAT("test", i),
year: 1970 + FLOOR(RAND() * 55),
gender: i % 2 == 0 ? 'male' : 'female'
} IN users
Then do the count:
FOR user IN users
FILTER user.gender == 'male'
COLLECT WITH COUNT INTO number
RETURN {
number: number
}
And if you use this query in production, make sure to add an index too. On my machine it reduces the execution time by factor > 100x (0.043 sec / 1.1mio documents).
Check your query with EXPLAIN to further estimate how "expensive" the execution will be.
Query string:
FOR user IN users
FILTER user.gender == 'male'
COLLECT WITH COUNT INTO number
RETURN {
number: number
}
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
8 IndexRangeNode 550001 - FOR user IN users /* hash index scan */
5 AggregateNode 1 - COLLECT WITH COUNT INTO number /* sorted*/
6 CalculationNode 1 - LET #4 = { "number" : number } /* simple expression */
7 ReturnNode 1 - RETURN #4
Indexes used:
Id Type Collection Unique Sparse Selectivity Est. Fields Ranges
8 hash users false false 0.00 % `gender` [ `gender` == "male" ]
Optimization rules applied:
Id RuleName
1 use-index-range
2 remove-filter-covered-by-index

Related

How to map sales against purchases sequentially using python?

I have a transaction dataframe as under:
Item Date Code Qty Price Value
0 A 01-01-01 Buy 10 100.5 1005.0
1 A 02-01-01 Buy 5 120.0 600.0
2 A 03-01-01 Sell 12 125.0 1500.0
3 A 04-01-01 Buy 9 110.0 990.0
4 A 04-01-01 Sell 1 100.0 100.0
#and so on... there are a million rows with about thousand items (here just one item A)
What I want is to map each selling transaction against purchase transaction in a sequential manner of FIRST IN FIRST OUT. So, the purchase that was made first will be sold out first.
For this, I have added a new column bQty with opening balance same as purchase quantity. Then I iterate through the dataframe for each sell transaction to set the sold quantity off against purchase transaction before that date.
df['bQty'] = df[df['Code']=='Buy']['Quantity']
for each in df[df['Code']=='Sell']:
for each in df[(df['Code']=='Buy') & (df['Date'] <= sellDate)]:
#code#
Now this requires me to go through the whole dataframe again and again for each sell transaction.
For 1000 records it takes about 10 seconds to complete. So, we can assume that for a million records, this approach will take a lot time.
Is there any faster way to do this?

If you are only interested in the resulting final balance values per item, here is a fast way to calculate them:
Add two additional columns that contain the same absolute values as Qty and Value, but with a negative sign in those rows where the Code value is Sell. Then you can group by item and sum these values for each item, to get the remaining number of items and the money spent for them on balance.
sale = df.Code == 'Sell'
df['Qty_signed'] = df.Qty.copy()
df.loc[sale, 'Qty_signed'] *= -1
df['Value_signed'] = df.Value.copy()
df.loc[sale, 'Value_signed'] *= -1
qty_remaining = df.groupby('Item')['Qty_signed'].sum()
print(qty_remaining)
money_spent = df.groupby('Item')['Value_signed'].sum()
print(money_spent)
Output:
Item
A 11
Name: Qty_signed, dtype: int64
Item
A 995.0
Name: Value_signed, dtype: float64

How to create a leaderboard that sorts by score and time mongodb?

I am trying to implement a leaderboard which acts like a queue instead of a score board i.e.
user score time position
1 1 1 1
2 1 2 2
3 0 1 3
4 0 2 4
My question is how one should structure a query to derive a user's position in this queue considering the fact that the time is also taken into account, the following depicts a function that derives the user position from the collection:
public async position(leaderboardId: string, id: any) {
const user = await this.get(leaderboardId, id);
return await this.getCollection(leaderboardId).
find({
score: {
$gt: user.score
}
}).
count() + 1;
}
However, if multiple users have the same score, it will mean a draw for all users with that score (position) i.e.
user score time position
1 1 1 1
2 1 2 1
3 0 1 3
4 0 2 3
How does one modify the aforementioned query to implement the respective logic?
Thanks

NetSuite - Saved search join with location inventory data

I'm looking to find a way to join/add location inventory data to a search (this will then be used in a script)
I have a search as below that takes several work orders and sums the total usage requirement by item. I would like to also include the current on-hand qty per item but for multiple locations (i.e. LocA and LocB).
i.e
Item Remaining Quantity Built onHandLocA onHandLocB
A 10 20 10 99 0
B 20 20 0 23 659
C 30 30 0 2 33
I know I can get this data by iterating through each line and then get the value from the sublist, but wondered if there is a way to do this via the search join.
var workorderSearch = nlapiSearchRecord("workorder",null,
[
["type","anyof","WorkOrd"],
"AND",
["status","anyof","WorkOrd:D","WorkOrd:A","WorkOrd:B"],
"AND",
["itemsource","anyof","STOCK"],
"AND",
["formulanumeric: INSTR('11518,11624', {number}) ","notequalto","0"]
],
[
new nlobjSearchColumn("item",null,"GROUP").setSort(false),
new nlobjSearchColumn("formulanumeric",null,"SUM").setFormula("{quantity}-{built}"),
new nlobjSearchColumn("quantity",null,"SUM"),
new nlobjSearchColumn("built",null,"SUM")
]
);

new nlobjSearchColumn("locationquantityavailable","item","SUM")
or
new nlobjSearchColumn("formulanumeric", null, "SUM").setFormula("DECODE({item.inventorylocation}, 'LocA',{locationquantityavailable},0)")
The problem with grabbing item inventory location data is the rest of the data gets duplicated for each location, so if your work order requires a quantity of 10 and you have 2 locations, the summarized saved search will return quantity of 20 (10 quantity x 2 locations).
So, you will need to make all of your other aggregate columns into formulas as well.
[
new nlobjSearchColumn("item",null,"GROUP").setSort(false),
new nlobjSearchColumn("formulanumeric", null, "SUM").setFormula("DECODE({item.inventorylocation},'LocA',{quantity}-{built},0)"),
new nlobjSearchColumn("formulanumeric", null, "SUM").setFormula("DECODE({item.inventorylocation},'LocA',{quantity},0)",null,"SUM"),
new nlobjSearchColumn("formulanumeric", null, "SUM").setFormula("DECODE({item.inventorylocation},'LocA',{built},0)",null,"SUM")
]

ArangoDB graph traversal not utilizing combined index

I have those two queries, which should - based on my understanding - do basically the same. One is doing a filter on my edge collection and is performing very well, while the other query is doing a graph traversal of depth 1 and performs quite poor, due to not utilizing the correct index.
I have an accounts collection and a transfers collection and a combined index on transfers._to and transfers.quantity.
This is the filter query:
FOR transfer IN transfers
FILTER transfer._to == "accounts/testaccount" && transfer.quantity > 100
RETURN transfer
Which is correctly using the combined index:
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
6 IndexNode 18930267 - FOR transfer IN transfers /* skiplist index scan */
5 ReturnNode 18930267 - RETURN transfer
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
6 skiplist transfers false false 10.11 % [ `_to`, `quantity` ] ((transfer.`_to` == "accounts/testaccount") && (transfer.`quantity` > 100))
Optimization rules applied:
Id RuleName
1 use-indexes
2 remove-filter-covered-by-index
3 remove-unnecessary-calculations-2
On the other hand this is my graph traversal query:
FOR account IN accounts
FILTER account._id == "accounts/testaccount"
FOR v, e IN 1..1 INBOUND account transfers
FILTER e.quantity > 100
RETURN e
Which only uses _to from the combined index for filtering the inbound edges, but fails to utilize quantity:
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
9 IndexNode 1 - FOR account IN accounts /* primary index scan */
5 TraversalNode 9 - FOR v /* vertex */, e /* edge */ IN 1..1 /* min..maxPathDepth */ INBOUND account /* startnode */ transfers
6 CalculationNode 9 - LET #7 = (e.`quantity` > 100) /* simple expression */
7 FilterNode 9 - FILTER #7
8 ReturnNode 9 - RETURN e
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
9 primary accounts true false 100.00 % [ `_key` ] (account.`_id` == "accounts/testaccount")
5 skiplist transfers false false n/a [ `_to`, `quantity` ] base INBOUND
Traversals on graphs:
Id Depth Vertex collections Edge collections Options Filter conditions
5 1..1 transfers uniqueVertices: none, uniqueEdges: path
Optimization rules applied:
Id RuleName
1 use-indexes
2 remove-filter-covered-by-index
3 remove-unnecessary-calculations-2
However, as I want to use the graph traversal, is there a way to utilize this combined index correctly?
Edit: I'm using ArangoDB 3.4.2

Vertex centric indexes (indexes that are created on an edge and include either the '_from' or the '_to' properties) are normally used in traversals when the filtering is done on the path rather than the edge itself. ( assuming the optimizer does not find a better plan of course)
So in your query, try something like the following:
FOR account IN accounts
FILTER account._id == "accounts/testaccount"
FOR v, e IN 1..1 INBOUND account transfers
FILTER p.edges[*].quantity ALL > 100
RETURN e
You can find the docs about this index type here

SSAS - Data Warehouse structure and the Unknown value

I have a table that shows summed monthly values grouped by different analysis codes
TableId Month Value Analysis1ID Analysis2ID
1 1 100 1 NULL
2 1 50 NULL 3
3 1 50 2 NULL
4 1 50 3 NULL
I have set the above as a fact table (also have a dimension for the analysis values).
As you can see the table has a new row for each unique ID for the analysis column.
We are then analysing the data in excel, Simply summing the Value column and grouping by Analyis1ID, Month
This give us :
AnalysisID1 1 = 100
AnalysisID1 2 = 50
AnalysisID1 3 = 50
Unknown = 50
Total = 250
This all looks ok apart from the Unknown, which is summed total of NULL....
I have tried excluding the NULL Value in the Dimension by setting the UnknownMember to "Hidden".
This does work but it does not exclude the amount from the total. How can i exclude it from the total value?
I am guessing that the table structure is not correct for that data, I'm unsure though how else to structure it?
Any help or guidance would be appreciated

I would not have NULL values in dimension members, in the past i've always used an Unallocated Member with a -1 ID.
You could then use Cube Security to filter out the Unknown or Unallocated members.

I would Filter that row out using Excel. Right-click on the cell labelled 'Unknown' and you can choose Filter / Hide Selected Items.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to approximate execution time of ArangoDB count function - arangodb

Related

How to map sales against purchases sequentially using python?

How to create a leaderboard that sorts by score and time mongodb?

NetSuite - Saved search join with location inventory data

ArangoDB graph traversal not utilizing combined index

SSAS - Data Warehouse structure and the Unknown value

Categories

Resources