ArangoDB graph traversal not utilizing combined index - arangodb

I have those two queries, which should - based on my understanding - do basically the same. One is doing a filter on my edge collection and is performing very well, while the other query is doing a graph traversal of depth 1 and performs quite poor, due to not utilizing the correct index.
I have an accounts collection and a transfers collection and a combined index on transfers._to and transfers.quantity.
This is the filter query:
FOR transfer IN transfers
FILTER transfer._to == "accounts/testaccount" && transfer.quantity > 100
RETURN transfer
Which is correctly using the combined index:
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
6 IndexNode 18930267 - FOR transfer IN transfers /* skiplist index scan */
5 ReturnNode 18930267 - RETURN transfer
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
6 skiplist transfers false false 10.11 % [ `_to`, `quantity` ] ((transfer.`_to` == "accounts/testaccount") && (transfer.`quantity` > 100))
Optimization rules applied:
Id RuleName
1 use-indexes
2 remove-filter-covered-by-index
3 remove-unnecessary-calculations-2
On the other hand this is my graph traversal query:
FOR account IN accounts
FILTER account._id == "accounts/testaccount"
FOR v, e IN 1..1 INBOUND account transfers
FILTER e.quantity > 100
RETURN e
Which only uses _to from the combined index for filtering the inbound edges, but fails to utilize quantity:
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
9 IndexNode 1 - FOR account IN accounts /* primary index scan */
5 TraversalNode 9 - FOR v /* vertex */, e /* edge */ IN 1..1 /* min..maxPathDepth */ INBOUND account /* startnode */ transfers
6 CalculationNode 9 - LET #7 = (e.`quantity` > 100) /* simple expression */
7 FilterNode 9 - FILTER #7
8 ReturnNode 9 - RETURN e
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
9 primary accounts true false 100.00 % [ `_key` ] (account.`_id` == "accounts/testaccount")
5 skiplist transfers false false n/a [ `_to`, `quantity` ] base INBOUND
Traversals on graphs:
Id Depth Vertex collections Edge collections Options Filter conditions
5 1..1 transfers uniqueVertices: none, uniqueEdges: path
Optimization rules applied:
Id RuleName
1 use-indexes
2 remove-filter-covered-by-index
3 remove-unnecessary-calculations-2
However, as I want to use the graph traversal, is there a way to utilize this combined index correctly?
Edit: I'm using ArangoDB 3.4.2

Vertex centric indexes (indexes that are created on an edge and include either the '_from' or the '_to' properties) are normally used in traversals when the filtering is done on the path rather than the edge itself. ( assuming the optimizer does not find a better plan of course)
So in your query, try something like the following:
FOR account IN accounts
FILTER account._id == "accounts/testaccount"
FOR v, e IN 1..1 INBOUND account transfers
FILTER p.edges[*].quantity ALL > 100
RETURN e
You can find the docs about this index type here

Related

How to compute Multicompare Tukey HSD in python?

I am trying to compute a multicompare Tukey test with a list which contains 5 list of values (it is a list of lists). I was reading some documentation about numpy.recarray which it seems to fit in this topic but I don't really know how it works.
Let's suppose that my list of lists is as follows:
my_list_of_lists = [[0.75,0.78,0.80,0.77,0.71,0.69,0.73],[0.76,0.73,0.88,0.71,0.72,0.80,0.72],[0.71,0.75,0.77,0.79,0.68,0.0.77,0.66],[0.72,0.79,0.82,0.73,0.75,0.60,0.72],[0.73,0.71,0.66,0.79,0.72,0.67,0.71]]
The output should be like this with 3 elements to compare:
Multiple Comparison of Means - Tukey HSD, FWER=0.05
============================================
group1 group2 meandiff lower upper reject
--------------------------------------------
0 1 1.5 0.3217 2.6783 True
0 2 1.0 -0.1783 2.1783 False
1 2 -0.5 -1.6783 0.6783 False
--------------------------------------------
Moreover, is there a way to plot the multicomparaison on means?
Thanks in advance

Find feature or combination of features that has an effect

I am looking for a statistical model or test to answer following question and would be grateful for some help:
I have m products p1,...,p5 that my customers can subscribe to.
I have divided my customers into groups A1,...,A and for each group and each combination of products, I have counted how many customers have this combination of products, and how it has affected their sales:
Customer_group has_p1 has_p2 [...] has_p5 cust_count total_sales
A1 0 0 0 124 1234
A1 1 0 0 315 999
A1 1 1 0 199 7777
[...]
An 1 1 1 233 663
Now I want to find out which group of customers benefit from which product or combination of products.
My first idea was to use a paired t test for the group of customers that had a product versus the group that does not have a product in the same combination with other products, i.e. for measuring the effect of p1 I would pair {A1, 1, 0, 0, 1, 0} with {A1, 0, 0, 0, 1, 0} and compare the series of the two values of total_sales/cust_count.
However, with this test I just find out which of the products has an effect, not which group it has an effect for or if it is significant that the product is sold in combination with another product.
Any good ideas?
So after thinking a day, I found a way:
First I did a one-hot encoding of the groups, so I replaced the customer_group column with n columns containing 0 and 1's.
Then I made a linear regression model with mixed terms:
product_i * product_j + group_k * product_i + group_k * product_i * product_j
And by reducing the model I found which product x product combinations and which group x product and group x product x product combinations were significant

Group people based on their hobbies in Spark

I am working with PySpark on a case where I need to group people based on their interests. Let's say I have n persons:
person1, movies, sports, dramas
person2, sports, trekking, reading, sleeping, movies, dramas
person3, movies, trekking
person4, reading, trekking, sports
person5, movies, sports, dramas
.
.
.
Now I want to group people based on their interests.
Group people who have at least m common interests (m is user input, it could be 2, 3, 4...)
Let's assume m=3
Then the groups are:
(person1, person2, person5)
(person2, person4)
User who belongs to x groups (x is user input)
Let's assume x=2
Then
person2 is in two groups
My response will be algebraic and not Spark/Python specific, but can be implemented in Spark.
How can we express the data in your problem?
I will go with matrix - each row represents a person, each column represents interest. So following your example:
movies,sports,trekking,reading,sleeping,dramas
P1: 1 1 0 0 0 1
P2: 1 1 1 1 1 1
P3: 1 0 1 0 0 0
P4: 0 0 1 1 0 1
P5: 1 1 0 0 0 1
What if we would like to investigate similarity of P2 and P3 - check how many interests do they share - we could use the following formula:
(movies)+(sports)+(trekking)+(reading)+(sleeping)+(dramas)
1*1 + 1*0 + 1*1 + 1*0 + 1*0 + 1*0 = 2
It may look familiar to you - it looks like part of matrix multiplication calculation.
To get full usage of the fact that we observed, we have to transpose the matrix - it will look like that:
P1,P2,P3,P4,P5
movies 1 1 1 0 1
sports 1 1 0 0 1
trekking 0 1 1 1 0
reading 0 1 0 1 0
sleeping 0 1 0 0 0
dramas 1 1 0 1 1
Now if we multiply the matrices (original and transposed) you would get new matrix:
P1 P2 P3 P4 P5
P1 3 3 1 1 3
P2 3 6 2 3 4
P3 1 2 2 1 1
P4 1 3 1 2 1
P5 3 3 1 1 3
What you see here is the result you are looking for - check the value on the row/column crossing and you will get number of shared interests.
How many interests do P2 share with P4? Answer: 3
Who shares 3 interests with P1? Answer: P2 and P5
Who shares 2 interests with P3? Answer: P2
Some hints on how to apply this idea into Apache Spark
How to operate on matrices using Apache Spark?
Matrix Multiplication in Apache Spark
How to transpose matrix using Apache Spark?
Matrix Transpose on RowMatrix in Spark
EDIT 1: Adding more realistic method (after the comments)
We have a table/RDD/Dataset "UserHobby":
movies,sports,trekking,reading,sleeping,dramas
P1: 1 1 0 0 0 1
P2: 1 1 1 1 1 1
P3: 1 0 1 0 0 0
P4: 0 0 1 1 0 1
P5: 1 1 0 0 0 1
Now to find all the people that share 2 groups with P1 you would have to execute:
SELECT * FROM UserHobby
WHERE movies*1 + sports*1 + sports*0 +
trekking*0 + reading*0+ sleeping*0 + dramas*1 = 2
Now you would have to repeat this query for all the users (changing 0s and 1s to the actual values). The algorithm complexity is O(n^2 * m) - n number of users, m number of hobbies
What is nice about this method is that you don't have to generate subsets.
Probably my answer might not be the best, but will do the work. If you know the total list of hobbies in prior, you can write a piece of code which can compute the combinations before going into spark's part.
For example:
Filter out the people whose hobbies_count < input_number at the the very start to scrap out unwanted records.
If the total list of hobbies are {a,b,c,d,e,f} and the input_number is 2, the list of combinations in this case would be
{(ab)(ac)(ad)(ae)(af)(bc)(bd)(be)(bf)(cd)(ce)(cf)(de)(df)(ef)}
So will need to generate the possible combinations in prior for the input_number
Later, perform a filter for each combination and track the record count.
If the number of users is large, you can't possibly think about going for any User x User approach.
Step 0. As a first step, we should ignore all the users who don't have at least m interests (since they cannot have at least m common interests with anyone).
Step 1. Possible approaches:
i)Brute Force: If the maximum number of interests is small, say 10, you can generate all the possible interest combinations in a HashMap, and assign an interest group id to each of these. You will need just one pass over the interest set of a user to find out which interest groups do they qualify for. This will solve the problem in one pass.
ii) Locality Sensitive Hashing: After step 0 is done, we know that we only have users that have a minimum of m hobbies. If you are fine with an approximate answer, locality sensitive hashing can help. How to understand Locality Sensitive Hashing?
A sketch of the LSH approach:
a) First map each of the hobbies to an integer (which you should do anyways, if the dataset is large). So we have User -> Set of integers (hobbies)
b) Obtain a signature for each user, by taking the min hash (Hash each of the integers by a number k, and take the min. This gives User -> Signature
c) Explore the users with the same signature in more detail, if you want (or you can be done with an approximate answer).

How to approximate execution time of ArangoDB count function

I am considering using ArangoDB for a new project of mine, but I have been unable to find very much information regarding its scalability.
Specifically, I am looking for some information regarding the count function. Is there a reliable way (perhaps a formula) to approximate how long it will take to count the number of documents in a collection which match a simple Boolean value?
All documents in the collection would have the same fields, however with different values. How can I determine how long would it take to count several hundred million documents?
Just create a collection users and insert as many random documents as you need.
FOR i IN 1..1100000
INSERT {
name: CONCAT("test", i),
year: 1970 + FLOOR(RAND() * 55),
gender: i % 2 == 0 ? 'male' : 'female'
} IN users
Then do the count:
FOR user IN users
FILTER user.gender == 'male'
COLLECT WITH COUNT INTO number
RETURN {
number: number
}
And if you use this query in production, make sure to add an index too. On my machine it reduces the execution time by factor > 100x (0.043 sec / 1.1mio documents).
Check your query with EXPLAIN to further estimate how "expensive" the execution will be.
Query string:
FOR user IN users
FILTER user.gender == 'male'
COLLECT WITH COUNT INTO number
RETURN {
number: number
}
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
8 IndexRangeNode 550001 - FOR user IN users /* hash index scan */
5 AggregateNode 1 - COLLECT WITH COUNT INTO number /* sorted*/
6 CalculationNode 1 - LET #4 = { "number" : number } /* simple expression */
7 ReturnNode 1 - RETURN #4
Indexes used:
Id Type Collection Unique Sparse Selectivity Est. Fields Ranges
8 hash users false false 0.00 % `gender` [ `gender` == "male" ]
Optimization rules applied:
Id RuleName
1 use-index-range
2 remove-filter-covered-by-index

Error in Proc Freq

I am have a data set with multiple visits with 2 treatment arms and a Vehicle group. Also i have a variable say "SSA" having two values 1 and 0. here 1 stands for responder and 0 for non-responder subjects. while performing Proc Freq for chi-square statistics i am getting the following error. Here is the code i used
PROC FREQ DATA=P&V1;
TABLE TREATMENT*SSA/CHISQ ;
WHERE TREATMENT IN (1 &TR1); *(&tri for treatment 2 and treatment 3);
RUN;
NOTE: No statistics are computed for TREATMENT * SSA since SSA has less than 2 no missing levels.
WARNING: No OUTPUT data set is produced because no statistics can be computed for this table, which has a row or column variable with less than 2 no missing levels.
this error is for my last visit where i am having a single value 0 in all treatment groups and Vehicle.

Resources