The optimal data structure for filtering for objects that match criteria - object

I'll try to present the problem as generally as I can, and a response in any language will do.
Suppose there are a few sets of varying sizes, each containing arbitrary values to do with a category:
var colors = ["red", "yellow", "blue"] // 3 items
var letters = ["A", "B", "C", ... ] // 26 items
var digits = [0, 1, 2, 3, 4, ... ] // 10 items
... // each set has fixed amount of items
Each object in this master list I already have (which I want to restructure somehow to optimize searching) has properties that are each a selection of one of these sets, as such:
var masterList = [
{ id: 1, color: "red", letter: "F", digit: 5, ... },
{ id: 2, color: "blue", letter: "Q", digit: 0, ... },
{ id: 3, color: "red", letter: "Z", digit: 3, ... },
...
]
The purpose of the search would be to create a new list of acceptable objects from the master list. The program would filter the master list by given search criteria that, for each property, contains a list of acceptable values.
var criteria = {
color: ["red", "yellow"],
letter: ["A", "F", "P"],
digit: [1, 3, 5],
...
};
I'd like to think that some sort of tree would be most applicable. My understanding is that it would need to be balanced, so the root node would be the "median" object. I suppose each level would be defined by one of the properties so that as the program searches from the root, it would only continue down the branches that fit the search criteria, each time eliminating the objects that don't fit given the particular property for that level.
However, I understand that many of the objects in this master list will have matching property values. This connects them in a graphical manner that could perhaps be conducive to a speedy search.
My current search algorithm is fairly intuitive and can be done with just the master list as it is. The program
iterates through the properties in the search criteria,
with each property iterates over the master list, eliminating a number of objects that don't have a matching property, and
eventually removes all the objects that don't fit the criteria. There is surely some quicker filtering system that involves a more organized data structure.
Where could I go from here? I'm open to a local database instead of another data structure I suppose - GraphQL looks interesting. This is my first Stack Overflow question, so my apologies for any bad manners 😶

As I don't have the context of number of sets and also on the number of elements in the each set. I would suggest you some very small changes which will make things at-least relatively fast for you.
To keep things mathematical, I will define few terms here:
number of sets - n
size of masterlist - k
size of each property in the search criteria - p
So, from the algorithm that I believe you are using, you are doing n iterations over the search criteria, because there can be n possible keys in the search criteria.
Then in each of these n iterations, you are doing p iterations over the allowed values of that particular set. Finally, in each of these np iterations you are iterating over the master list, with k iterations ,and checking if this value of record should be allowed or not.
Thus, in the average case, you are doing this in O(npk) time complexity.
So, I won't suggest to change much here.
The best you can do is, change the values in the search criteria to a set (hashset) instead of keeping it as a list, and then iterate over the masterlist. Follow this Python code:
def is_possible(criteria, master_list_entry):
for key, value in master_list_entry.items(): # O(n)
if not key in criteria or value not in criteria[key]: # O(1) average
return False
return True
def search(master_list, criteria):
ans = []
for each_entry in master_list: # O(k)
if is_possible(criteria, each_entry): # O(n), check above
ans.append(each_entry)
return ans
Just call search function, and it will return the filtered masterlist.
Regarding the change, change your search criteria to:
criteria = {
color: {"red", "yellow"}, # This is a set, instead of a list
letter: {"A", "F", "P"},
digit: {1, 3, 5},
...
}
As, you can see, I have mentioned the complexities along with each line, thus we have reduced the problem to O(nk) in average case.

Related

Extract values for all row, for given fieldname, from an Octave struct [duplicate]

This question already has answers here:
How to slice a struct array?
(4 answers)
Closed 2 years ago.
How can I get all values of a complete column (fieldname), for all rows, from an Octave struct?
I would get it into a cell array, or a regular vector, preferably without looping.
You seem to be confusing a few things. Partly because of your equivalence comparison of structs to "R dataframes / python pandas".
Structs are better thought of as being similar to python dicts, R lists, etc. They are a special object that can hold 'fields', which can be accessed by a 'fieldname' ( or values accessed by keys, if you prefer ).
Also, like any other object in octave, they are valid elements for an array. This means you can have something like this:
octave:1> struct( 'name', { 'Tom', 'Jim'; 'Ann', 'Sue' }, 'age', { 20, 21; 22, 23 } )
S =
2x2 struct array containing the fields:
name
age
In general, when one deals with such a struct array, accessing a field on more than one elements of the array, produces a comma separated list. E.g.
octave:6> S(2,:).name
ans = Ann
ans = Sue
This can be passed to (i.e. "expanded into") any function that expects such a comma separated list as arguments. E.g.
octave:7> fprintf( 'The girls names are %s, and %s.\n', S(2,:).name )
The girls names are Ann, and Sue.
If you want, you can also pass that list straight into a 'cell constructor', to create a cell. (though if you want it to have a particular shape, you'll have to reshape it afterwords). E.g.
octave:9> reshape( { S.age }, size(S) )
ans =
{
[1,1] = 20
[2,1] = 22
[1,2] = 21
[2,2] = 23
}
There is also struct2cell but this does something different. Try it to see what it does (e.g. C = struct2cell(S) ).
Finally, to avoid confusion, given the fact that when one deals with struct arrays, "columns" refer to columns in the 'array', I would avoid referring to "fieldnames" by that term.

ArangoDB - Aggregate sum of descendant attributes in DAG

I have a bill of materials represented in ArangoDB as a directed acyclic graph. The quantity of each part in the bill of materials is represented on the edges while the part names are represented by the keys of the nodes. I'd like to write a query which traverses down the DAG from an ancestor node and sums the quantities of each part by its part name. For example, consider the following graph:
Qty: 2 Qty: 1
Widget +------> Gadget +------> Stuff
+ + Qty: 4
| Qty: 1 +---------> Thing
+----------------------------^
Widget contains two Gadgets, which each contains one Stuff and four Things. Widget also contains one Thing. Thus I'd like to write an AQL query which traverses the graph starting at widget and returns:
{
"Gadget": 2,
"Stuff": 2,
"Thing": 9
}
I believe collect aggregate may be my friend here, but I haven't quite found the right incantation yet. Part of the challenge is that all descendant quantities of a part need to be multiplied by their parent quantities. What might such a query look like that efficiently performs this summation on DAGs of depths around 10 layers?
Three possible options come to mind:
1.- return the values from the path and then summarize the data in the app server:
FOR v,e,p IN 1..2 OUTBOUND 'test/4719491'
testRel
RETURN {v:v.name, p:p.edges[*].qty}
This returns Gadget 2, Stuff [2,1], Thing [2,4], Thing [ 1 ]
2.- enumerate the edges on the path, to get the results directly :
FOR v,e,p IN 1..2 OUTBOUND 'test/4719491'
testRel
let e0 = p.edges[0].qty
let e1 = NOT_NULL(p.edges[1].qty,1)
collect itemName = v.name aggregate items = sum(e0 * e1)
Return {itemName: itemName, items: items}
This correctly returns Gadget 2, Stuff 2, Thing 9.
This obviously requires that you know the number of levels before hand.
3.- Write a custom function "multiply" similar to the existing "SUM" function so that you can multiply values of an array. The query would be similar to this :
let vals = (FOR v,e,p IN 1..2 OUTBOUND 'test/4719491'
testRel
RETURN {itemName:v.name, items:SUM(p.edges[*].qty)})
for val in vals
collect itemName = val.itemName Aggregate items = sum(val.items)
return {itemName: itemName, items: items}
So your function would replace the SUM in the inner sub-select. Here is the documentation on custom functions

Select a Range of Elements from a List in Terraform

Is there a way to select a range of elements from a list in Terraform?
For example - if we have:
[a, bb, ccc, dddd, eeeee]
How can the first 3 elements be selected?
a, bb, ccc
And then the 4th and 5th elements?
dddd, eeeee
Subsets of lists as you are looking for are often referred to as slices. Terraform has a built-in function for this, called slice, which is availble since version 0.8.8. You are looking for
slice(<put_reference_to_list_here, 0, 3)
slice(<put_reference_to_list_here, 3, 5)
From the documentation:
slice(list, from, to) - Returns the portion of list between from (inclusive) and to (exclusive).
Most notably slice is picky about the to parameter, which may must be less or equal to the lists length, otherwise TF will complain.
These fromIndex,toIndex interfaces are not intuitive for me, so I started to keep code snippets around for every language I (have to) use. This is my helper for Terraform:
variable "mylist" { default = [ 101, 102, 103, 104, 105 ] }
locals{
everything = "${slice(var.mylist, 0 , length(var.mylist) )}"
butlast = "${slice(var.mylist, 0 , length(var.mylist)-1)}"
thelast = "${slice(var.mylist, length(var.mylist)-1, length(var.mylist) )}"
}
data "null_data_source" "slices" {
inputs {
everything = "${join(",",local.everything)}"
butlast = "${join(",",local.butlast)}"
thelast = "${join(",",local.thelast)}"
}
}
output "slices" {
value = "${data.null_data_source.slices.outputs}"
}
To spare you the effort of terraform init; terraform refresh:
data.null_data_source.slices: Refreshing state...
Outputs:
slices = {
butlast = 101,102,103,104
everything = 101,102,103,104,105
thelast = 105
}
Use the slice function. That page describes the full expression language available to you.
Depending on where that list comes from, you might find it more convenient to split it up at its source. For instance, instead of declaring 5 aws_instance resources then trying to slice their output this way, have two separate aws_instance declarations of 3 and 2 instances respectively, and just deal with the entire list of outputs.

how to enable string and numeric comparison temporarily?

Given this simplified example to sort:
l = [10, '0foo', 2.5, 'foo', 'bar']
I want to sort l so that numeric is always before strings. In this case, I'd like to get [2.5, 10, '0foo', 'foo', 'bar']. Is it possible make numeric and string temporarily comparable (with strings always larger than numeric)?
Note it is not easy to provide a key function to sorted if you are thinking about it. For example, converting numeric to string won't work because "10" < "2.5".
A way that you might do this does involve passing a key to sorted. it looks like this:
sorted(l, key=lambda x:(isinstance(x str), x))
This works because the key returns a tuple with the type of x and its value. Because of the way tuples are compared. Items at index 0 are compared first and if it they are the same, then next two items are compared and so on if they also are the same. This allows for the values to be sorted by type (string or not a string), then value if they are a similar type.
A more robust solution that also can handle further types might use a dictionary in the key function like this:
sorted(l,key=lambda x:({int:0, float:0, str:1, list:2, set:3}[type(x)], x))
further types can be added as necessary.

Methods for nearby numbers in Groovy

In groovy are there any methods that can find the near by numbers? For example :
def list = [22,33,37,56]
def number = 25
//any method to find $number is near to 22 rather than 33.
Is there any method for the above mentioned purpose, or i have to construct my own method or closure for this purpose.
Thanks in advance.
The following combination of Groovy's collection methods will give you the closest number in the list:
list.groupBy { (it - number).abs() }.min { it.key }.value.first()
The list.groupBy { (it - number).abs() } will transform the list into a map, where each map entry consists of the distance to the number as key and the original list entry as the value:
[3:[22], 8:[33], 12:[37], 31:[56]]
The values are now each a list on their own, as theoretically the original list could contain two entries with equal distance. On the map you then select the entry with the smallest key, take its value and return the first entry of the value's list.
Edit:
Here's a simpler version that sorts the original list based on the distance and return the first value of the sorted list:
list.sort { (it - number).abs() }.first()
If it's a sorted List, Collections.binarySearch() does nearly the same job. So does Arrays.binarySearch().

Resources