I have on multiple occasions met problems where all but one field is defined in a search query (some against databases, some not), but it is not known at the time of indexing which field will be unknown. My usual approach is to create an index for every field and do set operations on them, or in the case of SQL, index each field and then select. For a lot of fields, this may or may not be better than a linear search.
But if I know only one field will be missing at a time, is it possible to do better than this in terms of time complexity, or in terms of disk access? Perhaps a data structure with which I am not familiar?
Taking inspiration from Lucene, you could start with an inverted index structure on disk, and then step a leapfrog iterator across all fields in question.
E.g.
Sample Docs
{"id": 1, "network": "NBC", "show": "Wheel of Fortune", "host": "Pat Sajak", "sponsor": "NordicTrack"}
{"id": 2, "network": "NBC", "show": "Jeopardy", "host": "Alex Trebek", "sponsor": "IBM Watson"}
{"id": 3, "network": "NBC", "show": "The Wizard of Odds", "host": "Alex Trebek", "sponsor": "NordicTrack"}
Inverted indices
network := "NBC" (1,2,3)
show := "Jeopardy" (2), "The Wizard of Odds" (3), "Wheel of Fortune" (1)
host := "Alex Trebek" (2,3), "Pat Sajak" (1)
sponsor := "IBM Watson" (2), "NordicTrack" (1,3),
Query issued for:
Network: NBC
Host: Alex Trebek
Sponsor: NordicTrack
Show: Unknown
Query execution
Iteration Step 1
network := "NBC" (1...
update consensus id to 1
advance to consensus id in all other queried fields (if exists)
Iteration Step 2
network := "NBC" (1...
host := "Alex Trebek" (2...
Fault: id:2 is higher than consensus id ,
update consensus id to 2
advance to new consensus id in all other queried fields (if exists)
Iteration Step 3
network := "NBC" (1, 2...
host := "Alex Trebek" (2...
sponsor := "NordicTrack" (1, 3)
Fault: id:3 is higher than consensus id,
update consensus id to 3
advance to new consensus id in all other queried fields (if exists)
Iteration Step 4
network := "NBC" (1, 2, 3)
host := "Alex Trebek" (2, 3)
sponsor := "NordicTrack" (1, 3)
Match: All queried fields concur,
add 3 to list of matches
...
With linear iteration, the number of comparisons performed by leapfrog is the sum of the length of the postings lists for all fields queried.
But the number of comparisons can be reduced using Skip Lists. (Though this requires fast random access to postings).
Related
we got a unique scenario while using Azure search for one of the project. So, our clients wanted to respect user's privacy, hence we have a feature where a user can restrict search for any PII data. So, if user has opted for Privacy, we can only search for him/her with UserID else we can search using Name, Phone, City, UserID etc.
JSON where Privacy is opted:
{
"Id": "<Any GUID>",
"Name": "John Smith", //searchable
"Phone": "9987887856", //searchable
"OtherInfo": "some info" //non-searchable
"Address" : {}, //searchable
"Privacy" : "yes", //searchable
"UserId": "XXX1234", //searchable
...
}
JSON where Privacy is not opted:
{
"Id": "<Any GUID>",
"Name": "Tom Smith", //searchable
"Phone": "7997887856", //searchable
"OtherInfo": "some info" //non-searchable
"Address" : {}, //searchable
"Privacy" : "no", //searchable
"UserId": "XXX1234", //searchable
...
}
Now we provide search service to take any searchText as input and fetch all data which matches to it (all searchable fields).
With above scenario,
We need to remove those results which has "Privacy" as "yes" if searchText is not matching with UserId
In case searchText is matching with UserId, we will be including it in result.
If "Privacy" is set "no" and searchText matches any searchable field, it will be included in result.
So we have gone with "Lucene Analysers" to check it while querying, resulting in a very long query as shown below. Let us assume searchText = "abc"
((Name: abc OR Phone: abc OR UserId: abc ...) AND Privacy: no) OR
((UserId: abc ) AND Privacy: yes)
This is done as we show paginated results i.e. bringing data in batches like 1 - 10, 11 - 20 and so on, hence, we get top 10 records in each query with total result count.
Is there any other optimised approach to do so??
Or Azure search service facilitates any internal mechanism for conditional queries?
If I understand your requirement correctly, it can be solved quite easily. You determine which property should be searchable and not in your data model. You don't need to construct a complicated query that repeats the end user input for every property. And you don't need to do any batching or processing of results.
If searchText is your user's input, you can use this:
(*searchText* AND Privacy:false)
This will search all searchable fields, but it will only return records that have allowed search in PII data.
You also have a requirement that allows the users to search for userid in all records regardless of the PII setting for the record. To support this, extend the query to:
(*searchText* AND Privacy:false) OR (UserId:*searchText*)
This allows users to search all fields in records where Privacy is false, and for all other records it allows search in the UserId only. This query pattern will solve all of your requirements with one optimized query.
From the client side you could dynamically add the ¨SearchFields¨ parameter as part of the query, that way if the user got the Privacy flag set to true, only UserId is set as part of the available Search fields.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.search.models.searchparameters.searchfields?view=azure-dotnet
I’ve just started to use Cloudant and I just can’t get my head around the map functions. I’ve been fiddling with the data below but it isn’t working out as I expected.
The relationship is, a user can have many vehicles. A vehicle belongs to 1 user. The vehicle ‘userId’ is the key of the user. There is a bit of redundancy as in user the _id and userId is the same, guess later is not required.
Anyhow, how can I find for a/every user, the vehicles which belong to it? The closest I’ve come through trial and error is a result which displays the owner of every vehicle, but I would like it the other way round, the user and the vehicles belonging to it. All the examples I’ve found use another document which ‘joins’ two or more documents, but I don’t need to do that?
Any point in the right direction appreciated - I really have no idea.
function (doc) {
if (doc.$doctype == "vehicle")
{
emit(doc.userId, {_id: doc.userId});
}
}
EDIT: Getting closer. I'm not sure exactly what I was expecting, but the result seems a bit 'messy'. Row[0] is the user document, row[n > 0] are the vehicle documents. I guess it's fine when a startkey/endkey is used, but without the results are a bit jumbled up.
function (doc) {
if (doc.$doctype == 'user') {
emit([doc._id, 0], doc);
} else if (doc.$doctype == 'vehicle') {
emit([doc.userId, 1, doc._id], doc);
}
}
A user is described as,
{
"_id": "user:10",
"firstname": “firstnamehere",
"secondname": “secondnamehere",
"userId": "user:10",
"$doctype": "user"
}
a vehicle is described as,
{
"_id": "vehicle:4002”,
“name”: “avehicle”,
"userId": "user:10",
"$doctype": "vehicle",
}
You're getting in the right direction! You already got that right with the global IDs. Having the type of the document as part of the ID in some form is a very good idea, so that you don't get confused later (all documents are in the same "pot").
Here are some minor problems with your current solution (before getting to your actual question):
Don't emit the doc as value in emit(key, value). You can always ask for the document that belongs to a view row by querying with include_docs=true. Having the doc as view value increases the view indexes a lot. When you don't need a specific value, use emit(key, null).
You also don't need the ID in the emit value. You'll get the ID of the document that belongs to a view row as part of the row anyway.
View Collation
Now to your problem of aggregating the vehicles with their user. You got the basic pattern right. This pattern is called view collation, you can read more about it in the CouchDB docs (ignore that it is in the "Couchapp" section).
The trick with view collation is that you return two or more types of documents, but make sure that they are sorted in a way that allows for direct grouping. Thus it is important to understand how CouchDB sorts the view result. See the collation specification for more information on that one. An important key to understanding view collation is that rows with array keys are sorted by key elements. So when two rows have the same key[0], they sort by key[1]. If that's equal as well, key[2] is considered, and so on.
Your map function frist groups users and vehicles by user ID (key[0]). Your map function then uses the fact that 0 sorts before 1 in the second element of the key, so your view will contain the following:
user 1
vehicle of user 1
vehicle of user 1
vehicle of user 1
user 2
user 3
vehicle of user 3
user 4
etc.
As you can see, the vehicles of a user immediately follow their user. Thus you can group this result into aggregates without performing expensive sort or lookup operations.
Note that users are sorted according to their ID, and vehicles within users also according to their ID. This is because you use the IDs in the key array.
Creating Queries
Now that view isn't worth much if you can't query according to your needs. A view as you have it supports the following queries:
Get all users with their vehicles
Get a range of users with their vehicles
Get a single user with its vehicles
Get a single user without vehicles (you could also use the _all_docs view for that though)
Example query for "all users between user 1 and user 3 (inclusive) with their vehicles"
We want to query for a range, so we use startkey and endkey in the query:
startkey=["user:1", 0]
endkey=["user:3", 1, {}]
Note the use of {} as sentinel value, which is required so that the end key is larger than any row that has a key of ["user:3", 1, (anyConceivableVehicleId)]
I am struggling to implement a map / reduce function that joins two documents and sums the result with reduce.
First document type is Categories. Each category has an ID and within the attributes I stored a detail category, a main category and a division ("Bereich").
{
"_id": "a124",
"_rev": "8-089da95f148b446bd3b33a3182de709f",
"detCat": "Life_Ausgehen",
"mainCat": "COL_LEBEN",
"mainBereich": "COL",
"type": "Cash",
"dtCAT": true
}
The second document type is a transaction. The attributes show all the details for each transaction, including the field "newCat" which is a reference to the category ID.
{
"_id": "7568a6de86e5e7c6de0535d025069084",
"_rev": "2-501cd4eaf5f4dc56e906ea9f7ac05865",
"Value": 133.23,
"Sender": "Comtech",
"Booking Date": "11.02.2013",
"Detail": "Oki Drucker",
"newCat": "a124",
"dtTRA": true
}
Now if I want to develop a map/reduce to get the result in the form:
e.g.: "Name of Main Category", "Sum of all values in transactions".
I figured out that I could reference to another document with "_ID:" and ?include_docs=true, but in that case I can not use a reduce function.
I looked in other postings here, but couldn't find a suitable example.
Would be great if somebody has an idea how to solve this issue.
I understand, that multiple Category documents may have the same mainCat value. The technique called view collation is suitable to some cases where single join would be used in relational model. In your case it will not help: although you use two document schemes, you really have three level structure: main-category <- category <- transaction. I think you should consider changing the DB design a bit.
Duplicating the data, by storing mainCat value also in the transaction document, would help. I suggest to use meaningful ID for the transaction instead of generated one. You can consider for example "COL_LEBEN-7568a6de86e5e" (concatenated mainCat with some random value, where - delimiter is never present in the mainCat). Then, with simple parser in map function, you emit ["COL_LEBEN", "7568a6de86e5e"] for transactions, ["COL_LEBEN"] for categories, and reduce to get the sum.
I have a Couch Database that contains a stack of IP address documents like this:
{
"_id": "09eea172ea6537ad0bf58c92e5002199",
"_rev": "1-67ad27f5ab008ad9644ce8ae003b1ec5",
"1stOctet": "10",
"2ndOctet": "1",
"3rdOctet": "3",
"4thOctet": "55"
}
The documents consist of multiple IP that are part of different subnet ranges.
I need a way to reduce/group these documents based on the 1st, 2nd, 3rd and 4th Octets in order to produce a reduced list of subnets.
Has anybody done anything like this before.
Best Regards,
Carlskii
I'm not sure if this is exactly what you're looking for, if you can provide more of an example as to your desired output, I can likely be of more help.
First, I would have your document structure look like this: (if you can't change that structure, it's not a big deal)
{
"ip": "10.1.3.55"
}
Your map function would look like:
function (doc) {
emit(doc.ip.split("."));
}
You'll need a reduce function, I've just used this in my testing
_count
Then I would use the group_level view query parameter to group based on each octet.
1 = group based on 1st octet
2 = group based on 1st-2nd octet
3 = group based on 1st-3rd octet
4 = group based on entire octet
group=true is functionally the same in this case as group_level=4
Consider the following documents in a CouchDB:
{
"name":"Foo1",
"tags":["tag1", "tag2", "tag3"],
"otherTags":["otherTag1", "otherTag2"]
}
{
"name":"Foo2",
"tags":["tag2", "tag3", "tag4"],
"otherTags":["otherTag2", "otherTag3"]
}
{
"name":"Foo3",
"tags":["tag3", "tag4", "tag5"],
"otherTags":["otherTag3", "otherTag4"]
}
I'd like to query all documents that contain ALL (not any!) tags given as the key.
For example, if I request using '["tag2", "tag3"]' I'd like to retrieve Foo1 and Foo2.
I'm currently doing this by querying by tag, first for "tag2", then for "tag3", creating the union manually afterwards.
This seems to be awfully inefficient and I assume that there must be a better way.
My second question - but they are quite related, I think - would be:
How would I query for all documents that contain "tag2" AND "tag3" AND "otherTag3"?
I hope a question like this hasn't been asked/answered before. I searched for it and didn't find one.
Do you have a maximum number of?
Tags per document, and
Tags allowed in the query
If so, you have an upper-bound on the maximum number of tags to be indexed. For example, with a maximum of 5 tags per document, and 5 tags allowed in the AND query, you could simply output every 1, 2, 3, 4, and 5-tag combination into your index, for a maximum of 1 (five-tag combos + 5 (four-tag combos) + 10 (three-tag combos) + 10 (two-tag combos) + 5 (one-tag combos) = 31 rows in the view for that document.
That may be acceptable to you, considering that it's quite a powerful query. The disk usage may be acceptable (especially if you simply emit(tags, {_id: doc._id}) to minimize data in the view, and you can use ?include_docs=true to get the full document later. The final thing to remember is to always emit the key array sorted, and always query it the same way, because you are emitting only tag combinations, not permutations.
That can get you so far, however it does not scale up indefinitely. For full-blown arbitrary AND queries, you will indeed be required to split into multiple queries, or else look into CouchDB-Lucene.