Google Datastore query filter for multiple values for same property - node.js

I have a query I wish to run on google Datastore that is intended to retrieve data from multiple devices. However, I couldn't find anywhere in the documentation that would allow me to get data from e.g. device-1 or device-2 or device-3, i.e. only 1 property name can be set. Is this a Datastore limitation? Or am I just missing something that I don't know about?
Based on the NodeJS client library, the query might look something like the below filter criteria:
var query = datastore.createQuery('data')
.filter('device_id', 1)
.filter('device_id', 2)
.filter('device_id', 3);
Otherwise, I might have to run separate queries for the various devices, which doesn't seem like a very elegant solution, especially if there are a lot of devices to simultaneously run queries on.
Any suggestions for the Datastore API or alternative approaches are welcome!

Yes, this would be an OR operation which is one of the Restrictions on queries (emphasis mine):
The nature of the index query mechanism imposes certain restrictions
on what a query can do. Cloud Datastore queries do not support
substring matches, case-insensitive matches, or so-called full-text
search. The NOT, OR, and != operators are not natively
supported, but some client libraries may add support on top of Cloud
Datastore. Additionally:

Related

Usage of the DISTINCT keyword through haskell persist

Is there some way in Haskell persistent to perform a select distinct (column_name_1, column_name_2) .... Note that I do not mean unique, I really want to select the distinct records for these columns. I can of course perform some filter-magic afterwards, but I would like to have the database (in my case postgres) solve it, but I did not really find this in the documentation.
Kasper
A search of the persistent repo shows that the DISTINCT keyword is not used in any meaningful context, meaning that persistent does not support DISTINCT queries at all.
The reason for this is because an explicit design goal of persistent is to be backend-agnostic, and many non-SQL backends do not natively support certain SQL features, such as distinct queries and joins.
I opened a Github issue to query this, and Matt Parsons, a persistent maintainer, responded recommending the Esqueleto package, which is written on top of persistent and aims to provide SQL-specific functionality.

Google Datastore filter with OR condition

I am working with NodeJS on Google App Engine with the Datastore database.
I am using composite query filter and just need a basic "OR" condition.
Example: Query Tasks that have Done = false OR priority = 4
const query = datastore.createQuery('Task')
.filter('done', '=', false) //How to make this an OR condition?
.filter('priority', '=', 4);
However, according to the documentation:
Cloud Datastore currently only natively supports combining filters
with the AND operator.
What is a good way to achieve a basic OR condition without running two entirely separate queries and then combining the results?
UPDATE
I have my solution described in detail here in my other post. Any feedback for improvements to the solution would be appreciated since I'm still learning NodeJS.
Not currently possible to achieve a query with an OR condition - this is what the note you quoted means.
Some client libraries provide some (limited) support for OR-like operations. From Restrictions on queries:
The nature of the index query mechanism imposes certain restrictions
on what a query can do. Cloud Datastore queries do not support
substring matches, case-insensitive matches, or so-called full-text
search. The NOT, OR, and != operators are not natively
supported, but some client libraries may add support on top of Cloud
Datastore.
But AFAIK no such library is available for NodeJS.
If you only have a need for a few specific such queries one possible approach would be to compute (at the time of writing the entities) an additional property with the desired result for such query and use equality queries on that property instead.
For example, assuming you'd like a query with OR-ing the equivalents of these filters:
.filter('status', '=', 'queued')
.filter('status', '=', 'running')
You could compute a property like not_done every time status changes and set it to true if status is either queued or running and false otherwise. Then you can use .filter('not_done', '=', true) which would have the same semantics. Granted, it's not convenient, but it may get you past the hurdle.
I wrote an answer on your other question, regarding using Array properties on Cloud Datastore to work around some cases where having the OR operator would have helped: https://stackoverflow.com/a/74958631/963901

Look ahead search on document fields in azure DocumentDb

We are interested in using DocumentDb as a data store for a number of data sources and as such we are running a quick POC to establish whether it meets the criteria we are looking for.
One of the areas we are keen to provide is look ahead search capabilities for certain fields. These are traditionally provided using the SQL LIKE syntax which does not appear to be supported at present.
Searching online I have seen people talking about integrating Azure search but this appears to be a very costly mechanism for such a simple use case.
I have also seen people mention the use of UDF's but this appears to require an entire collection scan which is not practical from a performance perspective.
Does anyone have any alternative suggestions? One thing I considered was simply using a SQL table and initiating an update each time a document was inserted\updated\deleted?
DocumentDB supports STARTSWITH and range indexes to support prefix/look ahead searching.
You can progressively make queries like the following based on what your user types in a text box:
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "H")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hi")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hil")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hilton")
Note that you must configure the collection, or the path/property you're using for these queries with a range index. You can extend this approach to handle additional cases as well:
To query in a case-insensitive manner, you must store the lower case form of the search property, and use that for querying.
I faced a similar situation, where a fast lookup was required, as a user typed search terms.
My scenario was that potentially thousands of simultaneous users would be performing such lookups; when testing this under load, to avoid saturation and throttling, we found we would have to increase the DocumentDB Request Unit (RU) throughput amount to a point that was not financially viable for us, in our specific circumstances.
We decided that DocumentDB was best used as the persistent store, and 'full' data retrieval - and this role it performs exceptionally well - while a small ElasticSearch cluster performed the role it was designed for - text search, faceted search, weighted search, stemming, and most relevant to your question, autocomplete analyzersand completion suggesters.
The subject of type ahead queries, creation of indexes, autocomplete analyzer and query time 'search as you type' in ElasticSearch can be found here, here and here
The fact that you plan to have several data sources would also potentially make the ElasticSearch cluster approach more attractive, to aggregate search data.
I used the Bitnami template available in the Azure market place to create relatively small instances, and most importantly, this allowed me to place the cluster on the same Virtual Network as my other components, which greatly increased performance.
Cost was lower than Azure Search (which uses ElasticSearch under the hood).

Dynamic queries with ArangoDB

I am looking to write dynamic queries for an ArangoDB graph database and am wondering if there are best practices or standard approaches to doing it.
By 'dynamic queries' I mean that users would have the ability to build a query that is then executed on the dataset.
Methods that ArangoDB can support this could include:
Dynamically generate AQL queries by manually injecting bindvars
Write Foxx functions to deliver on supported queries, and have another Foxx function bind those together to build a response.
Write a workflow which extracts data into a temporary collection and then invokes Foxx functions to filter/sort the data to the desired outcome.
The queries would be very open ended, where someone would (for example):
Query all countries with population over 10,000,000
Sort countries by land in square kilometers
Pick the top 10 countries in land coverage
Select primary language spoken in each country
Count occurrences of each language.
That query alone is straight forward to execute, but if a user was able to [x] check or select from a range of supported query options, order them in their own defined way, and receive the output, it's a little more involved.
Are there some supported or recommended approaches to doing this?
My current approach would be to write blocks of AQL that delivered on each part, probably in a LET Q1 = (....), LET Q2 = (...) format, and then finally in the bottom of the query have a generic way of processing the queries to generate a response.
But I have a feeling that smart use of Foxx functions could help here as well, having Foxx-Query-Q1 and Foxx-Query-Q2 coded to support each query type, then an aggregation Foxx app that invoked the right queries in the right order to build the right response.
If anyone has seen best ways of doing this, it would be great to get some hints/advice.
Thanks!

Using Lucene to index private data, should I have a separate index for each user or a single index

I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do)
My questions revolve around performance and security.
Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient?
Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index.
Any insight would be greatly appreciated.
Regards,
Nate
Of course, one index.
You can do even better than what you suggested by using ManifoldCF (Apache product that knows how to handle Solr) to manage security.
And one off topic, uninformed suggestion: I'd rather use CloudBees or Heroku (or Amazon) instead of Azure.
Until you will use several machines for indexing I guess it's more convenient to use single index. Lucene community done a lot of work to make indexing process as efficient as it can. So unless you intentionally want to implement distributed indexing I doesn't recommend you to split indexes.
However there are several reasons why you would want to split indexes:
if your machine have several IO devices which could be utilized in parallel. In this case, if you are IO bound, splitting indexes is good idea.
splitting document fields between indexes (this is what ParallelReader is supposed for). This is more exotic form of splitting, but it may be a good idea if search is performed using different groups of fields. Suppose, we have two search query types: the first is using field name and type, and the second is using fields price and discount. If those fields are updated at different rate (I guess, name updates are far more rarely than price updates), updating only part of index would require less IO resources. This will give more overall throughput to the system.

Resources