DynamoDB Filter Expresson vs GSIs

DynamoDB Filter Expresson vs GSIs - node.js

I am using DynamoDB to store my data. I am creating a dashboard application where users can sort by fields, search by fields, and add multiple filters at once. There will be approx 100 - 1000 entries in the table.
To achieve this search, filter, sort functionality, there are two ways I can achieve this:
Use FilterExpression. A simple solution, however, requires ALL the data to be pulled before filtering (not a 'true' query), requires more server-side processing + FilterExpression often seen as 'bad practice'.
Create GSIs for each field individually. Allows me to search and sort by fields using a true query, reducing server-side processing - can directly get the items I need. The issue with this is adding multiple filters, as it is not possible to use multiple GSIs in a single query call. If I had multiple filters, this approach would require multiple query calls, and manually aggregating / finding common items on client-side.
Would it be acceptable to use FilterExpression in this situation? It would simplify the process so much from a coding / maintenance perspective, but I am unsure if it's good practice. If GSIs are the better option, how would you deal with multiple filters?
Lastly, would there be a better approach, aside from the two options listed above?
Thanks so much in advance!

Honestly, sorting/searching is where DDB falls down..
With the amount of data you're talking about, I'd simply use Aurora.
At scale, assuming you hit the limits of Aurora, you'd be better served by front-ending DDB with Elasticsearch.

Related

Best architecture for fast filter queries in ArangoDB

I am working on a system where I need fast filtering queries. Basically, it is a set of 50 different fields, booleans, amounts, code and dates; just like a web-shop filter.
it is ~ 10 000 000 items.
For the moment I am using MSSQL, and using one big table with different indexes except for a few separate tables when I found it much faster to join instead of just filter the result in one table.
I usually get a response time around 1 second, with a fairly fast server.
I was considering to use ArangoDB for this and wonder what approach is best? Is it better to keep some of the "flags" as separate tables and join or is it more efficient to put everything in the same document and have it as a flag with an index? Or would it be any benefit using the graph/edge feature and make a link back to the same object (or an object representing the code for instance)?
The reason I am considering ArangoDB is that my plan is to have a more complex model and will most likely use the graph feature in the future even if the first priority is to get the system up to the current level of features with a similar speed.
Any thoughts?

Does Powerapps return the delegatable filtered results, prior to performing the non-delegatable filtering on the app?

I am setting up a large (2000+ records) "task tracking register" using a SharePoint List, and intend to use Powerapps as the UI.
As you would imagine there numerous drop drown fields in the list which I would like to use as a filter within the Powerapp, but being that these are "Complex" fields, they are non-delegatable.
I'm lead to believe that I can avoid this by creating additional Columns in the SharePoint list that use a Flow that populates them with plain text based on the Drop-down selected.
This is a bit of pain, so I'd like to limit the quantity of these helper columns as much as possible.
Can anyone advise if a Powerapps Gallery will initially filter the results being returned using the delegateable functions first, and then perform the non-delegatable search functions on those items, or whether the inclusion of a non-delgatable search criteria means that the whole query is performed in a non-delegatable manner?
i.e.
Filter 3000 records down to 800 using delegatable search, then perform the additional filtering of those 800 on the app for the non-delegatable search criteria.
I understand that it may be possible to do this via loading the initial filtered results into a collection within the app and potentially filtering that list, but have read some conflicting information as to the efficacy of this method, so not such if this is the route I should take.

Delegation can be a challenge. Here are some methods for handling it:
Users rarely need more than a few dozen records at any time in a mobile app. Try to use delegable queries to create a Collection locally. From there, its lightning fast.
If you MUST pull in all 3k+ of your records, here's my favorite hack. Collect chunks of your data source then combine into a single collection.
If you want the function to scale (and the user's wait time) you can determine the first and last ID to dynamically build a function.
Good luck!

Using Cognos 10.1 which is better an Inner Join or an "IN" Filter?

I'm using Cognos 10.1 and I have a report that uses two queries each with the same primary key.
Query 1: UniqueIds
Query 2: DetailedInfo
I'm not sure how to tell whether it's better build a report using the DetailedInfo query with a filter that says PrimaryKey in (UniqueIds.PrimaryKey) or should I create a third query that joins UniqueIds to DetailedInfo on PrimaryKey.
I'm new to Cognos and I'm learning to think differently. Using MicroSoft SQL Server I'd just use an inner join.
So my question is, in Cognos 10.1 which way is better and how can tell what the performance differences are?

You'd better start from the beginning.
You queries (I hope Query Subjects) should be joined in Framework Manager, in a model. Then you can easily filter second query by applying filters to first query.
Joins in Report Studio is the last solution.

The report writers ultimate weapon is a well indexed data warehouse, with a solid framework model built on top.
You want all of your filtering and joining to happen on the database side as much as possible. If not, then large data sets are brought over to the Cognos server before they are joined and filtered by Cognos.
The more work that happens on the database, the faster your reports will be. By building your reports in certain ways, you can mitigate Cognos side processing, and promote database side processing.
The first and best way to do this is with a good Framework Model, as Alexey pointed out. This will allow your reports to be simpler, and pushes most of the work to the database.
However a good model still exposes table keys to report authors so that they can have the flexibility to create unique data sets. Not every report warrants a new Star Schema, and sometimes you want to join the results of queries against two different Star Schema sources.
When using a join or a filter, Cognos attempts to push all of the work to the database as a default. It wants to have the final data set sent to it, and nothing else.
However when creating your filters, you have two ways of defining variables... with explicit names that refer to modeled data sources (ie. [Presentation View].[Sales].[Sales Detail].[Net Profit] ) or by referring to a column in the current data set (such as [Net Profit] ). Using explicit columns from the model will help ensure the filters are applied at the database.
Sometimes that is not possible, such as with a calculated column. For example, if you dont have Net Profit in your database or within your model, you may establish it with a Calculated column. If you filter on [Net Profit] > 1000, Cognos will pull the dataset into Cognos before applying your filter. Your final result will be the same, but depending on the size of data before and after the filter is applied, you could see a performance decrease.
It is possible to have nested queries within your report, and cognos will generate a single giant SQL statement for the highest level query, which includes sub queries for all the lower level data. You can generate SQL/MDX in order to see how Cognos is building the queries.
Also, try experimenting. Save your report with a new name, try it one way and time it. Run it a few times and take an average execution speed. Time it again with the alternate method and compare.
With smaller data sets, you are unlikely to see any difference. The larger your data set gets, the bigger a difference your method will affect the report speed.

Use joins to merge two queries together so that columns from both queries can be used in the report. Use IN() syntax if your only desire is to filter one query using the existence of corresponding rows in a second. That said, there are likely to be many cases that both methods will be equally performant, depending on the number of rows involved, indexes etc.
By the way, within a report Cognos only supports joins and unions between different queries. You can reference other queries directly in filters even without an established relationship but I've seen quirks with this, like it works when run interactively but not scheduled or exported. I would avoid doing this in reports.

Get the last documents?

CouchDB has a special _all_docs view, which returns documents sorted on ID. But as ID's are random by default, the sorting makes no sense.
I always need to sort by 'date added'. Now I have two options:
Generating my own ID's and make sure they start with a timestamp
Use standard GUID's, but add a timestamp in json, and sort on
that
Now the second solution is less hackish, but I suspect the first solution to be much more efficient and faster, because all queries will be done on the real row id, which is indexed.
Is it true that both solutions differ in performance? And if it's true, which one is likely to be faster or preferred?

Is it true that both solutions differ in performance?
Your examples given describing the primary and secondary index approach in CouchDB.
_all_docs is the only primary index and is always up-to-date. Secondary indexes (views) as in your second solution getting updated when they are requested.
Thats the reason why from the requesters point-of-view _all_docs might be "faster". In real there isn't a difference in requesting already up-to-date indexes. Two workarounds for potentially outdated views (secondary indexes) are the use of the query param stale=ok (update the view after the response to the request) or so called "view-heaters" (send a simple HTTP Get to the view to trigger the update process).
And if it's true, which one is [...] prefered?
The capabilities to build an useful index and response payload are significant higher on the side of secondary indexes.
When you want to use the primary index you have to "design" your id as you have described. You can imagine that is a huge pre-decision of what can also be done with the doc and the ids.
My recommendation would be to use secondary indexes (views). Only if you need data stored in real-time or high-concurrency scenarios you should include the primary index in the search for the best fit to request data.

Using Lucene to index private data, should I have a separate index for each user or a single index

I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do)
My questions revolve around performance and security.
Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient?
Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index.
Any insight would be greatly appreciated.
Regards,
Nate

Of course, one index.
You can do even better than what you suggested by using ManifoldCF (Apache product that knows how to handle Solr) to manage security.
And one off topic, uninformed suggestion: I'd rather use CloudBees or Heroku (or Amazon) instead of Azure.

Until you will use several machines for indexing I guess it's more convenient to use single index. Lucene community done a lot of work to make indexing process as efficient as it can. So unless you intentionally want to implement distributed indexing I doesn't recommend you to split indexes.
However there are several reasons why you would want to split indexes:
if your machine have several IO devices which could be utilized in parallel. In this case, if you are IO bound, splitting indexes is good idea.
splitting document fields between indexes (this is what ParallelReader is supposed for). This is more exotic form of splitting, but it may be a good idea if search is performed using different groups of fields. Suppose, we have two search query types: the first is using field name and type, and the second is using fields price and discount. If those fields are updated at different rate (I guess, name updates are far more rarely than price updates), updating only part of index would require less IO resources. This will give more overall throughput to the system.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string