During a post-processor evaluation, I need to refer to the underlying lines to get data that cannot be aggregated.
Say my cube schema is composed of fields from 3 stores, A, B and C.
It will then looks something like:
<datastoreSelection baseStore="storeA">
<fields>
<field name="fieldAmount1" expression="fieldAmount1" />
<field name="fieldAmount2" expression="fieldAmount2" />
<field name="fieldRefValue" expression="fieldRefValue" />
<field name="fieldForDimension1" expression="storeB/fieldForDimension1" />
<field name="fieldForDimension2" expression="storeC/fieldForDimension2" />
</fields>
</datastoreSelection>
My cube has 2 dimensions:
dimension1 based on "fieldForDimension1"
dimension2 based on "fieldForDimension2"
I have my post-processor that compare fieldRefValues to some external value 'externalValue' (let's assume this external value is a context value that users can change).
if (fieldRefValues <= externalValue) {
return fieldAmount1;
} else {
return fieldAmount2;
}
Since externalValue can vary on the fly, I cannot pre process this if section. I also cannot work on aggregated level (or more precisely, there is an interval where I have to work on detail lines).
Working on ActivePivot v5, it seems there is no such element as the indexer. Moreover, my humble knowledge on datastore makes me think that it may not fill my need for my schema is composed of fields from multiple stores. In other words, I haven't find a way to query datastore with a location.
For example, if a query on my PP comes with a location
[[myPP][value1][value2]]
I will struggle to retrieve detail lines using datastore queries, For I would have to:
query storeB to get id of "value1"
query storeC to get id of "value2"
query storeA with results from the 2 previous queries
performing my "if" statement on those lines
Quite complicated, isn't it?
In the meantime the drillthrough functionality does a similar job efficiently.
So I ended up with the idea of using the drillthrough to get the retrieving job done.
However, I never heard of such use case. Hence my questions:
Does someone has a similar use case and how did you solve it? Is there any caveat?
From my first attempts, it seems that performing a drillthrough query during a post processor execution can raise issue relative to context.
in your post processor you can access the facts tanks to a datastore query
say you compute an amout and you want to express that amount converted in a given currency then what you have to do in your PP:
- somehow get the coordinate of the location involved in the conversion (e,g, the currency), then use that coordinate to query a store and get the rate that allows you to convert your amount.
long story short:
- if you have to deal with aggregated values so use a post processor
- if you have to query the underlying facts, use a datastore query (similar to indexer queries in AP4)
- you can use a datastore query inside a post processor
so all is related to your use case.
hope this helps, feel free to detail your use case if this is not helpful
Regards,
Related
I already have an answer:
const faunadb = require('faunadb')
const q = faunadb.query
exports.handler = async (event, context) => {
const client = new faunadb.Client({
secret: process.env.FAUNADB_SERVER_SECRET
})
try {
// Getting the refs with a first query
let refs = await client.query(q.Paginate(q.Match(q.Index('skus'))))
// Forging a second query with the retrieved refs
const bigQuery = refs.data.map((ref) => q.Get(ref))
// Sending over that second query
let allDocuments = await client.query(bigQuery)
// All my documents are here!
console.log('#allDocuments: ', allDocuments);
//...
} catch (err) {
// ...
}
}
But I find it unsatisfying because I'm making 2 queries for what seems like one the most trivial DB call. It seems inefficient and wordy to me.
As I'm just learning about FaunaDB, there's probably something I don't grasp here.
My question could be split into 3:
Can I query for all documents in a single call?
If not, why not? What's the logic behind such a design?
Could I make such a query without an index?
FaunaDB's FQL language is quite similar to JavaScript (which helps a lot if you want to do conditional transactions etc).
In essence, FaunaDB also has a Map. Given that your index contains only one value that is the reference you can write this:
q.Map(
q.Paginate(q.Match(q.Index('skus'))),
q.Lambda(x => q.Get(x))
)
For this specific case, you actually do not need an index since each collection has a built-in default index to do a select all via the 'Documents' function.
q.Map(
q.Paginate(q.Documents(q.Collection('<your collection>'))),
q.Lambda(x => q.Get(x))
)
Now in case the index that you are using returns multiple values (because you would want to sort on something other than 'ref') then you need to provide the same amount of parameters to the Lambda as the amount of values that were defined in the index. Let's say my index has ts and ref in values because I want to sort them on time, then the query to get all values becomes:
q.Map(
q.Paginate(q.Match(q.Index('<your index with ts and ref values>'))),
q.Lambda((ts, ref) => q.Get(ref))
)
Values are used for range queries/sorting but also define what the index returns
Coming back to your questions:
- Can I query for all documents in a single call?
Absolutely, I would advice you to do so. Note that the documents you will get are paginated automatically. You can set the page size by providing a parameter to paginate and will get back an 'after' or 'before' attribute in case the page is bigger. That after or before can be again presented to the Paginate function as a parameter to get a next or previous page: https://docs.fauna.com/fauna/current/api/fql/functions/paginate
- Could I make such a query without an index?
No, but you can use the built-in index as explained above. FaunaDB protects users from querying without an index. Since it is a scalable database that could contain massive data and is pay-as-you-go it's a good idea to prevent users from shooting themselves in the foot :). Pagination and mandatory Indexes help to do that.
As to the why FQL is different. FQL is a language that is not declarative like many querying languages. Instead it's procedural, you write exactly how you fetch data. That has advantages:
By writing how data is retrieved you can exactly predict how a query behaves which is nice-to-have in a pay-as-you-go system.
The same language can be used for security rules or complex conditional transactions (update certain entities or many entities ranging over different collections depending on certain conditions). It's quite common in Fauna to write a query that does many things in one transaction.
Our flavour of 'stored procedures' called User Defined Functions are just written in FQL and not another language.
Querying is also discussed in this tutorial that comes with code in a GitHub repository which might give you a more complete picture: https://css-tricks.com/rethinking-twitter-as-a-serverless-app/
Can I query for all documents in a single call?
Yes, if your collection is small. The Paginate function defaults to fetching 64 documents per page. You can adjust the page size up to 100,000 documents. If your collection has more than 100,000 documents, then you have to execute multiple queries, using cursors to fetch subsequent documents.
See the Pagination tutorial for details: https://docs.fauna.com/fauna/current/tutorials/indexes/pagination
If not, why not? What's the logic behind such a design?
For an SQL database, SELECT * FROM table is both convenient and, potentially, a resource nightmare. If the table contains billions of rows, attempting to serve results for that query could consume the available resources on the server and/or the client.
Fauna is a shared database resource. We want queries to perform well for any user with any database, and that requires that we put sensible limits on the number of documents involved in any single transaction.
Could I make such a query without an index?
No, and yes.
Retrieving multiple results from Fauna requires an index, unless you are independently tracking the references for documents. However, with the Documents function, Fauna maintains an internal index so you don't need to create your own index to access all documents in a collection.
See the Documents reference page for details: https://docs.fauna.com/fauna/current/api/fql/functions/documents
Returning to your example code, you are executing two queries, but they could easily be combined into one. FQL is highly composable. For example:
let allDocuments = await client.query(
q.Map(
q.Paginate(q.Documents(q.Collection("skus"))),
q.Lambda("X", q.Get(q.Var("X")))
)
)
Your observation that FQL is wordy, is correct. Many functional languages exhibit that wordiness. The advantage is that any functions that accept expressions can be composed at will. One of the best examples of composability, and how to manage inter-document references, is presented in our E-commerce tutorial, specifically, the section describing the submit_order function: https://docs.fauna.com/fauna/current/tutorials/ecommerce#function
How can I use indexes in aggregate?
I saw the document https://docs.mongodb.com/manual/core/aggregation-pipeline/#pipeline-operators-and-indexes
The $match and $sort pipeline operators can take advantage of an index when they occur at the beginning of the pipeline.
Is there any way of using index not the beginning situation?
like $sort,
$match or $group
Please help me
An index works by keeping a record of certain pieces of data that point to a given record in your collection. Think of it like having a novel, and then having a sheet of paper that lists the names of various people or locations in that novel with the page numbers where they're mentioned.
Aggregation is like taking that novel and transforming the different pages into an entirely different stream of information. You don't know where the new information is located until the transformation actually happens, so you can't possibly have an index on that transformed information.
In other words, it's impossible to use an index in any aggregation pipeline stage that is not at the very beginning because that data will have been transformed and MongoDB has no way of knowing if it's even possible to efficiently make use of the newly transformed data.
If your aggregation pipeline is too large to handle efficiently, then you need to limit the size of your pipeline in some way such that you can handle it more efficiently. Ideally this would mean having a $match stage that sufficiently limits the documents to a reasonably-sized subset. This isn't always possible, however, so additional effort may be required.
One possibility is generating "summary" documents that are the result of aggregating all new data together, then performing your primary aggregation pipeline using only these summary documents. For example, if you have a log of transactions in your system that you wish to aggregate, then you could generate a daily summary of the quantities and types of the different transactions that have been logged for the day, along with any other additional data you would need. You would then limit your aggregation pipeline to only these daily summary documents and avoid using the normal transaction documents.
An actual solution is beyond the scope of this question, however. Just be aware that the index usage is a limitation that you cannot avoid.
I am using Azure DocumentDB and all my experience in NoSql has been in MongoDb. I looked at the pricing model and the cost is per collection. In MongoDb I would have created 3 collections for what I was using: Users, Firms, and Emails. I noted that this approach would cost $24 per collection per month.
I was told by the people I work with that I'm doing it wrong. I should have all three of those things stored in a single collection with a field to describe what the data type is. That each collection should be related by date or geographic area so one part of the world has a smaller portion to search.
and to:
"Combine different types of documents into a single collection and add
a field across all to separate them in searching like a type field or
something"
I would never have dreamed of doing that in Mongo, as it would make indexing, shard keys, and other things hard to get right.
There might not be may fields that overlap between the objects (example: Email and firm objects)
I can do it this way, but I can't seem to find a single example of anyone else doing it that way - which indicates to me that maybe it isn't right. Now, I don't need an example, but can someone point me to some location that describes which is the 'right' way to do it? Or, if you do create a single collection for all data - other than Azure's pricing model, what are the advantages / disadvantages in doing that?
Any good articles on DocumentDb schema design?
Yes. In order to leverage CosmosDb to it's full potential need to think of a Collection is an entire Database system and not as a "table" designed to hold only one type of object.
Sharding in Cosmos is exceedingly simply. You just specify a field that all of your documents will populate and select that as your partition key. If you just select a generic value such as key or partitionKey you can easily separate the storage of your inbound emails, from users, from anything else by picking appropriate values.
class InboundEmail
{
public string Key {get; set;} = "EmailsPartition";
// other properties
}
class User
{
public string Key {get; set;} = "UsersPartition";
// other properties
}
What I'm showing is still only an example though. In reality your partition key values should be even more dynamic. It's important to understand that queries against a known partition are extremely quick. As soon as you need to scan across multiple partitions you'll see much slower and more costly results.
So, in an app that ingests a lot of user data. Keeping a single user's activity together in one partition might make sense for that particular entity.
If you want evidence that this is the appropriate way to use CosmosDb, consider the addition of the new Gremlin Graph APIs. Graphs are inherently heterogenous as they contain many different entities and entity types as well as the relationships between them. The query boundary of Cosmos is at the collection level so if you tried putting your entities all in different collections none of the Graph API or queries would work.
EDIT:
I noticed in the comments you made this statement And you would have an index on every field in both objects. CosmosDb does automatically index every field of every document. They use a special proprietary path based indexing mechanism that ensures every path of your JSON tree has indices on it. You have to specifically opt out of this auto indexing feature.
I am looking to write dynamic queries for an ArangoDB graph database and am wondering if there are best practices or standard approaches to doing it.
By 'dynamic queries' I mean that users would have the ability to build a query that is then executed on the dataset.
Methods that ArangoDB can support this could include:
Dynamically generate AQL queries by manually injecting bindvars
Write Foxx functions to deliver on supported queries, and have another Foxx function bind those together to build a response.
Write a workflow which extracts data into a temporary collection and then invokes Foxx functions to filter/sort the data to the desired outcome.
The queries would be very open ended, where someone would (for example):
Query all countries with population over 10,000,000
Sort countries by land in square kilometers
Pick the top 10 countries in land coverage
Select primary language spoken in each country
Count occurrences of each language.
That query alone is straight forward to execute, but if a user was able to [x] check or select from a range of supported query options, order them in their own defined way, and receive the output, it's a little more involved.
Are there some supported or recommended approaches to doing this?
My current approach would be to write blocks of AQL that delivered on each part, probably in a LET Q1 = (....), LET Q2 = (...) format, and then finally in the bottom of the query have a generic way of processing the queries to generate a response.
But I have a feeling that smart use of Foxx functions could help here as well, having Foxx-Query-Q1 and Foxx-Query-Q2 coded to support each query type, then an aggregation Foxx app that invoked the right queries in the right order to build the right response.
If anyone has seen best ways of doing this, it would be great to get some hints/advice.
Thanks!
We have a decent sized object-oriented application. Whenever an object in the app is changed, the object changes are saved back to the DB. However, this has become less than ideal.
Currently, transactions are stored as a transaction and a set of transactionLI's.
The transaction table has fields for who, what, when, why, foreignKey, and foreignTable. The first four are self-explanatory. ForeignKey and foreignTable are used to determine which object changed.
TransactionLI has timestamp, key, val, oldVal, and a transactionID. This is basically a key/value/oldValue storage system.
The problem is that these two tables are used for every object in the application, so they're pretty big tables now. Using them for anything is slow. Indexes only help so much.
So we're thinking about other ways to do something like this. Things we've considered so far:
- Sharding these tables by something like the timestamp.
- Denormalizing the two tables and merge them into one.
- A combination of the two above.
- Doing something along the lines of serializing each object after a change and storing it in subversion.
- Probably something else, but I can't think of it right now.
The whole problem is that we'd like to have some mechanism for properly storing and searching through transactional data. Yeah you can force feed that into a relational database, but really, it's transactional data and should be stored accordingly.
What is everyone else doing?
We have taken the following approach:-
All objects are serialised (using the standard XMLSeriliser) but we have decorated our classes with serialisation attributes so that the resultant XML is much smaller (storing elements as attributes and dropping vowels on field names for example). This could be taken a stage further by compressing the XML if necessary.
The object repository is accessed via a SQL view. The view fronts a number of tables that are identical in structure but the table name appended with a GUID. A new table is generated when the previous table has reached critical mass (a pre-determined number of rows)
We run a nightly archiving routine that generates the new tables and modifies the views accordingly so that calling applications do not see any differences.
Finally, as part of the overnight routine we archive any old object instances that are no longer required to disk (and then tape).
I've never found a great end all solution for this type of problem. Some things you can try is if your DB supports partioning (or even if it doesn't you can implement the same concept your self), but partion this log table by object type and then you can further partion by date/time or by your object ID (if your ID is a numeric this works nicely not sure how a guid would partion).
This will help maintain the size of the table and keep all related transactions to a single instance of an object to itself.
One idea you could explore is instead of storing each field in a name value pair table, you could store the data as a blob (either text or binary). For example serialize the object to Xml and store it in a field.
The downside of this is that as your object changes you have to consider how this affects all historical data if your using Xml then there are easy ways to update the historical xml structures, if your using binary there are ways but you have to be more concious of the effort.
I've had awsome success storing a rather complex object model that has tons of interelations as a blob (the xml serializer in .net didn't handle the relationships btw the objects). I could very easily see myself storing the binary data. A huge downside of storing it as binary data is that to access it you have to take it out of the database with Xml if your using a modern database like MSSQL you can access the data.
One last approach is to split the two patterns, you could define a Difference Schema (and I assume more then one property changes at a time) so for example imagine storing this xml:
<objectDiff>
<field name="firstName" newValue="Josh" oldValue="joshua"/>
<field name="lastName" newValue="Box" oldValue="boxer"/>
</objectDiff>
This will help alleviate the number of rows, and if your using MSSQL you can define an XML Schema and get some of the rich querying ability around the object. You can still partition the table.
Josh
Depending on the characteristics of your specific application an alternative approach is to keep revisions of the entities themselves in their respective tables, together with the who, what, why and when per revision. The who, what and when can still be foreign keys.
Although I would be very careful to use this approach, since this is only viable for applications with a relatively small amount of changes per entity/entity type.
If querying the data is important I would use true Partitioning in SQL Server 2005 and above if you have enterprise edition of SQL Server. We have millions of rows partitioned by year down to day for the current month - you can be as granular as your application demands with a maximum number of 1000 partitions.
Alternatively , if you are using SQL 2008 you could look into filtered indexes.
These are solutions that will enable you to retain the simplified structure you have whilst providing the performance you need to query that data.
Splitting/Archiving older changes obviously should be considered.