Index multiple MongoDB fields, make only one unique - node.js

I've got a MongoDB database of metadata for about 300,000 photos. Each has a native unique ID that needs to be unique to protect against duplication insertions. It also has a time stamp.
I frequently need to run aggregate queries to see how many photos I have for each day, so I also have a date field in the format YYYY-MM-DD. This is obviously not unique.
Right now I only have an index on the id property, like so (using the Node driver):
collection.ensureIndex(
{ id:1 },
{ unique:true, dropDups: true },
function(err, indexName) { /* etc etc */ }
);
The group query for getting the photos by date takes quite a long time, as one can imagine:
collection.group(
{ date: 1 },
{},
{ count: 0 },
function ( curr, result ) {
result.count++;
},
function(err, grouped) { /* etc etc */ }
);
I've read through the indexing strategy, and I think I need to also index the date property. But I don't want to make it unique, of course (though I suppose it's fine to make it unique in combine with the unique id). Should I do a regular compound index, or can I chain the .ensureIndex() function and only specify uniqueness for the id field?

MongoDB does not have "mixed" type indexes which can be partially unique. On the other hand why don't you use _id instead of your id field if possible. It's already indexed and unique by definition so it will prevent you from inserting duplicates.
Mongo can only use a single index in a query clause - important to consider when creating indexes. For this particular query and requirements I would suggest to have a separate unique index on id field which you would get if you use _id. Additionally, you can create a non-unique index on date field only. If you run query like this:
db.collection.find({"date": "01/02/2013"}).count();
Mongo will be able to use index only to answer the query (covered index query) which is the best performance you can get.
Note that Mongo won't be able to use compound index on (id, date) if you are searching by date only. You query has to match index prefix first, i.e. if you search by id then (id, date) index can be used.

Another option is to pre aggregate in the schema itself. Whenever you insert a photo you can increment this counter. This way you don't need to run any aggregation jobs. You can also run some tests to determine if this approach is more performant than aggregation.

Related

TypeORM count grouping with different left joins each time

I am using NestJS with TypeORM and PostgreSQL. I have a queryBuilder which joins other tables based on the provided array of relations.
const query = this.createQueryBuilder('user');
if (relations.includes('relation1') {
query.leftJoinAndSelect('user.relation1', 'r1');
}
if (relations.includes('relation2') {
query.leftJoinAndSelect('user.relation2', 'r2');
}
if (relations.includes('relation3') {
query.leftJoinAndSelect('user.relation3', 'r3');
}
// 6 more relations
Following that I select a count on another table.
query
.leftJoin('user.relation4', 'r4')
.addSelect('COUNT(case when r4.value > 10 then r4.id end', 'user_moreThan')
.addSelect('COUNT(case when r4.value < 10 then r4.id end', 'user_lessThan')
.groupBy('user.id, r1.id, r2.id, r3.id ...')
And lastly I use one of the counts (depending on the request) for ordering the result with orderBy.
Now, of course, based on the relations parameter, the requirements for the groupBy query change. If I join all tables, TypeORM expects all of them to be present in groupBy.
I initially had the count query separated, but that was before I wanted to use the result for ordering.
Right now I planned to just dynamically create the groupBy string, but this approach somehow feels wrong and I am wondering if it is in fact the way to go or if there is a better approach to achieving what I want.
You can add group by clause conditionally -
if (relations.includes('relation1') {
query.addGroupBy('r1.id');
}

Unable to execute a timeseries query using a timeuuid as the primary key

My goal is to do a sum of the messages_sent and emails_sent per each DISTINCT provider_id value for a given time range (fromDate < stats_date_id < toDate), but without specifying a provider_id. In other words, I need to know about any and all Providers within the specified time range and to sum their messages_sent and emails_sent.
I have a Cassandra table using an express-cassandra schema (in Node.js) as follows:
module.exports = {
fields: {
stats_provider_id: {
type: 'uuid',
default: {
'$db_function': 'uuid()'
}
},
stats_date_id: {
type: 'timeuuid',
default: {
'$db_function': 'now()'
}
},
provider_id: 'uuid',
provider_name: 'text',
messages_sent: 'int',
emails_sent: 'int'
},
key: [
[
'stats_date_id'
],
'created_at'
],
table_name: 'stats_provider',
options: {
timestamps: {
createdAt: 'created_at', // defaults to createdAt
updatedAt: 'updated_at' // defaults to updatedAt
}
}
}
To get it working, I was hoping it'd be as simple as doing the following:
let query = {
stats_date_id: {
'$gt': db.models.minTimeuuid(fromDate),
'$lt': db.models.maxTimeuuid(toDate)
}
};
let selectQueries = [
'provider_name',
'provider_id',
'count(direct_sent) as direct_sent',
'count(messages_sent) as messages_sent',
'count(emails_sent) as emails_sent',
];
// Query stats_provider table
let providerData = await db.models.instance.StatsProvider.findAsync(query, {select: selectQueries});
This, however, complains about needing to filter the results:
Error during find query on DB -> ResponseError: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance.
I'm guessing you can't have a primary key and do date range searches on it? If so, what is the correct approach to this sort of query?
So while not having used Express-Cassandra, I can tell you that running a range query on your partition key is a hard "no." The reason for this, is that Cassandra can't determine a single node for that query, so it has to poll every node. As that's essentially a full scan of your table across multiple nodes, it throws that error to prevent you from running a bad query.
However, you can run a range query on a clustering key, provided that you are filtering on all of the keys prior to it. In your case, if I'm reading this right, your PRIMARY KEY looks like:
PRIMARY KEY (stats_date_id, created_at)
That primary key definition is going to be problematic for two reasons:
stats_date_id is a TimeUUID. This is great for data distribution. But it sucks for query flexibility. In fact, you will need to provide that exact TimeUUID value to return data for a specific partition. As a TimeUUID has millisecond precision, you'll need to know the exact time to query down to the millisecond. Maybe you have the ability to do that, but usually that doesn't make for a good partition key.
Any rows underneath that partition (created_at) will have to share that exact time, which usually leads to a lot of 1:1 cardinality ratios for partition:clustering keys.
My advice on fixing this, is to partition on a date column that has a slightly lower level of cardinality. Think about how many provider messages are usually saved within a certain timeframe. Also pick something that won't store too many provider messages together, as you don't want unbound partition growth (Cassandra has a hard limit of 2 billion cells per partition).
Maybe something like: PRIMARY KEY (week,created_at)
So then your CQL queries could look something like:
SELECT * FROM stats_provider
WHERE week='201909w1'
AND created_at > '20190901'
AND created_at < '20190905';
TL;DR;
Partition on a time bucket not quite as precise as something down to the ms, yet large enough to satisfy your usual query.
Apply the range filter on the first clustering key, within a partition.

Azure CosmosDB: how to ORDER BY id?

Using a vanilla CosmosDB collection (all default), adding documents like this:
{
"id": "3",
"name": "Hannah"
}
I would like to retrieve records ordered by id, like this:
SELECT c.id FROM c
ORDER BY c.id
This give me the error Order-by item requires a range index to be defined on the corresponding index path.
I expect this is because /id is hash indexed and not range indexed. I've tried to change the Indexing Policy in various ways, but any change I make which would touch / or /id gets wiped when I save.
How can I retrieve documents ordered by ID?
The best way to do this is to store a duplicate property e.g. id2 that has the same value of id, and is indexed using a range index, then use that for sorting, i.e. query for SELECT * FROM c ORDER BY c.id2.
PS: The reason this is not supported is because id is part of a composite index (which is on partition key and row key; id is the row key part) The Cosmos DB team is working on a change that will allow sorting by id.
EDIT: new collections now support ORDER BY c.id as of 7/12/19
I found this page CosmosDB Indexing Policies , which has the below Note that may be helpful:
Azure Cosmos DB returns an error when a query uses ORDER BY but
doesn't have a Range index against the queried path with the maximum
precision.
Some other information from elsewhere in the document:
Range supports efficient equality queries, range queries (using >, <,
>=, <=, !=), and ORDER BY queries. ORDER By queries by default also require maximum index precision (-1). The data type can be String or
Number.
Some guidance on types of queries assisted by Range queries:
Range Range over /prop/? (or /) can be used to serve the following
queries efficiently:
SELECT FROM collection c WHERE c.prop = "value"
SELECT FROM collection c WHERE c.prop > 5
SELECT FROM collection c ORDER BY c.prop
And a code example from the docs also:
var rangeDefault = new DocumentCollection { Id = "rangeCollection" };
// Override the default policy for strings to Range indexing and "max" (-1) precision
rangeDefault.IndexingPolicy = new IndexingPolicy(new RangeIndex(DataType.String) { Precision = -1 });
await client.CreateDocumentCollectionAsync(UriFactory.CreateDatabaseUri("db"), rangeDefault);
Hope this helps,
J

IndexedDB getAll() ordering

I'm using getAll() method to get all items from db.
db.transaction('history', 'readonly').objectStore('history').getAll().onsuccess = ...
My ObjectStore is defined as:
db.createObjectStore('history', { keyPath: 'id', autoIncrement: true });
Can I count on the ordering of the items I get? Will they always be sorted by primary key id?
(or is there a way to specify sort explicitly?)
I could not find any info about ordering in official docs
If the docs don't help, consult the specs:
getAll refers to "steps for retrieving multiple referenced values"
the retrieval steps refer to "first count records in index"
the specification of index contains the following paragraph:
The records in an index are always sorted according to the record's
key. However unlike object stores, a given index can contain multiple
records with the same key. Such records are additionally sorted
according to the index's record's value (meaning the key of the record
in the referenced object store).
Reading backwards: An index is sorted. getAll retrieves the first N of an index, i.e. it is order-preserving. Therefore the result itself should retain the sort order.

not in query and select one field from second collection

My requirement is to count all the data whose particular id is not in reference collection. The equivalent SQL query would go as below:
select count(*) from tbl1 where tbl.arr.id not in (select id from tbl2)
I've tried as below, but got stuck up on fetching single field i.e. id from 2nd query.
db.coll1.find(
{$not:
{"arr.id":
{$in:
{db.coll2.find()}//how would I fetch a single column from
//2nd coll2
}
}
}
).count()
Also, Please note that arr.id is an ObjectId stored in collection coll1 and same will go with collection coll2. Should special care be taken while fetching the id like say ObjectId(id)?
Update - I am using mongo db version 3.0.9
I had to use $nin to check for not in condition and get the array in a different format as the version of mongodb was 3.0.9. Below is how I did it.
db.coll1.find({"arr.id":{$nin:[db.coll2.find({},["id"])]}}).count()
For mongodb v>=3.2 it would be as below
db.coll1.find({"arr.id":{$nin:[db.coll2.find({},"id")]}}).count()

Resources