I've got a mid-size elasticsearch index (1.46T or ~1e8 docs). It's running on 4 servers which each have 64GB Ram split evenly between elastic and the OS (for caching).
I want to try out the new "Significant terms" aggregation so I fired off the following query...
{
"query": {
"ids": {
"type": "document",
"values": [
"xCN4T1ABZRSj6lsB3p2IMTffv9-4ztzn1R11P_NwTTc"
]
}
},
"aggregations": {
"Keywords": {
"significant_terms": {
"field": "Body"
}
}
},
"size": 0
}
Which should compare the body of the document specified with the rest of the index and find terms significant to the document that are not common in the index.
Unfortunately, this invariably results in a
ElasticsearchException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data would be larger than limit of [25741911654] bytes];
nested: UncheckedExecutionException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data would be larger than limit of [25741911654] bytes];
nested: CircuitBreakingException[Data too large, data would be larger than limit of [25741911654] bytes];
after a minute or two and seems to imply I haven't got enough memory.
The elastic servers in question are actually VMs, so I shut down other VMs and gave each elastic instance 96GB and each OS another 96GB.
The same problem occurred (different numbers, took longer). I haven't got hardware to hand with more than 192GB of memory available so can't go higher.
Are aggregations not meant for use against the index as a whole? Am I making a mistake with regards to the query format?
There is a warning on the documentation for this aggregation about RAM use on free-text fields for very large indices [1]. On large indices it works OK for lower-cardinality fields with a smaller vocabulary (e.g. hashtags) but the combination of many free-text terms and many docs is a memory-hog. You could look at specifying a filter on the loading of FieldData cache [2] for the Body field to trim the long-tail of low-frequency terms (e.g. doc frequency <2) which would reduce RAM overheads.
I have used a variation of this algorithm before where only a sample of the top-matching docs were analysed for significant terms and this approach requires less RAM as only the top N docs are read from disk and tokenised (using TermVectors or an Analyzer). However, for now the implementation in Elasticsearch relies on a FieldData cache and looks up terms for ALL matching docs.
One more thing - when you say you want to "compare the body of the document specified" note that the usual mode of operation is to compare a set of documents against the background, not just one. All analysis is based on doc frequency counts so with a sample set of just one doc all terms will have the foreground frequency of 1 meaning you have less evidence to reinforce any analysis.
Related
Selecting partition key is a simple but important design choice in Azure Cosmos DB. In terms of improving performance and costs (RUs). Azure cosmos DB does not allow us to change partition key. So it is very important to select right partition key.
I gone through Microsoft documents Link
But I still have confusion to choose partition key
Below is the item structure, I am planning to create
{
"id": "unique id like UUID", # just to keep some unique ID for item
"file_location": "/videos/news/finance/category/sharemarket/it-sectors/semiconductors/nvidia.mp4", # This value some times contains special symbols like spaces, dollars, caps and many more
"createatedby": "andrew",
"ts": "2022-01-10 16:07:25.773000",
"directory_location": "/videos/news/finance/category/sharemarket/it-sectors/semiconductors/",
"metadata": [
{
"codec": "apple",
"date_created": "2020-07-23 05:42:37",
"date_modified": "2020-07-23 05:42:37",
"format": "mp4",
"internet_media_type": "video/mp4",
"size": "1286011"
}
],
"version_id": "48ad8200-7231-11ec-abda-34519746721"
}
I am using Azure cosmos SQL API. By Default, Azure cosmos take cares of indexing all data. In above case all properties are indexed.
for reading items I use file_location property. Can I make file_location as primary key ? or anything else to consider.
Fews notes:
file_location values contains special characters like spaces, commas, dollars and many more.
Few containers contains 150 millions entries and few containers contains just 20 millions.
my operations are
more reads, frequent writes as new videos are added, less updates in case videos changed.
Few things to keep in mind while selecting partition keys:
Observe the query parameters while reading data, they give you good hints to what partition key candidates are.
You mentioned that few containers contain 150 million documents and few containers contain 20 million documents. Instead of number of documents stored in a container what matters is which containers are getting higher number of requests. If few containers are getting too many requests, that is a good indicator of poorly designed partition keys.
Try to distribute the request load as evenly as possible among containers so that it gets distributed evenly among the physical partitions. Otherwise, you will get hot-partition issues and will workaround by increasing throughput which will cost you more $.
Try to limit cross-partition queries as much as possible
I can't find any information on wether "item size" refers to the original document size, or to the result size of the query after projection.
I can observe that simple queries like these
documents.find({ /*...*/ }, { name: 1 })
consume more than 1000 RU, for results of 400 items (query fields are indexed). The original documents are pretty large, about 500 kb. The actually received data is tiny due to the projection. If I remove the projection, the query runs several seconds but doesn't consume significantly more RUs (it's actually slightly more, but it seems to be due to the fact that it's split into more GetMore calls).
It sounds really strange to me, that the cost of a query mainly depends on the size of the original document in the collection, not on the data retrieved. Is that really true? Can I redruce the cost of this query without splitting data into multiple collections? The logic is basically: "Just get the 'name' of all these big documents in the collection".
(No partitioning on the db...)
Microsoft unfortunately doesn't seem to publish their formula for determining RU costs, just broad descriptions. They do say about RU considerations:
As the size of an item increases, the number of RUs consumed to read
or write the item also increases
So it is the case that cost depends on the raw size of the item, not just the portion of it output from a read operation. If you use the Data Explorer to run some queries and inspect the Query Stats, you'll see two metrics, Retrieved Document Size and Output Document Size. By projecting a subset of properties, you reduce the output size, but not the retrieved size. In tests on my data, I see a very small decrease in RU charge by selecting the return properties -- definitely not a savings in proportion to the reduced output.
Fundamentally, getting items smaller is probably the most important thing to work towards, both in terms of the property data size and the number of properties. You definitely don't want 500 KB items if you can avoid it.
We’re using CosmosDB in production to store HTTP request/response audit data. The structure of this data generally looks as follows:
{
"id": "5ff4c51d3a7a47c0b5697520ae024769",
"Timestamp": "2019-06-27T10:08:03.2123924+00:00",
"Source": "Microservice",
"Origin": "Client",
"User": "SOME-USER",
"Uri": "GET /some/url",
"NormalizedUri": "GET /SOME/URL",
"UserAgent": "okhttp/3.10.0",
"Client": "0.XX.0-ssffgg;8.1.0;samsung;SM-G390F",
"ClientAppVersion": "XX-ssffgg",
"ClientAndroidVersion": "8.1.0",
"ClientManufacturer": "samsung",
"ClientModel": "SM-G390F",
"ResponseCode": "OK",
"TrackingId": "739f22d01987470591556468213651e9",
"Response": "[ REDACTED ], <— Usually quite long (thousands of chars)
"PartitionKey": 45,
"InstanceVersion": 1,
"_rid": "TIFzALOuulIEAAAAAACACA==",
"_self": "dbs/TIFzAA==/colls/TIFzALOuulI=/docs/TIFzALOuulIEAAAAAACACA==/",
"_etag": "\"0d00c779-0000-0d00-0000-5d1495830000\"",
"_attachments": "attachments/",
"_ts": 1561630083
}
We’re currently writing around 150,000 - 200,000 of documents similar to the above a day with /PartitionKey as the partition key path that's configured on the container. The value of the PartitionKey is a randomly generated number in C#.net between 0 and 999.
However, we are seeing daily hotspots where a single physical partition can hit a max of 2.5K - 4.5K RU/s and others are very low (around 200 RU/s). This has a knock on to cost implications as we need to provision throughput for our largest utilised partition.
The second factor is we're storing a fair bit of data, close to 1TB of documents, and we add a few GB each day. As a result we have currently have around 40 physical partitions.
Combining these two factors means we end up having to provision for at minimum somewhere between 120,000 - 184,000 RU/s.
I should mention that we barely ever need to query this data; apart from very occasional for ad-hoc manually constructed queries in Cosmos data explorer.
My question is... would we be a lot better off in terms of RU/s required and distribution of data by simply using the “id” column as our partition key (or a randomly generated GUID) - and then setting a sensible TTL so we don't have a continually growing dataset?
I understand this would require us to re-create the collection.
Thanks very much.
Max throughput per physical partition
While using the id or a GUID would give you better cardinality than the random number you have today, any query you run would be very expensive as it would always be cross-partition and over a huge amount of data.
I think a better choice would be to use a synthetic key that combines multiple properties that both have high cardinality and also are used to query for the data. Can learn more about these here, https://learn.microsoft.com/en-us/azure/cosmos-db/synthetic-partition-keys
As far as TTL I would definitely set that for whatever retention you need for this data. Cosmos will TTL the data off with unused throughput so will never get in the way.
Lastly, you should also consider (if you haven't already) using a custom indexing policy and exclude any paths which are never queried for. Especially the "response" property since you say it is thousands of characters long. This can save considerable RU/s in write-heavy scenarios like yours.
From my experience what I see is cosmos tends to degrade with new data. More data mean more physical partitons. So you meed more throughput to be allocated to each of them . Currently we are starting to archive old data into blob storage to avoid this kind of problems and keep the number of physical partition unchangeable. We use cosmos as hot storage and then the old data go to blobs storage as cold storage. We reduce RU allocated to each physical partitions and we save money.
I'm creating a logging system to monitor our (200 ish) main application installations, and Cosmos db seems like a good fit due to the amount of data we'll collect, and to allow a varying schema for the log data (particularly the Tags array - see document schema below).
But, never having used CosmosDb before I'm slightly unsure of what to use for my partition key.
If I partitioned by CustomerId, there would likely be several Gb of data in each of the 200 partitions, and the data will usually be queried by CustomerId, so this was my first choice for the partition key.
However I was planning to have a 'log stream' view in the logging system, showing logs coming in for all customers.
Would this lead to running a horribly slow / expensive cross partition query?
If so, is there an obvious way to avoid / limit the cost & speed implications of this cross partition querying? (Other than just taking out the log stream view for all customers!)
{
"CustomerId": "be806507-7cc4-4db4-881b",
"CustomerName": "Our Customer",
"SystemArea": 1,
"SystemAreaName": "ExchangeSync",
"Message": "Updated OK",
"Details": "",
"LogLevel": 2,
"Timestamp": "2018-11-23T10:59:29.7548888+00:00",
"Tags": {
"appointmentId": "109654",
"appointmentGroupId": "86675",
"exchangeId": "AAMkA",
"exchangeAlias": "customer.name#customer.com"
}
}
(Note - There isn't a defined list of SystemArea types we'll use yet, but it would be a lot fewer than the 200 customers)
Cross partition queries should be avoided as much as possible. If your querying is likely to happen with customer id then the customerid is a good logical partition key. However you have to keep in mind that there is a limit of 10GB per logical partition data.
A cross partition query across the whole database will lead to a very slow and very expensive operation but if it's not functionality critical and it's just used for infrequent reporting, it's not too much of a problem.
I have the following document in a couchdb database:
{
"_id": "000013a7-4df6-403b-952c-ed767b61554a",
"_rev": "1-54dc1794443105e9d16ba71531dd2850",
"tags": [
"auto_import"
],
"ZZZZZZZZZZZ": "910111",
"UUUUUUUUUUUUU": "OOOOOOOOO",
"RECEIVING_OPERATOR": "073",
"type": "XXXXXXXXXXXXXXXXXXX",
"src_file": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}
This JSON file takes exactly 319 bytes if saved in my local filesystem. My documents are all like this (give or take a couple of bytes, since some of the fields have varying lengths).
In my database I have currently around 6 millions documents, and they use 15 GB. That gives around 2.5KBytes/document. That means that the documents are taking 8 times more space on CouchDB as they would on disk.
Why is that?
The problem is related to the way the document id is used: it is stored not only in the document, but in other data structures. That means that using a standard UUID (000013a7-4df6-403b-952c-ed767b61554a 36 characters) is going to use up lots of disk space. If collission is a minor issue, with base64 you can number 16 millions of documents with just 4 characters,
and over 1 thousand million documents with 5 characters. A good choice for a dictionary is one which is ordered (in the "View Collation" sense):
-#0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
Using this method, I have reduced the size of my database from 2.5Kbytes/doc to 0.4Kbytes/doc. My new database uses only 16% of the space of the old database, which I would say is a very big improvement.
CouchDB uses something called MVCC which basically means it keeps previous versions of the documents as you modify them. It uses these previous versions to help with replication in case of conflicts and by default keeps 1000 revisions (see this for more info).
You can lower the number of revisions to keep if you aren't using replication or some how know that those sorts of conflicts will never happen.
You also might want to familiarize yourself with compaction as that can help (temporarily) lower the storage footprint as well.