I'm looking to implement a process that will occasionally pull all "new" records from a DocumentDb, where new is "all documents added or modified since the last time the process was run."
SQL Server has rowversion for this, which is guaranteed unique and monotonically increasing across all rows and columns in a database.
I see DocumentDb has _ts, which (according to the documentation) used as a high water mark for Azure Search indexing, but how does that work? If multiple documents are inserted at the same time as a read takes place, it's possible that all of them have the same _ts value. On the next read, if the comparison against _ts is strictly greater than, then some documents will be missed; if it's greater-than-or-equals, some documents will be pulled a second time.
Is _ts safe to use for this?
The _ts property is specific to a document, not a collection of documents. It represents the time that a particular document was updated (in seconds, since Jan 1 1970).
The _ts property will not give you a high water mark across all documents in a collection. Each document has its own independent _ts property (which may have the same value as another document's _ts property).
See this answer for a bit more detail.
Related
So i have a large collection storing messages and i would like to produce time series data from this collection.
Now i had issues with time series data before when i had 10 million records to group by time interval and count / average the values.
Timestamp => values
I sort of fixed it by putting all my data into one collection by day so now i have less documents but larger documents. This helped reduce the seek and search time the db needs to find the relevant document. However i am not sure how could i speed up my queries on documents that are not time series. Also i want to search text in this large document, so i have to seek all documents no exepction.
As i said i am storing messages in a single document and the schema looks something like this:
Id: string
Author: string
MessageType: string,
Group: string,
Message: string
Votes: number
Date: date
I would like to count all the records that contain a word in the message or all the records that has the author Joe. Or sum the votes and so on.
So i would end up with time series data that i can put on a chart.
Now if i have to go through one year data that is about 50 million records. And the query is gona take forever since it has to fetch so many records and filter out the ones i am interested in.
How could i achieve better performance?
I have indexing set up on the date and author fields only. Yet my queries are slow and the database is super busy processing one query.
Should i pre aggregate my data somehow, what would be a good way?
Or generate the time series data in a background worker?
Can someone direct me to the right way so i can implement a proper solution that can either reduce the load on the database, or increase query performamce?
What are the best practices for handling such a large collection that contains messages?
How could i segment this kind of data?
Would it be a good idea to set up a replica set and shard the database between multiple machines already?
Any help and input would be appriciated.
We've set up an Azure Search Index on our Azure SQL Database of ~2.7 million records all contained in one Capture table. Every night, our data scrapers grab the latest data, truncate the Capture table, then rewrite all the latest data - most of which will be duplicates of what was just truncated, but with a small amount of new data. We don't have any feasible way of only writing new records each day, due to the large amounts of unstructured data in a couple fields of each record.
How should we best manage our index in this scenario? Running the indexer on a schedule requires you to indicate this "high watermark column." Because of the nature of our database (erase/replace once a day) we don't have any column that would apply here. Further, what really needs to happen for our Azure Search Index is either it also needs to go through a full daily erase/replace, or some other approach so that we don't keep adding 2.7 million duplicate records every day to the index. The former likely won't work for us because it takes 4 hours minimum to index our whole database. That's 4 hours where clients (worldwide) may not have a full dataset to query on.
Can someone from Azure Search make a suggestion here?
What's the proportion of the data that actually changes every day? If that proportion is small, then you don't need to recreate the search index. Simply reset the indexer after the SQL table has been recreated, and trigger reindexing (resetting an indexer clears its high water mark state, but doesn't change the target index). Even though it may take several hours, your index is still there with the mostly full dataset. Presumably if you update the dataset once a day, your clients can tolerate hours of latency for picking up latest data.
In Azure Search is there a mechanism to set an "Expiration Date" on items within the index? I have a need for items to only be in the search index for a pre-defined period of time.
Not at this time. For now, you need to send a delete request to delete an item in an index.
We often refer to this capability as Time to Live. It would be great if you could vote for this feature to help us prioritize it, if you would find it valuable.
http://feedback.azure.com/forums/263029-azure-search/suggestions/6328648-time-to-live-for-data
Liam
As Liam mentions, there isn't at the moment.
One option might be to add an "Expiry" field with a type of Edm.DateTimeOffset into your documents and have all your queries only request documents whose expiry date is greater than the current timestamp.
I am storing account information in Cassandra. Each account has lists of data associated with it. For example, an account may have a list of friends and a list of liked books. Queries on accounts will always want all friends or all liked books or all of both. No filtering or searching is needed on either. The list of friends and books can grow and shrink.
Is it better to use a set column type or composite columns for this scenario?
I would suggest you not to use sets if
You are concerned about disk space(as each value is allocated a cell in disk + data space for metadata of each cell which is 15 bytes if am not wrong. Now that consumes a lot if your data is a growing one).
Not going to grow a lot of data in that particular row as each time ,the cells are to be fetched from different sstable .
In these kind of cases, the more preferred option would be a json array. You shall store it as a text and back the data from that.
Set (or any other collections ) use case was brought in for a completely different perspective. If you are having a particular value inside the list or a value has to be updated frequently inside the same collection, you shall make use of the collections .
My take on your query will be this.
Store all account specific info in a json object of friends that has a value as list of books .
Sets are good for smaller collections of data, if you expect your friends / liked books lists to grow constantly and get large (there isn't a golden number here) it would be better to go with composite columns as that model scales out better than collections and allows for straight up querying compared to requiring secondary indexes on collections.
I have a need to query a store of 200 million entities in Windows Azure. Ideally, I would like to use the Table Service, rather than SQL Azure, for this task.
The use case is this: a POST containing a new entity will be incoming from a web-facing API. We must query about 200 million entities to determine whether or not we may accept the new entity.
With the entity limit of 1,000: does this apply to this type of query, i.e. I have to query 1,000 at a time and perform my comparisons / business rules, or can I query all 200 million entities in one shot? I think I would hit a timeout in the latter case.
Ideas?
Expanding on Shiraz's comment about Table storage: Tables are organized into partitions, and then your entities are indexed by a Row key. So, each row can be found extremely fast using the combination of partition key + row key. The trick is to choose the best possible partition key and row key for your particular application.
For your example above, where you're searching by telephone number, you can make TelephoneNumber the partition key. You could very easily find all rows related to that telephone number (though, not knowing your application, I don't know just how many rows you'd be expecting). To refine things further, you'd want to define a row key that you can index into, within the partition key. This would give you a very fast response to let you know whether a record exists.
Table storage (actually Azure Storage in general - tables, blobs, queues) have a well-known SLA. You can execute up to 500 transactions per second on a given partition. With the example above, the query for rows for a given telephone number would equate to one transaction (unless you exceed 1000 rows returned - to see all rows, you'd need additional fetches); adding a row key to narrow the search would, indeed, yield a single transaction). So would inserting a new row. You can also batch up multiple row inserts, within a single partition, and save them in a single transaction.
For a nice overview of Azure Table Storage, with some good labs, check out the Platform Training Kit.
For more info about transactions within tables, see this msdn blog post.
The limit of 1000 is the number of rows returned from a query, not the number of rows queried.
Pulling all of the 200 million rows into the web server to check them will not work.
The trick is to store the rows with a key that can be used to check if the record should be accepted.