I am looking to implement a page view counter in azure table storage. If say two users visit the page at the same time, and the current value on PageViews = 100, is it guaranteed that the PageViews = 102 after the update operation?
The answer depends on how you implement your counter. :-)
Table storage doesn't have an "increment" operator, so you'd need to read the current value (100) and update it to the new value (101). Table storage employs optimistic concurrency, so if you do what comes naturally when using the .NET storage client library, you'd likely see an exception when two processes tried to do this simultaneously. This would be the flow:
Process A reads the value of PageViews and receives 100.
Process B reads the value of PageViews and receives 100.
Process A makes a conditional update to PageViews that means "set PageViews to 101 as long as it's currently 100." This succeeds.
Process B performs the same operations and fails, because the precondition (PageViews == 100) is false.
The obvious thing to do when you receive the error is to repeat the process. (Read the current value, which is now 101, and update to 102.) This will always (eventually) result in your counter having the correct value.
There are other possibilities, and we did an entire Cloud Cover episode about how to implement a truly scalable counter: http://channel9.msdn.com/Shows/Cloud+Cover/Cloud-Cover-Episode-43-Scalable-Counters-with-Windows-Azure.
What's described in that video is probably overkill if collisions are unlikely. I.e., if your hit rate is one-per-second, the normal "read, increment, write" pattern will be safe and efficient. If, on the other hand, you receive 1000 hits per second, you'll want to do something smarter.
EDIT
Just wanted to clarify for people who read this to understand optimistic concurrency... the conditional operation isn't really "set PageViews to 101 as long as it's currently 100." It's more like "set PageViews to 101 as long as it hasn't changed since the last time I looked at it." (This is accomplished by using the ETag that came back in the HTTP request.)
You could also rethink the 'count' part. Why not turn this into a 2 step process?
Step 1 - Recording Page Views
Each time someone views a page add a record to a table (let's call it PageViews). The info you would add in one of these stores would be the following:
PartitionKey = PageName
RowKey = Random GUID
After a few views you would have something like this:
MyPage.aspx - someGuid
MyPage.aspx - someGuid
SomePage.aspx - someGuid
MyPage.aspx - someGuid
Step 2 - Counting Page Views
What we want to do now is get all those records, count them, increase a counter somewhere and delete all records. Let's assume you have multiple workers running. Both your workers would have a loop randomly running between 1 and 10 minutes. Each time the worker's time elapsed it will take the lease on a blob if no lease has been taken yet (this should always be the same blob, you can use AutoRenewLease).
The first worker getting the lock can go ahead and do the counting:
Get all records from the PageViewRecordings table or from cache
Count all page views per page
Update count somewhere
Delete the records that were taken into account when counting
The issue here is that it's very hard to turn this into an idempotent process. What happens if your instance crashes between the count and the delete? You'll have an increased page count, but since the items were not deleted they'll be added to the total count the next time you process them.
This is why I would suggest the following. In the same table (PageViews), you will also record the total page views, in that same partition. But the data will be a bit different (this will be a single record in that partition holding the total count):
PartitionKey = PageName
RowKey = Guid.Empty (just don't use a random guid, this way we know the difference between a recorded page view and the record holding the total count).
Count = The current page view count
This is perfectly possible because Table Storage is schema less. And why are we doing this? Because we do have transactions if we limit ourselves to the same table + partition with a maxmium of 100 entities. What can we do with this?
Using Take, we get 100 records from that table + partition.
The first record we'll get is the 'counter' record. Why? Because its rowkey is Guid.Empty and sorting is lexicographical
Count these records (-1 because the first record isn't a page view, it's just our counter placeholder)
Update the Count property of the counter record
Delete the 99 (or less) other records
SaveChanges using Batch.
Repeat until there is only 1 record left (the counter record).
And each X minutes your workers will see if there isn't a lease on the blob, get a lease and restart the process.
Is this answer clear enough or should I add some code?
I came along with the same question. With Azure python library, I'm developing a simple counter increment using eTag and If-Match instead of lock. The basic idea is to retry to increase the counter until the update successfully runs under a certain criteria, which is no other updates interfere this running update. If the request of updates are heavy, sharding should be invoked.
https://github.com/flyakite/simple-scalable-datastore/blob/master/datastore/azuretable.py
If using Azure Websites, then Azure Queues and WebJobs is another option.
In one scenario of mine though I am actually going to take the sharding approach and have WebJobs update the aggregates periodically. An Azure Table Storage Table of UserPageViews with PartitionKey = User and RowKey = Page . Two simultaneous users with the same user id will not be allowed.
Related
I have a use case in which I utilize ScyllaDB to limit users' actions in the past 24h. Let's say the user is only allowed to make an order 3 times in the last 24h. I am using ScyllaDB's ttl and making a count on the number of records in the table to achieve this. I am also using https://github.com/spaolacci/murmur3 to get the hash for the partition key.
However, I would like to know what is the most efficient way to query the table. So I have a few queries in which I'd like to understand better and compare the behavior(please correct me if any of my statement is wrong):
using count()
count() will implement a full-scan query, meaning that it may query more than necessary records into the table.
SELECT COUNT(1) FROM orders WHERE hash_id=? AND user_id=?;
using limit
limit will only limit the number of records being returned to the client. Meaning it will still query all records that match its predicates but only limit the ones returned.
SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?;
using paging
I'm a bit new to this, but if I read the docs correctly it should only query the up until it received the first N records without having to query the whole table. So if I limit the page size to a number of records I want to fetch and only query the first page, would it work correctly? and will it have a consistent result?
docs: https://java-driver.docs.scylladb.com/stable/manual/core/paging/index.html
my query is still using limit, but utilizing the driver to achieve this with https://github.com/gocql/gocql
iter := conn.Query(
"SELECT user_id FROM orders WHERE hash_id=? AND user_id=? LIMIT ?",
hashID,
userID,3
).PageSize(3).PageState(nil).Iter()
Please let me know if my analysis was correct and which method would be best to choose
Your client should always use paging - otherwise you risk adding pressure to the query coordinator, which may introduce latency and memory fragmentation. If you use the Scylla Monitoring stack (and you should if you don't!), refer to the CQL Optimization dashboard and - more specifically - to the Paged Queries panel.
Now, to your question. It seems to be that your example is a bit minimalist for what you are actually wanting to achieve and - even then - should it not be, we have to consider such set-up at scale. Eg: There may be a tenant allowed which is allowed to place 3 orders within a day, but another tenant allowed to place 1 million orders within a week?
If the above assumption is correct - and with the options at hand you have given - you are better off using LIMIT with paging. The reason is because there are some particular problems with the description you've given at hand:
First, you want to retrieve N amount of records within a particular time-frame, but your queries don't specify such time-frame
Second, either COUNT or LIMIT will initiate a partition scan, and it is not clear how a hash_id + user_id combination can be done to determine the number of records within a time-frame.
Of course, it may be that I am wrong, but I'd like to suggest different some approaches which may be or not applicable for you and your use case.
Consider a timestamp component part of the clustering key. This will allow you to avoid full partition scans, with queries such as:
SELECT something FROM orders WHERE hash_id=? AND user_id=? AND ts >= ? AND ts < ?;
If the above is not applicable, then perhaps a Counter Table would suffice your needs? You could simply increment a counter after an order is placed, and - after - query the counter table as in:
SELECT count FROM counter_table WHERE hash_id=? AND user_id=? AND date=?;
I hope that helps!
I have a few points I want to add to what Felipe wrote already:
First, you don't need to hash the partition key yourself. You can use anything you want for the partition key, even consecutive numbers, the partition key doesn't need to be random-looking. Scylla will internally hash the partition key on its own to improve the load balancing. You don't need to know or care which hashing algorithm ScyllaDB uses, but interestingly, it's a variant of murmur3 too (which is not identical to the one you used - it's a modified algorithm originally picked by the Cassandra developers).
Second, you should know - and decide whether you care - that the limit you are trying to enforce is not a hard limit when faced with concurrent operations: Imagine that the given partition already has two records - and now two concurrent record addition requests come in. Both can check that there are just two records, decide it's fine to add the third - and then when both add their record - and you end up with four records. You'll need to decide whether this is fine for you that a user can get in 4 requests in a day if they are lucky, or it's a disaster. Note that theoretically you can get even more than 4 - if the user managest to send N requests at exactly the same time, they may be able to get 2+N records in the database (but in the usual case, they won't manage to get many superflous records). If you'll want 3 to be a hard limit, you'll probably needs to change your solution - perhaps to one based on LWT and not use TTL.
Third, I want to note that there is not an important performance difference between COUNT and LIMIT when you know a-priori that there will only be up to 3 (or perhaps, as explained above, 4 or some other similarly small number) results. If you assume that the SELECT only yields three or less results, and it can never be a thousand results, then it doesn't really matter if you just retrieve them or count them - you should just do whichever is convenient for you. In any case, I think that paging is not a good solution your need. For such short results and you can just use the default page size and you'll never reach it anyway, and also paging hints the server that you will likely continue reading on the next page - and it caches the buffers it needs to do that - while in this case you know that you'll never continue after the first three results. So in short, don't use any special paging setup here - just use the default page size (which is 1MB) and it will never be reached anyway.
I'm working with Azure Table (storage) in order to store information about websites I'm working with. So, I planned this structure:
Partition Key - domain name
Row key - Webpage address
Valid until (date time) - after this date, the record will be deleted.
Other crucial data here...
Those columns will be stored in a table called as the website address (e.g. "cnn.com").
I have two main use case (high to low):
1. Check if URL "x" is in the table - find by combination of Partition Key and Row Key - very efficient.
2. Delete old data - remove all expired data (according to "Valid until" column). This operation is taking place every mid-night and possibly delete millions of row - very heavy.
So, our first task (check if URL exists) is implemented in efficient way with this data model. The second task, not. I want to avoid batch deletion.
I also worry about making "hot-spots", which will make me low performance. This because the Partition Key. I expect that in some hours, I will query more question for specific domain. This will make this partition hotspot and hit my performance. In order to avoid this, I thought to use hash-function (on the URL) and the result will be the "partition key". Is this good idea?
I also thought about other implementation way and it's looks like they have some problems:
Storing the rows in table that named with the deletion date (e.g. "cnn.com-1-1-2016"). This provide us great deleting performance. But, bad searching experience (the row can be exists in more then one table. e.g. "cnn.com-1-1-2016" or "cnn.com-2-1-2016"...).
What is the right solution for my problem?
Have you seen the Azure Table Storage Design Guide? It describes principles and patterns for designing tables solutions at scale. For hot spots take a look at the prepend / append anti-pattern for some extra information. This is where all your operations occur within a single partition which prevents additional resources from being added. For these types of scenarios you will get better scale if you can distribute the operations across partitions instead.
Let's assume you have a site https://www.yahoo.com/news/death-omar-al-shishani-could-mean-war-against-203132664.html?nhp=1. You can keep PK as domainName + "/news/" + 2 letters of page address, summary https://www.yahoo.com/news/de. RK - other part of the full address. This will split your domain partition on near 1000 partitions. If that's not enough - use 3 first letter in PK.
Remove obsolete data every 15 minutes (create a separate service for it). Your millions will became just tens of thousands. Or keep less data (2 weeks instead of month for.ex.). And do not forget optimize deletion (get PK and RK only, update ETag to "*", remove as DynamicTableEntity, batch if possible).
I am implementing a session table with nodejs which will grow to a huge number of items. each hash key is a uuid representing a user.
In order to delete the expired sessions, I must scan the table for expired attribute and delete old sessions. I am planning to do this scan once a few days, and other than that, I don't really need high read capacity.
I came out with 2 solutions, and i would like to hear some feedback about them.
1) UpdateTable to higher capacities for only that scheduled routine, and after the scan is done, simply reduce the table capacities to it's original values.
2) Perform the scan, and when retrieving the 'LastEvaluatedKey' after an x*MB read, create a initiation delay (for not consuming all read/sec units), and then continue the scan with 'ExclusiveStartKey'.
If you're doing a scan, option 1 is your best best. This is the only real way to guarantee that you won't effect your application performance while the scan is ongoing.
The only thing you need to be sure of is that you only run this operation once a day -- I believe you can only DOWNGRADE throughput units on a DynamoDB table 2x's per day (at most).
This is an old question, but I saw it through a related question.
There is now a much better native solution: DynamoDB Time to Live
It allows you to specify one attribute per table that serves as the time to live value for each item. You can then set the attribute per item with a Unix-Timestamp that specifies when the item should be deleted.
Within about 24 hours of that timestamp, the item will be deleted at no additional charge.
Ok so a simple task such as generating a sequential number has caused us an issue in the cloud.
Where you have more than one server it gets harder and harder to guarantee that the allocated number between servers are not clashing.
We are using Azure servers if it helps.
We thought about using the app cache but you cannot guarantee it will be updated between servers.
We are limited to using:
a SQL table with an identity column
or
some peer to peer method between servers
or
use a blob store and utalise the locks to store the nost upto date number. (this could have scaling issues)
I just wondered of anyone has an idea of a solution to resolve this?
Surely its a simple problem and must have been solved by now.
If you can live with a use-case where sometimes the numbers you get from this central location are not always sequential (but guaranteed to be unique) I would suggest considering the following pattern. I've helped an large e-commerce client implement this since they needed unique int PK's to synchronize back to premise:
Create a queue and create a small always-running process that populates this queue with sequential integers (this process should remember which number it generated last and keep replenishing the pool with more numbers once the queue gets close to be empty)
Now, you can have your code first poll the next number from the queue, delete it from the queue and then attempt to save it into the SQL Azure database. In case of failure, all you'll have is a "hole" in your sequential numbers. In scenarios with frequent inserts, you may be saving things out of order to the database (two processes poll from queue, one polls first but saves last, the PK's saved to the database are not sequential anymore)
The biggest downside is that you now have to maintain/monitor a process that replenishes the pool of PK's.
After read this, I would not trust on identity column.
I think the best way is before insert, get the last stored id and increment it by one. (programatically). Another option is create a trigger, but it could be a mass if you'll receive a lot of concurrent requests on DB or if your table have millions of records.
create trigger trigger_name
on table_name
after insert
as
declare #seq int
set #seq = (select max(id) + 1 from table_name)
update table_name
set table_name.id = #seq
from table_name
inner join inserted
on table_name.id = inserted.id
More info:
http://msdn.microsoft.com/en-us/library/windowsazure/ee336242.aspx
If you're worried about scaling the number generation when using blobs, then you can use the SnowMaker library which is available on GitHub and Nuget. It gets around the scale problem by retrieving blocks of ids into a local cache. This guarantees that the Ids are unique, but not necessarily sequential if you have more than one server. I'm not sure if that would achieve what you're after.
I have a table in Azure Table Storage, with rows that are regularly updated by various processes. I want to efficiently monitor when rows haven't been updated within a specific time period, and to cause alerts to be generated if that occurs.
Most task scheduler implementations I've seen for Azure function by making sure only one worker will perform a given job at a time. However, setting up a scheduled task that waits n minutes, and then queries the latest time-stamp to determine if action should be taken, seems inefficient since the work won't be spread across workers. It also seems generally inefficient to have to poll so many records.
An example use of this would be to send an email to a user that hasn't logged into a web site in the last 30 days. Assume that the number of users is a "large number" for the purposes of producing an efficient algorithm.
Does anyone have any recommendations for strategies that could be used to check for recent activity without forcing only one worker to do the job?
Keep a LastActive table with a timestamp as a rowkey (DateTime.UtcNow.Ticks.ToString("d19")). Update it by doing a batch transaction that deletes the old row and inserts the new row.
Now the query for inactive users is just something like from user in LastActive where user.PartitionKey == string.Empty && user.RowKey < (DateTime.UtcNow - TimeSpan.FromDays(30)).Ticks.ToString("d19") select user. That will be quite efficient for any size table.
Depending on what you're going to do with that information, you might want to then put a message on a queue and then delete the row (so it doesn't get noticed again the next time you check). Multiple workers can now pull those queue messages and take action.
I'm confused about your desire to do this on multiple worker instances... you presumably want to act on an inactive user only once, so you want only one instance to do the check. (The work of sending emails or whatever else you're doing can then be spread about by using a queue, but that initial check should be done by exactly one instance.)