Azure table storage - customer data per day

Azure table storage - customer data per day - azure

What do you recommend for the following scenario:
If I have in a table 100.000.000 items (a lot of items to be more exact) how can I get those items per day?
Once the items are added into the table, they are not modified or deleted anymore. Basically is just insert and read them.
My question is about retrieving them without having to loop through all 100.000.000 items.
Should I make the PartitionKey a datetime or just date and then retrieve by partionkey where is equal for example 22.10.2013?
What do you recommend?

If you are reading the items per day, then using the date (just the Date part, not the full DateTime) as the PartitionKey is the best solution.
When using a Date as the Key, I prefer converting it to a String in the YYYYMMDD (or YYYY-MM-DD) format.

The use of a datetime as a PartitionKey is an anti-pattern since all writes go in the same partition - which limits scalability. The Azure Storage scalability targets indicate that you can do 2,000 operations a second against a partition but 20,000 operations a second against a storage account. You can get round this by sharding inserts across a set of buckets for the day - and prepending the date with the bucket name.

Related

How to get Last 6 Month data comparing with timestamp column using cassandra query?

How to get Last 6 Month data comparing with timestamp column using cassandra query?
I need to get all account statement which belongs to last 3/6 months comparing with updatedTime(TimeStamp column) and CurrentTime.
For example in SQL we are using DateAdd() function tor this to get. i dont know how to proceed this in cassandra.
If anyone know,reply.Thanks in Advance.

Cassandra 2.2 and later allows users to define functions (UDT) that can be applied to data stored in a table as part of a query result.
You can create your own method if you use Cassandra 2.2 and later UDF
CREATE FUNCTION monthadd(date timestamp, month int)
CALLED ON NULL INPUT
RETURNS timestamp
LANGUAGE java
AS $$java.util.Calendar c = java.util.Calendar.getInstance();c.setTime(date);c.add(java.util.Calendar.MONTH, month);return c.getTime();$$
This method receive two parameter
date timestamp: The date from you want add or subtract number of month
month int: Number of month you want to or add(+) subtract(-) from date
Return the date timestamp
Here is how you can use this :
SELECT * FROM ttest WHERE id = 1 AND updated_time >= monthAdd(dateof(now()), -6) ;
Here monthAdd method subtract 1 mont from the current timestamp, So this query will data of last month
Note : By default User-defined-functions are disabled in cassandra.yaml - set enable_user_defined_functions=true to enable if you are aware of the security risks

In cassandra you have to build the queries upfront.
Also be aware that you will probably have to bucket the data depending on the number of accounts that you have within some period of time.
If your whole database doesn't contain more than let's say 100k entries you are fine with just defining a single generic partition let's say with name 'all'. But usually people have a lot of data that simply goes into bucket that carries a name of month, week, hour. This depends on the number of inserts you get.
The reason for creating buckets is that every node can find a partition by it's partition key. This is the first part of the primary key definition. Then on every node the data is sorted by the second information that you pass in to the primary key. Having the data sorted enables you to "scan" over them i.e. you will be able to retrieve them by giving timestamp parameter.
Let's say you want to retrieve accounts from the last 6 months and that you are saving all the accounts from one month in the same bucket.
The schema might be something on the lines of:
create table accounts {
month text,
created_time timestamp,
account text,
PRIMARY KEY (month, created_time)
}
Usually you will do this at the application level, merging queries is an anti pattern but is o.k. for smaller amount of queries:
select account
from accounts
where month = '201701';
Output:
'201702'
'201703'
and so on.
If you have something really simple with let's say expected 100 000 entries then you could use the above schema and just do something like:
create table accounts {
bucket text,
created_time timestamp,
account text,
PRIMARY KEY (bucket, created_time)
}
select account
from accounts
where bucket = 'some_predefined_name'
and created_time > '2016-10-04 00:00:00'
Once more as a wrap-up, with cassandra you always have to prepare the structures for the access pattern you are going to use.

how to query WADLogsTable by cloud service, role and role instance?

Say I want to fetch WADLogsTable logs for last 24h of a given cloud service, perhaps narrow down to a given role, and even further down to a given instance. What does the query look like? Is there any high level api for that?

As such there is no API specific for querying diagnostics data. Because diagnostics data is stored in Azure Tables, you can simply query the table to fetch the desired data.
One thing you would notice is that PartitionKey in these tables represent time (rounded to nearest minute) represented in ticks. If you're interested in fetching last "x" hours from WADLogsTable table, here's what you would need to do:
Convert from and to date/time (UTC) into ticks.
Prepend 0 in front of them.
Use these in query on PartitionKey. Assuming you're using REST API, the query would look something like:
(PartitionKey ge 'from date/time ticks with prepended zero' and PartitionKey le 'to
date/time ticks with prepended zero')
To further filter down by role or role instance, you would add appropriate filter criteria. For example:
(PartitionKey ge 'from date/time ticks with prepended zero' and PartitionKey le 'to
date/time ticks with prepended zero') and (Role eq 'Role Name')
I wrote a blog post long time back which you may find useful: http://gauravmantri.com/2012/02/17/effective-way-of-fetching-diagnostics-data-from-windows-azure-diagnostics-table-hint-use-partitionkey/ (Caution: The code showing the use of storage client library is based on the older version of the library, so you may want to use that only for understanding purpose).

Using unique partition keys will it create a range partition?

I'm considering using a timestamp for partition key for logging messages to always keep it ordered with the most recents first.
PartitionKey = String.Format("{0:D19}", DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks);
Since the key is not increasing every time with the same pattern, will the partitions still be grouped in a range partition in this case?
It's just for logging, there will be no updates, only inserts of new ones and maybe delete older ones eventually. The main properties to query is date range, user and entityName (where did it happen).

Take a look at Understanding the Table Service Data Model and Designing a Scalable Partitioning Strategy for Azure Table Storage - A TableEntity already has a Timestamp property. Since you aren't doing updates the Timestamp should be sufficient for your date range query.

Azure Tables - Partition Key and Row Key - Correct Choice

I am new to Azure tables and having read a lot of articles but would like some reassurance on the above given its fundamental.
I have data which is similar to this:
CustomerId, GUID
TripId, GUID
JourneyStep, GUID
Time, DataTime
AverageSpeed, int
Based on what I have read, is CustomerId a good PartitionKey? Where I become stuck is the combination of CustomerId and TripId that does not make a unique row. My justification for TripId as the Row Key is because every query will be a dataset based on CustomerId and TripId.
Just for context, the CustomerId is clearly unique, the TripId represents one journey in a vehicle and within that journey the JourneyStep represents a unit within that Trip which may be 10 steps or 1000.
The intention is aggregate the data into further tables with each level being used for a different purpose. At the most aggregated level, the customer will be given some scores.
The amount of data will obviously be huge so need to think about query performance from the outset.
Updated:
As requested, the solution is for Vehicle Telematics so think of yourself in your own car. Blackbox shipping data to an server which in turn passes it to Azure Tables. In Relational DB terms, I would have a Customer Table and a trip table with a foreign key back to the customer table.
The tripId is auto generated by the blackbox. TripId does not need stored by date time from a query point of view, however may be relevant from a query performance point of view.
Queries will be split into two:
Display a map of a single journey for each customer, so filter by customer and then Trip to then iterate each row (journeystep) to a map.
Per customer, I will score each trip and then retrieve trips for, let's say, the last month to aggregate a score. I do have SQL Database to enrich data with client records etc but for the volume data (the trip data) I wish to use Azure Tables.
The aggregates from the second query will probably be stored in a separate table, so if someone made 10 trips in one month, I would run the second query which would score each trip, then produce a score for all trips that month and store both answers so potentially a table of trip aggregates and a table of monthly aggregates.

The thing about the Partition Key is that it represents a logical grouping; You cannot insert data spanning multiple partition keys, for example. Similarly, rows with the same partition are likely to be stored on the same server, making it quick to retrieve all the data for a given partition key.
As such, it is important to look at your domain and figure out what aggregate you are likely to work with.
If I understand your domain model correctly, I would actually be tempted to use the TripId as the Partition Key and the JourneyStep as the Row Key.
You will need to, separately, have a table that lists all the Trip IDs that belongs to a given Customer - which sort of makes sense as you probably want to store some data, such as "trip name" etc in such a table anyway.

Your design has to be related to your query. You can filter your data based on 2 columns PartitionKey and RowKey. PartitionKey is your most important column since your queries will hit that column first.
In your case CustomerId should be your PartitionKey since most of the time you will try to reach your data based on the customer. (you may also need to keep another table for your client list)
Now, RowKey can be your tripId or time. if I were you I probably use rowKey as yyyyMMddHHmm|tripId format which will let you to query based on startWith and endWidth options.

Adding to #Frans answer:
One thing you could do is create a separate table for each customer. So you could have table named like Customer. That way each customer's data is nicely segregated into different tables. Then you could use TripId as PartitionKey and then JourneyStep as RowKey as suggested by #Frans. For storing some metadata about the trip, instead of going into a separate table, I would still use the same table but here I would keep the RowKey as empty and put other information about the trip there.

I would suggest considering the following approach to your PK/RK design. I believe it would yield the best performance for your outlined queries:
PartitionKey: combination of CustomerId and TripId.
string.Format("{0}_{1}", customerId.ToString(), tripId.ToString())
RowKey: combination of the DateTime.MaxValue.Ticks - Time.Ticks formatted to a large 0-padded string with the JourneyStep.
string.Format("{0}_{1}", (DateTime.MaxValue.Ticks - Time.Ticks).ToString("00000000000000000"), JourneyStep.ToString())
Such combination will allow you to do the following queries "quickly".
Get data by CustomerId only. Example: context.Trips.Where(n=>string.Compare(id + "_00000000-0000-0000-0000-000000000000", n.PartitionKey) <= 0 && string.Compare(id+"_zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz") >=0).AsTableServiceQuery(context);
Get data by CustomerId and TripId. Example: context.Trips.Where(n=>n.PartitionKey == string.Format("{0}_{1}", customerId, tripId).AsTableServiceQuery(context);
Get last X amount of journey steps if you were to search by either CustomerId or CustomerId/TripId by using the "Take" function
Get data via date-range queries by translating timestamps into Ticks
Save data into a trip with a single storage transaction (assuming you have less than 100 steps)
If you can guarantee uniqueness of Times of Steps within each Trip, you don't even have to put JourneyStep into the RowKey as it is somewhat inconvenient
The only downside to this schema is not being able to retrieve a particular single journey step without knowing its Time and Id. However, unless you have very specific use cases, downloading all of the steps inside a trip and then picking a particular one from the list should not be so bad.
HTH

The design of table storage is a function to optimize two major capabilities of Azure Tables:
Scalability
Search performance
As #Frans user already pointed out, Azure tables uses the partitionkey to decide how to scale out your data on multiple storage server nodes. Because of this, I would advise against having unique partitionkeys, since in theory, you will have Azure spanning out storage nodes that will be able to serve one customer only. I say "in theory" because, in practice, Azure uses smart algorithms to identify if there are patterns in your partitionkeys and thus be able to group them (example, if your ids are consecutive numbers). You don't want to fall into this scenario because the scalability of your storage will be unpredictable and at the hands of obscure algorithms that will be making those decisions. See HERE for more information about scalability.
Regarding performance, the fastest way to search is to hit both partitionkey+rowkey in your search queries. Contrary to Amazon DynamoDB, Azure Tables does not support secondary column indexes. If you have your search queries search for attributes stored in columns apart from those two, Azure will need to do a full table scan.
I faced a situation similar to yours, where the design of the partition/row keys was not trivial. In the end, we expanded our data model to include more information so we could design our table in such a way that ~80% of all search queries can be matched to partition+row keys, while the remaining 20% require a table scan. We decided to include the user's location, so our partition key is the user's country and the rowkey is a customer unique ID. This means our data model had to be expanded to include the user's country, which was not a big deal. Maybe you can do the same thing? Group your customers by segment, or by location, or by email address SMTP domain?

Is this a correct choice for partition and row key in Azure table?

I have a table that stores the online status of a user.
columns(userid, IsOnline,date)
If the user is online, the Isonline bool flag is true, if it goes offline, the IsOnline bool is false. This way I can see between which and which time the user was online.
Would it be fine choosing partitionKey: userId,
Rowkey: the date and time of the event
?
The user can not go on and off at the same time, so rowkey should be unique. I like about this that it keeps all data for a user on the same partition. Also does choosing the date as rowId make sorting more efficient?

UserId is a good strategy for the PartitionKey
With respect to RowKey, I would suggest using "DateTime.MaxValue.Ticks - dateOfEvent.Ticks" formatted to max number of digits.
This will make your RowKey always be in the descending order and thus allow you to pick the latest status of the user without getting data from all of he partition

Sounds reasonable to me. This groups all of a given user's actions together in a single partition. Each action is then delineated by an individual row with the Timestamp for the key.
You might want to keep in mind that every row in Azure Table Storage has a Timestamp column that is populated automatically on create/update. You could consider using this column for your Timestamp but searching/sorting will be slow since it is part of the tertiary data set associated with a Table Storage row.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string