I know the Secondary Index(s) is not here yet: It's in wish list and "planed"
I like to get some ideas (or information from the reliable source) about the incoming secondary index(s)
1st question: I noticed MS planed "secondary indexes": is that mean we can create as many indexes as we want on one table
2nd question: Current index is "PartitionKey+RowKey", if above question is not true, will the secondary index be "RowKey+PartitionKey" or we have a good chance that we can customize it?
I like to gain some ideas because I am currently design a table, since the data won't much from beginning, so I think I can wait for the secondary index feature without create multiple tables at this moment.
Please share you ideas or any source you have, thanks.
There's currently no information on secondary indexes, other than what's written at the site you referenced. So, there's no way to answer either of your two questions.
Several customers I work with, that use Table Storage, have taken the multiple-table approach to provide additional indexing. For those requiring extensive index coverage, that data typically has found its way into SQL Azure (or a combination of SQL Azure + Table Storage).
As a Windows Azure MVP I don't have any information about the secondary indexes in table service. If we do need more indexes in table service, but don't want use SQL Azure, (Not just because of the pricing...) then I would like to de-normalization my data, which split the same data into more than one table, with different row key as the indexes.
This question is now two years old. And still no sign of secondary indexes in Azure Table Storage. My guess is that it is now very unlikely to ever eventuate.
Azure Cosmos DB provides the Table API for applications that are written for Azure Table storage and that need capabilities like:
Automatic Secondary Indexing
From: Introduction to Azure Cosmos DB: Table API
So if you are willing to move your tables over to Cosmos, then you will get all the indexing you could ever want.
Related
We are using Cassandra 3 and have come up with a modelling based on the initial requirements. Since there have been very frequent requirements changes, this model has subsequently changed many times as well. Hence considering these requirements and model changes, there has been no major improvement in terms of development. The team have decided to go with the BLOB data type and store the entire data in the BLOB. Can you please share the drawback to use BLOB such a scenario. Thanks in Advance.
We migrated from Astyanax Cassandra 1.1 to CQL Cassandra 3.0 directly, so we still have a lot of column families which have value as BLOB.
Major issues we face right now are:
1) Difficult to visualize data directly from database: Biggest advantage of CQL is it supports SQL like queries, hence logging into cql terminal and getting results directly from there is saves a lot of time normally. If you use BLOB you will not be able to do all such things.
2) CQL performs better when your table has a well defined schema instead of using blob to store big chunk of data together.
If you are creating a new table, I will suggest to use Collections for your use case. You will be able to store different type of data and performance will also be good.
Nice slides comparing performance of schemaless tables and tables with scehma and collections. You can skip to slide 26 if you just want the summary.
https://www.slideshare.net/DataStax/migration-from-thrift-to-cql-brij-bhushan-ravat-ericsson-cassandra-summit-2016
I am designing an app which has a feature that allows a user to store geolocation based data, then later on allow other users to query for those data that falls within a given radius of their current geolocation.
The question is what is the best approach to design the table to be scalable and has great performance? I was thinking of having a table containing latitude as the partition key (pk), longitude as the row key (rk), then an dataid column that maps to another table that uses that dataid as its partition key.
My thinking is that using the 2-way lookup it's going to boost up my performance since both lookup would be instant. Is this the right way of thinking? I was reading somewhere that looking for partition keys that falls into some range is bad. If that is the case how should I approach this? How does Google maps, Apple map kit typically implement those feature?
This is difficult to answer completely without knowing your access patterns and scenarios fully. You can use the recently published Azure Storage Table design guide to help come up with a good design for your problem
http://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/
For reads, Azure Table Storage is designed for fast point queries where client knows the partition key and row key so you need to factor that in your data model. For writes, you want uniform distribution and avoid append/prepend pattern to achieve high scale.
I am working with azure table services. What I would like to ask is whether the Azure internals care about the notion of Table at all?
Making things fast largely depend on partition key and row key. Tables do not look like containers or Grouping of entities because there is no limit on number of tables you can create. The total storage size is tied to my storage account.
So is Table just a notion to help people transition from RDBMS land or do they serve a purpose internally? Can I do applications with a single table design without worrying about performance? After all, if table is a just a tag then I may as well include as part of partition key.
EDIT
To give an example, Table partition keys look very much like Cassandra rows and Table rows are like Cassandra columns. It is okay to treat the Storage as a big bucket of key(RowKey)-value pairs. Partition keys are sharding mechanism. Then table just comes across as a "labeling" notion.
You can have all your entities in a single azure table storage and get optimum performance by thoughtfully choosing a right Partition Key and Row Key combination, but IMHO it is any time better to have related entities each in a separate table as that would be easy manageable ( from a developer perspective) . you know what part of you application is hitting which table.
You may want to view this recorded session which discusses best practices and internals.
Windows Azure Storage: What’s Coming, Best Practices, and Internals
You can think the union of a Table name and a Partition Key as being a unit of performance.
Youre probably on the thought mindset that will re-invent FatEntities. developed by Lokad.
The concept of a "Table" has many restrictions and issues. For example if you have a large table with 100,000 entries in a partition, then you can't easily jump to entry 99,001 without iterating through each of the entries. Going backwards is impossible (you cant start at the last entry and go backwards)
I believe in the past the answer to this question was no. However has anything changed with the recent releases or does anyone know of a way that I can do this. I am using datatables and would love to be able to do something like skip 50 retrieve 50 rows. skip 100 retrieve 50 rows etc.
It is still not possible to skip rows. The only navigation construct supported is top. The Table Service REST API is the definitive way to access Wndows Azure Storage, so its documentation is the go-to location for what is or is not possible.
What you're asking here is possible using continuation tokens. Scott Densmore blogged about this a while ago to explain how you can use continuation tokens for paging when you're displaying a table (like what you're asking here with DataTables): Paging with Windows Azure Table Storage. The blog post shows how to show pages of 3 items while using continuation tokens to move forward and back between pages:
Besides that there's also Steve's post that describes the same concept: Paging Over Data in Windows Azure Tables
Yes (kinda) and no. No, in the sense that the Skip operation is not directly supported at the REST head. You could of course do it in memory, but that would defeat the purpose.
However, you can of course actually do this pattern if you structure your data correctly. We do something like this ourselves. We align our partition key to the datetime and use the RowKey as a discriminator. This means we can always pinpoint the partition range we are interested in and then Take() some amount of data. So, for example, we can easily Take() the first 20 rows per hour by specifying a unique query (skipping over data we don't want). The partion key is simply aligned per hour and then we optionally discriminate further using the RowKey - finally, we just take data. When executed in parallel, this works just dandy.
Again, the more technically correct answer is NO. However, you can approximate it cleverly using the PK and RK.
Hey everyone I am building a website that needs to have the capability to store the physical location of a person and allow someone to search for other people within some radius of that location. For the sake of an example lets pretend it's a dating website (it's not) and so as a user you want to find people within 50 miles of your current location that meet some other set of search criteria.
Currently I am storing all of my user information in Azure Table Storage. However this is the first time I've ever attempted to create a geo-aware search algorithm so I wanted to verify that I am not going to waste my time doing something totally insane. To that end the idea for my implementation is as follows:
Store the longitude and latitude for the users location in the Azure Table
The PartitionKey for each entry is the state (or country if outside the US) that the person lives in
I want to calculate the distance between the current user and all other users using the haversine equation. I'm assuming I can embed this into my LINQ query?
Eventually I'd like to be able to get a list of all states/countries within the radius so I can optimize based on PartitionKey's
Naturally this implementation strategy leads to a few questions:
If I do the haversine equation in a LINQ query where is it being executed? The last thing I want is my WebRole to pull every record from azure storage to run this haversine equation on it within the application process
Is the idea of doing this with Azure Storage totally crazy? I know there are solutions like MongoDB that have the capability to do spatial search queries built-in. I like the scalability of Azure but is there some better alternative I should investigate instead?
Thanks for the help in advance!
If you want the query to be fast against Azure tables, the query has to run against the partition key and row key. Also if you're using LINQ to query Azure tables then you need to be careful about which functions you use. If the value can be calculated once for all row, LINQ will be clever and evaluate it before sending the query to AZT, however if it needs to be evaluated for each row, then you can only use those are those that Azure supports (which is a pretty short list).
You might be able to make this work if you want to use a bounding square for your area not a bounding circle. Store latitude in the PartitionKey and longitude in the RowKey and have a query that looks like this:
var query = from UserLocation ul
in repository.All()
where
ul.PartitionKey.CompareTo(minimumLatitude) > 0
&& ul.PartitionKey.CompareTo(maximumLatitude) < 0
&& ul.RowKey.CompareTo(minimumLongitude) > 0
&& ul.RowKey.CompareTo(maximumLongitude) < 0
select
ul;
This is probably not as clever as you were hoping for though. If this is not going to work for you, then you'll need to look at other options. SQL Azure supports geospatial queries if you want to stay within the Microsoft family.
Having used spatial queries with SQL Azure (approximately 5 million records) i can confirm that it can be very quick - it may well be worth a look for you.