How to store and read millions of records in Azure with secondary indexing

How to store and read millions of records in Azure with secondary indexing - azure

I need to store tens of millions of records in Azure. I need to be able to sort them by insert date in descending order. Finally, I need to issue where clauses on 3 fields.
public class Record
{
public DateTime CreatedOn {get; set;}
public string Filter1 {get; set;}
public string Filter2 {get; set;}
public bool Filter3 {get; set;}
}
I believe Azure Table Storage does not support secondary indexes, so I'm considering CosmosDB. The problem is each logical partition on CosmosDB is limited to 20 GBs. I can't dump everything in a single partition. I'm thinking of slicing data horizontally based on the CreatedOn field (dd/mm/yyyy without the time component). This helps me distribute the data, but I'm stuck with the query part to read it back.
The data is user generated; thus, I have no idea what values CreatedOn will be assigned. If I partition on CreatedOn, how can I make my query smart so that I know what the next partition key is when I reach the end of the previous partition?
Example:
Record1 CreatedOn => 5/28/2020
Record2 CreatedOn => 5/28/2020
Record3 CreatedOn => 5/22/2020
Record4 CreatedOn => 5/10/2020
There could be a large gap between dates, so I don't want to brute force the next partition by scanning the range day by day.
Regarding queries:
Filter1 and Filter2 can have 5 and 12 different values respectively. Filter3 is just a boolean.
A sample query would look like:
select top 20 from record
where Filter1 = 'Value1' and Filter2 = 'Value2' and Filter3 = false
order by CreatedOn desc
Another sample query is the same as above without Filter2.
select top 20 from record
where Filter1 = 'Value1' and Filter3 = false
order by CreatedOn desc

Use Cosmos SQL API. You can use Order by to sort on query.
All the fields are indexed by default. You can customize the indexing policy to include composite indices to make queries more efficient.
Pick a PK which will have a wide range of possible values to spread the request unit (RU) consumption and data storage evenly across all logical partitions evenly. Also, you should know the PK value for majority of your queries. You will use it in you where clause so that the query engine will only query specific partitions where the data of interest resides.
In you case what should be that PK property?
public class Record
{
public DateTime CreatedOn {get; set;}
public string Filter1 {get; set;}
public string Filter2 {get; set;}
public bool Filter3 {get; set;}
}
"CreatedOn" is not a good choice. First, like you mentioned you would not know its value during query time, so you will have to always do cross partition queries. Second, all your writes for a given day will go to the same partition resulting in hot partition which will result in rate-limiting and in inefficient use of the provisioned throughput, and higher costs.
Do any of the other properties help the cause? Maybe you must create another property. Maybe UserID , I am just making one up since I don’t know the context.
public class Record
{
public DateTime CreatedOn {get; set;}
public string Filter1 {get; set;}
public string Filter2 {get; set;}
public bool Filter3 {get; set;}
public string UserID {get; set;}
}
Now you can query
select top 20 from record
where Filter1 = 'Value1' and Filter2 = 'Value2' and Filter3 = false and UserID = 'somevalue' order by CreatedOn desc

Related

How to use static column in scylladb and cassandra?

I am new in scylladb and cassandra, I am facing some issues in querying data from the table, following is the schema I have created:
CREATE TABLE usercontacts (
userID bigint, -- userID
contactID bigint, -- Contact ID lynkApp userID
contactDeviceToken text, -- Device Token
modifyDate timestamp static ,
PRIMARY KEY (contactID,userID)
);
CREATE MATERIALIZED VIEW usercontacts_by_contactid
AS SELECT userID, contactID, contactDeviceToken,
FROM usercontacts
contactID IS NOT NULL AND userID IS NOT NULL AND modifyDate IS NOT NULL
-- Need to not null as these are the primary keys in main
-- table same structure as the main table
PRIMARY KEY(userID,contactID);
CREATE MATERIALIZED VIEW usercontacts_by_modifyDate
AS SELECT userID,contactID,contactDeviceToken,modifyDate
FROM usercontacts WHERE
contactID IS NOT NULL AND userID IS NOT NULL AND modifyDate IS NOT NULL
-- Need to not null as these are the primary keys in main
-- table same structure as the main table
PRIMARY KEY (modifyDate,contactID);
I want to create materialized view for contact table which is usercontacts_by_userid and usercontacts_by_modifydate
I need the following queries in case of when I set modifydate (timestamp) static:
update usercontacts set modifydate="newdate" where contactid="contactid"
select * from usercontacts_by_modifydate where modifydate="modifydate"
delete from usercontacts where contactid="contactid"

It is not currently possible to create a materialized view that includes a static column, either as part of the primary key or just as a regular column.
Including a static row would require the whole base table (usercontacts) to be read when the static column is changed, so that the view rows could be re-calculated. This has a significant performance penalty.
Having the static row be the view's partition key means that there would only be one entry in the view for all the rows of a partition. However, secondary indexes do work in this case, and you can use that instead.
This is valid for both Scylla and Cassandra at the moment.

assign default value to DateTime property in table entity of azure storage table

How can I assign default value to DateTime property in table entity of Azure Storage Table.
I am trying the code below, but I'm not able to set value (throwing an error from Azure Table while inserting, as DateTime taking has 1/1/0001).
[DefaultValue(DateTime.Now]
public DateTime LastUpdate { get; set; }

As Per msdn, the value that can go in datetime property is "A 64-bit value expressed as Coordinated Universal Time (UTC). The supported DateTime range begins from 12:00 midnight, January 1, 1601 A.D. (C.E.), UTC. The range ends at December 31, 9999."
So you cannot assign DateTime.MinValue. Lowest you can store is new DateTime(1601, 1, 1)

As mentioned by #Aravind in the answer, minimum value for DateTime type attribute supported by Azure Tables is Jan 1, 1600 UTC. Thus setting the default value to DateTime.MinValue will result in an error.
One possible solution for you is to keep this attribute nullable in your model.
public DateTime? LastUpdate { get; set; }
So, when no value is supplied the default value for this would be null. You would need to remove DefaultValue attribute as mentioned in comments above.

How to define a class that represents an age group?

I need to be able to define various age groups, i.e. 10 - 11, 12 - 13, etc.
How can I represent an AgeGroup class such that that the age can appropriately group people together today, tomorrow, and in the future, i.e. today's 11 year old could 12 tomorrow (in which case he would be grouped into the 12-13 group instead of the 10-11 year old group, as he was yesterday).
So far I have this:
public class AgeGroup
{
public string Name { get; protected set; }
public DateRange BirthDateRange { get; protected set; }
public Guid Id { get; protected set; }
public Status Status { get; protected set; }
public virtual DateRange ApplicableDateRange { get; set; }
public AgeGroup(Guid id, string name, DateRange birthDateRange,DateRange applicableDateRange)
{
Id = id;
Name = name;
Status = status;
BirthDateRange = birthDateRange;
ApplicableDateRange = applicableDateRange;
}
}
public class DateRange
{
public DateRange(DateTime startDate, DateTime endDate)
{
StartDate = startDate;
EndDate = endDate;
}
DateTime _startDate;
DateTime _endDate;
public virtual DateTime StartDate
{
get { return _startDate; }
set { _startDate = value.ToUniversalTime(); }
}
public virtual DateTime EndDate
{
get { return _endDate; }
set { _endDate = value.ToUniversalTime(); }
}
}
This class defines a range of time where this agegroup is applicable. When the applicable date range is passed, we would need to shift everything in time. But is there a better way to do this?

If you're thinking of storing the age range within a person object of some description, I wouldn't, since (1) it's an easily calculable value from the birth date; and (2) it's not really a property of the object (it could be considered so but its transient nature will make things more complex than need be).
Instead, store only the birth date then provide some code which can give an age range based on that birth date and the current date.
You can still have age ranges but the people themselves do not exist "inside" them. Instead, you would be able to pass a person object to a function and have it give you back the relevant range.
And, if possible, the age ranges should be based on ages rather than birth dates, since the former is unchanging - you don't have to "move" objects between ranges since that's automatic when their range is calculated from their birth date and the current date.
If that's not possible (and it appears from your supplied chart that it may be difficult):
then you can use birth dates as you currently are but with a slight modification to allow for things to change year to year.
Have a season object which has the following fields:
Start date.
A collection of range objects.
This will allow you to have multiple season objects if, for example, the age ranges change in following years. As long as they don't change, you'll only have a single season object with a start date of whenever your 2014/15 season starts.
The range objects themselves are then a collection of objects of the form:
Start birth date End birth date Range name
---------------- -------------- ----------
1993-01-01 1993-07-31 U22
1993-08-01 1994-07-31 U21
1994-08-01 1995-07-31 U20
:
2009-08-01 2010-07-31 U5
2010-08-01 2010-12-31 U4
Then, in order to find out which age group someone belongs to, you:
Find the current season object, based on the date of the season (usually today, though it's conceivable you may want to look at earlier seasons as well).
Go through all the range objects for that season until you find one where the persons birth date falls between the start and end dates for the range. If you don't find an eligible range, then they're not allowed to play.
Then extract the range name.

Cassandra based Mahout user friend recommendations

I want to recommend a user , a list of users which the current user can add as friends.
I am using Cassandra and mahout. there is already a implementation of CassandraDataModel in mahout integration package. I want to use this class.
So my recommend-er class looks like follows
public class UserFriendsRecommender {
#Inject
private CassandraDataModel dataModel;
public List<RecommendedItem> recommend(Long userId, int number) throws TasteException{
UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel);
// Optional:
userSimilarity.setPreferenceInferrer(new AveragingPreferenceInferrer(dataModel));
UserNeighborhood neighborhood =
new NearestNUserNeighborhood(3, userSimilarity, dataModel);
Recommender recommender = new GenericUserBasedRecommender(dataModel, neighborhood, userSimilarity);
Recommender cachingRecommender = new CachingRecommender(recommender);
List<RecommendedItem> recommendations = cachingRecommender.recommend(userId, number);
return recommendations;
}
}
CassandraDataModel has 4 column familys
static final String USERS_CF = "users";
static final String ITEMS_CF = "items";
static final String USER_IDS_CF = "userIDs";
static final String ITEM_IDS_CF = "itemIDs";
i have a hard time understanding this class especially the column family's. is there any example where i can look for or if someone can explain will be great with a small example.?
javadoc says this
* <p>
* First, it uses a column family called "users". This is keyed by the user ID
* as an 8-byte long. It contains a column for every preference the user
* expresses. The column name is item ID, again as an 8-byte long, and value is
* a floating point value represnted as an IEEE 32-bit floating poitn value.
* </p>
*
* <p>
* It uses an analogous column family called "items" for the same data, but
* keyed by item ID rather than user ID. In this column family, column names are
* user IDs instead.
* </p>
*
* <p>
* It uses a column family called "userIDs" as well, with an identical schema.
* It has one row under key 0. It contains a column for every user ID in the
* model. It has no values.
* </p>
*
* <p>
* Finally it also uses an analogous column family "itemIDs" containing item
* IDs.
* </p>

All the following instructions about required column families by CassandraDataMdoel should be performed in cassandra-cli under the keyspace you created (recommender or other name).
1: Table users
userID is the row key, each itemID has a separate column name, and value is the preference:
CREATE COLUMN FAMILY users
WITH comparator = LongType
AND key_validation_class=LongType
AND default_validation_class=FloatType;
Insert values:
set users[0][0]='1.0';
set users[1][0]='3.0';
set users[2][2]='1.0';
2: Table items
itemID is the row key, each userID has a separate column name, and value is the preference:
CREATE COLUMN FAMILY items
WITH comparator = LongType
AND key_validation_class=LongType
AND default_validation_class=FloatType;
Insert Values:
set items[0][0]='1.0';
set items[0][1]='3.0';
set items[2][2]='1.0';
3: Table userIDs
This table just has one row, but many columns, i.e. each userID has a separate column:
CREATE COLUMN FAMILY userIDs
WITH comparator = LongType
AND key_validation_class=LongType;
Insert Values:
set userIDs[0][0]='';
set userIDs[0][1]='';
set userIDs[0][2]='';
4: Table itemIDs:
This table just has one row, but many columns, i.e. each itemID has a separate column:
CREATE COLUMN FAMILY itemIDs
WITH comparator = LongType
AND key_validation_class=LongType;
Insert Values:
set itemIDs[0][0]='';
set itemIDs[0][1]='';
set itemIDs[0][2]='';

Complementing the answer above, for Cassandra 2.0 the new syntax is the following, according that cli is deprecated.
Table users:
CREATE TABLE users (userID bigint, itemID bigint, value float, PRIMARY KEY (userID, itemID));
Table items:
CREATE TABLE items (itemID bigint, userID bigint, value float, PRIMARY KEY (itemID, userID));
Table userIDs:
CREATE TABLE userIDs (id bigint, userID bigint PRIMARY KEY(id, userID));
Table itemIDs:
CREATE TABLE itemIDs (id bigint, itemID bigint PRIMARY KEY(id, itemID));

Querying columns with prefix not working using Astyanax

I have a scheme as follows:
TimeStamp
Device ID
Device Name
Device Owner
Device location
I have created this column family using CQL and defined the primary key as (TimeStamp,Device ID, Device Name). Through a serializable object that has fields for DeviceID, name and a field name (which stores either Device Owner or Device Location). I have inserted some records using Astyanax.
As per my understanding, the columns for a row are created by combining Device ID, Device Name and field name as column name and the value to be the value for that particular field.
Thus for a particular timestamp and device, the column names would be in the pattern (Device ID:Device Name: ...).
So I believe we can use these 2 fields as prefix to obtain all the entries for a particular time-device combination.
I am using the following query to obtain the results:
RowSliceQuery<String, ApBaseData> query = adu.keyspace
.prepareQuery(columnFamily)
.getKeySlice(timeStamp)
.withColumnRange(new RangeBuilder()
.setStart(deviceID+deviceName+"_\u00000")
.setEnd(deviceID+deviceName+"_\uffff")
.setLimit(batch_size)
.build());
But on executing the above query I get the following Exception:
BadRequestException: [host=localhost(127.0.0.1):9160, latency=6(6), attempts=1]InvalidRequestException(why:Not enough bytes to read value of component 0)
#abhi The data I can have in this column family is as follows:
stime | devName | devId | Owner | Location
1361260800 | dev:01:02 | 1 | 1234 | 1
1361260800 | dev:02:03 | 2 | 34 | 2
1361260800 | dev:05:06 | 1 | 12 | 1
1361260800 | dev:03:02 | 2 | 56 | 3
The java seriazable java class that I am using for this is:
public class BaseData implements Serializable {
private static final long serialVersionUID = 1L;
#Component(ordinal = 0)
private String devName;
#Component(ordinal = 1)
private int devID;
#Component(ordinal = 2)
private String field;
}
Following the structure of the above class I can see columns in the column family as :
Columne Name: dev\:01\:02:1:location Column Value: 00000001
FYI using astyanax 1.56.31

You could use PlayOrm for cassandra (which I hear the latest is working on mongodb as well now). Using that you could create an entity and add the #NoSqlPartitionByField annotation to the beginOfMonthTimestamp AND the deviceId column and in addition have the timestamp column. That then allows you queries into partitions like so
PARTITIONS s('time', :partitionId) select s from TABLE as s where s.deviceName='mike'
This is ok only if you are sure your partition will not be more than millions. You can have infinite partitions though. So basically, you might use beginOfMonth or beginOfWeek depending on how fast stuff is coming into the system. There is more info at
http://buffalosw.com/wiki/playorm-documentation/

Two things here,
1. You are using ":" in your device id which is also used by Cassandra for composite columns. So avoid using that if you can.
2. Your columnname says "dev\:01\:02:1:location" which is devicename+deviceid+location as per your test data. However you are giving deviceID+deviceName+"_\u00000" in your query??

To overcome this problem, I changed the way I was storing data in the table. Instead of using a composite primary key, I changed the schema so as to have only single primary key and dynamic columns. So, now I have primary key on stime and my column names are constructed dynamically explicitly by my insertion code. For eg. "dev\:01\:02:1:location" the same way they were before but this has completely eliminated the need of a serialized object. This has solved my problem as of now. Lets see what problems it poses later.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string