What does unique records mean in GetStream? - getstream-io

In the getstream pricing page it is mentioned that the Free plan has 600,000 unique records. Is there any section in dashboard where I can see how many unique records I have used?
Also can I know what does a unique record means exactly?

unique records are all activities and feeds.

Related

Cassandra data modeling for one-to-many lookup

Consider the problem of storing users and their contacts. There are about a 100 million users, each has a few hundred contacts and on an average contacts are 1kb in size. There may be some users with too many contacts (>5000) and there may be some contacts that are much (say 10x) bigger than the average of 1kb. Users actively add contacts and less often also delete them. Contacts are not pointers to other users but just a bundle of information.
There are two kinds of queries -
Given a user and a contact name, lookup the contact details
Given a user, look up all associated contact names
I was thinking of a contacts table like this -
CREATE TABLE contacts {
user_name text,
contact_name text,
contact_details map<text, text>,
PRIMARY KEY ( (user_name, contact_name) )
// ^ Notice the composite primary key
}
The choice of composite primary key is due to the number and size of contacts per user. I wanted one contact per row.
This table easily addresses the query of looking up a contact's details given a user and a contact name.
I'm looking for suggestions to address the second query.
Two options (with related concerns) on my mind -
Create a second table called contact_names_by_user, with user_name as the partition key and contact_name as a clustering key. Concern: If there a user with way too many contacts (say 20k), would that result in a non-optimally wide row?
Create an index on user_name. Concern: However given the ratio of total number of users (100M) to average contacts per user (say 200), would that value be considered to have high-cardinality, hence bad for indexing?
In general, are there guideline around looking up many items (like contacts here) referred by one item (like user here) without running in wide rows or non-optimal indexes?
Creating index itself should not be a problem IMHO. Average cardinality of 200 sounds good.
Other option is you maintaining your own index like:
CREATE TABLE contacts_by_user (
user_name text PRIMARY KEY,
contacts set
)
though your index and contacts can go out of sync.

Is there a way to "Expire" items in an Azure Search Index?

In Azure Search is there a mechanism to set an "Expiration Date" on items within the index? I have a need for items to only be in the search index for a pre-defined period of time.
Not at this time. For now, you need to send a delete request to delete an item in an index.
We often refer to this capability as Time to Live. It would be great if you could vote for this feature to help us prioritize it, if you would find it valuable.
http://feedback.azure.com/forums/263029-azure-search/suggestions/6328648-time-to-live-for-data
Liam
As Liam mentions, there isn't at the moment.
One option might be to add an "Expiry" field with a type of Edm.DateTimeOffset into your documents and have all your queries only request documents whose expiry date is greater than the current timestamp.

Show specific document on top in search for specific keywords in solr

Suppose, I have 1000 sellers (S1.....S1000) of Apparels listed on my site. Since all the sellers are paying some amount to me, I am giving them equal weight-age, and the results are shown based on relevancy.
Now, I am planning to start with premium service, where I am thinking to list one supplier on top for each keywords in search results. Let say, S1 has been given premium search for keywords 'Jeans', so if a user searches 'jeans', I first wants to display this supplier on the top, then display other supplier based on relevancy. Plus, this premium service is for only for one month. So, another supplier say S2 can avail this service in next month and so on.
Is there any plugin, wherein I can store which supplier should be shown for which keyword. I am even OK with making 2 queries to meet the desire results.
Please suggest
I think the Query Elevation Component is your friend, you can configure which documents (and hence which suppliers) come first for any given query, see
https://wiki.apache.org/solr/QueryElevationComponent
If that's too much work, you could also add a new boolean field in your documents, indicating whether the document is to be promoted or not, and in the query, sort by this field first (so promoted documents come on top), and by score next (so most relevant documents come right after the promoted ones).
You can maybe also use the reRanking Componant :
https://cwiki.apache.org/confluence/display/solr/Query+Re-Ranking
With using a query like this :
q=jean&rq={!rerank reRankQuery=$rqq reRankDocs=1000 reRankWeight=3}&rqq=(brand:S1)
The top 1000 of results from query jean will be re-ranking thanks to the boost (of 3) add to the documents which contain the field brand with the value S1.
It can be useful, but in your case I think the QueryElevationComponent is the best.
Be careful, reRanking is only available since version 4.9.

Azure Tables - Partition Key and Row Key - Correct Choice

I am new to Azure tables and having read a lot of articles but would like some reassurance on the above given its fundamental.
I have data which is similar to this:
CustomerId, GUID
TripId, GUID
JourneyStep, GUID
Time, DataTime
AverageSpeed, int
Based on what I have read, is CustomerId a good PartitionKey? Where I become stuck is the combination of CustomerId and TripId that does not make a unique row. My justification for TripId as the Row Key is because every query will be a dataset based on CustomerId and TripId.
Just for context, the CustomerId is clearly unique, the TripId represents one journey in a vehicle and within that journey the JourneyStep represents a unit within that Trip which may be 10 steps or 1000.
The intention is aggregate the data into further tables with each level being used for a different purpose. At the most aggregated level, the customer will be given some scores.
The amount of data will obviously be huge so need to think about query performance from the outset.
Updated:
As requested, the solution is for Vehicle Telematics so think of yourself in your own car. Blackbox shipping data to an server which in turn passes it to Azure Tables. In Relational DB terms, I would have a Customer Table and a trip table with a foreign key back to the customer table.
The tripId is auto generated by the blackbox. TripId does not need stored by date time from a query point of view, however may be relevant from a query performance point of view.
Queries will be split into two:
Display a map of a single journey for each customer, so filter by customer and then Trip to then iterate each row (journeystep) to a map.
Per customer, I will score each trip and then retrieve trips for, let's say, the last month to aggregate a score. I do have SQL Database to enrich data with client records etc but for the volume data (the trip data) I wish to use Azure Tables.
The aggregates from the second query will probably be stored in a separate table, so if someone made 10 trips in one month, I would run the second query which would score each trip, then produce a score for all trips that month and store both answers so potentially a table of trip aggregates and a table of monthly aggregates.
The thing about the Partition Key is that it represents a logical grouping; You cannot insert data spanning multiple partition keys, for example. Similarly, rows with the same partition are likely to be stored on the same server, making it quick to retrieve all the data for a given partition key.
As such, it is important to look at your domain and figure out what aggregate you are likely to work with.
If I understand your domain model correctly, I would actually be tempted to use the TripId as the Partition Key and the JourneyStep as the Row Key.
You will need to, separately, have a table that lists all the Trip IDs that belongs to a given Customer - which sort of makes sense as you probably want to store some data, such as "trip name" etc in such a table anyway.
Your design has to be related to your query. You can filter your data based on 2 columns PartitionKey and RowKey. PartitionKey is your most important column since your queries will hit that column first.
In your case CustomerId should be your PartitionKey since most of the time you will try to reach your data based on the customer. (you may also need to keep another table for your client list)
Now, RowKey can be your tripId or time. if I were you I probably use rowKey as yyyyMMddHHmm|tripId format which will let you to query based on startWith and endWidth options.
Adding to #Frans answer:
One thing you could do is create a separate table for each customer. So you could have table named like Customer. That way each customer's data is nicely segregated into different tables. Then you could use TripId as PartitionKey and then JourneyStep as RowKey as suggested by #Frans. For storing some metadata about the trip, instead of going into a separate table, I would still use the same table but here I would keep the RowKey as empty and put other information about the trip there.
I would suggest considering the following approach to your PK/RK design. I believe it would yield the best performance for your outlined queries:
PartitionKey: combination of CustomerId and TripId.
string.Format("{0}_{1}", customerId.ToString(), tripId.ToString())
RowKey: combination of the DateTime.MaxValue.Ticks - Time.Ticks formatted to a large 0-padded string with the JourneyStep.
string.Format("{0}_{1}", (DateTime.MaxValue.Ticks - Time.Ticks).ToString("00000000000000000"), JourneyStep.ToString())
Such combination will allow you to do the following queries "quickly".
Get data by CustomerId only. Example: context.Trips.Where(n=>string.Compare(id + "_00000000-0000-0000-0000-000000000000", n.PartitionKey) <= 0 && string.Compare(id+"_zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz") >=0).AsTableServiceQuery(context);
Get data by CustomerId and TripId. Example: context.Trips.Where(n=>n.PartitionKey == string.Format("{0}_{1}", customerId, tripId).AsTableServiceQuery(context);
Get last X amount of journey steps if you were to search by either CustomerId or CustomerId/TripId by using the "Take" function
Get data via date-range queries by translating timestamps into Ticks
Save data into a trip with a single storage transaction (assuming you have less than 100 steps)
If you can guarantee uniqueness of Times of Steps within each Trip, you don't even have to put JourneyStep into the RowKey as it is somewhat inconvenient
The only downside to this schema is not being able to retrieve a particular single journey step without knowing its Time and Id. However, unless you have very specific use cases, downloading all of the steps inside a trip and then picking a particular one from the list should not be so bad.
HTH
The design of table storage is a function to optimize two major capabilities of Azure Tables:
Scalability
Search performance
As #Frans user already pointed out, Azure tables uses the partitionkey to decide how to scale out your data on multiple storage server nodes. Because of this, I would advise against having unique partitionkeys, since in theory, you will have Azure spanning out storage nodes that will be able to serve one customer only. I say "in theory" because, in practice, Azure uses smart algorithms to identify if there are patterns in your partitionkeys and thus be able to group them (example, if your ids are consecutive numbers). You don't want to fall into this scenario because the scalability of your storage will be unpredictable and at the hands of obscure algorithms that will be making those decisions. See HERE for more information about scalability.
Regarding performance, the fastest way to search is to hit both partitionkey+rowkey in your search queries. Contrary to Amazon DynamoDB, Azure Tables does not support secondary column indexes. If you have your search queries search for attributes stored in columns apart from those two, Azure will need to do a full table scan.
I faced a situation similar to yours, where the design of the partition/row keys was not trivial. In the end, we expanded our data model to include more information so we could design our table in such a way that ~80% of all search queries can be matched to partition+row keys, while the remaining 20% require a table scan. We decided to include the user's location, so our partition key is the user's country and the rowkey is a customer unique ID. This means our data model had to be expanded to include the user's country, which was not a big deal. Maybe you can do the same thing? Group your customers by segment, or by location, or by email address SMTP domain?

Azure - Querying 200 million entities

I have a need to query a store of 200 million entities in Windows Azure. Ideally, I would like to use the Table Service, rather than SQL Azure, for this task.
The use case is this: a POST containing a new entity will be incoming from a web-facing API. We must query about 200 million entities to determine whether or not we may accept the new entity.
With the entity limit of 1,000: does this apply to this type of query, i.e. I have to query 1,000 at a time and perform my comparisons / business rules, or can I query all 200 million entities in one shot? I think I would hit a timeout in the latter case.
Ideas?
Expanding on Shiraz's comment about Table storage: Tables are organized into partitions, and then your entities are indexed by a Row key. So, each row can be found extremely fast using the combination of partition key + row key. The trick is to choose the best possible partition key and row key for your particular application.
For your example above, where you're searching by telephone number, you can make TelephoneNumber the partition key. You could very easily find all rows related to that telephone number (though, not knowing your application, I don't know just how many rows you'd be expecting). To refine things further, you'd want to define a row key that you can index into, within the partition key. This would give you a very fast response to let you know whether a record exists.
Table storage (actually Azure Storage in general - tables, blobs, queues) have a well-known SLA. You can execute up to 500 transactions per second on a given partition. With the example above, the query for rows for a given telephone number would equate to one transaction (unless you exceed 1000 rows returned - to see all rows, you'd need additional fetches); adding a row key to narrow the search would, indeed, yield a single transaction). So would inserting a new row. You can also batch up multiple row inserts, within a single partition, and save them in a single transaction.
For a nice overview of Azure Table Storage, with some good labs, check out the Platform Training Kit.
For more info about transactions within tables, see this msdn blog post.
The limit of 1000 is the number of rows returned from a query, not the number of rows queried.
Pulling all of the 200 million rows into the web server to check them will not work.
The trick is to store the rows with a key that can be used to check if the record should be accepted.

Resources