Hector Cassandra Data Retrieval

Hector Cassandra Data Retrieval - cassandra

Is there any way to get all the data from a column family or from a key space?
I can't think of a way of doing this without knowing every single key for every single entry made to the database.
My problem is that I'm trying to create a Twitter clone where each message has its own id, and store those in the same keyspace in the same column family.
But then how do I get them back? I'll have to keep a track of every single id, and that can't possibly work.
Any help/ideas would be appreciated.

You can retrieve all data from a column family using get_range_slices, setting the range start and end to the same value to indicate that you want all data.
See the Cassandra FAQ
See http://aquiles.codeplex.com/discussions/278245 for a Thrift example.
Haven't yet found a handy Hector example but I think it uses RangeSlicesQuery...
However, it's not clear why you want to do this - for this sort of application you would normally look up the messages by ID, and use an index to determine which IDs you need. For example, storing a row for each user that lists all their messages. For example in the messages column family you might have something like:
MsgID0001 -> time text
1234567 Hello world
MsgID0300 -> time text
3456789 LOL ROTFL
And then in a "user2msg" column family, store the messages, perhaps using timestamp column names so the messages are stored in sorted in time order:
UserID001 -> 1234567 3456789
MsgID0001 MsgID0300
This can then be used to look up a particular user's messages, possibly filtered by time.
You'd then also need further column families to store user profiles etc.
Perhaps you need to add more detail to your question?
Update in response to comment: Yes, if you have one message per row, you have to retrieve each message individually. But what is your alternative? Retrieving all messages is only useful for doing batch processing of messages, not for (for example) showing a user their recent messages. Bear in mind that retrieving all messages could take a very long time - you have not explained why you want to retrieve all messages and what you are going to do with them all. How many messages are you expecting to have?
One possibility is to denormalise, i.e. in a row for each user, store the entire messages, so you don't have to do a separate lookup step for each message. This doubles the amount of storage required, however.

The answer i was looking for is CQL, cassandra's query language. It works similarly to sql which is what i need for the function im after.
this link has some excellent tutorials.

Related

Tally unread (chat) messages in database

My goal is to create daily reports for users about chat messages they've missed/not read yet. Right now, all data is getting stored in a ScyllaDB, and that is working out well for the most part. But when it comes to these reports I've no idea whether there a good way to achieve that without changing the database system.
Thing is, I don't want to query query for each user the unread messages. (I could do that because messages have a timeuuid I can compare with a last_read timestamp, but it's slow because it meant multiple queries for every single user there is.) Therefore, I tried to create a dedicated table for the reporting:
CREATE TABLE
user uuid,
channel uuid,
count_start_time timestamp,
missed_count int,
PRIMARY KEY (channel, user)
)
Once a new message in the channel arrives, I can retrieve all users in that channel (from another table). My idea was to increment missed_count, or decrement it in case a message was deleted (and it's creation timestamp is > count_start_time, I figure I could achieve that with an IF condition to the update). Once a user reads his messages, I reset the count_start_time to current date and missed_count to 0.
But several issues arise here:
Since I can't use a Counter my updates aren't atomic. But I think I could live with that.
For the reasons below it would be ideal if I could just delete a row once messages get read instead of reseting timestamp and counter. But I've read that many deletions might cause performance issues (and I'm also not sure what would happen if the entry gets recreated after a short period b/c new messages arrive in the channel again)
The real bummer: since I did not want to iterate over all users on the system in the first place, I don't want to iterate over all entries here either. The naive idea would be to query with WHERE missed_count > 0. But missed_count isn't part of the cluster key so for my understanding that's not feasible.
Since I have to paginate, it could happen that I get the missed messages for a single user in different hunks. I mean, it could happen that I report to user1 that he has unread messaged from channel1 first, and later that he has unread messages from channel2, That means additional overhead in case I want to avoid multiple reports for the same user.
Is there a way I could structure my table to solve that problem, especially how to query only entries with missed_count > 0 or to utilize row deletion? Or is my goal beyond the design of Cassandra/ScyllaDB?
Thanks in advance!

PostgreSQL: Is it possible to limit inserts per user based on time difference between timestamp column and current time?

I have an issue when two almost concurrent requests (+- 10ms difference) by the same user (unintentionally duplicated by client side) successfully execute whole use case logic twice. I can't really solve this situation in code of my API, so I've been thinking about how to limit one user_id to be able to insert row into table order max. once every second for example.
I want to achieve this: If in table order exists row with user_id X and that row was created (inserted) less than 1 second ago, insert with user_id X would fail.
This could be effective way of avoiding unintentionally duplicated requests by client side. Because I can't imagine situation when user could send two complex requests less than 1 second between intentionally. I'm also interested in any other ideas, for example what's the proper way to deal with similar situations in API's.

There is one problem with your idea. If the server becomes really slow for just a second, the orders will arrive more than one second apart in the database and will be inserted.
I'd recommend generating a unique ID, like a UUID, in the front-end, and sending that with the request. You could, for example, generate a new one every page load. Then, if the server sees that the received UUID already exists in the database, the order is skipped.
This avoids any potential timing issues, but also retains the possibility of someone re-ordering the exact same products.

You can do it with an EXCLUDE constraint. You need to create your own immutable helper function, and use an extension.
create extension btree_gist;
create function addsec(timestamptz) returns tstzrange immutable language sql as $$
select tstzrange($1,$1+interval '1 second')
$$;
create table orders (
userid int,
t timestamptz,
exclude using gist (userid with =, addsec(t) with &&)
);
But you should probably change the front end anyway to include a validation token, as currently it may be subject to CSRF attacks.
Note that EXCLUDE constraints may be much less efficient than UNIQUE constraints. Also, I'm not 100% sure that addsec really is immutable. There might be weird things with leap seconds or something that messes it up.

Best way to run a script for large userbase?

I have users stored in postgresql database (~10 M) and i want to send all of them emails.
Currently i have written a nodejs script which basically fetches users 1000 at a time (Offset and limit in sql) and queues the request in rabbit MQ. Now this seems clumsy to me, as if the node process fails at any time i have to restart the process (i am currently keeping track of number of users skipped per query, and can restart back at the previous number skipped found from logs). This might lead to some users receiving duplicate email and some not receiving any. I can create a new table with new column indicating whether email has been to that person or not, but in my current situation i cant do so. Neither can i create a new table nor can i add a new row to existing table. (Seems to me like idempotent problem?).
How would you approach this problem? Do you think compound indexes might help. Please explain.

The best way to handle this is indeed to store who received an email, so there's no chance of doing it twice.
If you can't add tables or columns to your existing database, just create a new database for this purpose. If you want to be able to recover from crashes, you will need to store who got the email somewhere so if you are given hard restrictions on not storing this in your main database, get creative with another storage mechanism.

Cassandra - multiple counters based on timeframe

I am building an application and using Cassandra as my datastore. In the app, I need to track event counts per user, per event source, and need to query the counts for different windows of time. For example, some possible queries could be:
Get all events for user A for the last week.
Get all events for all users for yesterday where the event source is source S.
Get all events for the last month.
Low latency reads are my biggest concern here. From my research, the best way I can think to implement this is a different counter tables for each each permutation of source, user, and predefined time. For example, create a count_by_source_and_user table, where the partition key is a combination of source and user ID, and then create a count_by_user table for just the user counts.
This seems messy. What's the best way to do this, or could you point towards some good examples of modeling these types of problems in Cassandra?

You are right. If latency is your main concern, and it should be if you have already chosen Cassandra, you need to create a table for each of your queries. This is the recommended way to use Cassandra: optimize for read and don't worry about redundant storage. And since within every table data is stored sequentially according to the index, then you cannot index a table in more than one way (as you would with a relational DB). I hope this helps. Look for the "Data Modeling" presentation that is usually given in "Cassandra Day" events. You may find it on "Planet Cassandra" or John Haddad's blog.

How to structure a Azure Table to hold user messages

I'm still trying to get my head around the correct way to use Azure Tables. I understand that they have a partition key and a row key, that that's it. Everything else is just data that you keep in that row.
Use Case
My web app gets files uploaded by a user, puts them in a queue, then has a worker roll process the queue and do analytics on those files.
I would like to put messages about those files in an Azure Table based on what we find when we process those files.
I then plan on making an AJAX call to get a members messages when they visit a webpage. If the user clicks on the message or closes the message then I'll delete it from the table. Very StackOverflowish.
Question
My question is on how to best store these messages in my Azure Table.
Here's my thinking so far:
PartionKey: MemberID
RowKey: ???(not sure what to have)
Column Data: Message data including any links and a time stamp. Probably a view count too.
I can't think of what I would put in a seperate index for the row key. Timestamp could work so I can order messages correctly, but I don't think I'll get much bang for my buck with that.

I have found that the best to think about the choice of partition and row keys is to think about the data access patterns. If your access pattern is to have a single row/entity represent something meaningful in your system. In your case is sounds like userid/fileid uniquely identifies the entity. From this, you have three options:
userid for partition key, fileid for row key
constant value for partition key, and a combination of userid and fileid for row key
constant value for row key, and a combination of userid and fileid for partition key
The decision on there is to figure out what other access pattern. Are you going to be querying for all files for a particular user? Then you would want userid as partition or row key. If you will only ever be querying based on fileid/userid, then it doesn't really matter.
Erick

Before thinking about actual storage, you should try to think about what entities you're going to have.
Sounds like something like this:
User entity
UserFile entity
FileMessage entity
Do you have one FileMessage per UserFile or can you have more than one? It sounds like (by your explanation of deletion logic) that you would only have one FileMessage per File.
If my assumptions so far are correct and if it were me, the FileMessage table would have the following structure:
PartitionKey: userId
RowKey: fileId (name/url/etc)
Other columns: as you see fit
HTH

I would think of it as: Partition Key is how you want to break data out, so if data is related, you want to keep the partition key the same. If you are doing something with a lot of data, you may want to use like the date for the Partition Key. The Row Key is the index, so that is what you will use to query the data.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string