Did Cassandra really suitable to store log message? - cassandra

I am looking for a NoSQL database to store firewall traffic log message from thousand of firewall devices. I hope the noSQL database can achieve 100% uptime with build-in multiple data center support, Cassandra seems a good choice for me.
The firewall log message i would like to store in Cassandra look like that
date=2014-07-04 time=14:26:59 type=traffic subtype=local
level=notice vd=vdom1 srcip=10.6.30.254 srcport=54705 srcintf="mgmt1"
dstip=10.6.30.1 dstport=80 dstintf="vdom1" sessionid=350696 status=close
policyid=0 dstcountry="Reserved" srccountry="Reserved" trandisp=noop service=HTTP
When i tried to create a table(column families) with multiple of columns corresponding to above log message key-value pair. i found it hard to define the table primary key/composite key because hundred of log message similar to above log example can be generated within the same second!
In order to uniquely identify each row, the primary key probably need to include almost all the columns..... which make me feel weird.
Did Cassandra really good for storing time series log message or i should consider another NoSQL database like MongoDB? Thanks.
Regards
Ro

Related

Redis and Postgresql synchronization (online users status)

In an NodeJS application I have to maintain a "who was online in the last N minutes" state. Since there is potentially thousands of online users - for performance reasons - I decided to not update my Postgresql user table for this task.
I choosed to use Redis to manage the online status. It's very easy and efficient.
But now I want to make complex queries to the user table, sorted by the online status.
I was thinking of creating a online table filled every minute from a Redis snapshot, but I'm not sure it's the best solution.
Following the table filling, will the next query referencing the online table take a big hit caused by the new indexes creation or loading?
Does anyone know a better solution?
I had to solve almost this exact same issue, but I took a different approach because I Didn't like the issues caused by trying to mix Redis and Postgres.
My solution was to collect the online data in a queue (Zero MQ in my case) but any queueing system should work, or a stream processing facility like Amazon Kinesis (The alternative I looked at.) I then inserted the data in batches into a second table (not the users table). I don't delete or update that table, only inserts and queries are allowed.
Doing things this way preserved the ability to do joins between the last online data and the users table without bogging down the database or creating many updates on the user tables. It has the side effect of giving us a lot of other useful data.
One thing to note that I have though about when thinking of other solutions to this problem is that your users table in transactional data(OLTP) while the latest online information is really analytics data (OLAP), so if you have a data warehouse, data lake, big data, or whatever term of the week you want to use for storing this type of data and querying against it that may be a better solution.

nosql separate data by client

I have to develop a project using a NoSql base, either couchbase or cassandra.
I would like to know if it is recommended to partition the data of each customer in a bucket?
In my case, there will never be requests between the different clients.
The data can be completely separated.
For couchbase, I saw that for each bucket a memory capacity, was reserved for him.
Where does the separation have to be done at another place document or super column for cassandra.
Thank you
Where does the separation have to be done at another place document or super column for cassandra.
Tip #1, when working with Cassandra, completely erase the word "super column" from your vocabulary.
I would like to know if it is recommended to partition the data of each customer in a bucket?
That depends. It sounds like your queries would be mostly based on a customer id, so it makes sense to have it as a part of your partition key. However, if each customer partition has millions of rows and/or columns underneath it, that's going to get very big.
Tip #2, proper Cassandra modeling is done based on what your required queries look like. So without actually seeing the kinds of queries you need to serve, it's going to be difficult to be any more specific than that.
If you have customer data relating to accounts and addresses, etc, then building a customers table with a PRIMARY KEY of only customer_id might make sense. But if you find that you need to query your customers (for example) by email_address, then you'll want to create a customers_by_email table, duplicate your data into that, and create a PRIMARY KEY that supports that.
Additionally, if you find yourself storing data on customer activity, you may want to consider a customer_activity table with a PRIMARY KEY of PRIMARY KEY ((customer_id,month),activity_time). That will use both customer_id and month as a partition key, storing the customer's activity clustered by activity_time. In this case, if we didn't use month as an additional partition key, each customer_id partition would be continually written to, until it became too ungainly to write to or query (unbound row growth).
Summary:
If anyone tells you to use a super column in Cassandra, slap them.
You need to know your queries before you design your tables.
Yes, customer_id would be a good way to keep your data separate and ensure that each query is restricted to a single node.
-Build your partition keys to account for unbound row growth, to save you from writing too much data into the same partition.

Where can I observer writes to Cassandra database, aka where are they logged?

Trying to track down a problem with one of our developers, mainly a program he wrote that modifies (adds some flags) to existing entries in the various tables in our Cassandra keyspace.
The issue is that it seems to work just fine for many of the tables, but at least 3 so far I've discovered that it isn't writing anything to them. The only thing his logs can tell me is that x number of rows were committed to the database, but of course when I query a specific row I see that is not the case.
I was just wondering if there is somewhere that Cassandra logs each INSERT so I can look the log and figure out what was going on when it was supposedly inserting that data into the table? I know when a write command is issued it is written to the commit log but I believe that is not human readable, so I need to be able to check somewhere that is.
The only thing his logs can tell me is that x number of rows were
committed to the database, but of course when I query a specific row I
see that is not the case.
This sounds like it might be a consistency issue, can you query using CL ALL?
I was just wondering if there is somewhere that Cassandra logs each
INSERT so I can look the log and figure out what was going on when it
was supposedly inserting that data into the table?
Bad news:
Cassandra does not have audit logging
Good news:
DSE does have audit logging -- http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/sec/secAuditingCassandraTable.html
Remember there is a performance penalty for audit logging. You may just want to turn it on temporarily.

Cassandra - multiple counters based on timeframe

I am building an application and using Cassandra as my datastore. In the app, I need to track event counts per user, per event source, and need to query the counts for different windows of time. For example, some possible queries could be:
Get all events for user A for the last week.
Get all events for all users for yesterday where the event source is source S.
Get all events for the last month.
Low latency reads are my biggest concern here. From my research, the best way I can think to implement this is a different counter tables for each each permutation of source, user, and predefined time. For example, create a count_by_source_and_user table, where the partition key is a combination of source and user ID, and then create a count_by_user table for just the user counts.
This seems messy. What's the best way to do this, or could you point towards some good examples of modeling these types of problems in Cassandra?
You are right. If latency is your main concern, and it should be if you have already chosen Cassandra, you need to create a table for each of your queries. This is the recommended way to use Cassandra: optimize for read and don't worry about redundant storage. And since within every table data is stored sequentially according to the index, then you cannot index a table in more than one way (as you would with a relational DB). I hope this helps. Look for the "Data Modeling" presentation that is usually given in "Cassandra Day" events. You may find it on "Planet Cassandra" or John Haddad's blog.

making cassandra store data on a local node

What is a simple way of configuring a cassandra cluster so that if I try to store a key in it, it will be stored in the local node to which I issue the set/write command?
I am looking at the IPartitioner which allows me to specify how the key will be hashed but it seems a bit heavy weight for something like above.
Thanks!
If you were able to arbitrarily write keys to arbitrary nodes, then on lookup the system would not know where the data for that key lived. The system would have to do a full cluster lookup which would be super slow.
By design, Cassandra spreads the data around in a known way so that lookups are quick.
Check out this post by Jonathan Ellis the primary maintainer of Cassandra.

Resources