I have read various sources on this subject and understand the idea of modelling around the queries needed but wondered how far this can be stretched for Cassandra.
I need to store processing events which contain both measures and dimension data if I was relating to a traditional data warehouse.
The format of the data is something like
log_timestamp (timestamp): user_id (text): measure_1 (num): measure_2 (num) : measure_3 (num) : dim_1 (text) : dim_2 (text) : ... dim_n(text)
where there may be 10 or more dim data items.
The queries I would like to model include :
user_id by time (minute/hour/day/week/month/year) with measure aggregates
user_id by single dim by time with measure aggregates
single dim by time with measure aggregates
some of the dimension fields form a natural hierarchy so I would like the above queries with more than one dim field as well.
Before embarking on the creation of a large number of discrete column families to try and cover the permutations I would like to know if anyone can recommend a better approach
e.g. using a single cf for the dim data with a column identifying the type of dim and another for the value and a similar idea for hierarchy data with the hierarchy type and member dims and values.
Alternatively what might be a good model for storing the data at a relatively granular level such that it could be read back out into an aggregation tool e.g. hive or spark (which looks really quite interesting).
Thanks.
Suppose you want to be able to query for aggregated data by week. You could use the following data structure.
Column Family = day
Row Key: Date = day_identifier (e.g., time at beginning of some day this week)
Column Name: Date = timestamp, Long = field_ordinal
Column Value: field value
Column Family = week
Row Key: Date = week_identifier (e.g., time at beginning of first day of a week)
Column Name: Date = timestamp, Long = field_ordinal
Column Value: field value
At the end of each week, you'd take the entries in the day column family and aggregate them into an entry in the week column family. Then you can remove the data per day if it's no longer that useful to you.
This concept allows you to store much less data, but you can still accomplish a lot. For instance, if you want to query for data aggregated over a month, you'd just access all the weeks in that month. Alternatively, you could use the same concept to roll up data for an entire month as well.
Good luck.
Related
I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.
Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html
Can someone explain in simple terms what is the static column in Cassandra, and its use?
I came across this link Static column, but wasn't able to understand it much.
Static column is a way to associate data with the whole partition, so it will be shared between all rows inside that partition. There are legitimate cases, when all rows need to have the same data, and when data is updated, we won't need to update every row.
One example that comes in mind is e-commerce. For example, you're selling something, and you're selling in different countries with different currency & different prices. But some things are common between them, like, description, sizes, etc. In this case we can model it as following:
create table articles (
sku text,
description text static,
country text,
currency text,
price decimal,
primary key (sku, country)
);
in this case, when you do select * from articles where sku = ... and country = ... then you get description anyway, and you can update description only with update articles set description = '...' where sku = ..., and next select will pull updated description.
Also, static columns may exist in partition without any rows. One of the use cases that I've seen is collection of the aggregated information, where detailed data were stored as individual rows with some TTL, and there was a job that aggregated data into static column, so when rows are expired, then this partition still stays only with aggregated data.
I have a column of data [Sales ID] that bringing in duplicate data for an analysis. My goal is to try and limit the data to pull unique sales ID's for the max day of every month in the analysis only (instead of daily). Im basically trying to get it to only pull in unique sales ID values for the last the day of every month in the analysis ,and if the current day is the last day so far then it should pull that in. So it should pull in the MAX date in any given month. Please how do i write an expresion with the [Sales ID] column and [Date ] column to acieve this?
Probably the two easiest options are to
1) Adjust the SQL as niko mentioned
2) Limit the visualization with the "Limit Data Using Expression" option, using the following:
Rank(Day([DATE]), "desc", Month([DATE]), Year([DATE])) = 1
If you had to do it in the Data on Demand section (maybe the IL itself is a usp or you don't have permission to edit it), my preference would be to create another data table that only has the max dates for each month, and then filter your first data table by that.
However, if you really need to do it in the Data on Demand section, then I'm guessing you don't have the ability to create your own information links. This would mean you can't key off additional data tables, and you're probably going to have to get creative.
Constraints of creativity include needing to know the "rules" of your data -- are you pulling the data in daily? Once a week? Do you have today's data, or today - 2? You could probably write a python script to grab the last day of every month for the last 10 years, and then whatever yesterday's date was, and throw all those values into a document property. This would allow you to do a "Values from Property".
(Side Note: I want to say you could also do it directly in the expression portion with something like an extremely long
Date(DateTimeNow()),DateAdd("dd",-1,Date(Year(DateTimeNow()), Month(DateTimeNow()), 1))
But Spotfire is refusing to accept that as multiple values. Interestingly, when I pull the logic for a StringList property, it gives this: $map("${udDates}", ","), which suggests commas are an accurate methodology, but I get an error reading "Expected 'End of expression' but found ','" . Uncertain if this is a Spotfire issue, or related to my database connection)
tl;dr -- Doing it in the Data on Demand section is probably convoluted. Recommend adjusting in SQL if possible, and otherwise limiting in the visualization
I am trying to create a summary calculation on a set of tables which I have added to an Excel data model. I have 5 tables with the following columns:
Datetime (UTC)
Measured Data 1
Simulated Data 1
Measured Data 2
Simulated Data 2
etc.
I have created a master Date-time table which links these 5 tables together on their UTC date-time column.
My query is how to optimally create calculated fields on each of these tables without needing to explicitly specify the calculations on each individual table, as is the case with PivotTables (I need to select the specific measured and simulated data columns from one individual table in the data model). Instead I would like to be able to map all measured fields to one column and all simulated fields to another, and then use filters to select out the fields (or groups of fields) I want to compare.
I started by creating a summary table which lists all my tables in my data model by name along with the names of measured and simulated columns within each. However, I'm unsure what the next step should be. This seems like a pretty straightforward problem, but it has me stumped this morning. I'd appreciate any help. Let me know if I haven't fully explained anything.
I'm new to Cassandra, and I'm not familiar with super columns.
Consider this scenario: Suppose we have a some fields of a customer entity like
Name
Contact_no
address
and we can store all these values in a normal column. I want to arrange that when a person moves from one location to another location (the representative field could store the longitude and latitude) that values will be stored consecutively with respect to customer location. I think we can do this with super columns but I'm confused how to design the schema to accomplish this.
Please help me to create this schema and come to understand the concepts behind super columns.
supercolumns are really not recommended anymore...still used but more and more have switched to composite columns. For example playOrm uses this concept for indexing. If I am indexing an integer, and indexing row may look like this
rowkey = 10.pk56 10.pk39 11.pk50
Where the column name type is a composite integer and string in this case. These rows can be up to about 10 million columns though I have only run expirements up to 1 million my self. For example, playOrm's queries use these types of indexes to do a query that took 60 ms on 1,000,000 rows.
With playOrm, you can do scalable relational models in noSQL....you just need to figure out how to partition your data correctly as you can have as many partitions as you want in each table, but a partition should really not be over 10 million rows.
Back to the example though, if you have a table with columns numShares, price, username, age, you may wnat to index numShares and the above row would be that index so you could grab the index by key OR better yet, grab all column names with numShares > 20 and numShares < 50
Once you have those columns, you can then get the second half of the column name which is the primary key. The reason primary key is NOT a value is because as in the example above there is two rows pk56 and pk39 with the same 10 and you can't have two columns named 10, but you can have a 10.pk56 and 10.pk39.
later,
Dean