Maintain statistics across rows in accumulo - accumulo

I am relatively new to Accumulo, so would greatly appreciate general tips for doing this better.
I have a rowIds that are made up of a time component and a geographic component. I'd like to maintain statistics (counts, sums, etc.) in an iterator of some sort, but would like to emit mutations to other rows as part of the ingest. In other words, as I insert a row:
<timeA>_<geoX> colFam:colQual value
In addition to the mutation above, I'd like to maintain stats in separate rows in the same table (or a different one) as follows:
timeA_countRow colFam:colQual count++
geoX_countRow colFam:colQUal count++
timeA_sumRow colFam:colQUal sum += value
geoX_sumRow colFam:colQual sum += value
What is the best way to accomplish such a thing? I have definitely seen the stats combiner, but that works within a single row to my understanding. I'd like to maintain stats based on parts of the key...
Thanks!

In addition to the mutation above, I'd like to maintain stats in separate rows in the same table (or a different one) as follows
This is something that fundamentally does not work with Accumulo. You cannot know, within the confines of an Iterator, about data in a separate row. That's why the StatsCombiner is written in the context of a single row. Any other row is not guaranteed to be contained in the Tablet (physical data boundary).
A common approach is to maintain this information client-side via a separate table or locality group with a SummingCombiner. When you insert an update for a specific column, you also submit an update to your stats table.
You could also look into Fluo which allows you to perform cross-row transactions. This is a different beast than normal Accumulo and is still in beta.

Related

How to identify all columns that have different values in a Spark self-join

I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.
Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html

Classification and Grouping of Sorted Data

I have a Dataset H3:J12 where components are classified based on Type. I have summed the count for similar components and sorted based on Count with Unique and Sort formulae and the result is L3:M7.
In my actual case, there are several thousands of such components which are sorted as in L:L and now I would like to add the Type column next to the Component with sorted Count as shown in P3:R12. Is it possible to extract them directly from L3:M7 or directly from H3:J12, as I will not be able to do them manually.
Screenshots/here refer:
Mechanical approaches include pivots, VB, etc. However, you could consider a more dynamic approach that doesn't require constant updating / refreshing (of code, pivots, etc.) whenever underlying data is appended/amended.
YES: you can retrieve directly from source data as follows:
=SORT(UNIQUE(C3:D20))
=SUMIFS(E3:E30,C3:C30,H3:H17,D3:D30,I3:I17)
Notes:
Could make this averageif(s), countif(s), percentile etc. etc. - another
significant advantage over alternative methods (VB, pivots, etc.), besides dynamic nature, is the flexibility re: measures; restrictions present in pivots are substantive in relation to Excel direct calculation
Disadvantage is inability to automatically chart data using pivot chart functionality that accompanies pivot tables
See linked sheet opening line for conditional formatting sample used to recreate 'look / feel' you might otherwise be losing out on with this approach
It seemed as though you were almost there! I'm not sure what relevance other tables have (ignored this in light of question).

Spark moving average

I am trying to implement moving average for a dataset containing a number of time series. Each column represents one parameter being measured, while one row contains all parameters measured in a second. So a row would look something like:
timestamp, parameter1, parameter2, ..., parameterN
I found a way to do something like that using window functions, but the following bugs me:
Partitioning Specification: controls which rows will be in the same partition with the given row. Also, the user might want to make sure all rows having the same value for the category column are collected to the same machine before ordering and calculating the frame. If no partitioning specification is given, then all data must be collected to a single machine.
The thing is, I don't have anything to partition by. So can I use this method to calculate moving average without the risk of collecting all the data on a single machine? If not, what is a better way to do it?
Every nontrivial Spark job demands partitioning. There is just no way around it if you want your jobs to finish before the apocalypse. The question is simple: When it comes time to do the inevitable aggregation (in your case, an average), how can you partition your data in such a way as to minimize shuffle by grouping as much related data as possible on the same machine?
My experience with moving averages is with stocks. In that case it's easy; the partition would be on the stock ticker symbol. After all, the calculation of the 50-Day Moving Average for Stock A has nothing to with that for Stock B, so those data don't need to be on the same machine. The obvious partition makes this simpler than your situation--not to mention that it only requires one data point (probably) per day (the closing price of the stock at the end of trading) while you have one per second.
So I can only say that you need to consider adding a feature to your data set whose sole purpose is to serve as a partition key even if it is irrelevant to what you're measuring. I would be surprised if there isn't one, but if not, then consider a time-based partition on days for example.

Cassandra sets or composite columns

I am storing account information in Cassandra. Each account has lists of data associated with it. For example, an account may have a list of friends and a list of liked books. Queries on accounts will always want all friends or all liked books or all of both. No filtering or searching is needed on either. The list of friends and books can grow and shrink.
Is it better to use a set column type or composite columns for this scenario?
I would suggest you not to use sets if
You are concerned about disk space(as each value is allocated a cell in disk + data space for metadata of each cell which is 15 bytes if am not wrong. Now that consumes a lot if your data is a growing one).
Not going to grow a lot of data in that particular row as each time ,the cells are to be fetched from different sstable .
In these kind of cases, the more preferred option would be a json array. You shall store it as a text and back the data from that.
Set (or any other collections ) use case was brought in for a completely different perspective. If you are having a particular value inside the list or a value has to be updated frequently inside the same collection, you shall make use of the collections .
My take on your query will be this.
Store all account specific info in a json object of friends that has a value as list of books .
Sets are good for smaller collections of data, if you expect your friends / liked books lists to grow constantly and get large (there isn't a golden number here) it would be better to go with composite columns as that model scales out better than collections and allows for straight up querying compared to requiring secondary indexes on collections.

Azure Table Storage Delete where Row Key is Between two Row Key Values

Is there a good way to delete entities that are in the same partition given a row key range? It looks like the only way to do this would be to do a range lookup and then batch the deletes after looking them up. I'll know my range at the time that entities will be deleted so I'd rather skip the lookup.
I want to be able to delete things to keep my partitions from getting too big. As far as I know a single partition cannot be scaled across multiple servers. Each partition is going to represent a type of message that a user sends. There will probably be less than 50 types. I need a way to show all the messages of each type that were sent (ex: show recent messages regardless of who sent it of type 0). This is why I plan to make the type the partition key. Since the types don't scale with the number of users/messages though I don't want to let each partition grow indefinitely.
Unfortunately, you need to know precise Partition Keys and Row Keys in order to issue deletes. You do not need to retrieve entities from storage if you know precise RowKeys, but you do need to have them in order to issue batch delete. There is no magic "Delete from table where partitionkey = 10" command like there is in SQL.
However, consider breaking your data up into tables that represent archivable time units. For example in AzureWatch we store all of the metric data into tables that represent one month of data. IE: Metrics201401, Metrics201402, etc. Thus, when it comes time to archive, a full table is purged for a particular month.
The obvious downside of this approach is the need to "union" data from multiple tables if your queries span wide time ranges. However, if your keep your time ranges to minimum quantity, amount of unions will not be as big. Basically, this approach allows you to utilize table name as another partitioning opportunity.

Resources