druid rollup exclude a long column - transform

my Druid version is 0.16.
I have a long column: status_code, it's just represent http response status.
When enable rollup, this column would be aggregated, but I don't want it. Event I create a transform column, add a string prefix to status_code, make a string column. It also will be aggregated as long column.
How can I exclude this column from rollup? Please heeeeeelp...

Maybe it is also useful to make sure that the column is specified as dimension column, and not as metric. Metric columns are fields which are used in aggregation functions, like sum. Dimensions are "fixed" values which can be used in filtering on specific records, and to group by.

I found a solution. Create a new transform column make long column to string column, then add transformed string column manually in "Edit json spec" tab.

Related

Delete bottom two rows in Azure Data Flow

I would like to delete the bottom two rows of an excel file in ADF, but I don't know how to do it.
The flow I am thinking of is this.
enter image description here
*I intend to filter -> delete the rows to be deleted in yellow.
The file has over 40,000 rows of data and is updated once a month. (The number of rows changes with each update, so the condition must be specified with a function.)
The contents of the file are also shown here.
The bottom two lines contain spaces and asterisks.
enter image description here
Any help would be appreciated.
I'm new to Azure and having trouble.
I need your help.
Add a surrogate key transformation to put a row number on each row. Add a new branch to duplicate the stream and in that new branch, add an aggregate.
Use the aggregate transformation to find the max() value of the surrogate key counter.
Then subtract 2 from that max number and filter for just the rows up to that max-2.
Let me provide a more detailed answer here ... I think I can get it in here without writing a separate blog.
The simplest way to filter out the final 2 rows is a pattern depicted in the screenshot here. Instead of the new branch, I just created 2 sources both pointing to the same data source. The 2nd stream is there just to get a row count and store it in a cached sink. For the aggregation expression I used this: "count(1)" as the row count aggregator.
In the first stream, that is the primary data processing stream, I add a Surrogate Key transformation so that I can have a row number for each row. I called my key column "sk".
Finally, set the Filter transformation to only allow rows with a row number <= the max row count from the cached sink minus 2.
The Filter expression looks like this: sk <= cachedSink#output().rowcount-2

How can I transpose and summarize data appropriately in PowerQuery?

I'm working on achieving the following data transformation/wrangling within Power Query but can't seem to get there on my own. i have read a lof of different questions and answers on the forum but it seems just a bit beyond my grasp.
I have a table which has the ticker of a specific currency in the first column.
There is a second column with the date and time when a certain event, related to that specific currency, happens. This second column is basically the different 5-minute intervals which exist on any given day.
Finally there is a third column which describes the magnitude of the event.
The table therefore looks like this
What I would like to do in power Query is transpose the uniques name of the currencies as the first row of a new table. The first column of this table would be the largest time interval for any given currency. In this case, as you can see in the data I am attaching, the largest timeseries would be that of the currency ETH. Using the longest calendar as our first column I would then like to place the values described in item 3 above as rows in the new table.
The new layout would look like this
My steps to transform the raw data in the first table are detailed in this image. Basically just expanding a JSON file and getting all the data I need into that first format which I described previously.
What I then do is:
Pivot using the first column
Transpose
That gives me a whole bunch of new columns. Way more than I want. Any idea what I can do differently?
In powerquery,
click select pair column
Transform .. pivot column .. values column: basis advanced options: do not aggregate
code:
#"Pivoted Column6" = Table.Pivot(YourPriorStepName, List.Distinct(Source[pair]), "pair", "basis", List.Sum)
output:

Azure Data Flow - Can we have Dynamic columns or change in projections for Unpiovt functionality

The excel consist of 62 columns and 7 columns are fixed and rest of them have weeks as in year(week1 to week 52)
I have used a data flow task to unpivot the 53 columns into rows with 2 extra columns year and value.
The problem is that I have the 52 week column names keep changing on every week data load and how to I handle this change in column names in data flow. For a single run it gives the exact output
What you'll want to do here is to implement late-binding of your schema, or what ADF refers to as "schema drift". Instead of setting a hardened "early binding" schema in your Source projection, leave the dataset schema and projection empty.
Next, add a Derived Column after your source and call it "Projection". This is where you'll build your projection using rules to account for your evolving schema.
Build out your canonical model with the column names for your entire year using byName('columnname'). That will tell ADF to look for the existence of the column in single quotes from your source data while also providing a schema that you can use to build out your pivot table.
If you need to cast the values, wrap byName() inside of a casting function, i.e. toString(), toDate(), etc.

How to identify notesView columns that have column totals defined for them?

I have several views that have 'column totals' defined in some columns of these views. The totals can be in different positions in each view. I'm looking a for a fast, reliable way of identifying which columns have totals before I scan these columns and views.
Ideally, I want a 'isTotal' property on the column defintion (NotesViewColumn), but that property is not defined/available.
I can see that totals in the ColumnValues array does return a 'double' datatype where a column exists, but I can only see this once I've started scanning the data in the view, and I want this detail before I start looking at the data. (For information, the ColumnValues for a category notesViewEntry is an array containing: strings for cat columns, 'empty' for untotalled fields, and doubles for totals).
I can (of course) hard-code this detail somewhere, but it seems archaic to have to do this. I can 'getFirstDoc' to work out the ColumnValues in a 'pre-loop' check, but this seems 'wasteful'.
PS: I have seen something called 'ColumnValuesIndex' but this appears to be an undocumented feature which I would prefer not to use. However, if there were an 'isTotal' undocumented feature - I'd be ok with it!
The only solution I can think of to do this before you scan the data is to export the view design to DXL, then check the DXL for attributes or elements that specify whether each column shows totals.
I'm assuming view columns in DXL have an attribute or child element for this purpose. I haven't checked.
In case you've never done anything like this, using a NotesDXLExporter with either a NotesDOMParser or a NotesSAXParser, you can export selected design elements to in-memory DXL and programmatically analyse it.
As outlined in the original post, looking at the ColumnValues of a NotesViewEntry (NVE) on a category row does provide an array of values you can use to determine if any specific column is either a category string, empty, or a total. The totals have a datatype of double so stepping over the columns in a loop can easily flag the totals.
If the view has categories, then the first NVE in the view will give these details. A simple 'NotesView.GetFirstEntry().ColumnValues' will return the array. If the view has totals, but not categories, you can 'GetLastEntry' for the totals row at the bottom of the view.
Reading the totals is then just a case of looking for category rows (nve.IsCategory) and extracting the totals from the nve.ColumnValues.
Performance is reasonable and can be made a bit quicker by building a booleans array pre-loop of where the totals exist.

How to arrange data in Cassandra to get data in last in first out format

As we cannot sort data in Cassandra, I wanted to store data in such format that when I retrieve the data, I need to get data in ' last in first out format ' i.e if user enter comments when I retrieve data, I should first get very latest comment first and then older comments. I think it's something to do with comparator.
I have set following when configuring Cassandra:
assume posts comparator as utf8;
assume posts validator as utf8;
assume posts keys as utf8;
Please help - how should I create the column to arrange data in time format so that latest data is stored first?
Columns in a row are always sorted, and you can iterate over the columns in a row in reverse order. Given these two facs we could model the situation you're describing by storing comments in a column family called "comments" where the row key is the post ID, and the columns represent the comments to the corresponding post. The columns are timestamts (either ISO formatted dates, UNIX timestamps or time UUIDs) and the values are the comment text bodies.
If you would now get the columns for a row and specify that you wanted them in reverse order you would get what you want. How to specify reverse order depends on your driver, but it's usually just an option to the command that retrieves a row, or a column slice.
Another way, which is more hackish, would be to take the UNIX timestamp of a post, and subtract it from a large integer, like 2^31, and use that as column key. That way columns would sort in reverse order by default. It's not pretty and the above method is more elegant.
If you worry about using timestamps because there could be collisions where two comments are posted at exactly the same time, use Cassandra's time UUID type.
You need to organize your data such that the comparator is a timestamp. You store your data in natural order and specify reverse order in your slice query.

Resources