How Cassandra can provide optimized solution for a Overlapping range check - cql

In my input I have a key , a lower bound of range R1, a upper bound of range R1. and some data.I have to insert this data only after getting ensured that my input range R1 should not overlap any other present ranges already present in cassandra.
So with before each insert i have to fire a select query
key | lowerbound | upperbound | data
------+------------+------------+------------------------------------------------------------------------
1024 | 1000 | 1100 | <blob>
1024 | 1500 | 1900 | <blob>
1024 | 2000 | 2900 | <blob>
1024 | 3000 | 3900 | <blob>
1024 | 4000 | 4500 | <blob>
Case1 Given Range R(S,E)=(1,999)
This is a positive case hence system should Insert the data
Case2: Given Range R(S,E)=(1001,1010)
this is a Negative case hence system should discard the data
I have a solution with one Range query and one programmatic check solution
please let me know whether this kind of problem statement have solution in Cassandra if yes can it be optimized to get a better performance

You don't have a better solution for your problem: this is the only way. Possibly in future Lightweight Transactions might help also for these situations but now the only solution you have is to read before writing. One more consideration: ensure to avoid double insertion in concurrency situation (if this can happen in your application).
Cheers,
Carlo

Related

Calculate sum of differences between rows but only if the previous cell is bigger than the current one

I started to get a headache around my problem that I cannot figure out for the love of me.
There are unknown amounts of column if that makes any difference, but basically each row needs to be compared to the previous one and ONLY when the previous value is greater, the difference between them gets added to the sum.
So for example I have this table
| A |
--|-----|
1 | 100 |
2 | 90 |
3 | 80 |
4 | 100 |
5 | 70 |
6 | 20 |
7 | 100 |
...
Expected result: 100, derived from ((100-90) + (90-80) + (100-70) + (70-20))
I have spent a whole day browsing every single excel tutorial page and cannot find a single helpful answer. Please help :(
Formula for Cell B2: (pull down through the rows).
=IF(A1>B1;A1-B1;0)+B1
Logic: If previous value is larger than current value, add the difference to the total.
If you want to do it in one formula, a basic way would be two use two ranges offset by one cell:
=SUMPRODUCT((A1:A6-A2:A7)*(A1:A6>A2:A7))
If you wanted to make a bit more dynamic (assuming there are no gaps in the data) you could try
=SUMPRODUCT((A1:INDEX(A:A,COUNT(A:A)-1)-A2:INDEX(A:A,COUNT(A:A)))*(A1:INDEX(A:A,COUNT(A:A)-1)>A2:INDEX(A:A,COUNT(A:A))))
If there are blanks between numbers, this won't work and you would probably need to go back to a simpler pull-down formula

Calculating the size of a table in Cassandra

In "Cassandra The Definitive Guide" (2nd edition) by Jeff Carpenter & Eben Hewitt, the following formula is used to calculate the size of a table on disk (apologies for the blurred part):
ck: primary key columns
cs: static columns
cr: regular columns
cc: clustering columns
Nr: number of rows
Nv: it's used for counting the total size of the timestamps (I don't get this part completely, but for now I'll ignore it).
There are two things I don't understand in this equation.
First: why do clustering columns size gets counted for every regular column? Shouldn't we multiply it by the number of rows? It seems to me that by calculating this way, we're saying that the data in each clustering column, gets replicated for each regular column, which I suppose is not the case.
Second: why do primary key columns don't get multiplied by the number of partitions? From my understanding, if we have a node with two partitions, then we should multiply the size of the primary key columns by two because we'll have two different primary keys in that node.
It's because of Cassandra's version < 3 internal structure.
There is only one entry for each distinct partition key value.
For each distinct partition key value there is only one entry for static column
There is an empty entry for the clustering key
For each column in a row there is a single entry for each clustering key column
Let's take an example :
CREATE TABLE my_table (
pk1 int,
pk2 int,
ck1 int,
ck2 int,
d1 int,
d2 int,
s int static,
PRIMARY KEY ((pk1, pk2), ck1, ck2)
);
Insert some dummy data :
pk1 | pk2 | ck1 | ck2 | s | d1 | d2
-----+-----+-----+------+-------+--------+---------
1 | 10 | 100 | 1000 | 10000 | 100000 | 1000000
1 | 10 | 100 | 1001 | 10000 | 100001 | 1000001
2 | 20 | 200 | 2000 | 20000 | 200000 | 2000001
Internal structure will be :
|100:1000: |100:1000:d1|100:1000:d2|100:1001: |100:1001:d1|100:1001:d2|
-----+-------+-----------+-----------+-----------+-----------+-----------+-----------+
1:10 | 10000 | | 100000 | 1000000 | | 100001 | 1000001 |
|200:2000: |200:2000:d1|200:2000:d2|
-----+-------+-----------+-----------+-----------+
2:20 | 20000 | | 200000 | 2000000 |
So size of the table will be :
Single Partition Size = (4 + 4 + 4 + 4) + 4 + 2 * ((4 + (4 + 4)) + (4 + (4 + 4))) byte = 68 byte
Estimated Table Size = Single Partition Size * Number Of Partition
= 68 * 2 byte
= 136 byte
Here all of the field type is int (4 byte)
There is 4 primary key column, 1 static column, 2 clustering key column and 2 regular column
More : http://opensourceconnections.com/blog/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure/
As the author, I greatly appreciate the question and your engagement with the material!
With respect to the original questions - remember that this is not the formula to calculate the size of the table, it is the formula to calculate the size of a single partition. The intent is to use this formula with "worst case" number of rows to identify overly large partitions. You'd need to multiply the result of this equation by the number of partitions to get an estimate of total data size for the table. And of course this does not take replication into account.
Also thanks to those who responded to the original question. Based on your feedback I spent some time looking at the new (3.0) storage format to see whether that might impact the formula. I agree that Aaron Morton's article is a helpful resource (link provided above).
The basic approach of the formula remains sound for the 3.0 storage format. The way the formula works, you're basically adding:
the sizes of the partition key and static columns
the size of the clustering columns per row, times the number of rows
8 bytes of metadata for each cell
Updating the formula for the 3.0 storage format requires revisiting the constants. For example, the original equation assumes 8 bytes of metadata per cell to store a timestamp. The new format treats the timestamp on a cell as optional since it can be applied at the row level. For this reason, there is now a variable amount of metadata per cell, which could be as low as 1-2 bytes, depending on the data type.
After reading this feedback and rereading that section of the chapter, I plan to update the text to add some clarifications as well as stronger caveats about this formula being useful as an approximation rather than an exact value. There are factors it doesn't account for at all such as writes being spread over multiple SSTables, as well as tombstones. We're actually planning another printing this spring (2017) to correct a few errata, so look for those changes soon.
Here is the updated formula from Artem Chebotko:
The t_avg is the average amount of metadata per cell, which can vary depending on the complexity of the data, but 8 is a good worst case estimate.

Apache POI: after shiftRows() some of the cell ranges don't get extended

The project I'm working on uses Apache POI to manage Excel output. For all the output values a number of statistical values are calculated by Excel. By default, 10 output values are expected and are written to the spreadsheet in one column. Starting from row 11, there are several rows dedicated for the above mentioned statistical summary calculations.
For instance:
| A | B |
1 |
2 |
3 |
4 |
..|
10|
11|$(AVERAGE(A1:A10))
12|$(STDEV.S(A1:A10))
13| //other statistical values (st.error, confidence intervals, etc.)
If the total number of output entries exceeds 10, shiftRows() function is used to move statistical calculations down by a number of rows that exceeds 10. By using shiftRows() starting from row 10, the cell range used in rows 11+ gets extended as expected. For instance, if 13 output values are produced, the cell range should become A1:A13. This is true for all rows except for standard deviation, which happens to be the second row:
| A | B |
1 |
2 |
3 |
4 |
..|
13|
14|$(AVERAGE(A1:A13))
15|$(STDEV.S(A1:A10)) // should be A1:A13
16|$(func(A1:A13))
17| //other statistical values (st.error, confidence intervals, etc.)
I cannot find a reasonable explanation for why it doesn't work for st.deviation row.
Updating Apache POI from 3.13 to 3.15 has resolved the issue.
Thanks to everyone for offering your help.

Dynamic Data Validation lists based on VLookup

I'm trying to add a custom 'discount' list to my spreadsheet.
I've got a table that contains all the data, and has costs for the standard 'used' value, then also the values at a 5% discount and a 10% discount.
Example:
+---------+-------------------+------+------------+-------------+
| Code | Role | Used | Used - 5% | Used - 10% |
+=========+===================+======+============+=============+
| Test001 | Employee | 5.67 | | |
+---------+-------------------+------+------------+-------------+
| Test002 | Junior Technician | 9.80 | 9.31 | 8.38 |
+---------+-------------------+------+------------+-------------+
| Test003 | Project Manager | 15 | | |
+---------+-------------------+------+------------+-------------+
| Test004 | Engineer | 20 | 19 | 17.10 |
+---------+-------------------+------+------------+-------------+
I've then got a Data validation list which returns all other the 'Roles' to select from. On the back of this this populates the Cost cell.
Example:
+----------+----------+----------+-------+
| Role | VLOOKUP | Discount | Cost |
+==========+==========+==========+=======+
| Employee | | | 5.67 |
+----------+----------+----------+-------+
| Engineer | 5%,10% | 10% | 15.10 |
+----------+----------+----------+-------+
What I want to do is have a list to be populated with 5%, 10% if there is that option. I'd like to achieve this without vba (I could easily achieve this with vba but trying to keep it all in the worksheet)
My VLOOKUP Column is populated using:
=CONCATENATE(IF(VLOOKUP(A2,INDIRECT("Test[[Role]:[Used - 10%]]"), 3, FALSE) <> "", "5%", ""),
IF(VLOOKUP(A2,INDIRECT("Test[[Role]:[Used - 10%]]"), 4, FALSE) <> "", ",10%", ""))
The issue comes when trying to do the data validation. It accepts the formula (tried using the above to no avail in the data validation) but populates the drop down list with just the one value of 5%,10% instead of interpreting it as a csv.
I'm currently using this to attempt to populate the Discount Drop Down
=OFFSET(INDIRECT(ADDRESS(ROW(), COLUMN())),0, -1)
It is possible assuming your version of Excel has access to the dynamic functions FILTER and UNIQUE. Let's go through a couple of things, and here is a google doc where this is demonstrated. I also included an online excel file*.
It isn't necessary to calculate the cost in the setup table (A:E). You can just use a character to mark availability (and in some versions it was difficult to make the FILTER work with comparisons like <>"", etc, when ="x" worked fine).
You can get an array of available discounts by using FILTER, INDEX and MATCH. See Col P. You use INDEX/MATCH to return a single row of the array containing the discounts (in this case D:E), and then use that row to filter the top row (D1:E1) which has the friendly discount names and return it as an array.
It isn't necessary to concat the discount list the way you're doing. You can use TEXTJOIN, FILTER, INDEX and MATCH. See Col I. You just wrap the calculation that generates the array of discount names (step 2) in TEXTJOIN to get a string.
The validation is accomplished by referencing the output of step 2. I don't think that the data validation dialog can handle the full formula, so I pointed it to Cols O:Q. Col O is included in the validation so that you can get an empty spot at the top of the list, but Google Docs seems to strip it out.
You can just calculate the discounted cost from the selected option. See Col K. I included the original cost in Col L so you can see it.
you will need a microsoft account to view

How to resolve duplicate column names in excel file with Alteryx?

I have a wide excel file with price data, looking like this
Product | 2015-08-01 | 2015-09-01 | 2015-09-01 | 2015-10-01
ABC | 13 | 12 | 15 | 14
CDE | 69 | 70 | 71 | 67
FGH | 25 | 25 | 26 | 27
The date 2015-09-01 can be found twice, which in the context is valid but obviously messes up my workflow.
It can be understood that the first value is the minimum price, the second one the maximum price. If there is only one column, min and max are the same.
Is there a way to resolve this issue?
An idea I had was the following:
I also have cells that contain a value like "38 - 42", again indicating min and max. I resolved this by spliting it based on a Regex expression. What could be a solution is to join two columns that have the same header, to afterwards split the values according to my rules. That however would require me to detect dynamically if the headers are duplicates.
Is that something that is possible in Alteryx or is there an easier solution for this problem?
And of course asking the supplier of the file to change it is not really an option, unfortunatelly.
Thanks
EDIT:
Just got another idea:
I transpose the table to have the format
Product | Date | Price Low | Price High
So if I could check for duplicates in that table and somehow merge these records into one, that would do the trick as well.
EDIT2:
Since I seem to haven't made that clear, my final result should look like the transposed table in EDIT1. If there is only one value it should go in "Price Low" (and then I will probably copy it to "Price High" anyway. If there are two values they should go in the according columns. #Poornima's suggestion resolves the duplicate issue in a more sophisticated form than putting a "_2" behind the column name, but doesn't put the value in the required column.
If this format works for you:
Product | Date | Price Low | Price High
Then:
- Transpose with Product as a key field
- Use a select tool to truncate your Name field to 10 characters. This will remove any _2 values that Alteryx has automatically renamed.
- Summarize:
Group by Product
Group by Name
Then apply Min and Max operations to value.
Result is:
Product | Name | Min_Value | Max_Value
ABC | 2015-08-01 | 13 | 13
ABC | 2015-09-01 | 12 | 15
ABC | 2015-10-01 | 14 | 14
For this problem, you can leverage the native Excel (.xlsx) driver available in Alteryx 9.1. If multiple columns in Excel use the same string, then they are renamed by the native driver with an underscore at the end e.g., 2015-09-01, 2015-09-01_1. By leveraging this, we can reformat the data in three steps:
As you suggested, we start by transposing the data so that we can leverage the column headers.
We can then write a formula with the Formula Tool that evaluates whether the column header for the date is the first or the last one based on the header length.
The final step would be to bring the data back into the same format as before, which can be via the Crosstab Tool.
You can review the configurations for each of these tools here. The end result would be as follows.
Hope this helps.
Regards,
Poornima

Resources