Column counts are larger than 1996099046, unable to calculate percentiles

Column counts are larger than 1996099046, unable to calculate percentiles - cassandra

while I am running TableHistograms getting below message:
NodeTool TableHistograms keyspace TableName
Column counts are larger than 1996099046, unable to calculate percentiles
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 0.00 0.00 268650950 NaN
75% 0.00 0.00 0.00 3449259151 NaN
95% 0.00 0.00 0.00 25628284214 NaN
98% 0.00 0.00 0.00 44285675122 NaN
99% 0.00 0.00 0.00 44285675122 NaN
Min 0.00 0.00 0.00 105779 0
Max 0.00 0.00 0.00 442856751229223372036854776000
Cassandra version:
[cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
Replication factor 3
4 node cluster
Getting the above message in one node only
Tried repairing the table but failed with streaming error :
40328:ERROR [StreamReceiveTask:53] 2019-06-10 13:54:33,684 StreamSession.java:593 - [Stream #c9214180-8b82-11e9-90ce-399bac480141] Streaming error occurred on session with peer <IP ADDRESS>
40329-java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Unable to compute ceiling for max when histogram overflowed
40330- at org.apache.cassandra.utils.Throwables.maybeFail(Throwables.java:51) ~[apache-cassandra-3.11.2.jar:3.11.2]
40331- at org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:373) ~[apache-cassandra-3.11.2.jar:3.11.2]
40332- at org.apache.cassandra.index.SecondaryIndexManager.buildIndexesBlocking(SecondaryIndexManager.java:383) ~[apache-cassandra-3.11.2.jar:3.11.2]
40333- at org.apache.cassandra.index.SecondaryIndexManager.buildAllIndexesBlocking(SecondaryIndexManager.java:270) ~[apache-cassandra-3.11.2.jar:3.11.2]
40334- at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:216) ~[apache-cassandra-3.11.2.jar:3.11.2]
40335- at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_144]
40336- at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_144]
40337- at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_144]
40338- at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_144]
--
0354:ERROR [Reference-Reaper:1] 2019-06-10 13:54:33,907 Ref.java:224 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State#7bd8303d) to class org.apache.cassandra.io.util.ChannelProxy$Cleanup#1084465868:PATH/talename-5b621cd0c53311e7a612ffada4e45177/mc-26405-big-Index.db was not released before the reference was garbage collected
Table description includes :
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Any idea why it is happening? Any help or suggestion is welcome.

You cannot have 2 billion cells in a partition. Also having a secondary index on a table with a 44gb partition is going to have issues for multiple reasons. There really isn't much you can do to fix this short of dropping your index and building a new data model to migrate into. You could build a custom version of Cassandra to ignore that exception but something else will come up very soon as you are at the extreme limits of whats even theoretically possible. You are already past a point that I am surprised is running.
If the streaming error is from repairs you can ignore it while you fix your data model. If it's from bootstrapping I think you will need a custom version of Cassandra to stay running in meantime (or can just ignore the down node you are replacing). Keep in mind node failures are a serious threat to you now as bootstrapping likely will not work. When you put so much in a single partition it cannot be scaled out so there are limited options.

Related

Create unique list from 2 columns and sum values per row based on that unique list from 2 value columns

Having scoured numerous posts I am still struggling to find a solution for a report I am trying to transition over to PowerBI, from MS Excel.
Problem
Create a table in the report section of PowerBI, which has a unique list of currencies (based on 2 columns) and their corresponding FXexposure, which are defined based on each currency leg from 2 columns. Below I have shown the source data and workings I use in Excel, which i am trying to replicate.
Source data (from database table)
a
b
d
d
e
f
g
Instrument
Currency 1
Currency 2
FX nominal 1
FX nominal 2
FXNom1 - Gross
FXNom2 - Gross
FWD EUR/USD
EUR
USD
-7.965264529
7.90296523
7.97
7.90
FWD USD/JPY
USD
JPY
1.030513307
-1.070305687
1.03
1.07
Instrument 1
USD
1.75862819
1.76
0.00
Instrument 2
USD
TRY
0
3.45E-04
0.00
0.00
Instrument 3
JPY
1.121782037
1.12
0.00
Instrument 4
EUR
6.2505079
6.25
0.00
FWD EUR/CNH
EUR
CNH
0.007591392
3.00E-09
0.01
0.00
Instrument 5
RUB
6.209882675
6.21
0.00
F2 = ABS(FX nominal 1)
G2 = ABS(FX nominal 2)
Report output in excel
a
b
c
d
e
FX
Long
Short
Net
**Gross **
0
0.00
0.00
0.00
0.00
RUB
6.21
0.00
6.21
6.21
EUR
6.26
-7.97
-1.71
14.22
JPY
1.12
-1.07
0.05
2.19
USD
10.69
0.00
10.69
10.69
CNH
0.00
0.00
0.00
0.00
TRY
0.00
0.00
0.00
0.00
My Excel formulas are below to recreate what i am looking for.
A2: =IFERROR(LOOKUP(2, 1/(COUNTIF(Report!$A$1:A1,Data!$B$2:$B$553)=0), Data!$B$2:$B$553), LOOKUP(2, 1/(COUNTIF(Report!$A$1:A1, Data!$C$2:$C$553)=0), Data!$C$2:$C$553))
B2: =((SUMIFS(Data!$D$2:$D$553, Data!$B$2:$B$553, Report!$A2, Data!$D$2:$D$553, ">0"))+(SUMIFS(Data!$E$2:$E$553, Data!$C$2:$C$553, Report!$A2, Data!$E$2:$E$553, ">0")))
C2: =((SUMIFS(Data!$D$2:$D$553, Data!$B$2:$B$553, Report!$A3, Data!$D$2:$D$553, "<0"))+(SUMIFS(Data!$E$2:$E$553, Data!$C$2:$C$553, Report!$A3, Data!$E$2:$E$553, "<0")))
D2: =(SUMIF(Data!$B$1:$B$553,Report!$A3,Data!$D$1:$D$553)+SUMIF(Data!$C$1:$C$553,Report!$A3,Data!$E$1:$E$553))
E2: =(SUMIF(Data!$B$1:$B$554,Report!$A3,Data!$F$1:$F$554)+SUMIF(Data!$C$1:$C$554,Report!$A3,Data!$G$1:$G$554))
Now I believe I've managed to find a hack by using the UNIQUE/SELECTCOLUMNS function, but when you try and graph the output it is very small (as if there is other data it is trying to find behind the scenes). Note i tend to filter on date to get the output I need (this is mapped using relationships across other data tables).
FX =
DISTINCT (
UNION (
SELECTCOLUMNS ( DATA, "Date", [DATE], "Currency", [CURRENCY1], "FXNom", [FXNOMINAL1] ),
SELECTCOLUMNS ( DATA, "Date", [DATE], "Currency", [CURRENCY2], ,"FXNom", [FXNOMINAL2] )
)
)
If anyone has any ideas I would be very grateful as I still feel my workaround is more of a lucky hack.
Thanks!

The approach that you're using looks nearly ideal. From a dimensional model perspective, you want one column for values and one column for currency labels. So selecting those pairs as different tables and appending with UNION is the right way to go. Generally, I think it's better to do all the transformation you can in power query, using DAX this way can lead to some limitations.
But if we're going with DAX, I do think you want to get rid of DISTINCT. This could cause identical positions to be collapsed into a single row and you'd lose data this way.
FX =
UNION (
SELECTCOLUMNS ( FX_Raw, "Date", "FakeDate", "Currency", [CURRENCY 1], "FXNom", [FX nominal 1] ),
SELECTCOLUMNS ( FX_Raw, "Date", "FakeDate", "Currency", [CURRENCY 2], "FXNom", [FX nominal 2] )
)
And then a few measures:
Long =
CALCULATE(sum(FX[FXNom]), FX[FXNom] >= 0)
Short =
CALCULATE(sum(FX[FXNom]), FX[FXNom] < 0)
Gross =
SUMX( FX, if(FX[FXNom] > 0, FX[FXNom], 0-FX[FXNom]))
Net =
SUM(FX[FXNom])
Seems to produce the desired result:

How to Slice Dataframe based on two String Column Values in Pandas Python?

I have a Data frame that I am importing from a SharePoint list. Within the Data frame I have a column called 'Identifier' (a text field) which separates tasks based on their hierarchy (x / x.x / x.x.x / - where x is a digit)
The df looks something like below:
Task Name Assigned To % Complete ... Modified By Version Identifier
0 Equipment Intelligent Special Network System 0.00 ... Dominic Leow 1.0 1
1 Core Switch, 24SFP+8GE, Combo+4SFP 0.00 ... Dominic Leow 1.0 1.1
2 Level 1 0.00 ... Dominic Leow 1.0 1.1.1
3 Hacking 0.30 ... Dominic Leow 2.0 1.1.1.1
4 PVC Piping 0.20 ... Dominic Leow 2.0 1.1.1.2
5 Trunking 0.45 ... Dominic Leow 2.0 1.1.1.3
6 Cabling 0.90 ... Dominic Leow 2.0 1.1.1.4
7 Testing 0.25 ... Dominic Leow 2.0 1.1.1.5
8 Termination 0.10 ... Dominic Leow 2.0 1.1.1.6
9 Level 2 0.00 ... Dominic Leow 1.0 1.1.2
10 Hacking 0.00 ... Dominic Leow 1.0 1.1.2.1
11 PVC Piping 0.00 ... Dominic Leow 1.0 1.1.2.2
12 Trunking 0.00 ... Dominic Leow 1.0 1.1.2.3
13 Cabling 0.00 ... Dominic Leow 1.0 1.1.2.4
14 Testing 0.00 ... Dominic Leow 1.0 1.1.2.5
15 Termination 0.00 ... Dominic Leow 1.0 1.1.2.6
16 Level 3 0.00 ... Dominic Leow 1.0 1.1.3
I want to slice all the Rows between two Identifier column values having the format (x.x.x). For example I would want all rows between 1.1.1 until 1.1.2 along with 1.1.2 until 1.1.3 so on an so forth. The objective is to group rows between two sets of identifiers having the format (x.x.x) & save this sliced data frame into a variable, so that I can call it later and do some calculation on it while iterating the entire process.
I have tried the below code but it doesn't seem to work
UPDATE 1 (1/28/2020) - Based on the responses I have made the 'Identifier' Column as the index and it does slice the df, but only when I provide the actual index value. The entirety of the dataset has 1000s of columns & the 'Identifier' only follows a pattern of (x.x.x - where x is a digit) Code updated below
from shareplum import Site
from shareplum import Office365
import pandas as pd
import re
import numpy as np
pd.set_option('display.max_rows', None)
authcookie = Office365('https://speedmax.sharepoint.com', username='username', password='password').GetCookies()
site = Site('https://speedmax.sharepoint.com/sites/jdtstadium', authcookie=authcookie)
sp_list = site.List('joblist')
data = sp_list.GetListItems('All Tasks', rowlimit=5000)
df = pd.DataFrame(data)
stringMatch_mainTask = re.compile(r'^\d$')
stringMatch_bqItem = re.compile(r'^\d'+'.'+'\d$')
stringMatch_level = re.compile(r'^\d'+'.'+'\d'+'.'+'\d$')
stringMatch_job = re.compile(r'^\d'+'.'+'\d'+'.'+'\d'+'.'+'\d$')
mainTaskdf = df[df['Identifier'].str.contains(stringMatch_mainTask)]
bqItemdf = df[df['Identifier'].str.contains(stringMatch_bqItem)]
leveldf = df[df['Identifier'].str.contains(stringMatch_level)]
jobdf = df[df['Identifier'].str.contains(stringMatch_job)]
df = df.set_index("Identifier")
dfSlice = df["1.1.1":"1.1.2"]
print(solution)
Please let me know how I can find a viable solution for this? This data will fuel our custom reports so I am desperate to get a solution.

have you tried setting the index to identifier and just slicing from there?

pandas, how to get close price from returns?

I'm trying to convert from returns to a price index to simulate close prices for the ffn library, but without success.
import pandas as pd
times = pd.to_datetime(pd.Series(['2014-07-4',
'2014-07-15','2014-08-25','2014-08-25','2014-09-10','2014-09-15']))
strategypercentage = [0.01, 0.02, -0.03, 0.04,0.5,-0.3]
df = pd.DataFrame({'llt_return': strategypercentage}, index=times)
df['llt_close']=1
df['llt_close']=df['llt_close'].shift(1)*(1+df['llt_return'])
df.head(10)
llt_return llt_close
2014-07-04 0.01 NaN
2014-07-15 0.02 1.02
2014-08-25 -0.03 0.97
2014-08-25 0.04 1.04
2014-09-10 0.50 1.50
2014-09-15 -0.30 0.70
How can I make this correct?

You can use the cumulative product of return-relatives.
A return-relative is one-plus that day's return.
>>> start = 1.0
>>> df['llt_close'] = start * (1 + df['llt_return']).cumprod()
>>> df
llt_return llt_close
2014-07-04 0.01 1.0100
2014-07-15 0.02 1.0302
2014-08-25 -0.03 0.9993
2014-08-25 0.04 1.0393
2014-09-10 0.50 1.5589
2014-09-15 -0.30 1.0912
This assumes the price index starts at start on the close of the trading day prior to 2014-07-04.
On 7-04, you have a 1% return and the price index closes at 1 * (1 + .01) = 1.01.
On 7-15, return was 2%; close price will be 1.01 * (1 + .02) = 1.0302.
Granted, this is not completely realistic given you're forming a price indexing from irregular-frequency data (missing dates), but hopefully this answers your question.

Cassandra compaction tasks number keep growing

I`m using Cassandra dsc 2.1.5 with 3 nodes, and the following table description:
cqlsh> DESCRIBE TABLE mykeyspace.mytable;
CREATE TABLE mykeyspace.mytable (
a text,
b text,
c timestamp,
d timestamp,
e text,
PRIMARY KEY ((a, b), c)
) WITH CLUSTERING ORDER BY (c ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
The first and the third nodes are working fine (all have the same cassandra.yaml) but the second node started to have more and more pending compaction tasks that I can see with nodetool compactionstats -H command. The situation is so bad that my spark jobs get stuck and work only when I completely shut down the second node.
I have around 130G free on the second node...
Also, here a cfstatss state:
> nodetool cfstatss mykeyspace.mytable
Keyspace: mykeyspace
Read Count: 0
Read Latency: NaN ms.
Write Count: 54316 [1/1863] Write Latency: 0.1877597945356801 ms.
Pending Flushes: 0
Table: mytable
SSTable count: 1249
Space used (live): 1125634027755
Space used (total): 1125634027755
Space used by snapshots (total): 0
Off heap memory used (total): 1202327957
SSTable Compression Ratio: 0.11699340657338655
Number of keys (estimate): 34300801
Memtable cell count: 758856
Memtable data size: 351011415
Memtable off heap memory used: 0
Memtable switch count: 10
Local read count: 0
Local read latency: NaN ms
Local write count: 54319
Local write latency: 0.188 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 48230904
Bloom filter off heap memory used: 48220912
Index summary off heap memory used: 11161093
Compression metadata off heap memory used: 1142945952
Compacted partition minimum bytes: 925
Compacted partition maximum bytes: 52066354
Compacted partition mean bytes: 299014
Average live cells per slice (last five minutes): 0.0
Maximum live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0
What can be the problem?

Even simple queries not working for a cluster ("Request did not complete within rpc_timeout")

I've successfully set up a Cassandra cluster with 7 nodes. However, I can't get it to work for basic queries.
CREATE TABLE lgrsettings (
siteid bigint,
channel int,
name text,
offset float,
scalefactor float,
units text,
PRIMARY KEY (siteid, channel)
)
insert into lgrsettings (siteid,channel,name,offset,scalefactor,units) values (999,1,'Flow',0.0,1.0,'m');
Then on one node:
select * from lgrsettings;
Request did not complete within rpc_timeout.
And on Another:
select * from lgrsettings;
Bad Request: unconfigured columnfamily lgrsettings
Even though the keyspace and column family shows up on all nodes.
Any ideas where I could start looking?
Alex
Interesting results. The node that handled the keyspace creation and insert shows:
Keyspace: testdata
Read Count: 0
Read Latency: NaN ms.
Write Count: 2
Write Latency: 0.304 ms.
Pending Tasks: 0
Column Family: lgrsettings
SSTable count: 0
Space used (live): 0
Space used (total): 0
Number of Keys (estimate): 0
Memtable Columns Count: 10
Memtable Data Size: 129
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 2
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 0
Compacted row minimum size: 0
Compacted row maximum size: 0
Compacted row mean size: 0
Column Family: datapoints
SSTable count: 0
Space used (live): 0
Space used (total): 0
Number of Keys (estimate): 0
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 0
Compacted row minimum size: 0
Compacted row maximum size: 0
Compacted row mean size: 0
Other nodes don't have this in the cfstats but do show it in DESCRIBE KEYSPACE testdata; in the CQL3 clients...

Request did not complete within rpc_timeout
Check your Cassandra logs to confirm if there is any issue - sometimes exceptions in Cassandra lead to timeouts on the client.

In a comment, the OP said he found the cause of his problem:
I've managed to solve the issue. It was due to time sync between the nodes so I installed ntpd on all nodes, waited 5 minutes and tried again and I have a working cluster!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string