In our test environment, We have a 1 node cassandra cluster with RF=1 for all keyspaces.
VM arguments of interest are listed below
-XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn1G -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
We noticed Full GC happening frequently and cassandra getting unresponsive during GC.
INFO [Service Thread] 2016-12-29 15:52:40,901 GCInspector.java:252 - ParNew GC in 238ms. CMS Old Gen: 782576192 -> 802826248; Par Survivor Space: 60068168 -> 32163264
INFO [Service Thread] 2016-12-29 15:52:40,902 GCInspector.java:252 - ConcurrentMarkSweep GC in 1448ms. CMS Old Gen: 802826248 -> 393377248; Par Eden Space: 859045888 -> 0; Par Survivor Space: 32163264 -> 0
We are getting java.lang.OutOfMemoryError with below exception
ERROR [SharedPool-Worker-5] 2017-01-26 09:23:13,694 JVMStabilityInspector.java:94 - JVM state determined to be unstable. Exiting forcefully due to:
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) ~[na:1.7.0_80]
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) ~[na:1.7.0_80]
at org.apache.cassandra.utils.memory.SlabAllocator.getRegion(SlabAllocator.java:137) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.utils.memory.SlabAllocator.allocate(SlabAllocator.java:97) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.utils.memory.ContextAllocator.allocate(ContextAllocator.java:57) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.utils.memory.ContextAllocator.clone(ContextAllocator.java:47) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.utils.memory.MemtableBufferAllocator.clone(MemtableBufferAllocator.java:61) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.db.Memtable.put(Memtable.java:192) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1237) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:400) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:363) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.db.Mutation.apply(Mutation.java:214) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.service.StorageProxy$7.runMayThrow(StorageProxy.java:1033) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.service.StorageProxy$LocalMutationRunnable.run(StorageProxy.java:2224) ~[apache-cassandra-2.1.8.jar:2.1.8]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_80]
at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [apache-cassandra-2.1.8.jar:2.1.8]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]
We were able restore the cassandra after executing nodetool repair.
nodetool status
Datacenter: DC1
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.3.211.3 5.74 GB 256 ? 32251391-5eee-4891-996d-30fb225116a1 RAC1
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
nodetool info
ID : 32251391-5eee-4891-996d-30fb225116a1
Gossip active : true
Thrift active : true
Native Transport active: true
Load : 5.74 GB
Generation No : 1485526088
Uptime (seconds) : 330651
Heap Memory (MB) : 812.72 / 1945.63
Off Heap Memory (MB) : 7.63
Data Center : DC1
Rack : RAC1
Exceptions : 0
Key Cache : entries 68, size 6.61 KB, capacity 97 MB, 1158 hits, 1276 requests, 0.908 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 48 MB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Token : (invoke with -T/--tokens to see all 256 tokens)
From System.log, I see lots of compacting large partitiion
WARN [CompactionExecutor:33463] 2016-12-24 05:42:29,550 SSTableWriter.java:240 - Compacting large partition mydb/Table_Name:2016-12-23 00:00+0530 (142735455 bytes)
WARN [CompactionExecutor:33465] 2016-12-24 05:47:57,343 SSTableWriter.java:240 - Compacting large partition mydb/Table_Name_2:22:0c2e6c00-a5a3-11e6-a05e-1f69f32db21c (162203393 bytes)
For Tombstone I notice below in system.log
[main] 2016-12-28 18:23:06,534 YamlConfigurationLoader.java:135 - Node
configuration:[authenticator=PasswordAuthenticator;
authorizer=CassandraAuthorizer; auto_snapshot=true;
batch_size_warn_threshold_in_kb=5;
batchlog_replay_throttle_in_kb=1024;
cas_contention_timeout_in_ms=1000;
client_encryption_options=; cluster_name=bankbazaar;
column_index_size_in_kb=64; commit_failure_policy=ignore;
commitlog_directory=/var/cassandra/log/commitlog;
commitlog_segment_size_in_mb=32; commitlog_sync=periodic;
commitlog_sync_period_in_ms=10000;
compaction_throughput_mb_per_sec=16; concurrent_counter_writes=32;
concurrent_reads=32; concurrent_writes=32;
counter_cache_save_period=7200; counter_cache_size_in_mb=null;
counter_write_request_timeout_in_ms=15000; cross_node_timeout=false;
data_file_directories=[/cryptfs/sdb/cassandra/data,
/cryptfs/sdc/cassandra/data, /cryptfs/sdd/cassandra/data];
disk_failure_policy=best_effort; dynamic_snitch_badness_threshold=0.1;
dynamic_snitch_reset_interval_in_ms=600000;
dynamic_snitch_update_interval_in_ms=100;
endpoint_snitch=GossipingPropertyFileSnitch;
hinted_handoff_enabled=true; hinted_handoff_throttle_in_kb=1024;
incremental_backups=false; index_summary_capacity_in_mb=null;
index_summary_resize_interval_in_minutes=60;
inter_dc_tcp_nodelay=false; internode_compression=all;
key_cache_save_period=14400; key_cache_size_in_mb=null;
listen_address=127.0.0.1; max_hint_window_in_ms=10800000;
max_hints_delivery_threads=2; memtable_allocation_type=heap_buffers;
native_transport_port=9042; num_tokens=256;
partitioner=org.apache.cassandra.dht.Murmur3Partitioner;
permissions_validity_in_ms=2000; range_request_timeout_in_ms=20000;
read_request_timeout_in_ms=10000;
request_scheduler=org.apache.cassandra.scheduler.NoScheduler;
request_timeout_in_ms=20000; row_cache_save_period=0;
row_cache_size_in_mb=0; rpc_address=127.0.0.1; rpc_keepalive=true;
rpc_port=9160; rpc_server_type=sync;
saved_caches_directory=/var/cassandra/data/saved_caches;
seed_provider=[{class_name=org.apache.cassandra.locator.SimpleSeedProvider,
parameters=[{seeds=127.0.0.1}]}];
server_encryption_options=;
snapshot_before_compaction=false; ssl_storage_port=9001;
sstable_preemptive_open_interval_in_mb=50;
start_native_transport=true; start_rpc=true; storage_port=9000;
thrift_framed_transport_size_in_mb=15;
tombstone_failure_threshold=100000; tombstone_warn_threshold=1000;
trickle_fsync=false; trickle_fsync_interval_in_kb=10240;
truncate_request_timeout_in_ms=60000;
write_request_timeout_in_ms=5000]
nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 0 0 0
ReadStage 32 4061 50469243 0 0
RequestResponseStage 0 0 0 0 0
MutationStage 32 22 27665114 0 0
ReadRepairStage 0 0 0 0 0
GossipStage 0 0 0 0 0
CacheCleanupExecutor 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
MigrationStage 0 0 0 0 0
Sampler 0 0 0 0 0
ValidationExecutor 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
MiscStage 0 0 0 0 0
MemtableFlushWriter 0 0 7769 0 0
MemtableReclaimMemory 1 57 13433 0 0
PendingRangeCalculator 0 0 1 0 0
MemtablePostFlush 0 0 9279 0 0
CompactionExecutor 3 47 169022 0 0
InternalResponseStage 0 0 0 0 0
HintedHandoff 0 1 148 0 0
Is there any YAML/other config to be used to avoid "large compaction"
What is the correct Compaction Strategy to be used ? Can OutOfMemory because of wrong Compaction Strategy
In one of the keyspace we have write once and read multiple times for each row.
For another keyspace we have Timeseries kind of data where it's insert only and multiple reads
Seeing this: Heap Memory (MB): 812.72 / 1945.63 tells me that your 1 machine is probably under powered. There's a good chance that you're not able to keep up with GC.
While in this case, I think this is probably related to being undersized - access patterns, data model, and payload size can also affect GC so if you'd like to update your post with that info, I can update my answer to reflect that.
EDIT to reflect new information
Thanks for adding additional information. Based on what you posted, there are two immediate things I notice that can cause your heap to blow:
Large Partition:
It looks like compaction had to compact 2 partitions that exceeded 100mb (140 and 160 mb respectively). Normally, that would still be ok (not great) but because you're running on under powered hardware with such a small heap, that's quite a lot.
The thing about compaction
It uses a healthy mix of resources when it runs. It's business as usual so it's something you should test and plan for. In this case, I'm certain that compaction is working harder because of the large partition which is using CPU resources (that GC needs), heap, and IO.
This brings me to another concern:
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 0 0 0
ReadStage 32 4061 50469243 0 0
This is usually a sign that you either need to scale up and/or scale out. In your case, you might want to do both. You can exhaust a single, under-powered node pretty quickly with an un-optimized data model. You also don't get to experience the nuances of a distributed system when you test in a single node environment.
So the TL;DR:
For a read heavy workload (which this seems to be), you'll need a larger heap. For over all sanity and cluster health, you'll need to revisit your data model to make sure the partitioning logic is sound. If you're not sure about how or why to do either, I suggest spending some time here: https://academy.datastax.com/courses
Related
I am using azure-kusto-spark to write data to ADX, I can see schema created in ADX, but I do not see any data, there is not any error from log, note I try it using local spark.
df.show();
df.write()
.format("com.microsoft.kusto.spark.datasource")
.option(KustoSinkOptions.KUSTO_CLUSTER(), cluster)
.option(KustoSinkOptions.KUSTO_DATABASE(), db)
.option(KustoSinkOptions.KUSTO_TABLE(), table)
.option(KustoSinkOptions.KUSTO_AAD_APP_ID(), client_id)
.option(KustoSinkOptions.KUSTO_AAD_APP_SECRET(), client_key)
.option(KustoSinkOptions.KUSTO_AAD_AUTHORITY_ID(), "microsoft.com")
.option(KustoSinkOptions.KUSTO_TABLE_CREATE_OPTIONS(), "CreateIfNotExist")
.mode(SaveMode.Append)
.save();
22/12/13 12:06:14 INFO QueuedIngestClient: Creating a new IngestClient
22/12/13 12:06:14 INFO ResourceManager: Refreshing Ingestion Auth Token
22/12/13 12:06:16 INFO ResourceManager: Refreshing Ingestion Resources
22/12/13 12:06:16 INFO KustoConnector: ContainerProvider: Got 2 storage SAS with command :'.create tempstorage'. from service 'ingest-engineermetricdata.eastus'
22/12/13 12:06:16 INFO KustoConnector: ContainerProvider: Got 2 storage SAS with command :'.create tempstorage'. from service 'ingest-engineermetricdata.eastus'
22/12/13 12:06:16 INFO KustoConnector: KustoWriter$: finished serializing rows in partition 0 for requestId: '9065b634-3b74-4993-830b-16ee534409d5'
22/12/13 12:06:16 INFO KustoConnector: KustoWriter$: finished serializing rows in partition 1 for requestId: '9065b634-3b74-4993-830b-16ee534409d5'
22/12/13 12:06:17 INFO KustoConnector: KustoWriter$: Ingesting from blob - partition: 0 requestId: '9065b634-3b74-4993-830b-16ee534409d5'
22/12/13 12:06:17 INFO KustoConnector: KustoWriter$: Ingesting from blob - partition: 1 requestId: '9065b634-3b74-4993-830b-16ee534409d5'
22/12/13 12:06:19 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2135 bytes result sent to driver
22/12/13 12:06:19 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2135 bytes result sent to driver
22/12/13 12:06:19 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 6306 ms on 192.168.50.160 (executor driver) (1/2)
22/12/13 12:06:19 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 6231 ms on 192.168.50.160 (executor driver) (2/2)
22/12/13 12:06:19 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
22/12/13 12:06:19 INFO DAGScheduler: ResultStage 0 (foreachPartition at KustoWriter.scala:107) finished in 7.070 s
22/12/13 12:06:19 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
22/12/13 12:06:19 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
22/12/13 12:06:19 INFO DAGScheduler: Job 0 finished: foreachPartition at KustoWriter.scala:107, took 7.157414 s
22/12/13 12:06:19 INFO KustoConnector: KustoClient: Polling on ingestion results for requestId: 9065b634-3b74-4993-830b-16ee534409d5, will move data to destination table when finished
22/12/13 12:13:30 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.50.160:56364 in memory (size: 4.9 KiB, free: 2004.6 MiB)
Local Spark writes data to ADX
The following code works.
Tested on Azure Databricks.
11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12).
com.microsoft.azure.kusto:kusto-spark_3.0_2.12:3.1.6
import com.microsoft.kusto.spark.datasink.KustoSinkOptions
import org.apache.spark.sql.{SaveMode, SparkSession}
val cluster = "..."
val client_id = "..."
val client_key = "..."
val authority = "..."
val db = "mydb"
val table = "mytable"
val df = spark.range(10)
df.show()
df.write
.format("com.microsoft.kusto.spark.datasource")
.option(KustoSinkOptions.KUSTO_CLUSTER, cluster)
.option(KustoSinkOptions.KUSTO_DATABASE, db)
.option(KustoSinkOptions.KUSTO_TABLE, table)
.option(KustoSinkOptions.KUSTO_AAD_APP_ID, client_id)
.option(KustoSinkOptions.KUSTO_AAD_APP_SECRET, client_key)
.option(KustoSinkOptions.KUSTO_AAD_AUTHORITY_ID, authority)
.option(KustoSinkOptions.KUSTO_TABLE_CREATE_OPTIONS, "CreateIfNotExist")
.mode(SaveMode.Append)
.save()
The ingestion time depends on the ingestion batching policy of the table.
Defaults and limits
Type
Property
Default
Low latency setting
Minimum value
Maximum value
Number of items
MaximumNumberOfItems
1000
1000
1
25,000
Data size (MB)
MaximumRawDataSizeMB
1024
1024
100
4096
Time (sec)
MaximumBatchingTimeSpan
300
20 - 30
10
1800
I meet a FetchFailedException when join to table while setting spark.sql.shuffle.partitions = 2700
But run successfully when setting spark.sql.shuffle.partitions = 500 .
As I know increasing shuffle.partitions will decrease data in every task when shuffle read..
Am I miss something?
Exception:
FetchFailed(BlockManagerId(699, nfjd-hadoop02-node120.jpushoa.com, 7337, None), shuffleId=4, mapId=59, reduceId=1140, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 byte(s) of direct memory (used: 2147483648, max: 2147483648)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCode
Config:
spark.executor.cores = 1
spark.dynamicAllocation.maxExecutors = 800
After Reading the code of shuffleFetch .
The problem I meet is the real block from ShuffleMapTask is too large to fetch into memory once, and the block size from driver is a average block size If my shuffle partitions more than 2000(according to spark.shuffle.minNumPartitionsToHighlyCompress ) which will be smaller then real size when having skew data.
When I'm starting a crawl using Nutch 1.15 with this:
/usr/local/nutch/bin/crawl --i -s urls/seed.txt crawldb 5
Then it starts to run and I get this error when it tries to fetch:
2019-02-10 15:29:32,021 INFO mapreduce.Job - Running job: job_local1267180618_0001
2019-02-10 15:29:32,145 INFO fetcher.FetchItemQueues - Using queue mode : byHost
2019-02-10 15:29:32,145 INFO fetcher.Fetcher - Fetcher: threads: 50
2019-02-10 15:29:32,145 INFO fetcher.Fetcher - Fetcher: time-out divisor: 2
2019-02-10 15:29:32,149 INFO fetcher.QueueFeeder - QueueFeeder finished: total 1 records hit by time limit : 0
2019-02-10 15:29:32,234 WARN mapred.LocalJobRunner - job_local1267180618_0001
java.lang.Exception: java.lang.NullPointerException
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.NullPointerException
at org.apache.nutch.net.URLExemptionFilters.<init>(URLExemptionFilters.java:39)
at org.apache.nutch.fetcher.FetcherThread.<init>(FetcherThread.java:154)
at org.apache.nutch.fetcher.Fetcher$FetcherRun.run(Fetcher.java:222)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2019-02-10 15:29:33,023 INFO mapreduce.Job - Job job_local1267180618_0001 running in uber mode : false
2019-02-10 15:29:33,025 INFO mapreduce.Job - map 0% reduce 0%
2019-02-10 15:29:33,028 INFO mapreduce.Job - Job job_local1267180618_0001 failed with state FAILED due to: NA
2019-02-10 15:29:33,038 INFO mapreduce.Job - Counters: 0
2019-02-10 15:29:33,039 ERROR fetcher.Fetcher - Fetcher job did not succeed, job status:FAILED, reason: NA
2019-02-10 15:29:33,039 ERROR fetcher.Fetcher - Fetcher: java.lang.RuntimeException: Fetcher job did not succeed, job status:FAILED, reason: NA
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:503)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:543)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:517)
And I get this error in the console which is the command it runs:
Error running:
/usr/local/nutch/bin/nutch fetch -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 crawlsites/segments/20190210152929 -noParsing -threads 50
I had to delete the nutch folder and do a new install and it worked after this.
I'm experiencing node crashes where system.logfile is showing bunch of 'ReadTimeoutException' hitting 500ms.
cassandra.yaml file has setting for [read_request_timeout_in_ms: 10000]
can you folks please share how i can address these timeout! Thanks in advance!
error stack:
ERROR [SharedPool-Worker-241] 2017-02-01 13:18:27,663 Message.java:611 - Unexpected exception during request; channel = [id: 0x5d8abf33, /172.18.30.62:47580 => /216.12.225.9:9042]
java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:497) ~[apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.auth.CassandraRoleManager.canLogin(CassandraRoleManager.java:306) ~[apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.service.ClientState.login(ClientState.java:269) ~[apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.transport.messages.AuthResponse.execute(AuthResponse.java:79) ~[apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:507) [apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:401) [apache-cassandra-2.2.8.jar:2.2.8]
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.AbstractChannelHandlerContext.access$700(AbstractChannelHandlerContext.java:32) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.AbstractChannelHandlerContext$8.run(AbstractChannelHandlerContext.java:324) [netty-all-4.0.23.Final.jar:4.0.23.Final]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_111]
at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [apache-cassandra-2.2.8.jar:2.2.8]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_111]
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:110) ~[apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:147) ~[apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1441) ~[apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1365) ~[apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1282) ~[apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:224) ~[apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:176) ~[apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:505) ~[apache-cassandra-2.2.8.jar:2.2.8]
at org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:493) ~[apache-cassandra-2.2.8.jar:2.2.8]
... 13 common frames omitted
INFO [ScheduledTasks:1] 2017-02-01 13:18:27,682 MessagingService.java:946 - READ messages were dropped in last 5000 ms: 149 for internal timeout and 0 for cross node timeout
INFO [Service Thread] 2017-02-01 13:18:27,693 StatusLogger.java:106 - enterprise.t_sf_venue_test 0,0
INFO [ScheduledTasks:1] 2017-02-01 13:18:27,699 MessagingService.java:946 - REQUEST_RESPONSE messages were dropped in last 5000 ms: 7 for internal timeout and 0 for cross node timeout
INFO [Service Thread] 2017-02-01 13:18:27,699 StatusLogger.java:106 - enterprise.alestnstats 0,0
INFO [ScheduledTasks:1] 2017-02-01 13:18:27,699 MessagingService.java:946 - RANGE_SLICE messages were dropped in last 5000 ms: 116 for internal timeout and 0 for cross node timeout
As you see in your logs, actually the failing query is not the one you are trying to execute.
the failing query is internal to cassandra:
"SELECT * FROM system_auth.roles;"
These internal cassandra queries(misc queries) does not use 'read_request_timeout_in_ms'. Instead, it uses 'request_timeout_in_ms'.
I am seeing below exception in my cassandra logs(/var/log/cassandra/system.log)
INFO [ScheduledTasks:1] 2014-02-13 13:13:57,641 GCInspector.java (line 119) GC for ParNew: 273 ms for 1 collections, 2319121816 used; max is 445
6448000
INFO [ScheduledTasks:1] 2014-02-13 13:14:02,695 GCInspector.java (line 119) GC for ParNew: 214 ms for 1 collections, 2315368976 used; max is 445
6448000
INFO [OptionalTasks:1] 2014-02-13 13:14:08,093 MeteredFlusher.java (line 64) flushing high-traffic column family CFS(Keyspace='comsdb', ColumnFa
mily='product_update') (estimated 213624220 bytes)
INFO [OptionalTasks:1] 2014-02-13 13:14:08,093 ColumnFamilyStore.java (line 626) Enqueuing flush of Memtable-product_update#1067619242(31239028/
213625108 serialized/live bytes, 222393 ops)
INFO [FlushWriter:94] 2014-02-13 13:14:08,127 Memtable.java (line 400) Writing Memtable-product_update#1067619242(31239028/213625108 serialized/
live bytes, 222393 ops)
INFO [ScheduledTasks:1] 2014-02-13 13:14:08,696 GCInspector.java (line 119) GC for ParNew: 214 ms for 1 collections, 2480175160 used; max is 445
6448000
INFO [FlushWriter:94] 2014-02-13 13:14:10,836 Memtable.java (line 438) Completed flushing /cassandra1/data/comsdb/product_update/comsdb-product_
update-ic-416-Data.db (15707248 bytes) for commitlog position ReplayPosition(segmentId=1391568233618, position=13712751)
ERROR [Thrift:13] 2014-02-13 13:15:45,694 CustomTThreadPoolServer.java (line 213) Thrift error occurred during processing of message.
org.apache.thrift.TException: Negative length: -2147418111
at org.apache.thrift.protocol.TBinaryProtocol.checkReadLength(TBinaryProtocol.java:388)
at org.apache.thrift.protocol.TBinaryProtocol.readBinary(TBinaryProtocol.java:363)
at org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandra.java:20304)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:21)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
ERROR [Thrift:103] 2014-02-13 13:21:25,719 CustomTThreadPoolServer.java (line 213) Thrift error occurred during processing of message.
org.apache.thrift.TException: Negative length: -2147418111
Below is my cassandra version and hector client version, which is being used currently
Cassandra-version: 1.2.11
Hector-client: 1.0-2
Any lead would be appreciated though we are planning to move cassandra 2.0 version with java-driver but it may take some time meanwhile need to find the root cause and resolve this issue.