Service Fabric - Main Method not called - azure

I am building a Service and I want to deploy it to my local 1-Node-Cluster. As soon as I publish my Service to the local cluster, the cluster fails with the following error:
Partitions - Error - Unhealthy partitions: 100% (1/1), MaxPercentUnhealthyPartitionsPerService=0%.
Partition - Error - Unhealthy partition: PartitionId='...', AggregatedHealthState='Error'.
Event - Error - Error event: SourceId='System.FM', Property='State'. Partition is below target replica or instance count. fabric:/MyFabric/MyFabricService 1 1 ...
(Showing 0 out of 0 replicas. Total up replicas: 0.)
The strange thing is this: I set a breakpoint at the very beginning of my Main-Method, and it never triggers, i.e. the Main Method of my Service is never called.
I read that comparable issues might show up when you have to little memory or disc space available. The system I am running on has 10 GB RAM and 80 GB free disc space, so I think this cannot be the issue here.
Any idea what might be wrong? Ideas how to fix it?

Related

Reaper failed to run repair on Cassandra nodes

After Reaper failed to run repair on 18 nodes of Cassandra cluster, I ran a full repair of each node to fix the failed repair issue, after the full repair, Reaper executed successfully, but after a few days again the Reaper failed to run, I can see the following error in system.log
ERROR [RMI TCP Connection(33673)-10.196.83.241] 2021-09-01 09:01:18,005 RepairRunnable.java:276 - Repair session 81540931-0b20-11ec-a7fa-8d6977dd3c87 for range [(-606604147644314041,-98440495518284645], (-3131564913406859309,-3010160047914391044]] failed with error Terminate session is called
java.io.IOException: Terminate session is called
at org.apache.cassandra.service.ActiveRepairService.terminateSessions(ActiveRepairService.java:191) ~[apache-cassandra-3.11.0.jar:3.11.0]
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
in nodetool tpstats I can see some pending tasks
Pool Name Active Pending
ReadStage 0 0
Repair#18 3 90
ValidationExecutor 3 3
Also in nodetool compactionstats there are 4 pending tasks:
-bash-4.2$ nodetool compactionstats
pending tasks: 4
- Main.visit: 1
- Main.post: 1
- Main.stream: 2
My question is why even after a full repair, reaper is still failing? and what is the root cause of pending repair?
PS: version of Reaper is 2.2.3, not sure if it is a bug in Reaper!
You most likely don't have enough segments in your Reaper repair definition, or the default timeout (30 mins) is too low for your repair.
Segments (and the associated repair session) get terminated when they reach the timeout, in order to avoid stuck repairs. When tuned inappropriately, this can give the behavior you're observing.
Nodetool doesn't set a timeout on repairs, which explains why it passes there. The good news is that nothing will prevent repair from passing with Reaper once tuned correctly.
We're currently working on adaptive repairs to have Reaper deal with this situation automatically, but in the meantime you'll need to deal with this manually.
Check the list of segments in the UI and apply the following rule:
If you have less than 20% of segments failing, double the timeout by adjusting the hangingRepairTimeoutMins value in the config yaml.
If you have more than 20% of segments failing, double the number of segments.
Once repair passes at least twice, check the maximum duration of segments and further tune the number of segments to have them last at most 15 mins.
Assuming you're not running Cassandra 4.0 yet, now that you ran repair through nodetool, you have sstables which are marked as repaired like incremental repair would. This will create a problem as Reaper's repairs don't mark sstables as repaired and you now have two different sstables pools (repaired and unrepaired), which cannot be compacted together.
You'll need to use the sstablerepairedset tool to mark all sstables as unrepaired to put all sstables back in the same pool. Please read the documentation to learn how to achieve this.
There could be a number of things taking place such as Reaper can't connect to the nodes via JMX (for whatever reason). It isn't possible to diagnose the problem with the limited information you've provided.
You'll need to check the Reaper logs for clues on the root cause.
As a side note, this isn't related to repairs and is a client/driver/app connecting to the node on the CQL port:
INFO [Native-Transport-Requests-2] 2021-09-01 09:02:52,020 Message.java:619 - Unexpected exception during request; channel = [id: 0x1e99a957, L:/10.196.18.230:9042 ! R:/10.254.252.33:62100]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
Cheers!

Cassandra-2.2.3 : Repeatedly facing "writing large partition error" even after multiple repairs

We have a 6 node each 2 datacenter Cassandra cluster production environment setup. We encounter large partition warning. We ran 2 successful repairs, still this is not getting resolved. How can I analyze and fix this?
BigTableWriter.java:184 - Writing large partition system_distributed/repair_history:rf_key_space:my_table (108140638 bytes)
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 1171896
Mismatch (Blocking): 808
Mismatch (Background): 131
Pool Name Active Pending Completed
Large messages n/a 11 0
Small messages n/a 0 48881938
Gossip messages n/a 0 113659
The system_distributed.repair_history table is not one that you really need to concern yourself with. Unfortunately, this can happen when a lot of repairs have been run. With 2.2, the only real solution is to TRUNCATE that table every now and then.

Java Heap Space issue in Grakn 1.6.0

i have a data of 100 nodes and 165 relations to be inserted into one keyspace. My grakn image have 4 core CPU and 3 GB Memory. While i try to insert the data i am getting an error [grpc-request-handler-4] ERROR grakn.core.server.Grakn - Uncaught exception at thread [grpc-request-handler-4] java.lang.OutOfMemoryError: Java heap space. It was noticed that the image used 346 % CPU and 1.46 GB RAM only. Also a finding for the issue in log was Caused by: com.datastax.oss.driver.api.core.AllNodesFailedException: Could not reach any contact point, make sure you've provided valid addresses (showing first 1, use getErrors() for more: Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=3cb85440): io.netty.channel.ChannelException: Unable to create Channel from class class io.netty.channel.socket.nio.NioSocketChannel)
Could you please help me with this?
It sounds like Cassandra ran out of memory - currently, Grakn spawns to processes: one for Cassandra and one for Grakn server. You can increase your memory limit with the following flags (unix):
SERVER_JAVAOPTS=-Xms1G STORAGE_JAVAOPTS=-Xms2G ./grakn server start
this would give the server 1GB, and the storage engine (cassandra) 2gb of memory, for instance. 3 GB may be a bit on the low end once your data grows so keep these flags in mind :)

High disk I/O on Cassandra nodes

Setup:
We have 3 nodes Cassandra cluster having data of around 850G on each node, we have LVM setup for Cassandra data directory (currently consisting 3 drives 800G + 100G + 100G) and have separate volume (non LVM) for cassandra_logs
Versions:
Cassandra v2.0.14.425
DSE v4.6.6-1
Issue:
After adding 3rd (100G) volume in LVM on each of the node, all the nodes went very high in disk I/O and they go down quite often, servers also become inaccessible and we need to reboot the servers, servers don't get stable and we need to reboot after every 10 - 15 mins.
Other Info:
We have DSE recommended server settings (vm.max_map_count, file descriptor) configured on all nodes
RAM on each node : 24G
CPU on each node : 6 cores / 2600MHz
Disk on each node : 1000G (Data dir) / 8G (Logs)
As I suspected, you are having throughput problems on your disk. Here's what I looked at to give you background. The nodetool tpstats output from your three nodes had these lines:
Pool Name Active Pending Completed Blocked All time blocked
FlushWriter 0 0 22 0 8
FlushWriter 0 0 80 0 6
FlushWriter 0 0 38 0 9
The column I'm concerned about is the All Time Blocked. As a ratio to completed, you have a lot of blocking. The flushwriter is responsible for flushing memtables to the disk to keep the JVM from running out of memory or creating massive GC problems. The memtable is an in-memory representation of your tables. As your nodes take more writes, they start to fill and need to be flushed. That operation is a long sequential write to disk. Bookmark that. I'll come back to it.
When flushwriters are blocked, the heap starts to fill. If they stay blocked, you will see the requests starting to queue up and eventually the node will OOM.
Compaction might be running as well. Compaction is a long sequential read of SSTables into memory and then a long sequential flush of the merge sorted results. More sequential IO.
So all these operations on disk are sequential. Not random IOPs. If your disk is not able to handle simultaneous sequential read and write, IOWait shoots up, requests get blocked and then Cassandra has a really bad day.
You mentioned you are using Ceph. I haven't seen a successful deployment of Cassandra on Ceph yet. It will hold up for a while and then tip over on sequential load. Your easiest solution in the short term is to add more nodes to spread out the load. The medium term is to find some ways to optimize your stack for sequential disk loads, but that will eventually fail. Long term is get your data on real disks and off shared storage.
I have told this to consulting clients for years when using Cassandra "If your storage has an ethernet plug, you are doing it wrong" Good rule of thumb.

Cassandra node down...any ideas why?

I've put up a test cluster - four nodes. Severely underpowered(!) - ok CPU, only 2 gigs of ram, shared non ssd storage. Hey, it's test :)
I just kept it running for three days. No data going in or out..everything's just idle. Connected with opscenter.
This morning, we found one of the nodes went down around 2 am last night. The OS didn't go down (was responding to pings). The cassandra log around that time is:
INFO [MemtableFlushWriter:114] 2014-07-29 02:07:34,952 Memtable.java:360 - Completed flushing /var/lib/cassandra/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-107-Data.db (686 bytes) for commitlog position ReplayPosition(segmentId=1406304454537, position=29042136)
INFO [ScheduledTasks:1] 2014-07-29 02:08:24,227 GCInspector.java:116 - GC for ParNew: 276 ms for 1 collections, 648591696 used; max is 1040187392
Next entry is:
INFO [main] 2014-07-29 09:18:41,661 CassandraDaemon.java:102 - Hostname: xxxxx
i.e. when we restarted the node through opscenter.
Does that mean it crashed on GC, or that GC finished and something else crashed? Is there some other log I should be looking at?
Note: In opscenter eventlog, we see this:
7/29/2014, 2:15am Warning Node reported as being down: xxxxxxx
I appreciate the nodes are underpowered, but for being completely idle, it shouldn't crash, should it?
Using 2.1.0-rc4 btw.
My guess is your node was shut down by the OOM killer. Because the Linux system over commits ram, when a heavy stress is on the system it may shut down applications to recover memory for the os. With 2G total ram this can happen very easily.

Resources