Cassandra keyspace fails when using symbolic link - cassandra

Need: create keyspace on alternate device
Problem: service aborts on startup with dir-create failure messages below.
INFO [main] 2017-01-06 00:45:03,300 ViewManager.java:137 - Not submitting build tasks for views in keyspace system_schema as storage service is not initialized
ERROR [main] 2017-01-06 00:45:03,393 Directories.java:239 - Failed to create /var/lib/cassandra/data/opus/aa-15be7240d3db11e6ad0eed0a1d791016 directory
ERROR [main] 2017-01-06 00:45:03,397 DefaultFSErrorHandler.java:92 - Exiting forcefully due to file system exception on startup, disk failure policy "stop"
Context: Cassandra 3.9 single-node ubuntu 16.04; directory perms are below.
01:52 opus/ cd /var/lib/cassandra/data
01:52 opus/ ls -l
total 24
drwxr-xr-x 3 cassandra cassandra 4096 Jan 6 00:41 opus
drwxr-xr-x 24 cassandra cassandra 4096 Jan 5 23:49 system
drwxr-xr-x 6 cassandra cassandra 4096 Jan 5 23:50 system_auth
drwxr-xr-x 5 cassandra cassandra 4096 Jan 5 23:50 system_distributed
drwxr-xr-x 12 cassandra cassandra 4096 Jan 5 23:50 system_schema
drwxr-xr-x 4 cassandra cassandra 4096 Jan 5 23:50 system_traces
01:52 opus/ cd opus
01:52 opus/ ls -l
total 4
drwxr-xr-x 3 cassandra cassandra 4096 Jan 6 00:41 aa-15be7240d3db11e6ad0eed0a1d791016
when the link is installed
01:57 data/ ls -l
total 20
lrwxrwxrwx 1 root root 35 Jan 6 01:57 opus -> /media/opus/quantdrive/opus
Steps:
Vanilla install of cassandra 3.9;
Create keyspace in cqlsh create keyspace opus with replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
Create table use opus; create table aa(aa int, primary key(aa));
Stop cassandra
Move keyspace dir mv /var/lib/cassandra/data/opus /media/opus/quantdrive
Create symbolic link ln -s /media/opus/quantdrive/opus /var/lib/cassandra/opus
Start cassandra [FAILS AS ABOVE] with create directory, when directory already present
No change in perms on opus keyspace directory, I just moved it. When I move it back, cassandra starts fine.
I would be grateful for any help with this and I apologize in advance if I the solution to my problem is described elsewhere or if I'm missing the obvious.

Move the mount point for the target drive from a user-owned directory to a root-owned one. I moved the mount-point in my case from /media/opus/quantdrive which is owned by user opus to /mnt/quantdrive which is owned by root and everything worked fine.

Related

zoo + snapshot files are created very frequently

under folder - /var/hadoop/zookeeper/version-2/
we can see that Zookeeper transaction logs and snapshot files are created very frequently (multiple files in every minute) and that fills up the Filesystem in a very short time.
ROOT CAUSE
One or more application are creating or modifying the znodes too frequently, causing too many transactions in a short duration. This leads to the creation of too many transactional log files and snapshot files since they get rolled over after 100,000 entries by default (as defined by zookeeper property 'snapCount')
-rw-r--r-- 1 zookeeper hadoop 67108880 Jul 28 17:24 log.570021fa92
-rw-r--r-- 1 zookeeper hadoop 490656299 Jul 28 17:24 snapshot.5700232ffa
-rw-r--r-- 1 zookeeper hadoop 67108880 Jul 28 17:29 log.5700232ffc
-rw-r--r-- 1 zookeeper hadoop 490656389 Jul 28 17:29 snapshot.5700249d7f
-rw-r--r-- 1 zookeeper hadoop 67108880 Jul 28 17:33 log.5700249d78
-rw-r--r-- 1 zookeeper hadoop 490656275 Jul 28 17:33 snapshot.570025fdaf
-rw-r--r-- 1 zookeeper hadoop 67108880 Jul 28 17:36 log.570025fdae
-rw-r--r-- 1 zookeeper hadoop 490656275 Jul 28 17:36 snapshot.570026c447
-rw-r--r-- 1 zookeeper hadoop 67108880 Jul 28 17:40 log.570026c449
-rw-r--r-- 1 zookeeper hadoop 490658969 Jul 28 17:40 snapshot.570027caed
-rw-r--r-- 1 zookeeper hadoop 67108880 Jul 28 17:43 log.570027caef
-rw-r--r-- 1 zookeeper hadoop 490658981 Jul 28 17:43 snapshot.570028a0d0
-rw-r--r-- 1 zookeeper hadoop 67108880 Jul 28 17:48 log.570028a0d2
-rw-r--r-- 1 zookeeper hadoop 165081088 Jul 28 17:48 snapshot.57002a0268
-rw-r--r-- 1 zookeeper hadoop 67108880 Jul 28 17:48 log.57002a026b
.
.
.
.
when we opened one of the log as - log.57002a026b we saw encrypted log
any suggestion how to unencrypted the logs above ?
or how to know which is the application thatcreating or modifying the znodes too frequently ?
PROBLEM
Zookeeper transaction logs and snapshot files are created very frequently (multiple files in every minute) and that fills up the FileSystem in a very short time.
ROOT CAUSE
One or more application are creating or modifying the znodes too frequently, causing too many transactions in a short duration. This leads to the creation of too many transactional log files and snapshot files since they get rolled over after 100,000 entries by default (as defined by zookeeper property 'snapCount')
RESOLUTION
The resolution for such cases involves reviewing the zookeeper transaction logs to find the znodes that are updated/created most frequently using the following command on one of the zookeeper servers:
# cd /usr/hdp/current/zookeeper-server
# java -cp zookeeper.jar:lib/* org.apache.zookeeper.server.LogFormatter /hadoop/zookeeper/version-2/logxxx
(where 'dataDir' is set to '/hadoop/zookeeper' within zookeeper configuration)
Once the frequently updating znodes are identified using the above command, one should continue with fixing the related application that is creating such a large number of updates on zookeeper.
An example of such an application that can cause this problem is Hbase, when there are very large number of regions stuck in transition and they repeatedly fail to become online.

Cassandra - Node stuck in joining as another node is down

I am trying to add another node to a Production cassandra cluster as the disc space utilization across nodes is reaching over 90%. However, the node is in joining state for over 2 days. I also noticed that one of the node went down(DN) as it is at 100% disc space utilization. Cassandra server is unable to run on this instance!!
Will this affect bootstrapping completion of the new node?
Any immediate solutions for restoring space on the node that went down?
If I remove this out of the ring, this may add more stress of data load and increase disc space on the other nodes.
Can I remove any SSTable(like the list of files) temporarily out of the instance, bring up the server, perform clean-up and then add back these files?
-rw-r--r--. 1 polkitd input 5551459 Sep 17 2020 mc-572-big-CompressionInfo.db
-rw-r--r--. 1 polkitd input 15859691072 Sep 17 2020 mc-572-big-Data.db
-rw-r--r--. 1 polkitd input 8 Sep 17 2020 mc-572-big-Digest.crc32
-rw-r--r--. 1 polkitd input 22608920 Sep 17 2020 mc-572-big-Filter.db
-rw-r--r--. 1 polkitd input 5634549206 Sep 17 2020 mc-572-big-Index.db
-rw-r--r--. 1 polkitd input 12538 Sep 17 2020 mc-572-big-Statistics.db
-rw-r--r--. 1 polkitd input 44510338 Sep 17 2020 mc-572-big-Summary.db
-rw-r--r--. 1 polkitd input 92 Sep 17 2020 mc-572-big-TOC.txt
If you are using vnodes then downed node will surelyimpact bootstrapping. For immediate relife, identify tables which are not used in traffic and move sstables to backup from that table.
I resolved this by temporarily increasing the EBS volume(disc space)on that node, brought up the server, then removed the node out of the cluster, cleared out cassandra data folders, decreased the EBS Volume and then added back the node to the cluster.
One thing that I noticed was removing the node out of the cluster, increased disc space on the other nodes. So I added additional nodes to distribute the load, then ran clean up on all other nodes before moving on to removing the node out of the cluster.

convert spring boot tomcat azure k8s deployment to standalone application

I have created an azure devops project for java , spring boot and kubernetes as a way to learn about the azure technology set. It does work , the simple spring boot web application is deployed and runs and is rebuilt if I make code changes.
However the spring boot application uses a very old version of spring 1.5.7.RELEASE and it is deployed in a tomcat server in k8s.
I am looking for some guidance on how to run it as a standalone spring boot version 2 application in kubernetes. My attempts so far have resulted in the deployment timing out after 15 minutes in the Helm Upgrade step.
The existing docker file
FROM maven:3.5.2-jdk-8 AS build-env
WORKDIR /app
COPY . /app
RUN mvn package
FROM tomcat:8
RUN rm -rf /usr/local/tomcat/webapps/ROOT
COPY --from=build-env /app/target/*.war /usr/local/tomcat/webapps/ROOT.war
How to change the dockerfile to build the image of a standalone spring boot app?
I changed the pom to generate a jar file, then modified the docker file to this:
FROM maven:3.5.2-jdk-8 AS build-env
WORKDIR /app
COPY . /app
RUN mvn package
FROM openjdk:8-jdk-alpine
VOLUME /tmp
COPY --from=build-env /app/target/ROOT.jar .
RUN ls -la
ENTRYPOINT ["java","-jar","ROOT.jar"]
This builds, see output from the log for 'Build an image' step
...
2019-06-25T23:33:38.0841365Z Step 9/20 : COPY --from=build-env /app/target/ROOT.jar .
2019-06-25T23:33:41.4839851Z ---> b478fb8867e6
2019-06-25T23:33:41.4841124Z Step 10/20 : RUN ls -la
2019-06-25T23:33:41.6653383Z ---> Running in 4618c503ac5c
2019-06-25T23:33:42.2022890Z total 50156
2019-06-25T23:33:42.2026590Z drwxr-xr-x 1 root root 4096 Jun 25 23:33 .
2019-06-25T23:33:42.2026975Z drwxr-xr-x 1 root root 4096 Jun 25 23:33 ..
2019-06-25T23:33:42.2027267Z -rwxr-xr-x 1 root root 0 Jun 25 23:33 .dockerenv
2019-06-25T23:33:42.2027608Z -rw-r--r-- 1 root root 51290350 Jun 25 23:33 ROOT.jar
2019-06-25T23:33:42.2027889Z drwxr-xr-x 2 root root 4096 May 9 20:49 bin
2019-06-25T23:33:42.2028188Z drwxr-xr-x 5 root root 340 Jun 25 23:33 dev
2019-06-25T23:33:42.2028467Z drwxr-xr-x 1 root root 4096 Jun 25 23:33 etc
2019-06-25T23:33:42.2028765Z drwxr-xr-x 2 root root 4096 May 9 20:49 home
2019-06-25T23:33:42.2029376Z drwxr-xr-x 1 root root 4096 May 11 01:32 lib
2019-06-25T23:33:42.2029682Z drwxr-xr-x 5 root root 4096 May 9 20:49 media
2019-06-25T23:33:42.2029961Z drwxr-xr-x 2 root root 4096 May 9 20:49 mnt
2019-06-25T23:33:42.2030257Z drwxr-xr-x 2 root root 4096 May 9 20:49 opt
2019-06-25T23:33:42.2030537Z dr-xr-xr-x 135 root root 0 Jun 25 23:33 proc
2019-06-25T23:33:42.2030937Z drwx------ 2 root root 4096 May 9 20:49 root
2019-06-25T23:33:42.2031214Z drwxr-xr-x 2 root root 4096 May 9 20:49 run
2019-06-25T23:33:42.2031523Z drwxr-xr-x 2 root root 4096 May 9 20:49 sbin
2019-06-25T23:33:42.2031797Z drwxr-xr-x 2 root root 4096 May 9 20:49 srv
2019-06-25T23:33:42.2032254Z dr-xr-xr-x 12 root root 0 Jun 25 23:33 sys
2019-06-25T23:33:42.2032355Z drwxrwxrwt 2 root root 4096 May 9 20:49 tmp
2019-06-25T23:33:42.2032656Z drwxr-xr-x 1 root root 4096 May 11 01:32 usr
2019-06-25T23:33:42.2032945Z drwxr-xr-x 1 root root 4096 May 9 20:49 var
2019-06-25T23:33:43.0909881Z Removing intermediate container 4618c503ac5c
2019-06-25T23:33:43.0911258Z ---> 0d824ce4ae62
2019-06-25T23:33:43.0911852Z Step 11/20 : ENTRYPOINT ["java","-jar","ROOT.jar"]
2019-06-25T23:33:43.2880002Z ---> Running in bba9345678be
...
The build completes but deployment fails in the Helm Upgrade step, timing out after 15 minutes. This is the log
2019-06-25T23:38:06.6438602Z ##[section]Starting: Helm upgrade
2019-06-25T23:38:06.6444317Z ==============================================================================
2019-06-25T23:38:06.6444448Z Task : Package and deploy Helm charts
2019-06-25T23:38:06.6444571Z Description : Deploy, configure, update a Kubernetes cluster in Azure Container Service by running helm commands
2019-06-25T23:38:06.6444648Z Version : 0.153.0
2019-06-25T23:38:06.6444927Z Author : Microsoft Corporation
2019-06-25T23:38:06.6445006Z Help : https://learn.microsoft.com/azure/devops/pipelines/tasks/deploy/helm-deploy
2019-06-25T23:38:06.6445300Z ==============================================================================
2019-06-25T23:38:09.1285973Z [command]/opt/hostedtoolcache/helm/2.14.1/x64/linux-amd64/helm upgrade --tiller-namespace dev2134 --namespace dev2134 --install --force --wait --set image.repository=stephenacr.azurecr.io/stephene991 --set image.tag=20 --set applicationInsights.InstrumentationKey=643a47f5-58bd-4012-afea-b3c943bc33ce --set imagePullSecrets={stephendockerauth} --timeout 900 azuredevops /home/vsts/work/r1/a/Drop/drop/sampleapp-v0.2.0.tgz
2019-06-25T23:53:13.7882713Z UPGRADE FAILED
2019-06-25T23:53:13.7883396Z Error: timed out waiting for the condition
2019-06-25T23:53:13.7885043Z Error: UPGRADE FAILED: timed out waiting for the condition
2019-06-25T23:53:13.7967270Z ##[error]Error: UPGRADE FAILED: timed out waiting for the condition
2019-06-25T23:53:13.7976964Z ##[section]Finishing: Helm upgrade
I have had another look at this as I now am more familiar with all the technologies, and I have located the problem.
The helm upgrade statement is timing out waiting for the newly deployed pod to become live but this doesn’t happen because the k8s liveness probe defined for the pod is not working. This can be seen with this command :
kubectl get po -n dev5998 -w
NAME READY STATUS RESTARTS AGE
sampleapp-86869d4d54-nzd9f 0/1 CrashLoopBackOff 17 48m
sampleapp-c8f84c857-phrrt 1/1 Running 0 1h
sampleapp-c8f84c857-rmq8w 1/1 Running 0 1h
tiller-deploy-79f84d5f-4r86q 1/1 Running 0 2h
The new pod is repeatedly restarted then killed. It seems to repeat forever or until another deployment is run.
In the log for the pod
kubectl describe po sampleapp-86869d4d54-nzd9f -n dev5998
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 39m default-scheduler Successfully assigned sampleapp-86869d4d54-nzd9f to aks-agentpool-24470557-1
Normal SuccessfulMountVolume 39m kubelet, aks-agentpool-24470557-1 MountVolume.SetUp succeeded for volume "default-token-v72n5"
Normal Pulling 39m kubelet, aks-agentpool-24470557-1 pulling image "devopssampleacreg.azurecr.io/devopssamplec538:52"
Normal Pulled 39m kubelet, aks-agentpool-24470557-1 Successfully pulled image "devopssampleacreg.azurecr.io/devopssamplec538:52"
Normal Created 37m (x3 over 39m) kubelet, aks-agentpool-24470557-1 Created container
Normal Started 37m (x3 over 39m) kubelet, aks-agentpool-24470557-1 Started container
Normal Killing 37m (x2 over 38m) kubelet, aks-agentpool-24470557-1 Killing container with id docker://sampleapp:Container failed liveness probe.. Container will be killed and recreated.
Warning Unhealthy 36m (x6 over 38m) kubelet, aks-agentpool-24470557-1 Liveness probe failed: HTTP probe failed with statuscode: 404
Warning Unhealthy 34m (x12 over 38m) kubelet, aks-agentpool-24470557-1 Readiness probe failed: HTTP probe failed with statuscode: 404
Normal Pulled 9m25s (x12 over 38m) kubelet, aks-agentpool-24470557-1 Container image "devopssampleacreg.azurecr.io/devopssamplec538:52" already present on machine
Warning BackOff 4m10s (x112 over 34m) kubelet, aks-agentpool-24470557-1 Back-off restarting failed container
So there must be a difference in what urls are delivered by the application depending on how it is deployed, tomcat or standalone. Which now seems obvious.

Restoring data after upgrading Cassandra

I'm trying to upgrade from Cassandra to the latest Datastax Enterprise and everything went fine except the fact I can't get my data back.
Basically, I had a clean cassandra after the upgrade, then I recreated the schema and trying to somehow link the files that are left from old db to the new db.
That's what I have right now in /var/lib/cassandra/data/wowch directory for example:
drwxr-x--- 4 cassandra cassandra 4.0K Feb 27 13:05 users-247834809d2011e58d82b7a748b1d9c2/
drwxr-xr-x 2 cassandra cassandra 4.0K Feb 27 18:53 users-f41a5300dd5611e58bc7b7a748b1d9c2/
As I get, the older directory is what was in the db before the upgrade. It contains some db files:
total 144K
drwxr-x--- 4 cassandra cassandra 4.0K Feb 27 13:05 ./
drwxr-x--- 60 cassandra cassandra 20K Feb 27 14:35 ../
drwxr-x--- 2 cassandra cassandra 4.0K Dec 7 21:21 backups/
-rwxr-x--- 2 cassandra cassandra 51 Jan 20 00:05 ma-46-big-CompressionInfo.db*
-rwxr-x--- 2 cassandra cassandra 828 Jan 20 00:05 ma-46-big-Data.db*
-rwxr-x--- 2 cassandra cassandra 10 Jan 20 00:05 ma-46-big-Digest.crc32*
-rwxr-x--- 2 cassandra cassandra 16 Jan 20 00:05 ma-46-big-Filter.db*
-rwxr-x--- 2 cassandra cassandra 83 Jan 20 00:05 ma-46-big-Index.db*
-rwxr-x--- 2 cassandra cassandra 4.9K Jan 20 00:05 ma-46-big-Statistics.db*
-rwxr-x--- 2 cassandra cassandra 92 Jan 20 00:05 ma-46-big-Summary.db*
-rwxr-x--- 2 cassandra cassandra 92 Jan 20 00:05 ma-46-big-TOC.txt*
-rwxr-x--- 2 cassandra cassandra 43 Feb 12 15:05 ma-47-big-CompressionInfo.db*
-rwxr-x--- 2 cassandra cassandra 41 Feb 12 15:05 ma-47-big-Data.db*
-rwxr-x--- 2 cassandra cassandra 10 Feb 12 15:05 ma-47-big-Digest.crc32*
-rwxr-x--- 2 cassandra cassandra 16 Feb 12 15:05 ma-47-big-Filter.db*
-rwxr-x--- 2 cassandra cassandra 20 Feb 12 15:05 ma-47-big-Index.db*
-rwxr-x--- 2 cassandra cassandra 4.5K Feb 12 15:05 ma-47-big-Statistics.db*
-rwxr-x--- 2 cassandra cassandra 92 Feb 12 15:05 ma-47-big-Summary.db*
-rwxr-x--- 2 cassandra cassandra 92 Feb 12 15:05 ma-47-big-TOC.txt*
-rwxr-x--- 2 cassandra cassandra 43 Feb 12 16:05 ma-48-big-CompressionInfo.db*
-rwxr-x--- 2 cassandra cassandra 169 Feb 12 16:05 ma-48-big-Data.db*
-rwxr-x--- 2 cassandra cassandra 10 Feb 12 16:05 ma-48-big-Digest.crc32*
-rwxr-x--- 2 cassandra cassandra 16 Feb 12 16:05 ma-48-big-Filter.db*
-rwxr-x--- 2 cassandra cassandra 20 Feb 12 16:05 ma-48-big-Index.db*
-rwxr-x--- 2 cassandra cassandra 4.9K Feb 12 16:05 ma-48-big-Statistics.db*
-rwxr-x--- 2 cassandra cassandra 92 Feb 12 16:05 ma-48-big-Summary.db*
-rwxr-x--- 2 cassandra cassandra 92 Feb 12 16:05 ma-48-big-TOC.txt*
-rwxr-x--- 1 cassandra cassandra 31 Dec 7 21:26 manifest.json*
drwxr-x--- 3 cassandra cassandra 4.0K Feb 27 13:05 snapshots/
I tried to copy all the stuff from here to the users-f41a5300dd5611e58bc7b7a748b1d9c2/ directory and run nodetool repair or nodetool refresh -- wowch users but had no success — the data is still not loaded.
Did I forget something? What is the right way of doing it and how to get the data back?
Depending on the version of Cassandra/DSE you were previously running you may need to run a nodetool upgradesstables. You can see the documentation here.
https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsUpgradeSstables.html
It's possible that you've run into this issue but without more info I can't say for sure.
You also haven't provided info on which version you started with and ended. A little more info would be very helpful. Can you also clarify - are you upgrading from community Cassandra to DSE? I couldn't tell from the way your question was worded.
Stuff to check: Do you have the token assignments from the old version? I didn't use vnodes and I found that I had to manually set initial_token in cassandra.yaml after a backup/restore of my cluster. Make sure that cassandra owns all of the dirs and files. After you import the schema, stop DSE and then empty the contents of the commitlog directory. move your data if necessary into the new folders and then restart DSE. Hope this helps.

svn permission issue - txn-current-lock: Permission denied

I setup svn on my local system /svn/repos/myproject by following this tutorial. I'm able to view the repo in browser.
But when it try to import new project I couldn't through svn client ( rapid svn ) it shows following error:
Execute: Import
Error while performing action:
Can't open file '/svn/repos/myproject/db/txn-current-lock': Permission denied
Svn directory permissions:
→ ls -l /svn
total 12
drwxrwxr-x 2 root root 4096 Feb 15 12:09 permissions
drwxrwxr-x 4 apache apache 4096 Feb 15 12:09 repos
drwxrwxr-x 2 root root 4096 Feb 15 12:09 users
Repo directory:
→ ls -l
total 8
drwxrwxr-x 3 root root 4096 Feb 15 12:09 conf
drwxrwxr-x 7 apache apache 4096 Feb 15 12:09 myproject
How to solve this issue?
I've given 777 permission to repos directory which solved this issue. But i got another issue like Couldn't perform atomic initialization.
I think this is due to incompatible sqlite version with subversion we're using, this can be solved by updating svnadmin command,
svnadmin create --pre-1.6-compatible --fs-type fsfs /svn/repos/myproject

Resources