kafka logs grows too high - linux

I can see kafka logs are growing rapidly and flooding the filesystem.
How can i change settings for kafka to write less logs and rotate this logs frequently.
location of files is - /opt/kafka/kafka_2.12-2.2.2/logs and their size -
5.9G server.log.2020-11-24-14
5.9G server.log.2020-11-24-15
5.9G server.log.2020-11-24-16
5.7G server.log.2020-11-24-17
sample logs from above file.
[2020-11-24 14:59:59,999] WARN Exception when following the leader (org.apache.zookeeper.server.quorum.Learner)
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.zookeeper.common.AtomicFileOutputStream.write(AtomicFileOutputStream.java:74)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at java.io.BufferedWriter.flush(BufferedWriter.java:254)
at org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:1391)
at org.apache.zookeeper.server.quorum.QuorumPeer.setCurrentEpoch(QuorumPeer.java:1426)
at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:454)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:83)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:981)
[2020-11-24 14:59:59,999] INFO shutdown called (org.apache.zookeeper.server.quorum.Learner)
java.lang.Exception: shutdown Follower
at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:169)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:985)
[2020-11-24 14:59:59,999] INFO Shutting down (org.apache.zookeeper.server.quorum.FollowerZooKeeperServer)
[2020-11-24 14:59:59,999] INFO LOOKING (org.apache.zookeeper.server.quorum.QuorumPeer)
[2020-11-24 14:59:59,999] INFO New election. My id = 1, proposed zxid=0x1000001d2 (org.apache.zookeeper.server.quorum.FastLeaderElection)
[2020-11-24 14:59:59,999] INFO Notification: 1 (message format version), 1 (n.leader), 0x1000001d2 (n.zxid), 0x2 (n.round), LOOKING (n.state), 1 (n.sid), 0x1 (n.peerEpoch) LOOKING (my state) (org.apache.zookeeper.server.quorum.FastLeaderElection)
it also writes to /opt/kafka/kafka_2.12-2.2.2/kafka.log.
[2020-12-05 16:51:10,109] INFO [GroupMetadataManager brokerId=1] Finished loading offsets and group metadata from __consumer_offsets-30 in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-12-05 16:51:10,109] INFO [GroupMetadataManager brokerId=1] Finished loading offsets and group metadata from __consumer_offsets-36 in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-12-05 16:51:10,109] INFO [GroupMetadataManager brokerId=1] Finished loading offsets and group metadata from __consumer_offsets-42 in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-12-05 16:51:10,110] INFO [GroupMetadataManager brokerId=1] Finished loading offsets and group metadata from __consumer_offsets-48 in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-12-05 17:01:09,528] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2020-12-05 17:11:09,528] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
kafka is used for elastic stack.
below is the entry from server.properties file.
# A comma seperated list of directories under which to store log files
log.dirs=/var/log/kafka
it has log files as
/var/log/kafka
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 heartbeat-1
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 __consumer_offsets-12
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 auditbeat-0
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 apm-2
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 __consumer_offsets-28
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 filebeat-2
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 __consumer_offsets-38
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 __consumer_offsets-44
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 __consumer_offsets-6
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 __consumer_offsets-16
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 metricbeat-0
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 __consumer_offsets-22
drwxr-xr-x 2 kafka users 4.0K Dec 5 16:51 __consumer_offsets-32
-rw-r--r-- 1 kafka users 747 Dec 5 18:02 recovery-point-offset-checkpoint
-rw-r--r-- 1 kafka users 4 Dec 5 18:02 log-start-offset-checkpoint
-rw-r--r-- 1 kafka users 749 Dec 5 18:03 replication-offset-checkpoint
no DEBUG level logs is enabled in files in /opt/kafka/kafka_2.12-2.2.2/config path.
How can I make sure it doesn't make such a hugh files in /opt/kafka/kafka_2.12-2.2.2/logs and also how can I rotate them regularly with compression.
Thanks,

log.dirs is the actual broker storage, not the process logs, therefore should not be in /var/log with other process logs
Almost 6G a day is not unreasonable, but you can modify the log4j.properties file to only keep around 1 or 2 days from the rolling file appender
Generally, as any Linux administration task, you'd have separate disk volumes for /var/log, your OS storage, and any dedicated disks for server data - say a mount at /kafka

Related

"mount: /dev/mqueue: must be superuser to use mount" when starting a Yocto Linux system via NFS and TFTP

I followed the guide "Yocto NFS & TFTP boot" from the i.MX knowledge base to make my embedded Linux device run a kernel and a filesystem on my development machine.
The kernel seems to be correctly loaded via TFTP, but the system doesn't boot up properly and systemd goes into maintenance mode.
Here's the first error in the log:
[ 10.637534] systemd[1]: dev-mqueue.mount: Mount process exited, code=exited, status=32/n/a
[ 10.657077] systemd[1]: dev-mqueue.mount: Failed with result 'exit-code'.
[ 10.666907] systemd[1]: Failed to mount POSIX Message Queue File System.
[FAILED] Failed to mount POSIX Message Queue File System.
See 'systemctl status dev-mqueue.mount' for details.
It seems similar to the log included in an unanswered comment in that same guide.
Looking into systemctl status dev-mqueue.mount I see:
* dev-mqueue.mount - POSIX Message Queue File System
Loaded: loaded (/lib/systemd/system/dev-mqueue.mount; static)
Active: failed (Result: exit-code) since Sun 2022-03-06 11:52:40 UTC; 5h 29min ago
Where: /dev/mqueue
What: mqueue
Docs: man:mq_overview(7)
https://www.freedesktop.org/wiki/Software/systemd/APIFileSystems
Mar 06 11:52:40 pico-imx7 mount[180]: mount: /dev/mqueue: must be superuser to use mount.
Notice: journal has been rotated since unit was started, output may be incomplete.
Not sure what's wrong, or why the system would fail like this.
The message must be superuser to use mount is a hint to a permission problem.
The Linux system expects most system files to be owned by UID 0 (root), but when reading the NFS filesystem set up in the guide it actually reads UID 1000, or the UID of whoever built the system in the development machine. If I list the contents of ${YOCTO_BUILD_DIR}/tmp/work/${TARGET}-poky-linux-gnueabi/${IMAGE}/1.0-r0/rootfs, I get:
drwxr-xr-x 20 1000 1000 4096 Mar 9 2018 ./
drwxr-xr-x 14 1000 1000 4096 Mar 5 19:19 ../
drwxr-xr-x 2 1000 1000 4096 Mar 9 2018 bin/
drwxr-xr-x 2 1000 1000 4096 Mar 9 2018 boot/
drwxr-xr-x 2 1000 1000 4096 Mar 9 2018 dev/
drwxr-xr-x 59 1000 1000 4096 Mar 9 2018 etc/
drwxr-xr-x 4 1000 1000 4096 Mar 9 2018 home/
drwxr-xr-x 10 1000 1000 4096 Mar 9 2018 lib/
drwxr-xr-x 2 1000 1000 4096 Mar 9 2018 media/
drwxr-xr-x 2 1000 1000 4096 Mar 9 2018 mnt/
drwxr-xr-x 4 1000 1000 4096 Mar 9 2018 opt/
drwxr-xr-x 2 1000 1000 4096 Mar 9 2018 proc/
drwxr-xr-x 2 1000 1000 4096 Mar 9 2018 run/
drwxr-xr-x 3 1000 1000 4096 Mar 9 2018 sbin/
drwxr-xr-x 2 1000 1000 4096 Mar 9 2018 srv/
drwxr-xr-x 2 1000 1000 4096 Mar 9 2018 sys/
drwxr-xr-t 2 1000 1000 4096 Mar 9 2018 tmp/
drwxr-xr-x 38 1000 1000 4096 Mar 9 2018 unit_tests/
drwxr-xr-x 11 1000 1000 4096 Mar 9 2018 usr/
drwxr-xr-x 9 1000 1000 4096 Mar 9 2018 var/
Notice the 1000 as UID and GID.
Compare that with the listing of the filesystem image tarball, made with tar --exclude=\*/\*/\* --no-wildcards-match-slash -tjvf ${YOCTO_BUILD_DIR}/tmp/deploy/images/${TARGET}/${IMAGE}-${TARGET}.tar.bz2:
drwxr-xr-x 0/0 0 2018-03-09 13:34 ./
drwxr-xr-x 0/0 0 2018-03-09 13:34 ./bin/
drwxr-xr-x 0/0 0 2018-03-09 13:34 ./boot/
drwxr-xr-x 0/0 0 2018-03-09 13:34 ./dev/
drwxr-xr-x 0/0 0 2018-03-09 13:34 ./etc/
drwxr-xr-x 0/0 0 2018-03-09 13:34 ./home/
drwxr-xr-x 0/0 0 2018-03-09 13:34 ./lib/
drwxr-xr-x 0/0 0 2018-03-09 13:34 ./media/
drwxr-xr-x 0/0 0 2018-03-09 13:34 ./mnt/
drwxr-xr-x 0/0 0 2018-03-09 13:34 ./opt/
dr-xr-xr-x 0/0 0 2018-03-09 13:34 ./proc/
drwxr-xr-x 0/0 0 2018-03-09 13:34 ./run/
drwxr-xr-x 0/0 0 2018-03-09 13:34 ./sbin/
drwxr-xr-x 0/0 0 2018-03-09 13:34 ./srv/
dr-xr-xr-x 0/0 0 2018-03-09 13:34 ./sys/
drwxrwxrwt 0/0 0 2018-03-09 13:34 ./tmp/
drwxr-xr-x 0/0 0 2018-03-09 13:34 ./unit_tests/
drwxr-xr-x 0/0 0 2018-03-09 13:34 ./usr/
drwxr-xr-x 0/0 0 2018-03-09 13:34 ./var/
Here the root directories are correctly owned by root.
The same permissions are in the .ext4 image file, that is part of the SD/MMC image file.
Two possible solutions:
mount ${YOCTO_BUILD_DIR}/tmp/deploy/images/${TARGET}/${IMAGE}-${TARGET}.ext4 file as loop in a directory, then export that directory via NFS (might require root privileges);
extract ${YOCTO_BUILD_DIR}/tmp/deploy/images/${TARGET}/${IMAGE}-${TARGET}.tar.bz2 in a directory, then export that directory via NFS; will require root privileges and some more time to extract the embedded filesystem.

Duplicated Cassandra SSTable files

After unsuccessful nodetool repair operation I got two big sstable files (last two in the listing below) instead of one, each having the same size as a single file before. And now this files cannot be merged back by common tools (nodetool clean, nodetool compact, nodetool repair). Tables are replicated to another cassandra node (replication_factor: 2), and there are two big sstable files as well now.
-rw-r--r-- 1 cassandra cassandra 16M Mar 5 12:36 mc-116413-big-Data.db
-rw-r--r-- 1 cassandra cassandra 34M Mar 5 01:21 mc-116320-big-Index.db
-rw-r--r-- 1 cassandra cassandra 39M Mar 3 22:46 mc-116125-big-Index.db
-rw-r--r-- 1 cassandra cassandra 66M Mar 5 12:25 mc-116412-big-Data.db
-rw-r--r-- 1 cassandra cassandra 262M Mar 5 05:51 mc-116365-big-Data.db
-rw-r--r-- 1 cassandra cassandra 263M Mar 5 08:46 mc-116386-big-Data.db
-rw-r--r-- 1 cassandra cassandra 263M Mar 5 11:42 mc-116407-big-Data.db
-rw-r--r-- 1 cassandra cassandra 7.2G Mar 5 03:18 mc-116345-big-Data.db
-rw-r--r-- 1 cassandra cassandra 43G Mar 3 22:46 mc-116125-big-Data.db
-rw-r--r-- 1 cassandra cassandra 48G Mar 5 01:21 mc-116320-big-Data.db```
I suppose that one of this files contains duplicated data. How can I compact files back to a single file?
Maybe I'm not looking properly but I don't see any duplicate SSTable files in the file listing you posted.
If you're referring to these 2:
-rw-r--r-- 1 cassandra cassandra 43G Mar 3 22:46 mc-116125-big-Data.db
-rw-r--r-- 1 cassandra cassandra 48G Mar 5 01:21 mc-116320-big-Data.db
They're not duplicates because they have 2 different generation IDs -- 116125 and 116320. This means they also have different ancestors.
If you're referring to these:
-rw-r--r-- 1 cassandra cassandra 39M Mar 3 22:46 mc-116125-big-Index.db
-rw-r--r-- 1 cassandra cassandra 43G Mar 3 22:46 mc-116125-big-Data.db
-rw-r--r-- 1 cassandra cassandra 34M Mar 5 01:21 mc-116320-big-Index.db
-rw-r--r-- 1 cassandra cassandra 48G Mar 5 01:21 mc-116320-big-Data.db
Again, they're not duplicates of each other. The *-Data.db files contain the actual data. The *-Index.db files are component files which contain the partition index, i.e. the index of the partitions within the data files which are used for fast retrieval.
If you're interested, I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/5219/. Cheers!
[UPDATE] To respond to this follow-up question:
Could you suppose why this two files don`t compacted in a single file,
as usual do?
Assuming the table is configured with SizeTieredCompactionStrategy, it will require similar-sized sstables as candidates before they get compacted together.
The default minimum sstable candidates is min_threshold: 4 so you need 4 similarly-sized sstables for a compaction to be triggered.

Does Spark support multiple users?

I have a 3-node spark 2.3.1 cluster running at the moment, and I'm also running a zeppelin server using a normal user, like ulab.
From zeppelin, I ran the commands:
%spark
val file = sc.textFile("file:///mnt/glusterfs/test/testfile")
file.saveAsTextFile("/mnt/glusterfs/test/testfile2")
It report a lot of error messages, something like:
WARN [2018-09-14 05:44:50,540] ({pool-2-thread-8} NotebookServer.java[afterStatusChange]:2302) - Job 20180907-130718_39068508 is finished, status: ERROR, exception: null, result: %text file: org.apache.spark.rdd.RDD[String] = file:///mnt/glusterfs/test/testfile MapPartitionsRDD[49] at textFile at <console>:51
org.apache.spark.SparkException: Job aborted.
...
... 64 elided
Caused by: java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus{path=file:/mnt/glusterfs/test/testfile2/_temporary/0/task_20180914054253_0050_m_000018/part-00018; isDirectory=false; length=33554979; replication=1; blocksize=33554432; modification_time=1536903780000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} to file:/mnt/glusterfs/test/testfile2/part-00018
And I found that some temporary files owned by user root, while some owned by ulab, like the following:
bash-4.4# ls -l testfile2
total 32773
drwxr-xr-x 3 ulab ulab 4096 Sep 14 05:42 _temporary
-rw-r--r-- 1 ulab ulab 33554979 Sep 14 05:44 part-00018
bash-4.4# ls -l testfile2/_temporary/
total 4
drwxr-xr-x 210 ulab ulab 4096 Sep 14 05:44 0
bash-4.4# ls -l testfile2/_temporary/0
total 832
drwxr-xr-x 2 root root 4096 Sep 14 05:42 task_20180914054253_0050_m_000000
drwxr-xr-x 2 root root 4096 Sep 14 05:42 task_20180914054253_0050_m_000001
drwxr-xr-x 2 root root 4096 Sep 14 05:42 task_20180914054253_0050_m_000002
drwxr-xr-x 2 root root 4096 Sep 14 05:42 task_20180914054253_0050_m_000003
....
Is there any setup to let all these temporary files created by ulab? so we can use multiple users in spark driver to isolate the priviledges.
You can enable 'User Impersonate' option for spark interpreter which will start the spark job as logged-in user.
Refer this link for more info

kafka remove content from topic index files

I am testing kafka integration with spark as consumer. For debugging , have set-up log.retention.minutes = 2 in server.properties which cleans up .log file every 2mins. But .index file is not cleaned up
[cloudera#quickstart airline1-1]$ ls -l
total 0
-rw-r--r-- 1 root root 10485760 Apr 29 15:08 00000000000000000101.index
-rw-r--r-- 1 root root 0 Apr 29 15:08 00000000000000000101.log
-rw-r--r-- 1 root root 10485756 Apr 29 15:08 00000000000000000101.timeindex
Wondering why .index files are not cleaned up. Any insight would be helpful to understand what's happening in the background.
Also please share recommended approach to clean up log and index during testing. Found many google links referring to stop kakfa server -> remove topic partition files -> Restart kafka. But not inclined towards this approach , it could impact offsets state maintained in zookeeper.
Thanks very much!

git gc: no space left on device, even though 3GB available and tmp_pack only 16MB

> git gc --aggressive --prune=now
Counting objects: 68752, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (66685/66685), done.
fatal: sha1 file '.git/objects/pack/tmp_pack_cO6T53' write error: No space left on device
sigh, ok
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 19G 15G 3.0G 84% /
udev 485M 4.0K 485M 1% /dev
tmpfs 99M 296K 99M 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 494M 0 494M 0% /run/shm
cgroup 494M 0 494M 0% /sys/fs/cgroup
doesn't look that bad
ls -lh .git/objects/pack/
total 580M
-r--r--r-- 1 foouser root 12K Oct 30 05:47 pack-0301f67f3b080de7eb0139b982fa732338c49064.idx
-r--r--r-- 1 foouser root 5.1M Oct 30 05:47 pack-0301f67f3b080de7eb0139b982fa732338c49064.pack
-r--r--r-- 1 foouser root 5.1K Oct 14 10:51 pack-27da727e362bcf2493ac01326a8c93f96517a488.idx
-r--r--r-- 1 foouser root 100K Oct 14 10:51 pack-27da727e362bcf2493ac01326a8c93f96517a488.pack
-r--r--r-- 1 foouser root 11K Oct 25 10:35 pack-4dce80846752e6d813fc9eb0a0385cf6ce106d9b.idx
-r--r--r-- 1 foouser root 2.6M Oct 25 10:35 pack-4dce80846752e6d813fc9eb0a0385cf6ce106d9b.pack
-r--r--r-- 1 foouser root 1.6M Apr 3 2014 pack-4dcef34b411c8159e3f5a975d6fcac009a411850.idx
-r--r--r-- 1 foouser root 290M Apr 3 2014 pack-4dcef34b411c8159e3f5a975d6fcac009a411850.pack
-r--r--r-- 1 foouser root 40K Oct 26 11:53 pack-87529eb2c9e58e0f3ca0be00e644ec5ba5250973.idx
-r--r--r-- 1 foouser root 6.1M Oct 26 11:53 pack-87529eb2c9e58e0f3ca0be00e644ec5ba5250973.pack
-r--r--r-- 1 foouser root 1.6M Apr 19 2014 pack-9d5ab71d6787ba2671c807790890d96f03926b84.idx
-r--r--r-- 1 foouser root 102M Apr 19 2014 pack-9d5ab71d6787ba2671c807790890d96f03926b84.pack
-r--r--r-- 1 foouser root 1.6M Oct 3 10:12 pack-af6562bdbbf444103930830a13c11908dbb599a8.idx
-r--r--r-- 1 foouser root 151M Oct 3 10:12 pack-af6562bdbbf444103930830a13c11908dbb599a8.pack
-r--r--r-- 1 foouser root 4.7K Oct 20 11:02 pack-c0830d7a0343dd484286b65d380b6ae5053ec685.idx
-r--r--r-- 1 foouser root 125K Oct 20 11:02 pack-c0830d7a0343dd484286b65d380b6ae5053ec685.pack
-r--r--r-- 1 foouser root 6.2K Oct 2 15:38 pack-c20278ebc16273d24880354af3e395929728481a.idx
-r--r--r-- 1 foouser root 4.2M Oct 2 15:38 pack-c20278ebc16273d24880354af3e395929728481a.pack
-r--r--r-- 1 root root 16M Feb 27 08:19 tmp_pack_cO6T53
So, git gc bails out on a tmp pack that's only 16MB big while my disk appears to have 3GB free. What am I missing? How can I get git gc to work more reliably? I've tried without aggressive option and --prune instead of --prune=now as well, same story.
Update
Doing a df -h during the repack action it shows that it is now using all my disk (100% usage). A little while later the repack action fails and it leaves another 14MB file in the .git/objects/pack/ folder. So, to recap, my packs use a total of 580MB. git repack somehow manages to use up 3GB to repack that. I have ~800MB free in the RAM after it's done btw. - maybe it's using so much working memory that it clogs up the swap? I guess my question comes down to: Are there options to make git repack less resource hungry?
versions: git version 1.7.9.5 on Ubuntu 12.04
Update 2
I've updated git to 2.3. Didn't change anything unfortunately.
> git --version
git version 2.3.0
> git repack -Ad && git prune
Counting objects: 68752, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (36893/36893), done.
fatal: sha1 file '.git/objects/pack/tmp_pack_N9jyVJ' write error: No space left on device
Update 3
Ok, so I just noticed something that is curious: the .git directory actually uses much more disk space than the 508MB previously reported.
> du -h -d 1 ./.git
8.0K ./.git/info
40K ./.git/hooks
24M ./.git/modules
28K ./.git/refs
4.0K ./.git/branches
140K ./.git/logs
5.0G ./.git/objects
5.0G ./.git
Upon further inspection .git/objects/pack actually uses 4.5GB. The differences lies in hidden temp files I didn't notice before:
ls -lha ./.git/objects/pack/
total 4.5G
drwxr-xr-x 2 foouser root 56K Feb 27 15:40 .
drwxr-xr-x 260 foouser root 4.0K Oct 26 14:24 ..
-r--r--r-- 1 foouser root 12K Oct 30 05:47 pack-0301f67f3b080de7eb0139b982fa732338c49064.idx
-r--r--r-- 1 foouser root 5.1M Oct 30 05:47 pack-0301f67f3b080de7eb0139b982fa732338c49064.pack
-r--r--r-- 1 foouser root 5.1K Oct 14 10:51 pack-27da727e362bcf2493ac01326a8c93f96517a488.idx
-r--r--r-- 1 foouser root 100K Oct 14 10:51 pack-27da727e362bcf2493ac01326a8c93f96517a488.pack
-r--r--r-- 1 foouser root 11K Oct 25 10:35 pack-4dce80846752e6d813fc9eb0a0385cf6ce106d9b.idx
-r--r--r-- 1 foouser root 2.6M Oct 25 10:35 pack-4dce80846752e6d813fc9eb0a0385cf6ce106d9b.pack
-r--r--r-- 1 foouser root 1.6M Apr 3 2014 pack-4dcef34b411c8159e3f5a975d6fcac009a411850.idx
-r--r--r-- 1 foouser root 290M Apr 3 2014 pack-4dcef34b411c8159e3f5a975d6fcac009a411850.pack
-r--r--r-- 1 foouser root 40K Oct 26 11:53 pack-87529eb2c9e58e0f3ca0be00e644ec5ba5250973.idx
-r--r--r-- 1 foouser root 6.1M Oct 26 11:53 pack-87529eb2c9e58e0f3ca0be00e644ec5ba5250973.pack
-r--r--r-- 1 foouser root 1.6M Apr 19 2014 pack-9d5ab71d6787ba2671c807790890d96f03926b84.idx
-r--r--r-- 1 foouser root 102M Apr 19 2014 pack-9d5ab71d6787ba2671c807790890d96f03926b84.pack
-r--r--r-- 1 foouser root 1.6M Oct 3 10:12 pack-af6562bdbbf444103930830a13c11908dbb599a8.idx
-r--r--r-- 1 foouser root 151M Oct 3 10:12 pack-af6562bdbbf444103930830a13c11908dbb599a8.pack
-r--r--r-- 1 foouser root 4.7K Oct 20 11:02 pack-c0830d7a0343dd484286b65d380b6ae5053ec685.idx
-r--r--r-- 1 foouser root 125K Oct 20 11:02 pack-c0830d7a0343dd484286b65d380b6ae5053ec685.pack
-r--r--r-- 1 foouser root 6.2K Oct 2 15:38 pack-c20278ebc16273d24880354af3e395929728481a.idx
-r--r--r-- 1 foouser root 4.2M Oct 2 15:38 pack-c20278ebc16273d24880354af3e395929728481a.pack
-r--r--r-- 1 root root 1.1K Feb 27 15:37 .tmp-7729-pack-00447364da9dfe647c89bb7797c48c79589a4e44.idx
-r--r--r-- 1 root root 14M Feb 27 15:29 .tmp-7729-pack-00447364da9dfe647c89bb7797c48c79589a4e44.pack
-r--r--r-- 1 root root 1.1K Feb 27 15:32 .tmp-7729-pack-020efaa9c7caf8b792081f89b27361093f00c2db.idx
-r--r--r-- 1 root root 41M Feb 27 15:30 .tmp-7729-pack-020efaa9c7caf8b792081f89b27361093f00c2db.pack
-r--r--r-- 1 root root 1.1K Feb 27 15:37 .tmp-7729-pack-051980133b8f0052b66dce418b4d3899de0d1342.idx
(continuing for a *long* while).
Now I'd like to know: Is it safe to just delete those?
So here is what I found out so far: I couldn't find any documentation about these hidden '.tmp-XXXX-pack' in the .git/objects/pack folder. All other threads I can find are about non-hidden files with tmp_ prefix in the same folder. The hidden ones are also clearly created during the repack action and it's possible that these get stuck as well. I can't confirm whether that's still possible in git 2.3.0 (which I've updated to since), but at least the disk space requirement doesn't seem to have changed in this newer version - it still can't complete gc/repack. By deleting these .tmp-files I was able to recover my last 4GB and git still seems to behave fine afterwards - your results may vary though, so please make sure you have a backup before doing this. Finally, even 4GB wasn't enough to repack with gc --agressive. My .git folder is 1.1GB after the cleanup, my entire repository is 1.7GB. So 2x the size of your repository is possibly not enough for git gc, even with the aggressive option (which should save space). So I had to recover more space from elsewhere first.
Here is the command I used to clean up (again, have backups!):
git gc --aggressive --prune=now || rm -f .git/objects/*/tmp_* && rm -f .git/objects/*/.tmp-*
Similar scenario (about 2.3G available), except git gc itself would also fail with fatal: Unable to create '/home/ubuntu/my-app-here/.git/gc.pid.lock': No space left on device
What worked was to git prune first, and then run the gc.
I had an instance of this problem. I was able to free up a considerable amount of disk space, but of course, that didn't solve the problem of what to do about the .tmp-* files. I ran git fsck and the Git repository wasn't damaged in that way.
I did the conventional pack-and-garbage-collect operations
git repack -Ad
git prune
but that didn't remove the .tmp-* files, though it would have ensured that all of the necessary objects were in the standard pack-* files if they needed to be copied from transient files left over from the crashed Git processes in the past.
Eventually I realized that I could safely move the .tmp-* files to a scratch directory, then run git fsck to see if what remained in the .git directory was complete. It turned out that it was, so I deleted the scratch directory and the files it contained. If git fsck had reported problems, I could have moved the .tmp-* files back into the .git directory and researched another solution.
The hidden ones are also clearly created during the repack action and it's possible that these get stuck as well
The way "git repack"(man) created temporary files when it received a signal was prone to deadlocking, which has been corrected with Git 2.39 (Q4 2022).
See commit 9b3fadf (23 Oct 2022), and commit 1934307, commit 9cf10d8, commit a4880b2, commit b639606, commit d3d9c51 (21 Oct 2022) by Jeff King (peff).
(Merged by Taylor Blau -- ttaylorr -- in commit c88895e, 30 Oct 2022)
repack: use tempfiles for signal cleanup
Reported-by: Jan Pokorný
Signed-off-by: Jeff King
When git-repack(man) exits due to a signal, it tries to clean up by calling its remove_temporary_files() function, which walks through the packs dir looking for ".tmp-$$-pack-*" files to delete (where "$$" is the pid of the current process).
The biggest problem here is that remove_temporary_files() is not safe to call in a signal handler.
It uses opendir(), which isn't on the POSIX async-signal-safe list.
The details will be platform-specific, but a likely issue is that it needs to allocate memory; if we receive a signal while inside malloc(), etc, we'll conflict on the allocator lock and deadlock with ourselves.
We can fix this by just cleaning up the files directly, without walking the directory.
We already know the complete list of .tmp-* files that were generated, because we recorded them via populate_pack_exts().
When we find files there, we can use register_tempfile() to record the filenames.
If we receive a signal, then the tempfile API will clean them up for us, and it's async-safe and pretty battle-tested.
And:
repack: drop remove_temporary_files()
Signed-off-by: Jeff King
After we've successfully finished the repack, we call remove_temporary_files(), which looks for and removes any files matching ".tmp-$$-pack-*", where $$ is the pid of the current process.
But this is pointless.
If we make it this far in the process, we've already renamed these tempfiles into place, and there is nothing left to delete.
Nor is there a point in trying to call it to clean up when we aren't successful.
It's not safe for using in a signal handler, and the previous commit already handed that job over to the tempfile API.
It might seem like it would be useful to clean up stray .tmp files left by other invocations of git-repack.
But it won't clean those files; it only matches ones with its pid, and leaves the rest.
Fortunately, those are cleaned up naturally by successive calls to git-repack; we'll consider .tmp-*.pack the same as normal packfiles, so "repack -ad", etc, will roll up their contents and eventually delete them.

Resources