kafka remove content from topic index files - apache-spark

I am testing kafka integration with spark as consumer. For debugging , have set-up log.retention.minutes = 2 in server.properties which cleans up .log file every 2mins. But .index file is not cleaned up
[cloudera#quickstart airline1-1]$ ls -l
total 0
-rw-r--r-- 1 root root 10485760 Apr 29 15:08 00000000000000000101.index
-rw-r--r-- 1 root root 0 Apr 29 15:08 00000000000000000101.log
-rw-r--r-- 1 root root 10485756 Apr 29 15:08 00000000000000000101.timeindex
Wondering why .index files are not cleaned up. Any insight would be helpful to understand what's happening in the background.
Also please share recommended approach to clean up log and index during testing. Found many google links referring to stop kakfa server -> remove topic partition files -> Restart kafka. But not inclined towards this approach , it could impact offsets state maintained in zookeeper.
Thanks very much!

Related

Duplicated Cassandra SSTable files

After unsuccessful nodetool repair operation I got two big sstable files (last two in the listing below) instead of one, each having the same size as a single file before. And now this files cannot be merged back by common tools (nodetool clean, nodetool compact, nodetool repair). Tables are replicated to another cassandra node (replication_factor: 2), and there are two big sstable files as well now.
-rw-r--r-- 1 cassandra cassandra 16M Mar 5 12:36 mc-116413-big-Data.db
-rw-r--r-- 1 cassandra cassandra 34M Mar 5 01:21 mc-116320-big-Index.db
-rw-r--r-- 1 cassandra cassandra 39M Mar 3 22:46 mc-116125-big-Index.db
-rw-r--r-- 1 cassandra cassandra 66M Mar 5 12:25 mc-116412-big-Data.db
-rw-r--r-- 1 cassandra cassandra 262M Mar 5 05:51 mc-116365-big-Data.db
-rw-r--r-- 1 cassandra cassandra 263M Mar 5 08:46 mc-116386-big-Data.db
-rw-r--r-- 1 cassandra cassandra 263M Mar 5 11:42 mc-116407-big-Data.db
-rw-r--r-- 1 cassandra cassandra 7.2G Mar 5 03:18 mc-116345-big-Data.db
-rw-r--r-- 1 cassandra cassandra 43G Mar 3 22:46 mc-116125-big-Data.db
-rw-r--r-- 1 cassandra cassandra 48G Mar 5 01:21 mc-116320-big-Data.db```
I suppose that one of this files contains duplicated data. How can I compact files back to a single file?
Maybe I'm not looking properly but I don't see any duplicate SSTable files in the file listing you posted.
If you're referring to these 2:
-rw-r--r-- 1 cassandra cassandra 43G Mar 3 22:46 mc-116125-big-Data.db
-rw-r--r-- 1 cassandra cassandra 48G Mar 5 01:21 mc-116320-big-Data.db
They're not duplicates because they have 2 different generation IDs -- 116125 and 116320. This means they also have different ancestors.
If you're referring to these:
-rw-r--r-- 1 cassandra cassandra 39M Mar 3 22:46 mc-116125-big-Index.db
-rw-r--r-- 1 cassandra cassandra 43G Mar 3 22:46 mc-116125-big-Data.db
-rw-r--r-- 1 cassandra cassandra 34M Mar 5 01:21 mc-116320-big-Index.db
-rw-r--r-- 1 cassandra cassandra 48G Mar 5 01:21 mc-116320-big-Data.db
Again, they're not duplicates of each other. The *-Data.db files contain the actual data. The *-Index.db files are component files which contain the partition index, i.e. the index of the partitions within the data files which are used for fast retrieval.
If you're interested, I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/5219/. Cheers!
[UPDATE] To respond to this follow-up question:
Could you suppose why this two files don`t compacted in a single file,
as usual do?
Assuming the table is configured with SizeTieredCompactionStrategy, it will require similar-sized sstables as candidates before they get compacted together.
The default minimum sstable candidates is min_threshold: 4 so you need 4 similarly-sized sstables for a compaction to be triggered.

/var/log/daemon.log taking more space how to reduce it?

below are the files
-rw-r----- 1 root adm 4.4G Mar 6 09:04 daemon.log
-rw-r----- 1 root adm 6.2G Mar 1 06:26 daemon.log.1
-rw-r----- 1 root adm 50M Feb 23 06:26 daemon.log.2.gz
-rw-r----- 1 root adm 41M Feb 17 06:25 daemon.log.3.gz
-rw-r----- 1 root adm 72K Feb 9 06:25 daemon.log.4.gz
how can I remove it? will it affect if I directly delete it?
Thanks in advance.
Best way to manage the logs would be to use Logrotate
This is Serhii's comment on your other similar question:
Have a look at this Logrotate tutorial
linode.com/docs/uptime/logs/use-logrotate-to-manage-log-files. You can
use size to force log rotation when it grows bigger than the
specified [value], also you can use rotate to control how many
times a log is rotated before old logs are removed (If you set it to
0 logs will be removed immediately after they are rotated).
You can delete the logs but depending on the software you're running - if some of it needs some part of logs or utilises them in any way - if you delete them it will stop working as intended.
You can also have a look at the logs and analyse them to see which software writes the most data and try to reconfigure it so the number of logs info generated will drop significantly. That - combined with logrorate should yield satisfactory results.
And if that's not enough you can store your logs in a bucket and mount it as a disk in your VM's filesystem. That way any software installed on your VM will be able to write to it.
But this will incur some charges for using the bucket storage so keep that in mind.

Log4j Rolling File appender uses wrong logfile

my first question - Please Keep calm ;-)
our Problem is, that log4j uses the wrong logfile. Our configuration is a simple DailyRollingFileAppender
log4j.appender.dx4wsa=org.apache.log4j.DailyRollingFileAppender
log4j.appender.dx4wsa.File=${env.WFL_DIR}/log/dx4wsa-agents.log
log4j.appender.dx4wsa.DatePattern='.'yyyy-MM-dd-HH
log4j.appender.dx4wsa.layout=org.apache.log4j.PatternLayout
log4j.appender.dx4wsa.layout.ConversionPattern=%d{dd.MM.yyyy HH:mm:ss}: %5p %-30c{1} %-50x %m%n
What we see is, that sometimes the logger uses old logfiles after rolling over. For example here
dx4wsa-agents.log.2016-10-12-18:12.10.2016 20:39:03: INFO VorgangLoeschen 21974690 Start executing agent on process instance = 21974690 and work item = 14f27076-f31f-48a7-849d-669189918730
You can see here, that this step is from 20:39:03 and the Logfile already rotated - it's the 18:00 Logfile. An "ls -ltr" Shows you the last Access.
-rw-r--r-- 1 tpdx4wf2 gpdoxis4 61944 Oct 12 17:59 dx4wsa-agents.log.2016-10-12-17
-rw-r--r-- 1 tpdx4wf2 gpdoxis4 51668039 Oct 12 17:59 dx4wsa-agents.log.2016-10-12-16
-rw-r--r-- 1 tpdx4wf2 gpdoxis4 40437528 Oct 12 19:59 dx4wsa-agents.log.2016-10-12-19
-rw-r--r-- 1 tpdx4wf2 gpdoxis4 1463292 Oct 12 20:54 dx4wsa-agents.log
-rw-r--r-- 1 tpdx4wf2 gpdoxis4 67702368 Oct 12 20:54 dx4wsa-agents.log.2016-10-12-18
Today we had a release and, stopped the Server, killed all Java-Threads deployed new Jar-Files and restarted the Server at 19:00!!! And log4j logged in the 18:00 file?

git gc: no space left on device, even though 3GB available and tmp_pack only 16MB

> git gc --aggressive --prune=now
Counting objects: 68752, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (66685/66685), done.
fatal: sha1 file '.git/objects/pack/tmp_pack_cO6T53' write error: No space left on device
sigh, ok
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 19G 15G 3.0G 84% /
udev 485M 4.0K 485M 1% /dev
tmpfs 99M 296K 99M 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 494M 0 494M 0% /run/shm
cgroup 494M 0 494M 0% /sys/fs/cgroup
doesn't look that bad
ls -lh .git/objects/pack/
total 580M
-r--r--r-- 1 foouser root 12K Oct 30 05:47 pack-0301f67f3b080de7eb0139b982fa732338c49064.idx
-r--r--r-- 1 foouser root 5.1M Oct 30 05:47 pack-0301f67f3b080de7eb0139b982fa732338c49064.pack
-r--r--r-- 1 foouser root 5.1K Oct 14 10:51 pack-27da727e362bcf2493ac01326a8c93f96517a488.idx
-r--r--r-- 1 foouser root 100K Oct 14 10:51 pack-27da727e362bcf2493ac01326a8c93f96517a488.pack
-r--r--r-- 1 foouser root 11K Oct 25 10:35 pack-4dce80846752e6d813fc9eb0a0385cf6ce106d9b.idx
-r--r--r-- 1 foouser root 2.6M Oct 25 10:35 pack-4dce80846752e6d813fc9eb0a0385cf6ce106d9b.pack
-r--r--r-- 1 foouser root 1.6M Apr 3 2014 pack-4dcef34b411c8159e3f5a975d6fcac009a411850.idx
-r--r--r-- 1 foouser root 290M Apr 3 2014 pack-4dcef34b411c8159e3f5a975d6fcac009a411850.pack
-r--r--r-- 1 foouser root 40K Oct 26 11:53 pack-87529eb2c9e58e0f3ca0be00e644ec5ba5250973.idx
-r--r--r-- 1 foouser root 6.1M Oct 26 11:53 pack-87529eb2c9e58e0f3ca0be00e644ec5ba5250973.pack
-r--r--r-- 1 foouser root 1.6M Apr 19 2014 pack-9d5ab71d6787ba2671c807790890d96f03926b84.idx
-r--r--r-- 1 foouser root 102M Apr 19 2014 pack-9d5ab71d6787ba2671c807790890d96f03926b84.pack
-r--r--r-- 1 foouser root 1.6M Oct 3 10:12 pack-af6562bdbbf444103930830a13c11908dbb599a8.idx
-r--r--r-- 1 foouser root 151M Oct 3 10:12 pack-af6562bdbbf444103930830a13c11908dbb599a8.pack
-r--r--r-- 1 foouser root 4.7K Oct 20 11:02 pack-c0830d7a0343dd484286b65d380b6ae5053ec685.idx
-r--r--r-- 1 foouser root 125K Oct 20 11:02 pack-c0830d7a0343dd484286b65d380b6ae5053ec685.pack
-r--r--r-- 1 foouser root 6.2K Oct 2 15:38 pack-c20278ebc16273d24880354af3e395929728481a.idx
-r--r--r-- 1 foouser root 4.2M Oct 2 15:38 pack-c20278ebc16273d24880354af3e395929728481a.pack
-r--r--r-- 1 root root 16M Feb 27 08:19 tmp_pack_cO6T53
So, git gc bails out on a tmp pack that's only 16MB big while my disk appears to have 3GB free. What am I missing? How can I get git gc to work more reliably? I've tried without aggressive option and --prune instead of --prune=now as well, same story.
Update
Doing a df -h during the repack action it shows that it is now using all my disk (100% usage). A little while later the repack action fails and it leaves another 14MB file in the .git/objects/pack/ folder. So, to recap, my packs use a total of 580MB. git repack somehow manages to use up 3GB to repack that. I have ~800MB free in the RAM after it's done btw. - maybe it's using so much working memory that it clogs up the swap? I guess my question comes down to: Are there options to make git repack less resource hungry?
versions: git version 1.7.9.5 on Ubuntu 12.04
Update 2
I've updated git to 2.3. Didn't change anything unfortunately.
> git --version
git version 2.3.0
> git repack -Ad && git prune
Counting objects: 68752, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (36893/36893), done.
fatal: sha1 file '.git/objects/pack/tmp_pack_N9jyVJ' write error: No space left on device
Update 3
Ok, so I just noticed something that is curious: the .git directory actually uses much more disk space than the 508MB previously reported.
> du -h -d 1 ./.git
8.0K ./.git/info
40K ./.git/hooks
24M ./.git/modules
28K ./.git/refs
4.0K ./.git/branches
140K ./.git/logs
5.0G ./.git/objects
5.0G ./.git
Upon further inspection .git/objects/pack actually uses 4.5GB. The differences lies in hidden temp files I didn't notice before:
ls -lha ./.git/objects/pack/
total 4.5G
drwxr-xr-x 2 foouser root 56K Feb 27 15:40 .
drwxr-xr-x 260 foouser root 4.0K Oct 26 14:24 ..
-r--r--r-- 1 foouser root 12K Oct 30 05:47 pack-0301f67f3b080de7eb0139b982fa732338c49064.idx
-r--r--r-- 1 foouser root 5.1M Oct 30 05:47 pack-0301f67f3b080de7eb0139b982fa732338c49064.pack
-r--r--r-- 1 foouser root 5.1K Oct 14 10:51 pack-27da727e362bcf2493ac01326a8c93f96517a488.idx
-r--r--r-- 1 foouser root 100K Oct 14 10:51 pack-27da727e362bcf2493ac01326a8c93f96517a488.pack
-r--r--r-- 1 foouser root 11K Oct 25 10:35 pack-4dce80846752e6d813fc9eb0a0385cf6ce106d9b.idx
-r--r--r-- 1 foouser root 2.6M Oct 25 10:35 pack-4dce80846752e6d813fc9eb0a0385cf6ce106d9b.pack
-r--r--r-- 1 foouser root 1.6M Apr 3 2014 pack-4dcef34b411c8159e3f5a975d6fcac009a411850.idx
-r--r--r-- 1 foouser root 290M Apr 3 2014 pack-4dcef34b411c8159e3f5a975d6fcac009a411850.pack
-r--r--r-- 1 foouser root 40K Oct 26 11:53 pack-87529eb2c9e58e0f3ca0be00e644ec5ba5250973.idx
-r--r--r-- 1 foouser root 6.1M Oct 26 11:53 pack-87529eb2c9e58e0f3ca0be00e644ec5ba5250973.pack
-r--r--r-- 1 foouser root 1.6M Apr 19 2014 pack-9d5ab71d6787ba2671c807790890d96f03926b84.idx
-r--r--r-- 1 foouser root 102M Apr 19 2014 pack-9d5ab71d6787ba2671c807790890d96f03926b84.pack
-r--r--r-- 1 foouser root 1.6M Oct 3 10:12 pack-af6562bdbbf444103930830a13c11908dbb599a8.idx
-r--r--r-- 1 foouser root 151M Oct 3 10:12 pack-af6562bdbbf444103930830a13c11908dbb599a8.pack
-r--r--r-- 1 foouser root 4.7K Oct 20 11:02 pack-c0830d7a0343dd484286b65d380b6ae5053ec685.idx
-r--r--r-- 1 foouser root 125K Oct 20 11:02 pack-c0830d7a0343dd484286b65d380b6ae5053ec685.pack
-r--r--r-- 1 foouser root 6.2K Oct 2 15:38 pack-c20278ebc16273d24880354af3e395929728481a.idx
-r--r--r-- 1 foouser root 4.2M Oct 2 15:38 pack-c20278ebc16273d24880354af3e395929728481a.pack
-r--r--r-- 1 root root 1.1K Feb 27 15:37 .tmp-7729-pack-00447364da9dfe647c89bb7797c48c79589a4e44.idx
-r--r--r-- 1 root root 14M Feb 27 15:29 .tmp-7729-pack-00447364da9dfe647c89bb7797c48c79589a4e44.pack
-r--r--r-- 1 root root 1.1K Feb 27 15:32 .tmp-7729-pack-020efaa9c7caf8b792081f89b27361093f00c2db.idx
-r--r--r-- 1 root root 41M Feb 27 15:30 .tmp-7729-pack-020efaa9c7caf8b792081f89b27361093f00c2db.pack
-r--r--r-- 1 root root 1.1K Feb 27 15:37 .tmp-7729-pack-051980133b8f0052b66dce418b4d3899de0d1342.idx
(continuing for a *long* while).
Now I'd like to know: Is it safe to just delete those?
So here is what I found out so far: I couldn't find any documentation about these hidden '.tmp-XXXX-pack' in the .git/objects/pack folder. All other threads I can find are about non-hidden files with tmp_ prefix in the same folder. The hidden ones are also clearly created during the repack action and it's possible that these get stuck as well. I can't confirm whether that's still possible in git 2.3.0 (which I've updated to since), but at least the disk space requirement doesn't seem to have changed in this newer version - it still can't complete gc/repack. By deleting these .tmp-files I was able to recover my last 4GB and git still seems to behave fine afterwards - your results may vary though, so please make sure you have a backup before doing this. Finally, even 4GB wasn't enough to repack with gc --agressive. My .git folder is 1.1GB after the cleanup, my entire repository is 1.7GB. So 2x the size of your repository is possibly not enough for git gc, even with the aggressive option (which should save space). So I had to recover more space from elsewhere first.
Here is the command I used to clean up (again, have backups!):
git gc --aggressive --prune=now || rm -f .git/objects/*/tmp_* && rm -f .git/objects/*/.tmp-*
Similar scenario (about 2.3G available), except git gc itself would also fail with fatal: Unable to create '/home/ubuntu/my-app-here/.git/gc.pid.lock': No space left on device
What worked was to git prune first, and then run the gc.
I had an instance of this problem. I was able to free up a considerable amount of disk space, but of course, that didn't solve the problem of what to do about the .tmp-* files. I ran git fsck and the Git repository wasn't damaged in that way.
I did the conventional pack-and-garbage-collect operations
git repack -Ad
git prune
but that didn't remove the .tmp-* files, though it would have ensured that all of the necessary objects were in the standard pack-* files if they needed to be copied from transient files left over from the crashed Git processes in the past.
Eventually I realized that I could safely move the .tmp-* files to a scratch directory, then run git fsck to see if what remained in the .git directory was complete. It turned out that it was, so I deleted the scratch directory and the files it contained. If git fsck had reported problems, I could have moved the .tmp-* files back into the .git directory and researched another solution.
The hidden ones are also clearly created during the repack action and it's possible that these get stuck as well
The way "git repack"(man) created temporary files when it received a signal was prone to deadlocking, which has been corrected with Git 2.39 (Q4 2022).
See commit 9b3fadf (23 Oct 2022), and commit 1934307, commit 9cf10d8, commit a4880b2, commit b639606, commit d3d9c51 (21 Oct 2022) by Jeff King (peff).
(Merged by Taylor Blau -- ttaylorr -- in commit c88895e, 30 Oct 2022)
repack: use tempfiles for signal cleanup
Reported-by: Jan Pokorný
Signed-off-by: Jeff King
When git-repack(man) exits due to a signal, it tries to clean up by calling its remove_temporary_files() function, which walks through the packs dir looking for ".tmp-$$-pack-*" files to delete (where "$$" is the pid of the current process).
The biggest problem here is that remove_temporary_files() is not safe to call in a signal handler.
It uses opendir(), which isn't on the POSIX async-signal-safe list.
The details will be platform-specific, but a likely issue is that it needs to allocate memory; if we receive a signal while inside malloc(), etc, we'll conflict on the allocator lock and deadlock with ourselves.
We can fix this by just cleaning up the files directly, without walking the directory.
We already know the complete list of .tmp-* files that were generated, because we recorded them via populate_pack_exts().
When we find files there, we can use register_tempfile() to record the filenames.
If we receive a signal, then the tempfile API will clean them up for us, and it's async-safe and pretty battle-tested.
And:
repack: drop remove_temporary_files()
Signed-off-by: Jeff King
After we've successfully finished the repack, we call remove_temporary_files(), which looks for and removes any files matching ".tmp-$$-pack-*", where $$ is the pid of the current process.
But this is pointless.
If we make it this far in the process, we've already renamed these tempfiles into place, and there is nothing left to delete.
Nor is there a point in trying to call it to clean up when we aren't successful.
It's not safe for using in a signal handler, and the previous commit already handed that job over to the tempfile API.
It might seem like it would be useful to clean up stray .tmp files left by other invocations of git-repack.
But it won't clean those files; it only matches ones with its pid, and leaves the rest.
Fortunately, those are cleaned up naturally by successive calls to git-repack; we'll consider .tmp-*.pack the same as normal packfiles, so "repack -ad", etc, will roll up their contents and eventually delete them.

Not able to copy data in hdfs with hdfs dfs commands

I'm trying to copy data over hdfs. but none of the commands not working for me.
I followed an online tutorial to install a single node cluster. it got installed correctly because $jsp command showing me all the 6 jobs. but when I'm trying to copy a file over to hdfs its showing me error.
the command I'm running is
hduser#naren-Vostro-3560:~$ hdfs dfs -copyFromLocal /home/nare/Desktop/data/first.txt /app/hadoop/tmp
Error
14/12/30 02:18:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
copyFromLocal: '/app/hadoop/tmp': No such file or directory
I have given all the permissions to input file (first.txt)
naren#naren-Vostro-3560:~$ ls -al /home/naren/Desktop/data
total 3612
drwxrwxr-x 2 naren naren 4096 Dec 30 01:40 .
drwxr-xr-x 3 naren naren 4096 Dec 30 01:40 ..
-rwxrwxrwx 1 naren naren 674570 Dec 30 01:37 first.txt
-rwxrwxrwx 1 naren naren 1423803 Dec 30 01:39 second.txt
-rwxrwxrwx 1 naren naren 1573151 Dec 30 01:40 third.txt
the permissions to the hdfs folder also looks right to me
hduser#naren-Vostro-3560:~$ ls -l /app/hadoop
total 4
drwxr-x--- 5 hduser hadoop 4096 Dec 26 01:22 tmp
I'm new to hadoop and linux and got stuck here.
also i tried creating new directory with
hduser#naren-Vostro-3560:~$ hadoop fs -mkdir -p /user/hduser/sample
and it didn't creat any directory for me.
Please let me know where i'm going wrong.
Thanks in Advance!!
Hadoop Version: Hadoop 2.5.2
OS: Ubuntu 14.04
You need to make sure /app/hadoop is created in hdfs, not the local fs. In your check for the directory you use ls -l which checks on the local filesystem which is separate from the hdfs namespace. Try hadoop fs -ls /app/hadoop. If it's not there then create it with hadoop fs -mkdir. – snkherv

Resources