I am now doing some tests of my application again corrupted files. But I found it is hard to find test files.
So I'm wondering whether there are some existing tools, which can write random/garbage bytes into a file of some format.
Basically, I need this tool to:
It writes random garbage bytes into the file.
It does not need to know the format of the file, just writing random bytes are OK for me.
It is best to write at random positions of the target file.
Batch processing is also a bonus.
Thanks.
The /dev/urandom pseudo-device, along with dd, can do this for you:
dd if=/dev/urandom of=newfile bs=1M count=10
This will create a file newfile of size 10M.
The /dev/random device will often block if there is not sufficient randomness built up, urandom will not block. If you're using the randomness for crypto-grade stuff, you can steer clear of urandom. For anything else, it should be sufficient and most likely faster.
If you want to corrupt just bits of your file (not the whole file), you can simply use the C-style random functions. Just use rnd() to figure out an offset and length n, then use it n times to grab random bytes to overwrite your file with.
The following Perl script shows how this can be done (without having to worry about compiling C code):
use strict;
use warnings;
sub corrupt ($$$$) {
# Get parameters, names should be self-explanatory.
my $filespec = shift;
my $mincount = shift;
my $maxcount = shift;
my $charset = shift;
# Work out position and size of corruption.
my #fstat = stat ($filespec);
my $size = $fstat[7];
my $count = $mincount + int (rand ($maxcount + 1 - $mincount));
my $pos = 0;
if ($count >= $size) {
$count = $size;
} else {
$pos = int (rand ($size - $count));
}
# Output for debugging purposes.
my $last = $pos + $count - 1;
print "'$filespec', $size bytes, corrupting $pos through $last\n";
# Open file, seek to position, corrupt and close.
open (my $fh, "+<$filespec") || die "Can't open $filespec: $!";
seek ($fh, $pos, 0);
while ($count-- > 0) {
my $newval = substr ($charset, int (rand (length ($charset) + 1)), 1);
print $fh $newval;
}
close ($fh);
}
# Test harness.
system ("echo =========="); #DEBUG
system ("cp base-testfile testfile"); #DEBUG
system ("cat testfile"); #DEBUG
system ("echo =========="); #DEBUG
corrupt ("testfile", 8, 16, "ABCDEFGHIJKLMNOPQRSTUVWXYZ ");
system ("echo =========="); #DEBUG
system ("cat testfile"); #DEBUG
system ("echo =========="); #DEBUG
It consists of the corrupt function that you call with a file name, minimum and maximum corruption size and a character set to draw the corruption from. The bit at the bottom is just unit testing code. Below is some sample output where you can see that a section of the file has been corrupted:
==========
this is a file with nothing in it except for lowercase
letters (and spaces and punctuation and newlines).
that will make it easy to detect corruptions from the
test program since the character range there is from
uppercase a through z.
i have to make it big enough so that the random stuff
will work nicely, which is why i am waffling on a bit.
==========
'testfile', 344 bytes, corrupting 122 through 135
==========
this is a file with nothing in it except for lowercase
letters (and spaces and punctuation and newlines).
that will make iFHCGZF VJ GZDYct corruptions from the
test program since the character range there is from
uppercase a through z.
i have to make it big enough so that the random stuff
will work nicely, which is why i am waffling on a bit.
==========
It's tested at a basic level but you may find there are edge error cases which need to be taken care of. Do with it what you will.
Just for completeness, here's another way to do it:
shred -s 10 - > my-file
Writes 10 random bytes to stdout and redirects it to a file. shred is usually used for destroying (safely overwriting) data, but it can be used to create new random files too.
So if you have already have a file that you want to fill with random data, use this:
shred my-existing-file
You could read from /dev/random:
# generate a 50MB file named `random.stuff` filled with random stuff ...
dd if=/dev/random of=random.stuff bs=1000000 count=50
You can specify the size also in a human readable way:
# generate just 2MB ...
dd if=/dev/random of=random.stuff bs=1M count=2
You can also use cat and head. Both are usually installed.
# write 1024 random bytes to my-file-to-override
cat /dev/urandom | head -c 1024 > my-file-to-override
Related
I have the below source file (~10GB) and I need to split into several small files (<100MB each) and each file should have the same header record. The tricky part is I can't just split the file at any random line by using some split command. Records belonging to an agent shouldn't be split across multiple files. For simplicity I am only showing 2 agents here (there are thousands of them in the real file).
Inout.csv
Src,AgentNum,PhoneNum
DWH,Agent_1234,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_5678,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_9999,phone1
NULL,NULL,phone2
NULL,NULL,phone3
Output1.csv
Src,AgentNum,PhoneNum
DWH,Agent_1234,phone1
NULL,NULL,phone2
NULL,NULL,phone3
Output2.csv
Src,AgentNum,PhoneNum
DWH,Agent_5678,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_9999,phone1
NULL,NULL,phone2
NULL,NULL,phone3
#!/bin/bash
#Calculate filesize in bytes
FileSizeBytes=`du -b $FileName | cut -f1`
#Check for the file size
if [[ $FileSizeBytes -gt 100000000 ]]
then
echo "Filesize is greater than 100MB"
NoOfLines=`wc -l < $FileName`
AvgLineSize=$((FileSizeBytes / NoOfLines))
LineCountInEachFile=$((100000000 / AvgLineSize))
#Section for splitting the files
else
echo "Filesize is already less than 100MB. No splitting needed"
exit 0
fi
I an new to UNIX but trying this bash script on my own and kind of stuck at splitting the files. I am not expecting somebody to give me a full script, I am looking for any simple approach/recommendation possibly using other simple alternatives like sed or such. Many thanks in advance!
Here is a rough idea of how to do it in Perl. Please modify the regular expression if it doesn't exactly match to your actual data. I only tested it on your dummy data.
#!/usr/bin/perl -w
my $l=<>; chomp($l); my $header=$l;
my $agent=""; my $fh;
while ($l=<>) {
chomp($l);
if ($l=~m/^\s*[^,]+,(Agent_\d+),[^,]+/) {
$agent="$1";
open($fh,">","${agent}.txt") or die "$!";
print $fh $header."\n";
}
print $fh $l."\n";
}
Use it as follows:
./perlscript.pl < inputfile.txt
If you don't have perl (check for perl at /usr/bin/perl or some other such location), I will try to do a awk script. Let me know if you find problems running in the above script.
In response to your updated request that you only want to split the file, with each output file as less than 100MB, with no agent records split across two files, and that that header is printed in each output file, here is a rough idea of how you can accomplish that. It doesn't to a exact-cut (because you would need to calculate before you write). If you set the $maxfilesize to a value like 95*1024*1024 or 99*1024*1024, that should let you have a file that is less than 100MB (For ex., if the maximum size of a agent's records are less than 5MB, then set the $maxfilesize to 95*1024*1024)
#!/usr/bin/perl -w
# Max file size, approximately in bytes
#
# For 99MB make it as 99*1024*1024
#
my $maxfilesize=95*1024*1024;
#my $maxfilesize=400;
my $l=<>; chomp($l); my $header=$l;
my $fh;
my $filecounter=0;
my $filename="";
my $filesize=1000000000000; # big dummy size for first iteration
while ($l=<>) {
chomp($l);
if ($l=~m/^\s*[^,]+,Agent_\d+,[^,]+/) {
if ($filesize>$maxfilesize) {
print "FileSize: $filesize\n";
$filecounter++; $filename=sprintf("outfile_%05d",$filecounter);
print "Opening New File: $filename\n";
open($fh,">","${filename}.txt") or die "$!";
print $fh $header."\n";
$filesize=length($header);
}
}
print $fh $l."\n";
$filesize+=length($l);
print "FileSize: $filesize\n";
}
If you want more precise cuts than this, I will update it buffer the data before printing.
Step 1. Save the header
Step 2. create a variable "content" to temp-save the things the program is going to read
Step 3. start reading the next lines, in python:
if line.startswith("DWH"):
if content != "":
#if the content.len() reaches your predefined size, output_your_header + content here and reinitiate content by 'content = ""'
#else, content.len() is still under size limit, keep adding the new agent to content by doing 'content += line'
else:
content += line
Problem
I have created a simple perl script to read log files and process the data asynchronously.
The reading sub also checks for changes in inode number so a new filehandle is created when the logs rotate.
The problem i am facing is that when copytruncate is used in logrotation then then inode does not change when the file is rotated.
This shouldn't be an issue as the script should just continue reading the file but for some reason that i cannot immediately see, as soon as the logs rotate no new lines are ever read.
Question
How can i modify the below script (or completely scrap and start again) to continously tail a file which is logrotated using copytruncate using perl ?
Code
use strict;
use warnings;
use threads;
use Thread::Queue;
use threads::shared;
my $logq = Thread::Queue->new();
my %Servers :shared;
my %servername :shared;
#########
#This sub just reads the data off the queue and processes it, i have
#reduced it to a simple print statement for simplicity.
#The sleep is to prevent it from eating cpu.
########
sub process_data
{
while(sleep(5)){
if ($logq->pending())
{
while($logq->pending() > 0){
my $data = $logq->dequeue();
print "Data:$data\n";
}
}
}
}
sub read_file
{
my $myFile=$_[0];
#Get the argument and assign to var.
open(my $logfile,'<',$myFile) || die "error";
#open file
my $Inode=(stat($logfile))[1];
#Get the current inode
seek $logfile, 0, 2;
#Go to the end of the file
for (;;) {
while (<$logfile>) {
chomp( $_ );
$logq->enqueue( $_ );
#Add lines to queue for processing
}
sleep 5;
if($Inode != (stat($myFile))[1]){
close($logfile);
while (! -e $myFile){
sleep 2;
}
open($logfile,'<',$myFile) || die "error";
$Inode=(stat($logfile))[1];
}
#Above checks if the inode has changed and the file exists still
seek $logfile, 0, 1;
#Remove eof
}
}
my $thr1 = threads->create(\&read_file,"test");
my $thr4 = threads->create(\&process_data);
$thr1->join();
$thr4->join();
#Creating the threads, can add more log files for processing or multiple processing sections.
Possibly relevant info
Log config for logrotate contains
compress
compresscmd /usr/bin/bzip2
uncompresscmd /usr/bin/bunzip2
daily
rotate 5
notifempty
missingok
copytruncate
for this file.
Specs
GNU bash, version 3.2.57(1)-release (s390x-ibm-linux-gnu)
perl, v5.10.0
(if logrotate has version and someone knows how to check then i will also add that)
Any more info needed just ask.
So the reason that this was failing is pretty obvious when you look at copytruncate, it copies the original file and then truncates the current one.
Whilst this ensure that the inode is kept, it created another problem.
As the current way i tail the file is by simply staying at the end and removing the eof flag this means that when the file is truncated, the pointer stays at the position of the last line before truncation, which in turn means that no more lines would be read until it reached that pointer again.
The obvious solution then is to simply check the size of the file and reset the pointer if it is ever pointing past the end of the file.
I found it easier to just check that file size never got smaller though, using the two lines below.
my $fileSize=(stat($logfile))[7];
#Added after the inode is assigned
and changing
if($Inode != (stat($myFile))[1]){
to
if($Inode != (stat($myFile))[1] || (stat($myFile))[7] < $fileSize){
I have a simulation running and expect it to go on for atleast 10 more hours. I have directed the console out put to a .txt file using
(binary) > out.txt
This out.txt is becoming too huge. I do not need a lot of contents in this file. How can I delete the older parts of this file without harming the writing process? The contents that will be written towards the end of the simulation is important to me.
As Carl mentioned in the comments, you cannot really do this on an actively written log file. However, if the initial data is not relevant to you, you can do the following (though beware that you will loose all data)
> out.txt
For future, you can use a utility called logrotate(8)
You could use tail to only store the end of the file:
# Say you want to save the last 100 lines
your_binary | tail -n 100 > out.txt
This assumes that the output ends at some point.
saw your comments - the file is 10 GB now ... try using sed -i to reduce the size so that it will work with the other tools, if you want to completely erase it then :> logfile.
tools can cope up with a file which is as big as their buffer , else they should be streamed ..... something like split wont work on a 4 GB file , dont know if they made a code adjustment for this , its been long since i had to work with a file that big.
two suggestions :
1
there were a few methods i could think off like using split ....but almost all were involving creation of a seperate file from the log (a reduced version) and renaming that or redirecting to that.
use split to break the log to smaller logs (split -l 100 ...) and just redirect the program output to the recent the last log found using ls -1.
this seems to work fine .
2
Also i tried a second method to edit/truncate top 10 lines in the same file ......
Kaizen ~/shell_prac
$ cat zcntr.sh
## test truncate a log file
##set -xv
:> zcntr.log ;
## fxn
cntr_log()
{
limit=$1 ;
start=0 ;
while [ $start -lt $limit ]
do
echo "count is $start" >> zcntr.log ; ## generate a continuous log
start=$(($start + 1));
sleep 1;
cnt=$(($start % 10)) ;
if [ $cnt -eq 0 ] ## check to truncate the top 10 lines using sed
then
echo "truncate at $start " >> zcntr.log ;
sed -i "1,10d" zcntr.log ;
fi
done ;
}
## main cntrlr
echo "enter a limit" ;
read lmt ;
cntr_log $lmt ;
this seems to work
i tested it with a counter to print till value 25
output :
Kaizen ~/shell_prac
$ cat zcntr.log
count is 19
truncate at 20
count is 20
count is 21
count is 22
count is 23
count is 24
i think either of the two will help.
let me know if there is something else on your mind !!
Truncate file with cat
> cat /dev/null > out.txt
how can i sent this values
24.215729
24.815729
25.055134
27.123499
27.159186
28.843474
28.877798
28.877798
to tcl input argument?
as you know we cant use pipe command because tcl dosent accept in that way!
what can i do to store this numbers in tcl file(the count of this numbers in variable and can be 0 to N and in this example its 7)
This is pretty easy to do in bash, dump the list of values into a file and then run:
tclsh myscript.tcl $(< datafilename)
And then the values are accessible in the script with the argument variables:
puts $argc; # This is a count of all values
puts $argv; # This is a list containing all the arguments
You can read data piped to stdin with commands like
set data [gets stdin]
or from temporary files, if you prefer. For example, the following program's first part (an example from wiki.tcl.tk) reads some data from a file, and the other part then reads data from stdin. To test it, put the code into a file (eg reading.tcl), make it executable, create a small file somefile, and execute via eg
./reading.tcl < somefile
#!/usr/bin/tclsh
# Slurp up a data file
set fsize [file size "somefile"]
set fp [open "somefile" r]
set data [read $fp $fsize]
close $fp
puts "Here is file contents:"
puts $data
puts "\nHere is from stdin:"
set momo [read stdin $fsize]
puts $momo
A technique I use when coding is to put data in my scripts as a literal:
set values {
24.215729
24.815729
25.055134
27.123499
27.159186
28.843474
28.877798
28.877798
}
Now I can just feed them into a command one at a time with foreach, or send them as a single argument:
# One argument
TheCommand $values
# Iterating
foreach v $values {
TheCommand $v
}
Once you've got your code working with a literal, switching it to pull the data from a file is pretty simple. You just replace the literal with code to read a file:
set f [open "the/data.txt"]
set values [read $f]
close $f
You can also pull the data from stdin:
set values [read stdin]
If there's a lot of values (more than, say, 10–20MB) then you might be better off processing the data one line at a time. Here's how to do that with reading from stdin…
while {[gets stdin v] >= 0} {
TheCommand $v
}
I am to process single PDFs that have each been created by 'merging' multiple PDFs. Each of the merged PDF has the places where the PDF parts start displayed with a bookmark.
Is there any way to automatically split this up by bookmarks with a script?
We only have the bookmarks to indicate the parts, not the page numbers, so we would need to infer the page numbers from the bookmarks. A Linux tool would be best.
pdftk can be used to split the PDF file and extract the page numbers of the bookmarks.
To get the page numbers of the bookmarks do
pdftk in.pdf dump_data
and make your script read the page numbers from the output.
Then use
pdftk in.pdf cat A-B output out_A-B.pdf
to get the pages from A to B into out_A-B.pdf.
The script could be something like this:
#!/bin/bash
infile=$1 # input pdf
outputprefix=$2
[ -e "$infile" -a -n "$outputprefix" ] || exit 1 # Invalid args
pagenumbers=( $(pdftk "$infile" dump_data | \
grep '^BookmarkPageNumber: ' | cut -f2 -d' ' | uniq)
end )
for ((i=0; i < ${#pagenumbers[#]} - 1; ++i)); do
a=${pagenumbers[i]} # start page number
b=${pagenumbers[i+1]} # end page number
[ "$b" = "end" ] || b=$[b-1]
pdftk "$infile" cat $a-$b output "${outputprefix}"_$a-$b.pdf
done
There's a command line tool written in Java called Sejda where you can find the splitbybookmarks command that does exactly what you asked. It's Java so it runs on Linux and being a command line tool you can write script to do that.
Disclaimer
I'm one of the authors
you have programs that are built like pdf-split that can do that for you:
A-PDF Split is a very simple, lightning-quick desktop utility program that lets you split any Acrobat pdf file into smaller pdf files. It provides complete flexibility and user control in terms of how files are split and how the split output files are uniquely named. A-PDF Split provides numerous alternatives for how your large files are split - by pages, by bookmarks and by odd/even page. Even you can extract or remove part of a PDF file. A-PDF Split also offers advanced defined splits that can be saved and later imported for use with repetitive file-splitting tasks. A-PDF Split represents the ultimate in file splitting flexibility to suit every need.
A-PDF Split works with password-protected pdf files, and can apply various pdf security features to the split output files. If needed, you can recombine the generated split files with other pdf files using a utility such as A-PDF Merger to form new composite pdf files.
A-PDF Split does NOT require Adobe Acrobat, and produces documents compatible with Adobe Acrobat Reader Version 5 and above.
edit*
also found a free open sourced program Here if you do not want to pay.
Here's a little Perl program I use for the task. Perl isn't special; it's just a wrapper around pdftk to interpret its dump_data output to turn it into page numbers to extract:
#!perl
use v5.24;
use warnings;
use Data::Dumper;
use File::Path qw(make_path);
use File::Spec::Functions qw(catfile);
my $pdftk = '/usr/local/bin/pdftk';
my $file = $ARGV[0];
my $split_dir = $ENV{PDF_SPLIT_DIR} // 'pdf_splits';
die "Can't find $ARGV[0]\n" unless -e $file;
# Read the data that pdftk spits out.
open my $pdftk_fh, '-|', $pdftk, $file, 'dump_data';
my #chapters;
while( <$pdftk_fh> ) {
state $chapter = 0;
next unless /\ABookmark/;
if( /\ABookmarkBegin/ ) {
my( $title ) = <$pdftk_fh> =~ /\ABookmarkTitle:\s+(.+)/;
my( $level ) = <$pdftk_fh> =~ /\ABookmarkLevel:\s+(.+)/;
my( $page_number ) = <$pdftk_fh> =~ /\BookmarkPageNumber:\s+(.+)/;
# I only want to split on chapters, so I skip higher
# level numbers (higher means more nesting, 1 is lowest).
next unless $level == 1;
# If you have front matter (preface, etc) then this numbering
# will be off. Chapter 1 might be called Chapter 3.
push #chapters, {
title => $title,
start_page => $page_number,
chapter => $chapter++,
};
}
}
# The end page for one chapter is one before the start page for
# the next chapter. There might be some blank pages at the end
# of the split for PDFs where the next chapter needs to start on
# an odd page.
foreach my $i ( 0 .. $#chapters - 1 ) {
my $last_page = $chapters[$i+1]->{start_page} - 1;
$chapters[$i]->{last_page} = $last_page;
}
$chapters[$#chapters]->{last_page} = 'end';
make_path $split_dir;
foreach my $chapter ( #chapters ) {
my( $start, $end ) = $chapter->#{qw(start_page last_page)};
# slugify the title so use it as a filename
my $title = lc( $chapter->{title} =~ s/[^a-z]+/-/gri );
my $path = catfile( $split_dir, "$title.pdf" );
say "Outputting $path";
# Use pdftk to extract that part of the PDF
system $pdftk, $file, 'cat', "$start-$end", 'output', $path;
}