How do I use dd command for destination size greater than source file size? - dd

I generated a fairly huge sized file using /dev/urandom. I know urandom is slow. I use these files for some IO verification. When I need a new file of a bigger size, I have to again create the file using urandom, which slows me down.
What I basically want is:
Use the same file to create destination files of larger sizes. The randomness of contents in the destination does not affect me , but I cannot use /dev/zero as well.
Is there a way I can instruct dd command to repeatedly write the same input file till the destination is filled up?

Maybe you can use cat to get a larger output
cat randomfile1 randomfile2 randomfile1 ...
and only use the first n bytes you need.

If you have created a file random_1(10M) using dd,
dd if=/dev/urandom of=./random_1 bs=1M count=10 # create a base file
and if you want a new file random_2(12M) then use the first file to create another file (random_1_1) and join them both,
dd if=./random_1 of=./ramdom_1_1 bs=1M count=2 # create a dummy 2M file
cat ramdom_1 random_1_1 > randrom_2 # Join both files

Related

Partially expand VCF bgz file in Linux

I have downloaded gnomAD files from - https://gnomad.broadinstitute.org/downloads
This is the bgz file
https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.2.vcf.bgz
When I expand using:
zcat gnomad.genomes.r2.1.1.sites.2.vcf.bgz > gnomad.genomes.r2.1.1.sites.2.vcf
The output VCF file becomes more than 330GB. I do not have that kind of space available on my laptop.
Is there a way where I can just expand - say 1 GB of the bgz file OR just 100000 rows from the bgz file?
From what I've been able to determine, a bgz file is compatible with gzip, and a VCF file is a plain text file. Since it's a gzip file, and not a .tar.gz, the solution doesn't require listing any archive contents, and simplifies things a bit.
This can probably be accomplished in several ways, and I doubt this is the best way, but I've been able to successfully decompress the first 100,000 rows into a file using the following code in python3 (it should also work under earlier versions back to 2.7):
#!/usr/bin/env python3
import gzip
ifile = gzip.GzipFile("gnomad.genomes.r2.1.1.sites.2.vcf.bgz")
ofile = open("truncated.vcf", "wb")
LINES_TO_EXTRACT = 100000
for line in range(LINES_TO_EXTRACT):
ofile.write(ifile.readline())
ifile.close()
ofile.close()
I tried this on your example file, and the truncated file is about 1.4 GiB. It took about 1 minute, 40 seconds on a raspberry pi-like computer, so while it's slow, it's not unbearably so.
While this solution is somewhat slow, it's good for your application for the following reasons:
It minimizes disk and memory usage, which could otherwise be problematic with a large file like this.
It cuts the file to exactly the given number of lines, which avoids truncating your output file mid-line.
The three input parameters can be easily parsed from the command line in case you want to make a small CLI utility for parsing other files in this manner.

Making a text file grow by replicating on mac

I am doing some unit tests and I have a small text file (a few kilobytes) and what I would like to do is make a new file out of this where the same text is replicated over and over again for some user specified times. The reason I want to do this is to ensure that my algorithm can handle large files and the results are correct (I can extrapolate the correct results from the tests ran on the smaller text file).
Is there a utility on the mac or linux platform that allows me to do that?
You can use a for loop and concatenate the contents of the file to a temporary file.
COUNT=10 # larger or smaller, depending on how large you want the file
FILENAME=test.txt
# remove the mv command if you do not wish the original file to be overwritten
for i in $(seq 1 $COUNT) ; do cat $FILENAME >> $FILENAME.tmp ; done && mv $FILENAME.tmp $FILENAME

which log file or how do I locate this log file?

I have perhaps a dumb question, but this might be an easy point...
so i run a dd command at the console and I get a message when it is done like:
0+1 records in
0+1 records out
424 bytes (424 B) copied, 0.0191003 s, 22.2 kB/s
The question is, which log file or record file is this info stored in? To be CLEAR, I need to access the above message and not the output file.
Thanks in advance
If you're talking about the file being created by dd, it's either going to be whatever file you specified with the of= option, or standard output, possibly redirected.
That's the way dd works: it writes to standard output by default but you can override this by specifying the output file explicitly.
For example:
pax> dd if=testprog.c of=/dev/null
6+1 records in
6+1 records out
3454 bytes (3.5 kB) copied, 8.3585e-05 s, 41.3 MB/s
If you're after the actual status output of the dd command rather than the file being copied, dd is simply writing this to standard error, so you can capture it with:
dd if=somfile of=someotherfile 2>dd.stderr
This will send standard error through to the file dd.stderr. If you don't redirect it, then it's almost certainly gone to your default standard error which tends to be your terminal. The only way to get it from there is to cut and paste it with your terminal program. As far as the file system is concerned, it's gone.

Script command to manipulate binary file (on linux)

I am looking for a mechanism to manipulate my eeprom image with a unique device id. I'd like to do this in a make file so that the device would automatically obtain a new ID and then update it to the data image, then flash it. In pseudocode:
wget http://my.centralized.uid.service/new >new.id
binedit binary.image -write 0xE6 new.id
flash binary.image into device
So first we get an id into a separate file, then we overwrite the image (from given offset) with the contents of this ID file. Then flash. But how to do the second part? I looked up bvi, which seems to have some scripting abilities, but I did not fully understand it, and to be honest vi always gave me the creeps.
Thanks for help beforehand!
(Full disclosure: I made the initial vote to close as a duplicate. This answer is adapted from the referenced question.)
Use dd with the notrunc option:
offset=$(( 0xe6 ))
length=$( wc -c < new.id )
dd bs=1 if=new.id of=binary.image count=$length seek=$offset conv=notrunc
You may want to try this on a copy first, just to make sure it works properly.
If you know the offset of the file that you want to replace from, you can use the split command to split the initial file up until the offset. The cat command can then be used to join the required pieces together.
Another useful tool when working with binary files is od which will let you examine the binary files in human readable format.
I would perhaps use something like Perl. See here and in particular the section labelled Updating a Random-Access File (example here)

Shortening large CSV on debian

I have a very large CSV file and I need to write an app that will parse it but using the >6GB file to test against is painful, is there a simple way to extract the first hundred or two lines without having to load the entire file into memory?
The file resides on a Debian server.
Did you try the head command?
head -200 inputfile > outputfile
head -10 file.csv > truncated.csv
will take the first 10 lines of file.csv and store it in a file named truncated.csv
"The file resides on a Debian server."- this is interesting. This basically means that even if you use 'head', where does head retrieve the data from? The local memory (after the file has been copied) which defeats the purpose.

Resources