Split files using tar, gz, zip, or bzip2 [closed] - linux

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 13 years ago.
Improve this question
I need to compress a large file of about 17-20 GB. I need to split it into several files of around 1GB per file.
I searched for a solution via Google and found ways using split and cat commands. But they did not work for large files at all. Also, they won't work in Windows; I need to extract it on a Windows machine.

You can use the split command with the -b option:
split -b 1024m file.tar.gz
It can be reassembled on a Windows machine using #Joshua's answer.
copy /b file1 + file2 + file3 + file4 filetogether
Edit: As #Charlie stated in the comment below, you might want to set a prefix explicitly because it will use x otherwise, which can be confusing.
split -b 1024m "file.tar.gz" "file.tar.gz.part-"
// Creates files: file.tar.gz.part-aa, file.tar.gz.part-ab, file.tar.gz.part-ac, ...
Edit: Editing the post because question is closed and the most effective solution is very close to the content of this answer:
# create archives
$ tar cz my_large_file_1 my_large_file_2 | split -b 1024MiB - myfiles_split.tgz_
# uncompress
$ cat myfiles_split.tgz_* | tar xz
This solution avoids the need to use an intermediate large file when (de)compressing. Use the tar -C option to use a different directory for the resulting files. btw if the archive consists from only a single file, tar could be avoided and only gzip used:
# create archives
$ gzip -c my_large_file | split -b 1024MiB - myfile_split.gz_
# uncompress
$ cat myfile_split.gz_* | gunzip -c > my_large_file
For windows you can download ported versions of the same commands or use cygwin.

If you are splitting from Linux, you can still reassemble in Windows.
copy /b file1 + file2 + file3 + file4 filetogether

use tar to split into multiple archives
there are plenty of programs that will work with tar files on windows, including cygwin.

Tested code, initially creates a single archive file, then splits it:
gzip -c file.orig > file.gz
CHUNKSIZE=1073741824
PARTCNT=$[$(stat -c%s file.gz) / $CHUNKSIZE]
# the remainder is taken care of, for example for
# 1 GiB + 1 bytes PARTCNT is 1 and seq 0 $PARTCNT covers
# all of file
for n in `seq 0 $PARTCNT`
do
dd if=file.gz of=part.$n bs=$CHUNKSIZE skip=$n count=1
done
This variant omits creating a single archive file and goes straight to creating parts:
gzip -c file.orig |
( CHUNKSIZE=1073741824;
i=0;
while true; do
i=$[i+1];
head -c "$CHUNKSIZE" > "part.$i";
[ "$CHUNKSIZE" -eq $(stat -c%s "part.$i") ] || break;
done; )
In this variant, if the archive's file size is divisible by $CHUNKSIZE, then the last partial file will have file size 0 bytes.

Related

In Linux compare File byte per byte in 2 folder and look for duplicate [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have several image files (.jpg, png and more) containing photos in 2 directories, how can I compare the files in the 2 directories byte by byte using Bash under Linux so as to:
1) highlight duplicate files in both directories on a stdout or file
2) delete only 1 of the duplicate files, e.g. the most recent one.
You probably don't need byte-per-byte comparison. Calculating checksum and working with it is easier and probability of collision is very low. It can also save some time if you want to perform this multiple times with slow disk.
I have two directories (a and b) with these files:
$ ls *
a:
agetty agetty-2 badblocks bridge btrfs btrfs-image lvreduce lvreduce-2 resize2fs
b:
agetty agetty-2 bridge
1 Calculate checksums first
I will calculate checksums for all files and sort them:
find a b -type f | xargs sha256sum | sort > cksums
You can also use md5sum and others. md5sum is faster than sha256sum but probability of collision (situation where two files have same checksum) is a bit higher (but still enough).
Content of the file:
b1a58ac886f70cb65cc124bcc8e12a52659fbf5ce841956953d70d29b74869d7 a/resize2fs
c0e532634d14783bbd2ec1a1ed9bfc0b64da4a1efea2e9936fb97c6777ac1e10 a/btrfs-image
d00cdf58189e2171e3cb6610e6290c70ba03ecc0dc46b0570595d9187d769d2e a/btrfs
fadc2874feb053947ac1a4d8f14df58dabc093fa00b92f01125497ac9a171999 a/badblocks
424cf438ac1b6db45d1f25e237f28cc22bd7098a7fdf0f9c402744dc3f6ea6f2 a/agetty
424cf438ac1b6db45d1f25e237f28cc22bd7098a7fdf0f9c402744dc3f6ea6f2 a/agetty-2
424cf438ac1b6db45d1f25e237f28cc22bd7098a7fdf0f9c402744dc3f6ea6f2 b/agetty
424cf438ac1b6db45d1f25e237f28cc22bd7098a7fdf0f9c402744dc3f6ea6f2 b/agetty-2
424cf438ac1b6db45d1f25e237f28cc22bd7098a7fdf0f9c402744dc3f6ea6f2 b/bridge
7e177d31c45ab550b27ca743e4502cc4be519de4c75b2f479f427930bcb7c7bd a/bridge
9954909c3436bef767729b8f6034e5f12ef300fad16dc0e540bfa3c89c38b9c6 a/lvreduce
9954909c3436bef767729b8f6034e5f12ef300fad16dc0e540bfa3c89c38b9c6 a/lvreduce-2
You can even visually compare the files. Files with same content have same checksum. Notice that the SHA256 checksum is 64 hex digits/chars long (32 bytes).
2 Find repeated lines
cat cksums | uniq -Dw 64 | sed 's/^\S*\s*//'
Output:
a/agetty
a/agetty-2
b/agetty
b/agetty-2
b/bridge
a/lvreduce
a/lvreduce-2
You can also group files with same contents:
cat cksums | uniq --group -w 64 --group | sed 's/^\S*\s*//'
a/resize2fs
a/btrfs-image
a/btrfs
a/badblocks
a/agetty
a/agetty-2
b/agetty
b/agetty-2
b/bridge
a/bridge
a/lvreduce
a/lvreduce-2
3 List files for deletion
count=0
cat cksums | uniq --group -w 64 --group | sed 's/^\S*\s*//' | while read filename
do
if [[ -z "$filename" ]]
then
if [[ 1 -lt "$count" ]]
then
echo "$prev"
fi
count=0
else
prev="$filename"
((count++))
fi
done
Delete them by appending | xargs rm -v to done.

Extract 3 smallest files from Tar archive in descending order by size [duplicate]

This question already has an answer here:
Extract top 10 biggest files from tar archive Linux
(1 answer)
Closed 3 years ago.
How can I extract from a Tar file in Linux the 3 smallest files in descending order using command line?
You can list the file details, sort by size, pick the top3 files, build the tar x command, and execute to extract the 3 files:
tar tvf foo.tar
|awk '$0=$3"\x99"$NF'
|sort -n
|awk -F'\x99' 'NR<4{s=s" "$2}END{print "tar xvf foo.tar "s}'
|sh
Note:
The above one-liner assumes that all filenames in the tarball don't contain spaces or other special characters
The tarball name foo.tar is hardcoded., you should replace it by your real tarball
you can test the cmd without the last pipe: |sh it will only output the generated tar -x command, if it is fine, you can pipe to |sh to do real extraction.

shell script for copying log files into a single compressed file

We have a folder in our embedded board "statuslogs", this folder contains logs which are of the format : daily_status_date_time.log.
We need to get all the files of a particular year into a single file, for fetching from the server.
We did the following in our script
gzip -c statuslogs/daily_status_2017*.log > status_2017.gz
gzip -c statuslogs/daily_status_2018*.log > status_2018.gz
gzip -c statuslogs/daily_status_2019*.log > status_2019.gz
gzip -c statuslogs/daily_status_2020*.log > status_2020.gz
gzip -c statuslogs/daily_status_2021*.log > status_2021.gz
The problem with this logic is that it will still create status_*.gz file for the years 2019,2020,2021.
I tried writing the following logic
if [ - f statuslogs/daily_status_2017*.log ] but it fails due to regex may be. And I am not using bash, the interpreter is ash.
Can you please help me to optimize the script
Thanks for your time
You have a syntax error. It's -f, not - f. Example:
if [ -f statuslogs/daily_status_2017*.log ]; then
gzip -c statuslogs/daily_status_2017*.log > status_2017.gz
fi
However, with this you will probably run into a "too many arguments" error, which will happen if you have more than one matching file. So this would work better:
if find statuslogs/daily_status_2017*.log -mindepth 0 -maxdepth 0|head -n1; then
gzip -c statuslogs/daily_status_2017*.log > status_2017.gz
fi
It would be better to instead stop the loop when you reach the current year. For example,
for year in $(seq 2017 $(date +%Y)); do
Gzip only works on single files. If you want the separate files you need to do one of the following:
Combine the files using tar:
tar cf status_2017.tar.gz statuslogs/daily_status_2017*.log
OR use zip which supports multiple files directly
zip status_2017.zip statuslogs/daily_status_2017*.log
Now, if the problem is just that you want one archive for every year, but only for the years for which files exist, you can handle all the years using a for loop:
for year in `ls statuslogs/daily_status_* | cut -d _ -f 3 | sort | uniq`; do
tar cf status_$year.tar.gz statuslogs/daily_status_$year*.log;
done
If your shell doesn't support that format of calling, you can try this instead
ls statuslogs/daily_status_* | cut -d _ -f 3 | sort | uniq > years
cat years | while read year; do
tar cf status_$year.tar.gz statuslogs/daily_status_$year*.log;
done
If you just want one file for all the logs, you can just forget about the year part completely
tar cf statuslogs.tar.gz statuslogs/daily_status*.log

Pipe files included in tar.gz file to c++ program in bash

I have a C++ program that can be run in the following format (two cases):
./main 10000 file1.txt
OR
./main 10000 file1.txt file2.txt
where file1.txt and file2.txt are huge text files. I have the file.tar.gz that basically may include:
Just one file (file1.txt)
Two files file1.txt and file2.txt
Is there a way in bash to use pipe to read the files from the gz file directly, for both cases, i.e., even if the gz file contains one or two files? I have checked at Pipe multiple files (gz) into C program but I am not too savvy at bash and thus, I have trouble understanding the answers there.
This isn't going to be particularly simple. Your question is really too broad, as it stands. One approach would be:
Determine whether the archive contains one or two files
Set up named pipes (fifos) for each of the files ("mkfifo" command)
Run commands to output the content of the files in the archive to the appropriate fifo, as a background process
Run the primary command, specifying the fifos as the filename arguments
Giving a full rundown of all of this is, I think, beyond what should be the scope of a Stackoverflow question. For (1), you could probably do something like:
FILECOUNT=`tar -vzf (filename.tar.gz) | wc -l`
This lists the files within the archive (tar -vzf) and counts the number of lines of output from that command (wc -l). It's not foolproof but should work if the filenames are simple names like the ones you suggested (file1.txt, file2.txt).
For (2), make either one or two fifos as appropriate:
mkfifo file1-fifo.txt
if [ $FILECOUNT = 2 ]; then
mkfifo file2-fifo.txt
fi
For (3), use tar with -O to extract file contents from the archive, and redirect it to the fifo(s), as a background process:
tar -O -xf (filename.tar.gz) file1.txt > file1-fifo.txt &
if [ $FILECOUNT = 2 ]; then
tar -O -xf (filename.tar.gz) file2.txt > file2-fifo.txt &
fi
And then (4) is just:
SECONDFILE=""
if [ $FILECOUNT = 2 ]; then
SECONDFILE=file2-fifo.txt
fi
./main 1000 file1-fifo.txt $SECONDFILE
Finally, you should delete the fifo nodes:
rm file1-fifo.txt
rm file2-fifo.txt
Note that this will involve extracting the archive contents twice (in parallel), once for each file. There's no way (that I can think of) of getting around this.

What is cat for and what is it doing here? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have this script I'm studying and I would like to know what is cat doing in this section.
if cat downloaded.txt | grep "$count" >/dev/null
then
echo "File already downloaded!"
else
echo $count >> downloaded.txt
cat $count | egrep -o "http://server.*(png|jpg|gif)" | nice -n -20 wget --no-dns-cache -4 --tries=2 --keep-session-cookies --load-cookies=cookies.txt --referer=http://server.com/wallpaper/$number -i -
rm $count
fi
Like most cats, this is a useless cat.
Instead of:
if cat downloaded.txt | grep "$count" >/dev/null
It could have been written:
if grep "$count" download.txt > /dev/null
In fact, because you've eliminated the pipe, you've eliminated issues with which exit value the if statement is dealing with.
Most Unix cats you'll see are of the useless variety. However, people like cats almost as much as they like using a grep/awk pipe, or using multiple grep or sed commands instead of combining everything into a single command.
The cat command stands for concatenate which is to allow you to concatenate files. It was created to be used with the split command which splits a file into multiple parts. This was useful if you had a really big file, but had to put it on floppy drives that couldn't hold the entire file:
split -b140K -a4 my_really_big_file.txt my_smaller_files.txt.
Now, I'll have my_smaller_files.txt.aaaa and my_smaller_files.txt.aaab and so forth. I can put them on the floppies, and then on the other computer. (Heck, I might go all high tech and use UUCP on you!).
Once I get my files on the other computer, I can do this:
cat my_smaller_files.txt.* > my_really_big_file.txt
And, that's one cat that isn't useless.
cat prints out the contents of the file with the given name (to the standard output or to wherever it's redirected). The result can be piped to some other command (in this case, (e)grep to find something in the file contents). Concretely, here it tries to download the images referenced in that file, then adds the name of the file to downloaded.txt in order to not process it again (this is what the check in if was about).
http://www.linfo.org/cat.html
"cat" is a unix command that reads the contents of one or more files sequentially and by default prints out the information the user console ("stdout" or standard output).
In this case cat is being used to read the contents of the file "downloaded.txt", the pipe "|" is redirecting/feeding its output to the grep program, which is searching for whatever is in the variable "$count" to be matched with.

Resources