extracting last n percentage of a file output from zcat command - linux

I am trying to extract the last 2 percentage of a file output coming from the zcat command. I tried something doing
numlines=$(zcat file.tar.gz | wc -l)
zcat file.tar.gz | tail -n + $numlines*(98/100)
But the problem with this approach is my file is too big, and I can't afford to run the zcat command twice. Is there some way I could do it by maybe piping the number of lines , or some other ways.
EDIT :
The output of zcat file.tar.gz | tar -xO | dd 2>&1 | tail -n 1 is
16942224047 bytes (17 GB, 16 GiB) copied, 109.154 s, 155 MB/s
Any help would be greatly appreciated.

Read content to a variable. I assume that there is enough RAM available.
content=$(zcat file.tar.gz| tar -xO)
lines=$(wc -l <<<"$content")
ninetyeight=$((100-$lines/100*98))
tail -n $ninetyeight
This only works if the file contains at least 100 lines.

The following awk program will only keep the last n% of your file into memory. The percentage is taken floor wise, that is to say, if we n% of the file represents 134.56 lines, it will print 134 lines
awk -v n=2 '{a[FNR]=$0; min=FNR-int(FNR*n/100)}
{i=min; while(i in a) delete a[i--]}
END{for(i=min+1;i<=FNR;++i) print a[i]}' - < <(zcat file)
you can verify this when you replace zcat file with seq 100

Related

Get size of image in bash

I want to get size of image. The image is in folder by name encodedImage.jpc
a="$(ls -s encodedImage.jpc | cut -d " " -f 1)"
temp="$(( $a*1024 * 8))"
echo "$a"
The output is not correct. How to get size? Thank You
Better than parsing ls output, the proper way is to use the command stat like this :
stat -c '%s' file
Check
man stat | less +/'total size, in bytes'
If by size you mean bytes or pretty bytes can you just use
ls -lh
-h When used with the -l option, use unit suffixes: Byte, Kilobyte, Megabyte, Gigabyte, Terabyte and Petabyte in order to reduce the number of digits to three or less using base 2 for sizes.
I guess the more complete answer if you're just trying to tear off the file size alone (I added the file name as well you can remove ,$9 to drop it)
ls -lh | awk '{print $5,$9}'
U can use this command
du -sh your_file

Check record length for fixed width files

In a Unix environment, I occasionally have some fixed width files for which I'd like to check the record lengths. For each file I'd like to catch if any records are not an appropriate line number for further investigation; appropriate size is known a priori.
If I want to check if all record lengths are the same, I simply run
zcat <gzipped file> | awk '{print length}' | sort -u
If there is more than one record length in the above command, then I run
zcat <gzipped file> | awk '{print length}' | nl -n rz -s "," > recordLenghts.csv
which stores a record length for row in the original file.
What: Is this an efficient method, or is there a better way of checking record length for a file?
Why: The reason I ask is that some of these files can be a few GB in size while gzipped. So this process can take a while.
With pure awk:
zcat <gzipped file> | awk '{printf "%0.6d,%s\n", NR, length}' > recordLenghts.csv
This way you will save one extra subprocess.

Retrieve last 100 lines logs

I need to retrieve last 100 lines of logs from the log file.
I tried the sed command
sed -n -e '100,$p' logfilename
Please let me know how can I change this command to specifically retrieve the last 100 lines.
You can use tail command as follows:
tail -100 <log file> > newLogfile
Now last 100 lines will be present in newLogfile
EDIT:
More recent versions of tail as mentioned by twalberg use command:
tail -n 100 <log file> > newLogfile
"tail" is command to display the last part of a file, using proper available switches helps us to get more specific output. the most used switch for me is -n and -f
SYNOPSIS
tail [-F | -f | -r] [-q] [-b number | -c number | -n number] [file ...]
Here
-n number :
The location is number lines.
-f : The -f option causes tail to not stop when end of file is
reached, but rather to wait for additional data to be appended to the
input. The -f option is ignored if the
standard input is a pipe, but not if it is a FIFO.
Retrieve last 100 lines logs
To get last static 100 lines
tail -n 100 <file path>
To get real time last 100 lines
tail -f -n 100 <file path>
You can simply use the following command:-
tail -NUMBER_OF_LINES FILE_NAME
e.g tail -100 test.log
will fetch the last 100 lines from test.log
In case, if you want the output of the above in a separate file then you can pipes as follows:-
tail -NUMBER_OF_LINES FILE_NAME > OUTPUT_FILE_NAME
e.g tail -100 test.log > output.log
will fetch the last 100 lines from test.log and store them into a new file output.log)
Look, the sed script that prints the 100 last lines you can find in the documentation for sed (https://www.gnu.org/software/sed/manual/sed.html#tail):
$ cat sed.cmd
1! {; H; g; }
1,100 !s/[^\n]*\n//
$p
$ sed -nf sed.cmd logfilename
For me it is way more difficult than your script so
tail -n 100 logfilename
is much much simpler. And it is quite efficient, it will not read all file if it is not necessary. See my answer with strace report for tail ./huge-file: https://unix.stackexchange.com/questions/102905/does-tail-read-the-whole-file/102910#102910
I know this is very old, but, for whoever it may helps.
less +F my_log_file.log
that's just basic, with less you can do lot more powerful things. once you start seeing logs you can do search, go to line number, search for pattern, much more plus it is faster for large files.
its like vim for logs[totally my opinion]
original less's documentation : https://linux.die.net/man/1/less
less cheatsheet : https://gist.github.com/glnds/8862214
len=`cat filename | wc -l`
len=$(( $len + 1 ))
l=$(( $len - 99 ))
sed -n "${l},${len}p" filename
first line takes the length (Total lines) of file
then +1 in the total lines
after that we have to fatch 100 records so, -99 from total length
then just put the variables in the sed command to fetch the last 100 lines from file
I hope this will help you.

Linux:How to list the information about file or directory(size,permission,number of files by type?) in total

Suppose I am staying in currenty directory, I wanted to list all the files in total numbers, as well as the size, permission, and also the number of files by types.
here is the sample outputs:
Here is a sample :
Print information about "/home/user/poker"
total number of file : 83
pdf files : 5
html files : 9
text files : 15
unknown : 5
NB: anyfile without extension could be consider as unknown.
i hope to use some simple command like ls, cut, sort, unique ,(just examples) put each different extension in file and using wc -l to count number of lines
or do i need to use grep, awk , or something else?
Hope to get the everybody's advices.thank you!
Best way is to use file to output only mimetype and pass it to awk.
file * -ib | awk -F'[;/.]' '{print $(NF-1)}' | sort -n | uniq -c
On my home directory it produces this output.
35 directory
3 html
1 jpeg
1 octet-stream
1 pdf
32 plain
5 png
1 spreadsheet
7 symlink
1 text
1 x-c++
3 x-empty
1 xml
2 x-ms-asf
4 x-shellscript
1 x-shockwave-flash
If you think text/x-c++ and text/plain should be in same Use this
file * -ib | awk -F'[;/.]' '{print $1}' | sort -n | uniq -c
6 application
6 image
45 inode
40 text
2 video
Change the {print $1} part according to your need to get the appropriate output.
You need bash.
files=(*)
pdfs=(*.pdf)
echo "${#files[#]}"
echo "${#pdfs[#]}"
echo "$((${#files[#]}-${#pdfs[#]}))"
find . -type f | xargs -n1 basename | fgrep . | sed 's/.*\.//' | sort | uniq -c | sort -n
That gives you a recursive list of file extensions. If you want only the current directory add a -maxdepth 1 to the find command.

Total size of the contents of all the files in a directory [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 4 years ago.
Improve this question
When I use ls or du, I get the amount of disk space each file is occupying.
I need the sum total of all the data in files and subdirectories I would get if I opened each file and counted the bytes. Bonus points if I can get this without opening each file and counting.
If you want the 'apparent size' (that is the number of bytes in each file), not size taken up by files on the disk, use the -b or --bytes option (if you got a Linux system with GNU coreutils):
% du -sbh <directory>
Use du -sb:
du -sb DIR
Optionally, add the h option for more user-friendly output:
du -sbh DIR
cd to directory, then:
du -sh
ftw!
Originally wrote about it here:
https://ao.ms/get-the-total-size-of-all-the-files-in-a-directory/
Just an alternative:
ls -lAR | grep -v '^d' | awk '{total += $5} END {print "Total:", total}'
grep -v '^d' will exclude the directories.
stat's "%s" format gives you the actual number of bytes in a file.
find . -type f |
xargs stat --format=%s |
awk '{s+=$1} END {print s}'
Feel free to substitute your favourite method for summing numbers.
If you use busybox's "du" in emebedded system, you can not get a exact bytes with du, only Kbytes you can get.
BusyBox v1.4.1 (2007-11-30 20:37:49 EST) multi-call binary
Usage: du [-aHLdclsxhmk] [FILE]...
Summarize disk space used for each FILE and/or directory.
Disk space is printed in units of 1024 bytes.
Options:
-a Show sizes of files in addition to directories
-H Follow symbolic links that are FILE command line args
-L Follow all symbolic links encountered
-d N Limit output to directories (and files with -a) of depth < N
-c Output a grand total
-l Count sizes many times if hard linked
-s Display only a total for each argument
-x Skip directories on different filesystems
-h Print sizes in human readable format (e.g., 1K 243M 2G )
-m Print sizes in megabytes
-k Print sizes in kilobytes(default)
For Win32 DOS, you can:
c:> dir /s c:\directory\you\want
and the penultimate line will tell you how many bytes the files take up.
I know this reads all files and directories, but works faster in some situations.
When a folder is created, many Linux filesystems allocate 4096 bytes to store some metadata about the directory itself.
This space is increased by a multiple of 4096 bytes as the directory grows.
du command (with or without -b option) take in count this space, as you can see typing:
mkdir test && du -b test
you will have a result of 4096 bytes for an empty dir.
So, if you put 2 files of 10000 bytes inside the dir, the total amount given by du -sb would be 24096 bytes.
If you read carefully the question, this is not what asked. The questioner asked:
the sum total of all the data in files and subdirectories I would get if I opened each file and counted the bytes
that in the example above should be 20000 bytes, not 24096.
So, the correct answer IMHO could be a blend of Nelson answer and hlovdal suggestion to handle filenames containing spaces:
find . -type f -print0 | xargs -0 stat --format=%s | awk '{s+=$1} END {print s}'
There are at least three ways to get the "sum total of all the data in files and subdirectories" in bytes that work in both Linux/Unix and Git Bash for Windows, listed below in order from fastest to slowest on average. For your reference, they were executed at the root of a fairly deep file system (docroot in a Magento 2 Enterprise installation comprising 71,158 files in 30,027 directories).
1.
$ time find -type f -printf '%s\n' | awk '{ total += $1 }; END { print total" bytes" }'
748660546 bytes
real 0m0.221s
user 0m0.068s
sys 0m0.160s
2.
$ time echo `find -type f -print0 | xargs -0 stat --format=%s | awk '{total+=$1} END {print total}'` bytes
748660546 bytes
real 0m0.256s
user 0m0.164s
sys 0m0.196s
3.
$ time echo `find -type f -exec du -bc {} + | grep -P "\ttotal$" | cut -f1 | awk '{ total += $1 }; END { print total }'` bytes
748660546 bytes
real 0m0.553s
user 0m0.308s
sys 0m0.416s
These two also work, but they rely on commands that don't exist on Git Bash for Windows:
1.
$ time echo `find -type f -printf "%s + " | dc -e0 -f- -ep` bytes
748660546 bytes
real 0m0.233s
user 0m0.116s
sys 0m0.176s
2.
$ time echo `find -type f -printf '%s\n' | paste -sd+ | bc` bytes
748660546 bytes
real 0m0.242s
user 0m0.104s
sys 0m0.152s
If you only want the total for the current directory, then add -maxdepth 1 to find.
Note that some of the suggested solutions don't return accurate results, so I would stick with the solutions above instead.
$ du -sbh
832M .
$ ls -lR | grep -v '^d' | awk '{total += $5} END {print "Total:", total}'
Total: 583772525
$ find . -type f | xargs stat --format=%s | awk '{s+=$1} END {print s}'
xargs: unmatched single quote; by default quotes are special to xargs unless you use the -0 option
4390471
$ ls -l| grep -v '^d'| awk '{total = total + $5} END {print "Total" , total}'
Total 968133
du is handy, but find is useful in case if you want to calculate the size of some files only (for example, using filter by extension). Also note that find themselves can print the size of each file in bytes. To calculate a total size we can connect dc command in the following manner:
find . -type f -printf "%s + " | dc -e0 -f- -ep
Here find generates sequence of commands for dc like 123 + 456 + 11 +.
Although, the completed program should be like 0 123 + 456 + 11 + p (remember postfix notation).
So, to get the completed program we need to put 0 on the stack before executing the sequence from stdin, and print the top number after executing (the p command at the end).
We achieve it via dc options:
-e0 is just shortcut for -e '0' that puts 0 on the stack,
-f- is for read and execute commands from stdin (that generated by find here),
-ep is for print the result (-e 'p').
To print the size in MiB like 284.06 MiB we can use -e '2 k 1024 / 1024 / n [ MiB] p' in point 3 instead (most spaces are optional).
Use:
$ du -ckx <DIR> | grep total | awk '{print $1}'
Where <DIR> is the directory you want to inspect.
The '-c' gives you grand total data which is extracted using the 'grep total' portion of the command, and the count in Kbytes is extracted with the awk command.
The only caveat here is if you have a subdirectory containing the text "total" it will get spit out as well.
This may help:
ls -l| grep -v '^d'| awk '{total = total + $5} END {print "Total" , total}'
The above command will sum total all the files leaving the directories size.

Resources