Why does Python find different file sizes to Windows? - python-3.x

I'm creating a basic GUI as a college project. It scans a user selected hard drive from their PC and gives them information about it such as the number of files on it etc...
There's a part of my scanning function that, for each file on the drive, takes the size of said file in bytes, and adds it to a running total. At the end of this, after comparing the number to the Windows total, I always find that my Python script finds less data than Windows says is on the drive.
Below is the code...
import os
overall_space_used = 0
def Scan (drive):
global overall_space_used
for path, subdirs, files in os.walk (r"" + drive + "\\"):
for file in files:
overall_space_used = overall_space_used + os.path.getsize(os.path.join(path,file))
print (overall_space_used)
When this is executed on one my HDDs, Python says there is 23,328,445,304 bytes of data in total (21.7 GB). However, when I go into the drive in Windows, it says that there is 23,536,922,624 bytes of data (21.9 GB). Why is there this difference?
I calculated it by hand, and using the same formula that Windows used to convert from bytes to gibibytes (gibibytes = bytes / 1024**3), I still arrived .2 GB short. Why is Python finding less data?

With os.path.getsize(...) you get the actual size of the file.
But NTFS, FAT32,... filesystems use cluster to store data in them, so they aren't filled up fully.
You can see this difference, when you go to the properties of a file, there is a difference between 'size' and 'size on the disk'. Now when you check the file size of the disk, it gives you the size of the used up clusters and not the size of the files added up.
Here some more detailed information:
Why is There a Big Difference Between ‘Size’ and ‘Size on Disk’?

Related

What is a quick way to check if file contents are null?

I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.
Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)

Squashfs check compressed file size

is there any way to check the final size of a specific file after compression in a squashfs filesystem?
I'm looking through mksquashfs/unsquashfs command line options but I can't find anything.
Using the -info option in mksquashfs only prints the size before the compression.
Thanks
This isn't feasible to do with much granularity, because compression is done at block level, not file level.
A file may be marked at starting 50kb into the size of the buffer created by decompressing block 50, and continuing to end 50 bytes into the decompressed block 52 (ignoring fragments here, which are a separate concern) -- but that doesn't let you map back to the position inside the compressed copy of block-50 where that file starts. (You can easily determine the compression ratio for block 51, but you can't easily figure out the ratios for the parts of the file contained in 50 and 52 in our example, because they're shared with other contents).
So the information isn't exposed because it isn't easily available. This actually makes storage of numerous (similar) small files significantly more efficient, because a single compression context is used for all of them (and decompressing a block to retrieve one file may mean that you've got files next to it already decompressed in memory)... but without potentially-unfounded assumptions (such as assuming that all contents within a block share that block's average ratio) it doesn't help with trying to backtrace how well each individual item compressed, because the items aren't compressed individually in the first place.

How to convert fixed size dimension to unlimited in a netcdf file

I'm downloading daily 600MB netcdf-4 files that have this structure:
netcdf myfile {
dimensions:
time_counter = 18 ;
depth = 50 ;
latitude = 361 ;
longitude = 601 ;
variables:
salinity
temp, etc
I'm looking for a better way to convert the time_counter dimension from a fixed size (18) to an unlimited dimension.
I found a way of doing it with the netcdf commands and sed. Like this:
ncdump myfile.nc | sed -e "s#^.time_counter = 18 ;#time_counter = UNLIMITED ; // (currently 18)#" | ncgen -o myfileunlimited.nc
which worked for me for small files, but when dumping a 600 MB netcdf files, takes to much memory and time.
Somebody knows another method for accomplishing this?
Your answers are very insightful. I'm not really looking a way to improve this ncdump-sed-ncgen method, I know that dumping a netcdf file that is 600MB uses almost 5 times more space in a text file (CDL representation). To then modify some header text and generate the netcdf file again, doesn't feels very efficient.
I read the latest NCO commands documentation, and found a option specific to ncks "--mk_rec_dmn". Ncks mainly extracts and writes or appends data to a new netcdf file, then this seems the better approach, extract all the data of myfile.nc and write it with a new record dimension (unlimited dimension) which the "--mk_rec_dmn" does, then replace the old file.
ncks --mk_rec_dmn time_counter myfile.nc -o myfileunlimited.nc ; mv myfileunlimited.nc myfile.nc
To do the opposite operation (record dimension to fixed-size) would be.
ncks --fix_rec_dmn time_counter myfile.nc -o myfilefixedsize.nc ; mv myfilefixedsize.nc myfile.nc
The shell pipeline can only be marginally improved by making the sed step only modify the beginning of the file and pass everything else through, but the expression you have is very cheap to process and will not make a dent in the time spent.
The core problem is likely that you're spending a lot of time in ncdump formatting the file information into textual data, and in ncgen parsing textual data into a NetCDF file format again.
As the route through dump+gen is about as slow as it is shown, that leaves using NetCDF functionality to do the conversion of your data files.
If you're lucky, there may be tools that operate directly on your data files to do changes or conversions. If not, you may have to write them yourself with the NetCDF libraries.
If you're extremely unlucky, NetCDF-4 files are HDF5 files with some extra metadata. In particular, the length of the dimensions is stored in the _netcdf_dim_info dataset in group _netCDF (or so the documentation tells me).
It may be possible to modify the information there to turn the current length of the time_counter dimension into the value for UNLIMITED (which is the number 0), but if you do this, you really need to verify the integrity of the resulting file, as the documentation neatly puts it:
"Note that modifying these files with HDF5 will almost certainly make them unreadable to netCDF-4."
As a side note, if this process is important to your group, it may be worth looking into what hardware could do the task faster. On my Bulldozer system, the process of converting a 78 megabyte file takes 20 seconds, using around 500 MB memory for ncgen working set (1 GB virtual) and 12 MB memory for ncdump working set (111 MB virtual), each task taking up the better part of a core.
Any decent disk should read/sink your files in 10 seconds or so, memory doesn't matter as long as you don't swap, so CPU is probably your primary concern if you take the dump+gen route.
If concurrent memory use is a big concern, you can trade some bytes for space by saving the intermediary result from sed onto disk, which will likely take up to 1.5 gigabytes or so.
You can use the xarray python package's xr.to_netdf() method, then optimise memory usage via using Dask.
You just need to pass names of the dimensions to make unlimited to the unlimited_dims argument and use the chunks to split the data. For instance:
import xarray as xr
ds = xr.open_dataset('myfile.nc', chunks={'time_counter': 18})
ds.to_netcdf('myfileunlimited.nc', unlimited_dims={'time_counter':True})
There is a nice summary of combining Dask and xarray linked here.

Zipping a folder into equal size parts

I've been using 7Zip for a few years now and always liked that I could zip a folder into several parts of a specific size. For example, the website BOX only allows uploads under 100MB so anything I wanted to put into BOX, I just split the zip file into 95MB files. However, recently I've needed to do something similar except instead of breaking into a certain size, I need to split them up into a specific number of files but all equaling the same size. Right now, 7zip breaks them into the max size you allow and the last file is any remaining data ranging from 1KB up to the limit specified.
For example, say I have a 826MB file, I want it to zip up 5 files that are all the same size. Is there any program out there that will do this?
Thanks in advanced!
I don't know of any program that does this, but if this is something that you're doing regularly, you could write a script that:
Finds out the size of the file
Calculates the maximum piece size to use if you want to split it into n pieces.
Constructs a corresponding 7zip command

Purpose of ibs/obs/bs in dd

I have a script that creates file system in a file on a linux machine. I see that to create the file system, it uses 'dd' with bs=x option, reads from /dev/zero and writes to a file. I think usually specifying ibs/obs/bs is useful to read from real hardware devices as one has specific block size constraints. In this case however, as it is reading from virtual device and writing to a file, I don't see any point behind using 'bs=x bytes' option. Is my understanding wrong here?
(Just in case if it helps, this file system is later on used to boot a qemu vm)
To understand block sizes, you have to be familiar with tape drives. If you're not interested in tape drives - for example, you don't think you're ever going to use one - then you can go back to sleep now.
Remember the tape drives from films in the 60s, 70s, maybe even 80s? The ones where the reel went spinning around, and so on? Not your Exabyte or even QIC - quarter-inch cartridge - tapes; your good old fashioned reel-to-reel half-inch tape drives? On those, block size mattered.
The data on a tape was written in blocks. Each block was separated from the next by an inter-record gap.
----+-------+-----+-------+-----+----
... | block | IRG | block | IRG | ...
----+-------+-----+-------+-----+----
Depending on the tape drive hardware and software, there were a variety of problems that could happen. For example, if the tape was written with a block size of 5120 bytes and you read the tape with a block size of 512 bytes, then the tape drive might read the first block, return you 512 bytes of it, and then discard the remaining data; the next read would start on the next block. Conversely, if the tape was written with a block size of 512 bytes and you requested blocks of 5120 bytes, you would get short reads; each read would return just 512 bytes, and if your software wasn't paying attention, you'd be reading garbage. There was also the issue that the tape drive had to get up to speed to read the block, and then slow down. The ASCII art suggests that the IRG was smaller than the data blocks; that was not necessarily the case. And it took time to read one block, overshoot the IRG, rewind backwards to get to the next block, and start forwards again. And if the tape drive didn't have the memory to buffer data - the cheaper ones did not - then you could seriously affect your tape drive performance.
War story: work prepared on newer machine with a slightly more modern tape drive. I wrote a tape using tar without a sensible block size (so it defaulted to 512 bytes). It was a large bit of software - must have been, oh, less than 100 MB in total (a long time ago, in other words). The tape wrote nicely because the machine was modern enough, and it took just a few seconds to do so. But, I had to get the material off the tape on a machine with an older tape drive, one that did not have any on-board buffer. So, it read the material, 512 bytes at a time, and the reel rocked forward, reading one block, and then rocked back all but maybe half an inch, and then read forwards to get to the next block, and then rocked back, and ... well, you could see it doing this, and since it took appreciable portions of a second to read each 512 byte block, the total time taken was horrendous. My flight was due to leave...and I needed to get that data across too. (It was long enough ago, and in a land far enough away, that last minute changes to flights weren't much of an option either.) To cut a long story short, it did get read - but if I'd used a sensible block size (such as 5120 bytes instead of the default of 512), I would have been done much, much quicker and with much less danger of missing the plane (but I did actually catch the plane, with maybe 20 minutes to spare, despite a taxi ride across Paris in the rush hour).
With more modern tape drives, there was enough memory on the drive to do buffering and getting a tape drive to stream - write continuously without reversing - was feasible. It used to be that I'd use a block size like 256 KB to get QIC tapes to stream. I've not done much with tape drives recently - let's see, not this millennium and not much for a few years before that, either; certainly not much since CD and DVD became the software distribution mechanisms of choice (when electronic download wasn't used).
But the block size really did matter in the old days. And dd provided good support for it. You could even transfer data from a tape drive that was written with, say, 4 KB block to another that you wanted to write with, say, 16 KB blocks, by specifying the ibs (input block size) separately from the obs (output block size). Darned useful!
Also, the count parameter is in terms of the (input) block size. It was useful to say 'dd bs=1024 count=1024 if=/dev/zero of=/my/file/of/zeroes' to copy 1 MB of zeroes around. Or to copy 1 MB of a file.
The importance of dd is vastly diminished; it was an essential part of the armoury for anybody who worked with tape drives a decade or more ago.
The block size is the number of bytes that are read and written at a time. Presumably there is a count= option, and that is specified in units of the block size. If there is a skip= or seek= option, those will also be in block size units. However if you are reading and writing a regular file, and there are no disk errors, then the block size doesn't really matter as long as you can scale those parameters accordingly and they are still integers. However certain sizes may be more efficient than others.
For reading from /dev/zero, it doesn't matter. ibs/obs/bs specify how many bytes will be read at a time. It's helpful to choose a number based on the way bytes are read/written in the operating system. For instance, Linux usually reads from a hard drive in 4096 byte chunks. If you have at least some idea about how the underlying hardware reads/writes, it might be a good idea to specify ibs/obs/bs. By the way, if you specify bs, it will override whatever you specify for ibs and obs.
In addition to the great answer by Jonathan Leffler, keep in mind that the bs= option isn't always a substitute for using both ibs= and obs=, in particular for the old [ugly] days of tape drives.
The bs= option reserves the right for dd to write the data as soon as it's read. This can cause you to no longer have identically sized blocks on the output. Here is GNU's take on this, but the behavior dates back as far as I can remember (80's):
(bs=) Set both input and output block sizes to bytes. This makes dd read and write bytes per block, overriding any ‘ibs’ and ‘obs’ settings. In addition, if no data-transforming conv option is specified, input is copied to the output as soon as it’s read, even if it is smaller than the block size.
For instance, back in the QIC days on an old Sun system, if you did this:
tar cvf /dev/rst0c /bla
It would work, but cause an enormous amount of back and forth thrashing while the drive wrote a small block, and tried to backup and read to reposition itself properly for the next write.
If you swapped this with:
tar cvf - /bla | dd ibs=16K obs=16K of=/dev/rst0c
You'd get the QIC drive writing much larger chunks and not thrashing quite so much.
However, if you made the mistake of this:
tar cvf - /bla | dd bs=16K of=/dev/rst0c
You'd run the risk of having precisely the same thrashing you had before depending upon how much data was available at the time of each read.
Specifying both ibs= and obs= precludes this from happening.

Resources