I just "move" image, and its metadata changes... - python-3.x

I simply copied my image and saved it to another temp folder in the current directory, nothing is modified, but the image is taking way more "disk space" than it's "bytes size".
And! When I done so, I lost most of my image's metadata such as location data, Device model, F number and others, only Color space, Alpha channel and Dimensions are preserved.
Here is the code I do:
from PIL import Image
import os
image_path = "/Users/moomoochen/Desktop/XXXXX.jpg"
img = Image.open(image_path)
pathname, filename = os.path.split(image_path)
new_pathname = (pathname + "/temp")
if not os.path.exists(new_pathname):
os.makedirs(new_pathname)
img.save(os.path.join(new_pathname, filename))
# If I want to adjust the quality, I do this:
img.save(os.path.join(new_pathname, filename), quality=80)
So my question is:
1) Why bytes size is different than disk space?
2) How can I adjust my code so that it will preserve all image's metadata?

Two things...
You are not actually "simply copying" your file. You are opening it in an image processor, expanding it out to a processable matrix of pixels and then resaving the image to disk - minus anything that your image processor wasn't interested in :-)
If you want to copy the complete file including EXIF data, use shutil like this:
#!/usr/local/bin/python3
from shutil import copyfile
copyfile('source.jpg', 'destination.jpg')
Check in Finder:
Secondly, all "on-disk" filesystems have a minimum allocation unit which means that if your file grows, it will grow by a whole unit, even if you just need 1 more byte of space. Most disks use a 4,096 byte allocation unit, so a 33 byte file will take up 4,096 bytes of space. I must say yours is rather higher than 4k of slack so maybe you are running on a fat RAID that works in big blocks to increase performance?
As an example, you can do this in Terminal:
# Write a file with 1 logical byte
echo -n "1" > file
# Look at file on disk
ls -ls file
8 -rw-r--r-- 1 mark staff 1 15 Nov 08:10 file
Look carefully, the 1 after staff means the logical size is 1 byte and that's what programs get if they read the whole file. But the first 8 on the left is the number of blocks on disk. Each block is 512 bytes, so this 1-byte file takes 8 blocks of 512 bytes, i.e. 4kB on disk!

Related

Why does Python find different file sizes to Windows?

I'm creating a basic GUI as a college project. It scans a user selected hard drive from their PC and gives them information about it such as the number of files on it etc...
There's a part of my scanning function that, for each file on the drive, takes the size of said file in bytes, and adds it to a running total. At the end of this, after comparing the number to the Windows total, I always find that my Python script finds less data than Windows says is on the drive.
Below is the code...
import os
overall_space_used = 0
def Scan (drive):
global overall_space_used
for path, subdirs, files in os.walk (r"" + drive + "\\"):
for file in files:
overall_space_used = overall_space_used + os.path.getsize(os.path.join(path,file))
print (overall_space_used)
When this is executed on one my HDDs, Python says there is 23,328,445,304 bytes of data in total (21.7 GB). However, when I go into the drive in Windows, it says that there is 23,536,922,624 bytes of data (21.9 GB). Why is there this difference?
I calculated it by hand, and using the same formula that Windows used to convert from bytes to gibibytes (gibibytes = bytes / 1024**3), I still arrived .2 GB short. Why is Python finding less data?
With os.path.getsize(...) you get the actual size of the file.
But NTFS, FAT32,... filesystems use cluster to store data in them, so they aren't filled up fully.
You can see this difference, when you go to the properties of a file, there is a difference between 'size' and 'size on the disk'. Now when you check the file size of the disk, it gives you the size of the used up clusters and not the size of the files added up.
Here some more detailed information:
Why is There a Big Difference Between ‘Size’ and ‘Size on Disk’?

DD Image larger than source

I created an image file using dd on my disk /dev/sda which fdisk says it is 500107862016 bytes in size. The resulting image file is 500108886016 bytes which is exactly 1024000 bytes larger.
Why is the image file 1MB larger than my source disk? Is there something related to the fact that I specified bs=1M in my dd command?
When I restore the image file onto another identical disk, I get "dd: error writing ‘/dev/sda’: No space left on device" error. Is this a problem? Will my new disk be corrupted?
conv=noerror makes dd(1) continue after a reading error, and this is not what you want. Also conv=sync fills incomplete blocks (mainly last block) with zeros up to fill a complete block, so probably this appending zeros to your last block is what is making your file greater than the actual disk size.
You don't need to use any of the conv options you used. No conversion is going to be made, and dd(1) will write the incomplete last block in case of the image doesn't have a full block size (which is the case)
Just retry your command with:
dd if=/dev/sda of=yourfile.img
and then
dd if=yourfile.img of=/dev/sdb
If you plan to use some greater buffer size (not needed, as you are using a block device and the kernel doesn't impose a blocksize for reading block devices) just use a multiple of the sector size that is a divisor of the whole disk size (something like one full track ---absurd, as today's disks' tracks are completely logical and don't have any relationship with actual disk geometry)

when user types size command in linux/unix, what does the result mean?

I've been wandering about the size of bss, data or text that I have. So I typed size command.
The result is
text data bss dec hex filename
5461 580 24 ....
What does the number mean? Is the unit bits, Bytes, Kilobytes or Megabytes?
In addition, how to reduce the size of bss, data, text of the file? (Not using strip command.)
That command shows a list of the sections and their sizes in bytes found in an object file. The unit is decimal bytes, unless display of a different format was specified. And there most likely exists a man page for the size command too.
"reduce the size" - modify source code. Take things out.
As for the part about reducing segment size, you have some leeway in moving parts from data to bss by not initializing them. This is only an option if the program initializes the data in another way.
You can reduce data or bss by replacing arrays with dynamically allocated memory, using malloc and friends.
Note that the bss takes no space in the executable and reducing it just for the sake of having smaller numbers reported by size is probably not a good idea.

du and size report different values for object files

While compiling a project I notice that du and size command outputs don't add up:
> du -sh X.o
490K X.o
> size X.o
text data bss dec hex filename
2128 0 12 2140 85c X.o
Why is the disk space taken by the object file different from the sum of the text data and bss segments of the file? What am I missing here?
The size command shows you how much the code and data will take up during execution. The object file consists of much more than that of course.
It begins with the overhead of the file format itself, which would have to contain at least the information that size uses to find out how big each part will be in memory. Then there's symbol tables, debugging information and who knows what (depends on compiler and object file format).
You can get more comprehensive information with objdump -h (or objdump -x to see just how many relocation records there are) which still doesn't cover the overhead, but shows how much actual content there is.
du displays the size of the file which resides on file system Vs. size which is actual size in bytes.
Reason behind why du has huge size - file systems usually made up of blocks on which files doesn't fit into these blocks exactly causing this difference. For example, if the file size is 4096 Bytes, size displays 4096 simliar to du but when file size is 5000 Bytes, size displays 5000 Bytes but du displays 8192.
This is referred as slack space.
Note: Above assuming file system allocation in units of 4096 Bytes.

why is the output of `du` often so different from `du -b`

why is the output of du often so different from du -b? -b is shorthand for --apparent-size --block-size=1. only using --apparent-size gives me the same result most of the time, but --block-size=1 seems to do the trick. i wonder if the output is then correct even, and which numbers are the ones i want? (i.e. actual filesize, if copied to another storage device)
Apparent size is the number of bytes your applications think are in the file. It's the amount of data that would be transferred over the network (not counting protocol headers) if you decided to send the file over FTP or HTTP. It's also the result of cat theFile | wc -c, and the amount of address space that the file would take up if you loaded the whole thing using mmap.
Disk usage is the amount of space that can't be used for something else because your file is occupying that space.
In most cases, the apparent size is smaller than the disk usage because the disk usage counts the full size of the last (partial) block of the file, and apparent size only counts the data that's in that last block. However, apparent size is larger when you have a sparse file (sparse files are created when you seek somewhere past the end of the file, and then write something there -- the OS doesn't bother to create lots of blocks filled with zeros -- it only creates a block for the part of the file you decided to write to).
Minimal block granularity example
Let's play a bit to see what is going on.
mount tells me I'm on an ext4 partition mounted at /.
I find its block size with:
stat -fc %s .
which gives:
4096
Now let's create some files with sizes 1 4095 4096 4097:
#!/usr/bin/env bash
for size in 1 4095 4096 4097; do
dd if=/dev/zero of=f bs=1 count="${size}" status=none
echo "size ${size}"
echo "real $(du --block-size=1 f)"
echo "apparent $(du --block-size=1 --apparent-size f)"
echo
done
and the results are:
size 1
real 4096 f
apparent 1 f
size 4095
real 4096 f
apparent 4095 f
size 4096
real 4096 f
apparent 4096 f
size 4097
real 8192 f
apparent 4097 f
So we see that anything below or equal to 4096 takes up 4096 bytes in fact.
Then, as soon as we cross 4097, it goes up to 8192 which is 2 * 4096.
It is clear then that the disk always stores data at a block boundary of 4096 bytes.
What happens to sparse files?
I haven't investigated what is the exact representation is, but it is clear that --apparent does take it into consideration.
This can lead to apparent sizes being larger than actual disk usage.
For example:
dd seek=1G if=/dev/zero of=f bs=1 count=1 status=none
du --block-size=1 f
du --block-size=1 --apparent f
gives:
8192 f
1073741825 f
Related: How to test if sparse file is supported
What to do if I want to store a bunch of small files?
Some possibilities are:
use a database instead of filesystem: Database vs File system storage
use a filesystem that supports block suballocation
Bibliography:
https://serverfault.com/questions/565966/which-block-sizes-for-millions-of-small-files
https://askubuntu.com/questions/641900/how-file-system-block-size-works
Tested in Ubuntu 16.04.
Compare (for example) du -bm to du -m.
The -b sets --apparent-size --block-size=1,
but then the m overrides the block-size to be 1M.
Similar for -bh versus -h:
the -bh means --apparent-size --block-size=1 --human-readable, and again the h overrides that block-size.
Files and folders have their real size and the size on disk.
--apparent-size is file or folder real size
size on disk is the amount of bytes the file or folder takes on disk.
Same thing when using just du.
If you encounter that apparent-size is almost always several magnitudes higher than disk usage then it means that you have a lot of (`sparse') files of files with internal fragmentation or indirect blocks.
Because by default du gives disk usage, which is the same or larger than the file size. As said under --apparent-size
print apparent sizes, rather than disk usage; although the apparent size is usually smaller, it may be
larger due to holes in (`sparse') files, internal fragmentation, indirect blocks, and the like

Resources