Deleting original files as you go along adding files to a TAR file - python-3.x

I've written a small server function which is intended to tar together a bunch of locally downloaded files, then delete the originals. It looks something like this:
with tarfile.open(archive_filename, "w:gz") as tar:
for pb in designated_objects:
bucket.download_file(pb.key, pb.key)
tar.add(pb.key)
os.delete(pb.key)
My expectation is that this will generate a tarfile with all of my desired data and an otherwise empty directory. The idea here is that I would like to minimize my disc usage as much as possible. However, I'm unsure if deleting a file before the tarfile is finished being generated (as done here) is allowed.
Will this expression work as expected?
If it will not, is there something akin to an append mode that will?

As expected, the original files are populated, then deleted. However, the behavior of the archive is unusual. When this code block is run, no archive is generated. In fact, this code block will do nothing at all (except delete your files).
I find this behavior particularly unusual and surprising given the fact that taking a pass inside the with statement (as in the code that follows) will actually write an empty archive to disc. So in a sense, the given code block does even less than nothing!
with tarfile.open('archive_filename.xy.gz', "w:gz") as tar:
pass
For reference, this behavior is what I get with Python 3.6. Behavior with other versions of Python may differ.

Related

How do I get the filename of an open std::fs::File in Rust?

I have an open std::fs::File, and I want to get it's filename, e.g. as a PathBuf. How do I do that?
The simple solution would be to just save the path used in the call to File::open. Unfortunately, this does not work for me. I am trying to write a program that reads log files, and the program that writes the logs keep changing the filenames as part of it's log rotation. So the file may very well have been renamed since it was opened. This is on Linux, so renaming open files is possible.
How do I get around this issue, and get the current filename of an open file?
On a typical Unix filesystem, a file may have multiple filenames at once, or even none at all. The file metadata is stored in an inode, which has a unique inode number, and this inode number can be linked from any number of directory entries. However, there are no reverse links from the inode back to the directory entries.
Given an open File object in Rust, you can get the inode number using the ino() method. If you know the directory the log file is in, you can use std::fs::read_dir() to iterate over all entries in that directory, and each entry will also have an ino() method, so you can find the one(s) matching your open file object. Of course this approach is subject to race conditions – the directory entry may already be gone again once you try to do anything with it.
On linux, files handles held by the current process can be found under /proc/self/fd. These look and act like symlinks to the original files (though I think they may technically be something else - perhaps someone who knows more can chip in).
You can therefore recover the (possibly changed) file name by constructing the correct path in /proc/self/fd using your file descriptor, and then following the symlink back to the filesystem.
This snippet shows the steps:
use std::fs::read_link;
use std::os::unix::io::AsRawFd;
use std::path::PathBuf;
// if f is your std::fs::File
// first construct the path to the symlink under /proc
let path_in_proc = PathBuf::from(format!("/proc/self/fd/{}", f.as_raw_fd()));
// ...and follow it back to the original file
let new_file_name = read_link(path_in_proc).unwrap();

LD_PRELOAD with file functions

I have a rather peculiar file format to work with:
Every line begins with the checksum of its content, followed by a new-line-character.
It looks like this:
[CHECKSUM OF LINE_1][LINE_1]\n
[CHECKSUM OF LINE_2][LINE_2]\n
[CHECKSUM OF LINE_3][LINE_3]\n
...
My goal: To allow any application to work with these files like they would work with any other text file - unaware of the additional checksums at the beginning of each line.
Since I work on a linux machine with debian wheezy (kernel 3.18.26) I want to use the LD_PRELOAD-mechanism to override the relevant file functions.
I have seen something like this with zlibc on https://zlibc.linux.lu/index.html - with an explanation of how it works ( https://zlibc.linux.lu/zlibc.html#SEC8 ).
But I dont get it. They only replace the file-opening functions. No read. No write. no fseek. Nothing. So how does it work?
Or - which functions would I have to intercept to handle every read or write operation on this file and handle them accordingly?
I didn't exactly check how it works but the reason seems to be quite simple.
Possible implementation:
zlibc open:
uncompress file you wanted to open to some temporary file
open this temporary file instead of yours
zlibc close:
Compress temporary file
Override original file
In this case you don't need to override read/write/etc because you can use original ones.
In your case you have two possible solutions:
open, that make a copy of your file with striped checksums. close that calculates checksums and override original file
read and write that are able to skip/calculate checksums.
Ad 2.
From What is the difference between read() and fread()?:
fread() is part of the C library, and provides buffered reads. It is
usually implemented by calling read() in order to fill its buffer
In this case I believe that overriding open and close will be less error prone because you can safely reuse original read, write, fread, fseek etc.

When I create a Temporary File/Directory, when will it be removed?

Julia contains a number of methods for making temporary files and directories.
I'm making fairly heavy use of them (and /dev/shm), to inferface with libraries that really want to work with actual files (JLD/HDF5, and OpenStack Swift).
I had been assuming they would be deleted when their finalisers on the pointer to there name were called.
But then after exiting julia it seemed like they were all still there.
Will linux delete them?
If the app didn't clean after itself, the OS will delete the files eventually. It depends on system settings when temp files are deleted. For example, it can happen on boot or nightly (via cron job) or some another way.
See this answer, for example: How is the /tmp directory cleaned up?
What you are likely looking for,
given your surprise that they were not removed, based on going out of scope, as the do block versions of mktemp.
In the very documentation you linked.
mktemp(f::Function[, parent=tempdir()])
Apply the function f to the result of mktemp(parent) and remove the temporary file upon completion.
mktempdir(f::Function[, parent=tempdir()])
Apply the function f to the result of mktempdir(parent) and remove the temporary directory upon completion.
Which you can use like:
mktempdir("/dev/shm") do tdir
fname = joinpath(tdir, name)
#Do some things with your new temp filename `fname` in your tempdir `tdir`
end
#the directory referenced by `tdir`, and `fname`, have now been deleted.

Cannot zip files with the same name?

I could not believe this: it seems that the zip specification does not allow two different files with the same file name going into one zip file.
In my case I use an external file to specify all the files I wanna zip.
This could look like this:
../Website1/favicon.ico
../Website2/favicon.ico
and there we are, that's not possible, despite keeping the directory structure. You would expect the name to be <../Website1/favicon.ico> rather than but that does not seem to be the case, I get:
"Invalid ZIP request (cannot repeat names in Zip file)"
with WinZip. I tried the same with 7Zip - same result.
Strangely googling did not show many hits that really fit but those I found seem to confirm my findings. That's hard to believe since this limitation is very severe. I actually struggle to understand why this did not hit me a couple of decades earlier.
Am I overlooking something very basic here?
To be precise:
Adding these two files:
C:\Temp\Website1\FavIcon
C:\Temp\Website2\FavIcon
results in a single file; the last Add wins...
This however:
Website1\FavIcon
Website2\FavIcon
results in a zip file that contains both files.

save MATLAB code file along with results in one folder?

I'm processing a data set and running into a problem - although I xlswrite all the relevant output variables to a big Excel file that is timestamped, I don't save the code that actually generated that result. So if I try to recreate a certain set of results, I can't do it without relying on memory (which is obviously not a good plan). I'd like to know if there's a command(s) that will help me save the m-files used to generate the output Excel file, as well as the Excel file itself, in a folder I can name and timestamp so I don't have to do this manually.
In my perfect world I would run the master code file that calls 4 or 5 other function m-files, then all those m-files would be saved along with the Excel output to a folder names results_YYYYMMDDTIME. Does this functionality exist? I can't seem to find it.
There's no such functionality built in.
You could build a dependency tree of your main function by using depfun with mfilename.
depfun(mfilename()) will return a list of all functions/m-files that are called by the currently executing m-file.
This will include all files that come as MATLAB builtins, you might want to remove those (and only record the MATLAB version in your excel sheet).
As pseudocode:
% get all files:
dependencies = depfun(mfilename());
for all dependencies:
if not a matlab-builtin:
copyfile(dependency, your_folder)
As a "long term" solution you might want to check if using a version control system like subversion, mercurial (or one of many others) would be applicable in your case.
In larger projects this is preferred way to record the version of source code used to produce a certain result.

Resources