I'm designing a system to process plain text files, one of its features will be to move processed files to an archive server once they've been completely processed through the system. What I want to do is tag a text file once its been completely processed by the system, i.e. a system seal of approval or marker. The reason for this is I want this same system to be able to analyze the text file later and search for this hidden marker so it can identify it as having been processed in the past. At the same time, I want this marker to be ignored by any other system that might be handling this file
I was thinking of having a unique key that only this system uses and has access to and creating a procedure for hashing and salting the key and placing it within the text file before it gets moved to its final destination. I'm curious about any other techniques for creating a hidden seal or marker. So to summarize:
Can I create a set or string of encoded bits and place them in a text file?
Can these bits be hidden within the text file such that they are ignored by any other system that might handle this text file?
I'd appreciate any insight or feedback.
Personally, i would avoid modifying original content, ASCII text file (to my knowledge) can't be signed in a way that would prevent all applications from seeing the signature.
Instead, i would take md5 of the file maintain "processed" one separately from the ones that have not yet been "processed.
Map<MD5, FileName> is a structure to consider. You should be able to write code to both retrieve by MD5 or file name.
Hope it helps.
Hiding data inside another file is called Steganography. It can be done with ASCII files, but it is usually more easily done with data or image files.
In your particular case, a parallel register, or meta-data, of processed files would seem to be a better fit. Using a good hash, MD5 or better, is fine as long as you do not expect malicious deliberate attacks. In that case you would need to use HMAC-MD5 or HMAC-SHA-256. A malicious attacker can easily calculate the correct hash value for the altered file.
Related
I work with a tool that contains everything within XML inside the database.
Some reports that are stored in the database use a third party tool to load, and store the main data to configure the 'report' definition in what is not a human-readable format.
I'd post it here, but it's some 130,000 bytes.
I have attempted to decode it using popular methods that I assumed it would have been encoded in, such as base64, base 32, etc, but none have been able to decode the string.
Is there a way to identify what encoding a given string has, using a tool available online?
I don't have the benefit of access to the developer that built this functionality, the source code generating this string, or any documentation on it.
To give some context around what I'm trying to do - I need to reverse-engineer how a specific definition in a system is generated, so that it can be modified slightly (manually) in a text editor to support an operation that would otherwise require manually re-creating the report.
I apologize is if this may be the wrong exchange site for this question - I realize it's not specific to a 'programming' issue and I haven't tried to solve it using a programming language. If so - please redirect me to the appropriate place and I'll be happy to ask there instead.
Update: The text consists of strictly A-Z, 0-9 characters.
You can check amongst known encoding formats with this tool only if you are sure data is not encrypted
I've been trying to figure out the best way to accomplish the task of encrypting big (several GB) files into the file system for later access.
I've been experimenting with several modes of AES (particularly CBC and GCM) and there are some pros and cons I've found on each approach.
After researching and asking around, I come to the conclusion that at least at this moment, using AES+GCM is not feasible for me, mostly because of the issues it has in Java and the fact that I can't use BouncyCastle.
So I am writing this to talk about the protocol I'm going to be implementing to complete the task. Please provide feedback as you see fit.
Encryption
Using AES/CBC/PKCS5Padding with 256 bit keys
The file will be encrypted using a custom CipherOutputStream. This output stream will take care of writing a custom header at the beginning of the file which will consist of at least the following:
First few bytes to easyly tell that the file is encrypted
IV
Algorithm, mode and padding used
Size of the key
The length of the header itself
While the file is being encrypted, it will be also digested to calculate its authentication tag.
When the encryption ends, the tag will be appended at the end of the file. The tag is of a know size, so this makes it easy to later recover it.
Decryption
A custom CipherInputStream will be used. This stream knows how to read the header.
It will then read the authentication tag, and will digest the whole file (without encrypting it) to validate it has not been tampered (I haven't actually measure how this will perform, however it's the only way I can think of to safely start decryption wihtout the risk of knowing too late the file should not have been decrypted in the first place).
If the validation of the tag is ok, then the header will provide all the information needed to initialize the cipher and make the input stream decrypt the file. Otherwise it will fail.
Is this something that seems ok to you in order to handle encryption/decryption of big files?
Some points:
A) Hashing of the encrypted data, with the hash not encrypted itself.
One of the possible things a malicious human M could do without any hash: Overwrite the encrypted file with something else. M doesn´t know key, the plaintext before and/or the plaintext after this action, but he can change the plaintext to something different (usually, it becomes garbage data). Destruction is also a valid purpose for some people.
The "good" user with the key can still decrypt it without problems, but it won´t be the original plaintext. So far no problems if it´s garbage data if (and only if) you know for sure what´s inside, ie. how to recognize if it is unchanged. But do you know that in every case? And there´s a small chance that the "gargabe" data actually makes sense, but is not the real one anyways.
So, to recognize if the file was changed, you add a SHA hash of the encrypted data.
And if the evil user M overwrites the encrypted file part, he will do what with the hash? Right, he can recalculate it so that it matches the new encrypted data. Once again, you can´t recognize changes.
If the plaintext is hashed and then everything is encrypted, it´s pretty much impossible to get it right. Remember, M doesn´t know the key or anything. M can change the plaintext inside to "something", but can´t change the hash to the correct value for this something.
B) CBC
CBC is fine if you decrypt the whole file or nothing everytime.
If you want to access parts of it without decrypting the unused parts, look at XTS.
C) Processing twice
It will then read the authentication tag, and will digest the whole
file (without encrypting it) to validate it has not been tampered (I
haven't actually measure how this will perform, however it's the only
way I can think of to safely start decryption wihtout the risk of
knowing too late the file should not have been decrypted in the first
place).
Depending on how the files are used, this in indeed necessary. Especially if you want to use the data during the final step already, before it has finished.
I don´t know details about the Java CipherOutputStream,
but besides that and the mentioned points, it looks fine to me.
I am making an application that will save information for certain files. I was wondering what the best way to keep track of files. I was thinking of using the absolute path for a file but that could change if the file is renamed. I found that if you run ls -i each file has an id beside it that is unique(?). Is that ok to use for a unique file id?
The inode is unique per device but, I would not recommend using it because imagine your box crashes and you move all the files to a new file system now all your files have new ids.
It really depends on your language of choice but almost all of them include a library for generating UUID's. While collisions are theoretically possible its a veritable non-issue. Generate the UUID prepend it to the front of your file and you are in business. As your implementation grows it will also allow you to create a HashTable index of your files for quick look ups later.
The question is, "unique across what?"
If you need something unique on a given machine at a given point in time, then yes, the inode number + device number is nearly always unique - these can be obtained from stat() or similar in C, os.stat() in python. However, if you delete a file and create another, the inode number may be reused. Also, two different hosts may have a completely different idea of what the device,inodeno pairs are.
If you need something to describe the file's content (so two files with the same content have the same id), you might look into one of the SHA or RIPEMD functions. This will be pretty unique - the odds of an accidental collision are astronomically low.
If you need some other form of uniqueness, please elaborate.
I am trying to store a plist and several binary files (let's say images) as part of an UIManagedDocument. The name of the binary files are an attribute in Core Data and I don't need to enumerate them, just access the right one when showing the related entity.
The file structure that I want to have is:
- <File yyyyMMdd-HHmmss>.extdoc
- StoreContent
- persistentStore
- AdditionalContent
- ListStatus.plist (used to store per document defaults)
- Images
- uuid1.png
- uuid2.png
- ...
- uuidn.png
So far, I have successfully followed the instructions in How do I save additional content into my UIManagedDocument file packages?, but when I try to add the binary files there are some things that I don't know how to do.
Should I treat the URL /the/path/File yyyyMMdd-HHmmss.extdoc/AdditionalContent (the default one provided with readAdditionalContentFromURL:error:) as a NSFileWrapper? Are there any advantages/disadvantages vs just using the URLs? I find it more complicated to use the file wrapper, since the plist has to be read using the file wrapper accessors and NSCoder (I guess), and the files, I have to store the file wrapper for the Images directory and then obtain the corresponding node with objectForKey (I assume). But Apple's Document-Based Apps Programming Guide for iOS regarding custom formats instead of NSData or NSFileWrapper, states "Keep in mind that your code will have to duplicate what UIDocument does for you, and so you must deal with greater complexity and a greater possibility of error." Am I misunderstanding this?
Per document defaults are declared as properties: the setter modifies the NSDictionary that maps the plist and marks the document as updated, and the getter accesses the dictionary with the proper key. How do I expose the ability to read/write the binary files? Should I add a method to my subclass of UIManagedDocument? - (void)writeImage:(NSString*)uuid; and -(UIImage *)readImage:(NSString *)uuid; And should I keep this data in memory until the document is saved? How?
Assuming that NSFileWrapper is the way to go, if I plan to use this document with iCloud should I use file coordinators with the file wrapper? If so, how?
Any source code for each question will be greatly appreciated. Thank you.
P.S.: I know that I could save some binary data inside of Core Data, but I don't feel comfortable with that solution. Among other reasons, I rather store the PNG data for image files that a serialized version of UIImage that won't be compatible with NSImage if I want to create a desktop app.
I'd like to say that, in general I rather like UIManagedDocument. It has a few advantages over raw Core Data. For example, it sets up the entire core data stack for you automatically. It also sets up nested managed object contexts for you, so you get free background saving. None of that is particularly earth-shattering, but it's a lot of functionality from a tiny amount of code.
I haven't played around with saving additional information...but here are my thoughts.
First, you shouldn't need to treat the new URL as a file wrapper. You should just be able to do regular file operations on the provided URL. Just make sure you have everything implemented properly in additionalContentForURL:error:, writeAdditionalContent:toURL:originalContentsURL:error: and readAdditionalContentFromURL:error:. The read and write operations need to be symmetric. And you should probably snapshot your data in additionalContentsForURL:error: so that everything will be saved in a known, good state (since the save operations are asynchronous).
As an alternative, have you considered using the Store in External Record File flag in your data model instead of saving it manually? This should force Core Data to (depending on the size of the binary data) automatically store them externally. I looked at the release notes, and I didn't see anything saying you couldn't use this feature with iCloud. That might be the easiest fix.
Attacking a side point for the moment (as I have not had ANY good experience with UIManagedDocument).
You can save the binary inside of Core Data for a iOS 5.0+ application using the external file reference. Then you can save the PNG of the image to Core Data directly and not need to worry about a UIManagedDocument or about bloating the sqlite file.
There is nothing stopping you from storing the PNG instead of a UIImage.
One other thought. You may need to use an NSFileCoordinator for the read and write operations. Technically, any read or write operations in the iCloud container need to use a file coordinator (to coordinate with the iCloud sync service--this prevents accidentally corrupting a file by reading it while another process is writing to it).
I know that UIDocument wraps most of its input and output methods automatically. I'd guess that these methods are similarly wrapped (since they give you a URL to use)--However, the docs aren't very clear.
I need to delete my input file securely once I have finished with it, at the moment I'm overwriting all the data with zero, this is messy as my temp folder becomes full of old files also the name of the files is a security issue.
Rather than just moving them to the recycle bin I would like them to skip it and just disappear, this is in conjunction with being wiped byte wise as data recovery software can recover items from beyond the recycle bin. As the name is also important I need to rename them before I delete them.
This is a progressive problem. What is "secure" for one application is insecure for another. If security is really important and you find yourself asking these kinds of questions on Stack Overflow, then most likely need to contract with an external security consultant. Examples of really important include financial information, medical records, or anything else where there is a law or contract requiring the securing of the data. I don't say this to be mean or imply that you are incapable of solving the problem, but to point out that this is a rather complex and evolving problem.
Basically to accomplish what you want to accomplish:
Once your code you wrote finishes then change the file size to empty - this makes recovery more difficult because the original file size is lost.
Then rename the file (RenameFile)to a different name.
Finally delete the file using DeleteFile, which does not move the file to the recycle bin.
Make sure you maintain an exclusive handle on the files the whole time they are on the disk too, or they can just be copied before they are deleted.
As I said, this is a progressive problem. This is a really basic solution, and is subject to a number of vulnerabilities. So depending on the level of security needed you might consider never letting the file be written to disk, or using multiple pass overwrites. If security is really important, then actually burning the hard drive platter at a high temperature, and then smashing it is the only way to be sure.
Edit: It appears you removed your code sample.
There are third-party utilities to do this kind of thing from the command - I found PGP Command Line has this feature, if you search around you can probably find a free app that will do this from the command line. You could then just call the command from your app in order to securely delete the file.
I would say that if you are insistent upon writing your own code to do this, then instead of using all 0's, write random bytes to the disk. And don't use the built-in c++ rand function, use a more secure random number generator.
As Jim McKeeth said, this is not something you want to do yourself if there are serious legal repercussions for getting it wrong.
Jim has described well the issues with solving your problem in code. The problem is indeed progressive, and any solution you implement will only approximate complete security without ever attaining it. So one thing to do is to decide exactly what you need to protect the file against (snooping family members? co-workers? corporate espionage? totalitarian governments?), then design your solution accordingly and document its limitations.
I have a sort of an orthogonal suggestion though. Instead of - or in addition to - implementing secure wiping in code, you can require cooperation from users. For example, you can suggest (or require) that input files be stored on an encrypted volume. In corporate environments PGP Disk might be preferred, since it's a recognizable brand, while home users would be well served to use the free and well-tested TrueCrupt. Both products support creating virtual encrypted volumes as well as encrypting whole partitions. This would go a great length to keeping the names and contents of input files secure, even before you write a single line of code.
Deleting a file can be touchy subject...
Depending on the need of your customer I would like to point to the Data remanence phenomenon. Which is residual data left after a simple overwrite. Data erasure is a method of destroying the residual data.
There are a few standards on how to erase the residual data, DoD 5220.22-M is mostly referred to by "secure file delete" applications, but apparently the rules have changed.
As of the June 2007 edition of the DSS
C&SM, overwriting is no longer
acceptable for sanitization of
magnetic media; only degaussing or
physical destruction is acceptable.
So what I'm saying is, try to get the rules which your customer has to follow.
Beware of "wear leveling" algorithms used with flash storage. To promote even wear, files are moved around on the drive, and it's invisible to your app, and even the operating system. So you can "secure delete" the file all you want, and you will only affect the most recent copy of the file. But prior copies are recoverable/discoverable with recovery software. So the only way to solve that, is to encrypt the file contents.