My embedded linux gets its data files from an external source (sd card). As this media is easily detachable I'd like to protect it in a certain way.
First idea that comes in mind is to do encryption. I'm afraid though this would take too much processing power. My files are not deeply sensitive, but I don't want that people can put the card into their desktop and see/copy my files. I assume these people know how to mount a standard ext4 drive.
Content is initially loaded on to the disk via a desktop linux box, so the process should be
I wouldn't care too much if the solution is not hack-proof. Basically I want to avoid to have my content copied by the general copycat.
I'm not looking for a turn-key solution, but like to get some pointers into the right direction.
A simple XOR Cipher requires very little processing. The security is limited in the sense that if someone has a both the encrypted and plain-text data, by XOR'ing the two the encryption key is revealed. However so long as you can avoid someone being knowingly in possession of both, and the key itself remains confidential, it may meet your requirements of simplicity and security.
Obviously you need a longer key that the simple 8 bit one in the example in the link. The key itself can be arbitrarily long with no impact on performance.
Related
I am in the process of upgrading an existing application that was written in flash to play mp3 files of phone calls. The purpose of the application is to train employees of how to work with customers. Some of the calls are "negative" calls and those are used to train employees of what NOT to do.
The reason I need to not provide a location of where the mp3s are, is that if someone were to become disgruntled and leave the company and decide to take some of the negative calls with them, that would be bad. I don't ever like to underestimate the intelligence of our users so I'm sure some could figure out a way to get them regardless.
The current implementation as I said was written in flash and it loads up all of the mp3s as the swf file loads on the client thereby mitigating the necessity to ever make a call up to the server to grab a new mp3 file. None of these mp3s are huge in file size because they're all only about 30 second phone call clips.
Are there any ways to prevent a direct download of an mp3 from an IIS server. Could I serve them up with c# as an aspx file that requires a specific hash or salt in order to play?
I really dont' want to have to have them all brought into a swf like the current implementation if I can avoid it.
any suggestions welcome.
TIA
Honestly, if a user is that determined to get the data, they will. I believe the balance here is at what point will said hypothetical employee feel the gain to be had by obtaining the data is not worth the effort to get it. And how much effort you have to go through vs. what it is worth to the company.
If the audio will always be played back on your application, one simple layer of security would be to encrypt the files. Keeping it simple, you can use a symmetric key, store it in the application, and decrypt the file in memory before it is played (this way it's not stored in a temporary file the user could just grab). Sure a user with 3/4 of a brain could probably fish the key out of the executable, but frankly the sound is playing on their speakers and I'm sure they have a smartphone. They could just as easily record the output with Sound Recorder as it plays too.
Simply speaking, I believe a very minimum layer of technological security mixed with a binding confidentiality agreement should give you enough recourse. The security will keep the would-be-honest honest and deter the lazy, as well as giving you a leg up in proving the employee obtained the audio through nefarious means (i.e. it wasn't just "available for the taking").
I saw allot of companies offering exe wrappers , but is there any in pdf side security programmatically ?
Well, you can encrypt the PDF. You can also use custom encryption handler and thus make your file unreadable with stock Acrobat or Reader (one will need to install your decryption plugin to Acrobat or Reader to make them understand your encryption). The problem is acrobat's DRM SDK (the one that allow you create encryption plugins) once had enormous cost (smth. like $25K to start). I don't know if this is still the case, though.
Another not-so-bad option is render everything to graphics - this makes text copying harder (though one can print everything and OCR it back).
The short answer is no. When you give someone the ciphertext, key, and cipher they will always be able to reproduce the plaintext. DRM fails universally for just this reason.
The long answer is that you can sometimes try little gimmicky tricks to prevent copying in some circumstances which may "work" if your audience doesn't try breaking it, but not in the general case. You can't really call something secure which is "safe as long as nobody tries to break it".
The PDF format itself has an "owner password" which allows the author to disallow readers from printing the document, modifying it, etc... Of course there's not actually any mechanism for preventing anyone from doing so. If you are trying to prevent the guys in the marketing department from printing it off or something, then maybe. But if you're releasing it out into the Internet, just assume that it can and will be copied however users see fit.
How does one combine several resources for an application (images, sounds, scripts, xmls, etc.) into a single/multiple binary file so that they're protected from user's hands? What are the typical steps (organizing, loading, encryption, etc...)?
This is particularly common in game development, yet a lot of the game frameworks and engines out there don't provide an easy way to do this, nor describe a general approach. I've been meaning to learn how to do it, but I don't know where to begin. Could anyone point me in the right direction?
There are lots of ways to do this. m_pGladiator has some good ideas, especially with seralization. I would like to make a few other comments.
First, if you are going to pack a bunch of resources into a single file (I call these packfiles), then I think that you should work to avoid loading the whole file and then deseralizing out of that file into memory. The simple reason is that it's more memory. That's really not a problem on PC's I guess, but it's good practice, and it's essential when working on the console. While we don't (currently) serialize objects as m_pGladiator has suggested, we are moving towards that.
There are two types of packfiles that you might have. One would be a file where you want arbitrary access to the contents of the files. A second type might be a collection of files where you need all of those files when loading a level. A basic example might be:
An audio packfile might contain all the audio for your game. You might only need to load certain kinds of audio for the menus or interface screens and different sets of audio for the levels. This might fall intot he first category above.
A type that falls into the second category might be all models/textures/etc for a level. You basically want to load the entire contents of this file into the game at load time because you will (likely) need all of it's contents while a player is playing that level or section.
many of the packfiles that we build fall into the second category. We basically package up the level contents, and then compresses them with something like zlib. When we load one of these at game time, we read a small amount of the file, uncompress what we've read into a memory buffer, and then repeat until the full file has been read into memory. The buffer we read into is relatively small while final destination buffer is large enough to hold the largest set of uncompressed data that we need. This method is tricky, but again, it saves on RAM, it's an interesting exercise to get working, and you feel all nice and warm inside because you are being a good steward of system resources. once the packfile has been completely uncompressed into it's destinatino buffer, we run a final pass on the buffer to fix up pointer locations, etc. This method only works when you write out your packfile as structures that the game knows. In other words, our packfile writing tools share struct (or classses) with the game code. We are basically writing out and compressing exact representations of data structures.
If you simply want to cut down on the number of files that you are shipping and installing on a users machine, you can do with something like the first kind of packfile that I describe. Maybe you have 1000s of textures and would just simply like to cut down on the sheer number of files that you have to zip up and package. You can write a small utility that will basically read the files that you want to package together and then write a header containing the files and their offsets in the packfile, and then you can write the contents of the file, one at a time, one after the other, in your large binary file. At game time, you can simply load the header of this packfile and store the filenames and offsets in a hash. When you need to read a file, you can hash the filename and see if it exists in your packfile, and if so, you can read the contents directly from the packfile by seeking to the offset and then reading from that location in the packfile. Again, this method is basically a way to pack data together without regards for encryption, etc. It's simply an organizational method.
But again, I do want to stress that if you are going a route like I or m_pGladiator suggests, I would work hard to not have to pull the whole file into RAM and then deserialize to another location in RAM. That's a waste of resources (that you perhaps have plenty of). I would say that you can do this to get it working, and then once it's working, you can work on a method that only reads part of the file at a time and then decompresses to your destination buffer. You must use a comprsesion scheme that will work like this though. zlib and lzw both do (I believe). I'm not sure about an MD5 algorithm.
Hope that this helps.
do as Java: pack it all in a zip, and use an filesystem-like API to read directly from there.
Personally, I never used the already available tools to do that. If you want to prevent your game to be hacked easily, then you have to develop your own resource manipulation engine.
First of all read about serializing objects. When you load a resource from file (graphic, sound or whatever), it is stored in some object instance in the memory. A game usually uses dozens of graphical and sound objects. You have to make a tool, which loads them all and stores them in collections in the memory. Then serialize those collections into a binary file and you have every resource there.
Then you can use for example MD5 or any other encryption algorithm to encrypt this file.
Also, you can use zlib or other compression library to make this big binary file a bit smaller.
In the game, you should load the encrypted binary file and unpack it. Then decrypt it. Then deserialize the object collections and you have all resources back in memory.
Of course you can make this more comprehensive by storing in different binary files the resources for different levels and so on - there are plenty of variants, depending on what you want. Also you can first zip, then encrypt, or make other combinations of the steps.
Short answer: yes.
In Mac OS 6,7,8 there was a substantial API devoted to this exact task. Lookup the "Resource Manager" if you are interested. Edit: So does the ROOT physics analysis package.
Not that I know of a good tool right now. What platform(s) do you want it to work on?
Edited to add: All of the two-or-three tools of this sort that I am away of share a similar struture:
The file starts with a header and index
There are a series of blocks some of which may have there own headers and indicies, some of which are leaves
Each leaf is a simple serialization of the data to be stored.
The whole file (or sometimes individual blocks) may be compressed.
Not terribly hard to implement your own, but I'd look for a good existing one that meets your needs first.
For future people, like me, who are wondering about this same topic, check out the two following links:
http://www.sfml-dev.org/wiki/en/tutorials/formatdat
http://archive.gamedev.net/reference/programming/features/pak/
How and why do 7- and 35-pass erases work?
Shouldn't a simple rewrite with all zeroes be enough?
A single pass with zeros doesn't completely erase magnetic artifacts from a disk. It's still possible to recover the data from the drive. A 7-pass erasure using random data will do a pretty complete job to prevent reconstruction of the data on the drive.
Wikipedia has a number of different articles relating to this topic.
http://en.wikipedia.org/wiki/Data_remanence
http://en.wikipedia.org/wiki/Computer_forensics
http://en.wikipedia.org/wiki/Data_erasure
I'd never heard of the 35-part erase: http://en.wikipedia.org/wiki/Gutmann_method
The Gutmann method is an algorithm for
securely erasing the contents of
computer hard drives, such as files.
Devised by Peter Gutmann and Colin
Plumb, it does so by writing a series
of 35 patterns over the region to be
erased. The selection of patterns
assumes that the user doesn't know the
encoding mechanism used by the drive,
and so includes patterns designed
specifically for three different types
of drives. A user who knows which type
of encoding the drive uses can choose
only those patterns intended for their
drive. A drive with a different
encoding mechanism would need
different patterns. Most of the
patterns in the Gutmann method were
designed for older MFM/RLL encoded
disks. Relatively modern drives no
longer use the older encoding
techniques, making many of the
patterns specified by Gutmann
superfluous.[1]
Also interesting:
One standard way to recover data that
has been overwritten on a hard drive
is to capture the analog signal which
is read by the drive head prior to
being decoded. This analog signal will
be close to an ideal digital signal,
but the differences are what is
important. By calculating the ideal
digital signal and then subtracting it
from the actual analog signal it is
possible to ignore that last
information written, amplify the
remaining signal and see what was
written before.
As mentioned before, magnetic artifacts are present from the previous data on the platter.
In a recent issue of MaximumPC they put this to the test. They took a drive, ran it through a pass of all zeros, and hired a data recovery firm to try and recover what they could. Answer: Not one bit was recovered. Their analysis was that unless you expect the NSA to try, a zero pass is probably enough.
Personally, I'd run an alternating pattern or two across it.
one random pass is enough for plausible deniability, as the lost data will have to be mostly "reconstructed" with a margin of error that grows with the length of the data trying to be recovered, as well as whether or not the data is contiguous (most cases, its not).
for the insanely paranoid, three passes is good. 0xAA (10101010), 0x55 (01010101), and then random. the first two will grey out residual bits, the last random pass will obliterate any "residual residual" bits.
never do passes with zeros. under magnetic microscopy the data is still there, its just "faded".
never trust "single file shredding", especially on solid state mediums like flash drives. if you need to "shred" a file, well, "delete" it and fill your drive with random data files until it runs out of space. then next time think twice about housing shred-worthy data on the same medium as "low-clearance" stuff.
the gutmann method is based on tin-foil hat speculation, it does various things to get drives to degauss themselves, which is admirable in an artistic sense, but pragmatically its overkill. no private organisation to-date has successfully recovered data from even a single random pass. and as for big brother, if the DoD considers it gone then you know its gone, the military industrial complex gets all the big bucks to try and do exactly what gutmann claims they can do, and believe you me if they had the tech to do so it would already have been leaked to the private sector since they're all in bed with each other. however if you want to use gutmann in spite of this, check out the secure-delete package for linux.
7 pass and 35 pass would take forever to finish. HIPAA only requires DOD 3-pass overwrite,
and I am not certain why DOD even has a 7 pass overwrite as it seems they just simply
shred the disks before disposing of machines anyway. Theoretically, you could recover
data off of the outer edges of each track (using a scanning electron microscope or
microscopic magnetic probe), but it practice you would need the resources of a disk
drive maker or one of the three letter government organizations to do this.
The reason to perform multipass writes is to take advantage of the slight errors in positioning to overwrite the edges of the track also, making recovery far less likely.
Most drive recovery companies can't recover a drive that has had its data overwritten
even once. They are typically taking advantage of the fact that Windows doesn't zero out the data blocks, just changes the directory to mark the space free. They simply 'undelete'
the file and make it visable again.
If you don't believe me, call them up and ask them if they can recover a disk
that has been dd'ed over... they will typically tell you no, and if they do agree to try, it will be serious $$$ to get it back...
DOD 3 pass followed by a zero overwrite should be more than sufficent for most (i.e.
non- TOP SECRET) folks.
DBAN (and its commercially supported decendent, EBAN) do this all cleanly... I would
recommed these.
See: Secure Deletion of Data from Magnetic and Solid-State Memory
Advanced recovery tools can recover single pass deleted files easily. And they are expensive too (e.g http://accessdata.com/).
A visual GUI for Gutmann passes from http://sourceforge.net/projects/gutmannmethod/ shows it has 8 semi random passes. I never seen a proof that files deleted by Gutmann been recovered.
An overkill, maybe, still far better that Windows soft delete.
Regarding the second part of the question, some of the answers here actually contradict real research on that exact atopic. According the the Number of overwrites needed of the Data erasure article on wikipedia, on modern drives, erasing with more than one pass is redundant:
"ATA disk drives manufactured after 2001 (over 15 GB) clearing by
overwriting the media once is adequate to protect the media from both
keyboard and laboratory attack." (citation)
Also, infosec did a nice article entitled "The Urban Legend of Multipass Hard Disk Overwrite", on the entire subject, talking about the old USA Government erasure standards, among others, of how the multi-pass myth established itself in the industry.
"Fortunately, several security researchers presented a paper [WRIG08]
at the Fourth International Conference on Information Systems Security
(ICISS 2008) that declares the “great wiping controversy” about how
many passes of overwriting with various data values to be settled:
their research demonstrates that a single overwrite using an arbitrary
data value will render the original data irretrievable even if MFM and
STM techniques are employed.
The researchers found that the probability of recovering a single bit
from a previously used HDD was only slightly better than a coin toss,
and that the probability of recovering more bits decreases
exponentially so that it quickly becomes close to zero.
Therefore, a single pass overwrite with any arbitrary value (randomly
chosen or not) is sufficient to render the original HDD data
effectively irretrievable."
There's a lot of misinformation around this, though most of the answers I see on this page are correct. I've worked in the data recovery industry for 25 years and have addressed this exact question an enormous number of times.
The "residual magnetism" hypothesis never worked in real life. And back then, tolerances were millions of times looser.
If you still doubt this, remember that a rotational hard drive uses the same storage principle as an audio tape - moving magnetic substrate storage - and the audio tape that was recorded over a single time in the Watergate case has still not been recovered.
A single zero-pass wipe renders all the data on a HDD unrecoverable unless some malfunction or mistake causes the overwrite to be incomplete. This was true even back in the days when Peter Gutmann released his paper (which was like a tsunami in the erasure industry.) Gutmann's paper was pure hypothesis, it never panned out in reality. Even in the days of MFM/RLL drives, nobody could recover from a single-pass overwrite. It should be noted that Gutmann patented the algorithm that his paper said would be required to ensure complete erasure. Presumably, every time erasure was sold with his algorithm, he got paid. I am not saying there was intentional deception on his part, just pointing out that his algorithm, though there was never any evidence it erased better than a single overwrite, was patented and sold.
Please note that SSDs are different. SSDs can (and often do) use a pool of sectors that are rotated in and out of use, so if data is written to an SSD and then "deleted" and the drive rotates the sectors on which the deleted file is on out of the pool, an erasure might not be able to reach those sectors because the firmware in the SSD has control that software can't override. One way around this is to continuously overwrite until all sectors have been rotated into use.
The reason multiple passes exist is because hardware can malfunction. If the drive somehow malfunctions during one pass, it's possible that not all sectors will be erased - however, most good erasure software offers a full verification, which basically reads every bit on the drive to make sure the erasure didn't malfunction. With that, multi-pass overwrites are overkill.
And sometimes, data is so sensitive, it makes sense to go overboard in making sure it's destroyed. For example, I heard about a drive that was erased by the military with a 7-pass zero-fill, then the drive was run over by a tank, and then the remains were buried in a secret location in a highly secured area. Practically, the recoverability is about the same as a single-pass overwrite, but if lives could be lost as a result of the data falling into the wrong hands, then why not go for the overkill?
I had the idea of a search engine that would index web items like other search engines do now but would only store the file's title, url and a hash of the contents.
This way it would be easy to find items on the web if you already had them and didn't know where they came from or wanted to know all the places that something appeared.
More useful for non textual items like images, executables and archives.
I was wondering if there is already something similar?
Check out the wikipedia page on locality sensitive hashing. There's also a good page hosted by a research on MIT.
In general, there are several flavors available: hashes for strings (such as simhash), sets or 0/1 features (such as min-wise hashes), and for real vectors.
The main trick for numerical hashes is basically dimension reduction, so far. For strings, the idea is to come up with a representation that's robust in the face of minor edits.
I'm also doing a little research in this field, although I guess stackoverflow might not be the right place for nascent work.
The question seems to focus on exact match hashes, which we understand better than nearest-neighbor approaches, and are indeed worthwhile, especially if people can share tags and other metadata that way.
As #rjmunro notes, hash-based searching is a popular idea in the P2P world, and Bitzi did pretty much this, though they have shut down and their Bitpedia (Digital Media Encyclopedia) isn't hosted there any more, though some of it at least is still available at Archive.org.
Bitzi also produced software like Bitcollider (SourceForge.net),
and the Magnet URI scheme, which allows for specifying a file by hash and is thus a content-based identifier. Various applications support searching at various databases via Magnet URIs as described at that Wikipedia page.
The same idea is popular in the password-cracking scene - see e.g. findmyhash - Python script to crack hashes using online services etc.
Going a step further, I think it would be great if there were databases and online repositories identifying content by hash and providing tags and other metadata about the content from various perspectives. Then I could leave my music collection in its pristine state (no wasted backup space and time), but still tag them myself and add other metadata, via external tag databases. If my applications knew how to grab the tags, it would seem much better than the current system where we modify and copy around big files just to move tags from e.g. my desktop to my phone.
See a related idea at Metadata Independent Hashing for Media Identification & P2P Transfer Optimisation (pdf).
Well, for images, there's http://tineye.com, which will one-up that, and find you similar images too.
It's not a bad idea. Sometimes I find myself stumbled upon some file trying to figure out where it comes from :) But how are you going to track item's sources? Content can be obtained by various means - web browser, download manager, simply by copying from network share.
If I understand your proposal right, http://bitzi.com/ has done this for a while.