Do you use "kibibyte" as a unit of measurement in your programs? [closed] - naming

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
For decades, in the field of computing (except disk manufacturers), a KB (kilobyte) was understood to mean 1024 bytes. In the past few years, there has been a movement to use KiB ("kibibyte") to mean 1024 bytes, and change the meaning of kilobyte to be 1000 bytes, dooming us to many more years of confusion. On the other hand, the movement seems to be confined to Gnome, and some overzealous wikipedia editing.
Will you be converting your programs to use KiB? If you have ever displayed a filesize in KB, did you divide by 1000 or 1024?

KB is 1024 bytes, damnit.

I did this once before in an app. While internally it used kibbi's and mebbi's (KiB, MiB, etc), it would still display in what users (in this case IT folks) were used to. The underlying field was just a long that was in bytes IIRC.
It was forward compatible, and would at least allow you to enter 4 GB as well as 4GiB. It also understood shorthand entry like 4.5G and properly rounded back to the real number of bytes - not forcing poor user to have to enter it that way and prevent their mistakes. Updating to use IEC notation is 1 line of code.
kilo's are 1000 and 98% of the world uses metric. We need to get over it already.
I see a lot of anger in many of these responses which baffles me. SI prefixes are SI prefixes, and programmers have no right to alter them for no better reason than convenience and custom. It's odd that those in Computer Science, a highly technical field, are the one's clamoring to go back to the days of cubits furlongs and rods. wtf?
We all know what we mean, but sticking to custom alienates and confuses users. Just because in the early pioneer days some guys, when talking about computer memory, decided to reuse SI notation doesn't mean they were correct to do so.

In my opinion, 1 Kilobyte equals to 1000 bytes is something drivemakers want you to believe, so that your drive looks more spacious than it really is. ;)

Since I spent a few years learning to be a mechanical engineer before switching majors, I have to admit that "kilo" always means 10^3 to me. From that standpoint, KiB makes sense. However, try saying "kibibyte" outloud a few times, and think about how dumb you sound.
Therefore, kilogram is 1000 grams, kilobyte is 1024 bytes.
Addendum: In addition, I agree with those who have been saying that we shouldn't change what is already established if it works. 1024 is simply a nicer number in binary. Also, "kibibyte" still sounds like something a dog eats.

It's not changing the meaning of "kilobyte". Kilo means 1000. Some people were using it incorrectly to refer to units of 1024 bytes.
I never display file sizes in kibibytes, because users don't care about 1000 vs 1024. Instead, I always use "XXX KB/MB/GB", where XXX is the number of bytes divided by 1 thousand / 1 million / etc.

There are 2 ways to think about this:
Use what the operating system you're running on uses. That way users have a consistent experience.
Use what is correct.
If you use KiB always though there will be no confusion. If you use KB there will be confusion. So if you chose option #2 then you're better off actually using 1024 and using the KiB suffix. Working with powers of 2 is more efficient anyway.
It's up to you but my rule of thumb would be that if you have a technical audience, then use KiB and avoid all confusion. If you have a large user base of non technical users, then use what your operating system uses. By the way Windows uses KB to mean 1024 bytes.

Areas of speciality have always used terms in ways that are understood by that specialisation. For example, a mechanical engineer building a bridge uses the term "stress" to mean something completely different from, say, a lawyer who finds out his star witness has been lying on the first day in court. Should we mandate that the engineer use the same definition for "stress" as the lawyer just because that definition is more widely used? If we do, I'm not driving across that bridge!
Kilobytes = 1024 bytes. Its an industry accepted specialisation of the term.

I use KiB.
Do you really want to hurt everyone by refusing to use well-established standards just like IE?

I've always displayed file size in 1000-byte Kilobytes. It hardly ever matters to the people who can't tell the difference, and often relieves confusion when they see the actual number. 65323 bytes = 65Kb when rounded, and the "normal" people like that.
I probably won't ever display "KiB", since that's never what my customers want.
The arrogance of deciding not to follow the standard created by more than just the computer community (see... it isn't "new" that Kilo actually means 1000) is staggering.

Only if the situation called for it. In almost all cases, 1,000-based units are more appropriate.
The only exceptions I know of are memory, since it naturally comes in multiples of a power of two, and CD size, since it's measured in multiples of 220 bytes by the manufacturers. Everything else, including hard drives, DVDs, flash drives, bandwidths, processor speeds, memory buses, etc. is currently measured in 1000s, and file sizes should be, too. (Or, at least, me and Steve Jobs think so. Windows will probably continue measuring file sizes in 1024s for years...)
To avoid confusing the user, use k- = 1,000, and Ki- = 1,024.
The sloppy usage of "k" to mean 1024 is an unholy abomination that should be killed with fire.

Mac OS X doesn't use KiB, MiB, GiB. On the other hand, when it uses the metric ones, it at least does the maths correctly:
Personally I prefer to get this stuff right so that users who are currently in the dark would learn from it. Waiting for users to change first is just foolish. Users didn't suddenly wake up some day and think that a kilobyte is 1024 bytes - it was software which made them think that, so shouldn't it be software's job to correct the mistake?

I've worked in the storage industry for a decade. Arguments over the size of a TB can vary the size of a solution by 10%. In short: programmers and the storage industry use different measurements. Neither are right all the time.
The Storage Networking Industry Association (SNIA) dictionary defines kilobyte as:
Kilobyte (KB)
[General] 1,000 (10^3) bytes.
The SNIA uses the 10^3 convention commonly found in storage and data transfer-related literature rather than the 1,024 (2^10) convention common in computer system random access memory and software literature.
My rule of thumb is:
Measure memory, files, file systems, and data on a network in 1024^n byte blocks.
Measure raw disk space — and only raw disk space — in 1000^n byte blocks.
Tell the customer which unit you're using. Repeat yourself often.
By and large, that keeps me out of trouble.

One program I'm working on uses "KiB" by default, but has a user pereference as to which unit of measurement to use (1024 B in a KiB, 1024 B in a KB, or 1000 B in a KB).

No. 1024 bytes is a kilobyte, regardless of whether that makes sense.
The usage of the "kilo-" prefix for units of 1024 bytes back in the day was probably a mistake. But it's now the standard. Trying to change it now only adds to the confusion.
We don't deal with the world as it should be; we deal with the world as it is.

Technically KiB is correct, but I have seen it only in a few applications (mainly linux console apps). Users are either used to work with 1024 for both KB and KiB (IT people) or they don't really care and will think that "KiB" is misspelled (non-IT people).
However: I have been used to work with "Kilobytes = 1024 bytes" for over 20 years now and even though I know that it is scientifically wrong will go on using it.
If you need to provide KiB to allow your soul to rest, make it available as an option, but don't confuse poor users with yet another definition - especially if they work with an OS, that uses the non-scientific approach and defines KB as 1024.
(BTW: Kibibytes always reminds me of Tinky Winky and his friends... ;) )

I tried to start using these terms when teaching my students, but I've sort of given up now.
I've taught an introductory computer course ("and this is a disk drive") a few times, and it can be confusing for the students that the prefixes mean different things in different contexts. Kilo means 1024 when you have a kilobyte or a kilobit of data, except if you store it on disk when it is 1000, and if you send a kilobit per second over a network then it is 1000, and a kilohertz is of course 1000 too. And one kilometer of fiber cable is 1000 meters! But it turns out that it really isn't that much of a problem. The engineering and computer science students need to know the difference, and they will get used to it anyway. When I meet them again in database courses or in the compiler course, there is never any confusion about the different kinds of kilos, megas and teras. And students from other areas (media design and so on) don't really care.
And after I did an informal poll among the other computer science people in my corridor at the university, and found out that most of them had never heard of these new prefixes, I definitely gave up.

A KB is 1024 bytes
A kB is 1000 bytes
unfortunately spelled out is ambiguous. I always use 1024.
Knuth refers to MB as KKBytes or kkBytes to differentiate between 1024*1024 and 1000*1000

I have honestly never heard of this & I doubt it's going to gain much traction in mainstream usage. I can't imagine why I would want to start doing this. The current definition of kilobyte is accurate & sufficient. I would much rather see hard drive manufacturers start using accurate terminology rather than further dumb-down technical terminology. Why can't manufacturers either build drives that are exactly xGB in size or simply say what they really are?

Other than rants about how the terminology needs to change, I have never heard those expressions used. It is not going to catch on.

I'm still going by measurements of 210*n until computers are based on decimal...

Kilo means 10^3 when you're working in the decimal number system.
Kilo means 2^10 when you're working in the binary number system.
I mean, just look at it... they're both quite arbitrary. It seems to me that anything else is equally arbitrary - so we have 40-year entrenched arbitrary versus brand-new arbitrary. Which should win? For now, I vote for the entrenched method, simply because it will cause less total confusion.
At some point our technology is bound to change - think quantum/genetic computers - that point will be a good opportunity to sanitize our measuring system.
Also, some users will always be confused - should we remove confusion for them at the risk of confusing the community that makes it all happen (us and the hardware guys)? I think not.

For me, this is a bit like the 'hacker' arguments we had, back in the day.
Depending on how old and stubborn you are, 'hacker' may mean a different thing to you. For a while in the media (and probably still today, partly) people consider hacking to be the act of breaking into machines illegally. However, in the industry now, the feeling people get is that it is someone who enjoys tinkering with things.
For a while the security community wasn't sure if this would take off, and we actually tried to use 'cracker' to refer to the bad guys. I don't think cracker has really taken off like we'd like, but we have reclaimed 'hacker' as a legitimate term, to quite a reasonable degree of success.
So to me this is the same: just because the media has tried to consider a KB as 1,000, I will never back down, and always stand up for the rights of the remaining 24 bits.
24bFL

Drivemaker/denary Kilobytes can burn in hell. Binary units for binary machines.

Related

Why does my linux entropy have such a low upper limit?

I noticed I was getting poor performance when running cryptographic operations.
I ran cat /proc/sys/kernel/random/entropy_avail.
After some testing using watch and my own observations I can see that my entropy levels never surpass 200!
Even when I generate entropy using mouse movements etc. (when my computer is completely idle) it briefly surpasses 200 then suddenly dips back down below it for no reason.
Why is this and how do I fix it?
Perhaps the entropy-accumulating system has only about 200 bits of state, and simply cannot get more "unknown" than that. The people most concerned about having enough entropy tend to be cryptologists, and 200 bits of entropy is plenty for most (maybe all?) cryptographic applications.
You can substantially improve the available entropy with haveged. It may be already included in your distribution. Centos/Redhat users can install it from the epel repository.
Haveged was created to remedy low-entropy conditions in the Linux
random device that can occur under some workloads, especially on
headless servers.
Don't worry about it. 200 bits of entropy is more than enough.
Here's a quote from RFC 4086 (Randomness Requirements for Security):
3.1. Volume Required
How much unpredictability is needed? Is it possible to quantify the
requirement in terms of, say, number of random bits per second?
The answer is that not very much is needed. For AES, the key can be
128 bits, and, as we show in an example in Section 8, even the
highest security system is unlikely to require strong keying material
of much over 200 bits.

Unknown events in nodejs/v8 flamegraph using perf_events

I try to do some nodejs profiling using Linux perf_events as described by Brendan Gregg here.
Workflow is following:
run node >0.11.13 with --perf-basic-prof, which creates /tmp/perf-(PID).map file where JavaScript symbol mapping are written.
Capture stacks using perf record -F 99 -p `pgrep -n node` -g -- sleep 30
Fold stacks using stackcollapse-perf.pl script from this repository
Generate svg flame graph using flamegraph.pl script
I get following result (which look really nice at the beginning):
Problem is that there are a lot of [unknown] elements, which I suppose should be my nodejs function calls. I assume that whole process fails somwhere at point 3, where perf data should be folded using mappings generated by node/v8 executed with --perf-basic-prof. /tmp/perf-PID.map file is created and some mapping are written to it during node execution.
How to solve this problem?
I am using CentOS 6.5 x64, and already tried this with node 0.11.13, 0.11.14 (both prebuild, and compiled as well) with no success.
FIrst of all, what "[unknown]" means is the sampler couldn't figure out the name of the function, because it's a system or library function.
If so, that's OK - you don't care, because you're looking for things responsible for time in your code, not system code.
Actually, I'm suggesting this is one of those XY questions.
Even if you get a direct answer to what you asked, it is likely to be of little use.
Here are the reasons why:
1. CPU Profiling is of little use in an I/O bound program
The two towers on the left in your flame graph are doing I/O, so they probably take a lot more wall-time than the big pile on the right.
If this flame graph were derived from wall-time samples, rather than CPU-time samples, it could look more like the second graph below, which tells you where time actually goes:
What was a big juicy-looking pile on the right has shrunk, so it is nowhere near as significant.
On the other hand, the I/O towers are very wide.
Any one of those wide orange stripes, if it's in your code, represents a chance to save a lot of time, if some of the I/O could be avoided.
2. Whether the program is CPU- or I/O-bound, speedup opportunities can easily hide from flame graphs
Suppose there is some function Foo that really is doing something wasteful, that if you knew about it, you could fix.
Suppose in the flame graph, it is a dark red color.
Suppose it is called from numerous places in the code, so it's not all collected in one spot in the flame graph.
Rather it appears in multiple small places shown here by black outlines:
Notice, if all those rectangles were collected, you could see that it accounts for 11% of time, meaning it is worth looking at.
If you could cut its time in half, you could save 5.5% overall.
If what it's doing could actually be avoided entirely, you could save 11% overall.
Each of those little rectangles would shrink down to nothing, and pull the rest of the graph, to its right, with it.
Now I'll show you the method I use. I take a moderate number of random stack samples and examine each one for routines that might be speeded up.
That corresponds to taking samples in the flame graph like so:
The slender vertical lines represent twenty random-time stack samples.
As you can see, three of them are marked with an X.
Those are the ones that go through Foo.
That's about the right number, because 11% times 20 is 2.2.
(Confused? OK, here's a little probability for you. If you flip a coin 20 times, and it has a 11% chance of coming up heads, how many heads would you get? Technically it's a binomial distribution. The most likely number you would get is 2, the next most likely numbers are 1 and 3. (If you only get 1 you keep going until you get 2.) Here's the distribution:)
(The average number of samples you have to take to see Foo twice is 2/0.11 = 18.2 samples.)
Looking at those 20 samples might seem a bit daunting, because they run between 20 and 50 levels deep.
However, you can basically ignore all the code that isn't yours.
Just examine them for your code.
You'll see precisely how you are spending time,
and you'll have a very rough measurement of how much.
Deep stacks are both bad news and good news -
they mean the code may well have lots of room for speedups, and they show you what those are.
Anything you see that you could speed up, if you see it on more than one sample, will give you a healthy speedup, guaranteed.
The reason you need to see it on more than one sample is, if you only see it on one sample, you only know its time isn't zero. If you see it on more than one sample, you still don't know how much time it takes, but you do know it's not small.
Here are the statistics.
Generally speaking it is a bad idea to disagree with a subject matter expert but (with the greatest respect) here we go!
SO urges the answer to do the following:
"Please be sure to answer the question. Provide details and share your research!"
So the question was, at least my interpretation of it is, why are there [unknown] frames in the perf script output (and how do I turn these [unknown] frames in to meaningful names)?
This question could be about "how to improve the performance of my system?" but I don't see it that way in this particular case. There is a genuine problem here about how the perf record data has been post processed.
The answer to the question is that although the prerequisite set up is correct: the correct node version, the correct argument was present to generate the function names (--perf-basic-prof), the generated perf map file must be owned by root for perf script to produce the expected output.
That's it!
Writing some new scripts today I hit apon this directing me to this SO question.
Here's a couple of additional references:
https://yunong.io/2015/11/23/generating-node-js-flame-graphs/
https://github.com/jrudolph/perf-map-agent/blob/d8bb58676d3d15eeaaf3ab3f201067e321c77560/bin/create-java-perf-map.sh#L22
[ non-root files can sometimes be forced ] http://www.spinics.net/lists/linux-perf-users/msg02588.html

/dev/zero or /dev/random - what is more secure and why?

Can anyone tell me why is /dev/random is preferred for security while wiping data from a hard drive?
Simple answer, /dev/random is not preferred. Both are equally secure. Use /dev/zero for easier verification. Also less CPU usage and possibly faster.
More complete answer. For modern hard drives platter density is such that it's impossible to obtain signals from incompletely overwritten sectors of the drive, that people such as Gutmann wrote about many, many years ago. As far as modern hard drives are concerned (I'd place this as any hard drive whose capacity can be measured in Gigabytes or better), if it's overwritten it's gone. End of story. So it doesn't matter what you change the data to. Just that you change the data.
To add onto this, even if you wipe a hard drive completely, there may still be data left on the drive in sectors that were remapped by the hard drive's firmware but these are relatively rare, and only a very small amount of data would be contained within, not to mention that you'd need very specialized equipment to retrieve that data (you'd have to edit the G-List within the System Area of the drive to get at it), not to mention that the reason why those sectors were remapped in the first place is because they were failing.
So to sum up, DoD wipes are stupid, Gutmann wipes are stupider, use /dev/zero, it's good in nearly 100% of all cases. And if it's an edge case then you need to have very specialized know how to get at the data and also remove the data.
"thanks! so, what about usb stick?"
USB stick is a different animal altogether, you'd need to bypass the flash controller in order to clean it out, even a Gutmann wipe won't completely remove the data because of wear leveling algorithms. But just like a hard drive, if you overwrite the data once, it's gone, the trick is forcing the device to actually overwrite the data.
That being said, if you have a cheap USB stick without a controller which does wear leveling then a single pass 0-fill should be sufficient to remove the data within. Otherwise, you're looking at custom hardware and soldering work.
SSDs should be considered USB sticks with a controller that performs wear leveling. SSDs will always do wear leveling, I do not know of any exceptions to this rule. Many USB sticks do not.
How do you tell if a USB stick does wear leveling? You need to take it apart and inspect the controller chip and look up a datasheet on it.
"Would you give a source for the statement that it is "impossible to obtain signals from incompletely overwritten sectors of the drive" ? I am not talking about tests from computer magazines concerning data recovery stores, I am talking of the worst case scenario: a well-equipped government laboratory. So I really would like to know how can you guarantee that statement, preferably a scientific paper."
I'll give some justification and information regarding the analog storage of digital data on magnetic media. The following is mostly things that I was taught while on the job at a data recovery company, and may partially inaccurate in places. If so, let me know, I will correct it. But this is my best understanding of the material.
After a hard drive is manufactured the first thing that happens is it receives servo labels from a servo label writing machine. This is a separate machine whose sole job is to take a completely blank hard drive and bootstrap it. (This is why hard drives have holes in them covered with aluminum tape, that's where the servo labeling machine places its write heads.) If you've ever had a drive that when you powered it on it just generated "click click click" it's is because it could not read the servo labels. When a hard drive is powered on the first thing it tries to do is fling its read heads somewhere onto the platter and acquire a track. Servo labels define tracks. If it can't see a servo label it reaches the middle, makes a clack, pulls the arm back and tries again.
The reason why I mention this, is that is pretty much the only instance that an external device reads and writes to the hard drive and it describes approximately the limit that hardware outside of that drives own read heads can work with the data on a platter. If it were possible to make servo labels smaller and more space efficient hard drive manufacturers would. Servo labels are comparatively space inefficient for two reasons.
It is absolutely critical that they do not fail. If a servo label fails then every time the head goes over that particular servo label it will lose track, this pragmatically means that the entire track is unusable.
It places some idea of how much better hard drive hardware is at dealing with information on platters than external machinery.
A ring of servo labels defines a track. There are some things you must know about tracks.
They are not necessarily circular. They are imperfect and can contain warps. This is because the servo label machine is not accurate.
They are not necessarily concentric. They can and do cross. This means that certain sectors or whole tracks can be unusable just because the servo label machine is inaccurate.
After servo labels are written, then comes the low level format. An actual low level 1980s format of a drive, except more complicated. Because platters are circular but hard drive speeds are constant the amount of area passing under the read head is a variable function of the distance to the middle of the platter. So, in an effort to squeeze every last drop of storage out of a platter the density of the platter is variable and defined in zones. On a typical 3.5" hard drive there will be several dozen zones with different platter densities.
One of which is special and extra low density called the System Area. The System Area is where all of the firmware and configuration settings are stored on the drive. It has an extra low density because that information is more important. The lower the density the less chance there is that something will randomly screw up. It happens all of the time of course, but less often than something in the user area.
After the drive is low level formatted the firmware is written to the System Area. The firmware is different for every drive. In order to optimize the drive for the ridiculously fine requirements of the platters, each drive must be tuned. (This actually takes place before the low level format, of course, because you have to know how good the equipment is in order to decide how dense to make the platters.) This data is known as adaptives and is saved in the System Area. Information in the adaptives area is stuff like "how much voltage should I use to correct myself when the servo labels tell me I'm drifting off track", and other information required to make the hard drive actually work. If the adaptives are off slightly it might be impossible to access the user area. The system area is easier to access, so only very few adaptives are required to be stored on the PCB CMOS.
Take aways from this paragraph:
Lower density means easier to read.
The higher the density the more likely it is for things to randomly screw up.
The user area has as high a density as the hard drive manufacturer can possibly make it.
If this seems slamdash and slipshod, that's because it really is. Hard drive manufacturers compete and win on price per GB. Hard drive design isn't really about making very carefully manufactured pieces of equipment and putting them together very carefully, because that simply isn't enough anymore. Sure, they still do do that, but they also have to make the pieces work together with each other in software because the hardware tolerances are too broad to be competitive anymore.
So. Because the user are has such a high density, it actually is very (very (very very)) likely to get screwed up bits in the normal course of things. This can be caused by many, many factors including very slight timing issues and platter degradation. A good percentage of sectors of your hard drive actually contain screwed up bits. (You can verify this yourself by issuing an ATA28 READLONG command to your drive (only valid for the first 127 GB or so. There is no ATA48 equivalent it was dropped!) several times on many sectors and comparing the output. You'll find that it isn't a rare occurrence that certain bits will misbehave and act suck on or off or even flip randomly.) It's a fact of life. Which is why we have ECC.
ECC is a checksum contained after the 512 (or 4096 in newer drives) bytes of data that will correct that data if it has few enough incorrect bits. The exact number depends on firmware and manufacturer, but all drives have it and all drives need it (and it's surprisingly higher than you'd expect, something like 48-60 bytes that can detect and correct up to 6-8 error bytes. Crazy math going on.) This is because the density of the platters is too high for even the highly specialized and tuned internal hard drive equipment.
Finally, I want to talk about the preamp chip. It's located on the arm of the hard drive and acts as a megaphone. Because the signals are being generated from very small magnetic fields, acting on very small heads they have a very small potential. So you cannot use the hard drive head for the Gutmann method, because you cannot get an accurate enough reading from it to make Gutmann's technique worthwhile.
But let's posit that the NSA has a piece of magic equipment, and they can get a very accurate read (accurate enough to calculate the potential and derive the previously written data) of any particular bit in 1 ms. What do they need first?
First, they need the System Area. Because that's where the Translator is stored (the translator is the things that turns an LBA address into a PCHS address (Physical Cylinder Head Sector as opposed to the logical CHS address which is fake and only around for legacy reasons). The size of the System Area varies, and you can get it without resorting to magic tools. Normally, it's only around 50-100MB. The layout of the translator is firmware specific, so you have to reverse it (but it's been done before, no big deal.)
So first problem, signal to noise. As mentioned, platter density is tuned way higher that is strictly safe. Gutmann's method requires a very low variance in normal read/write activity to calculate previous states of the bits with any accuracy. If the variance in signal is significant then it can screw over these attempts. And the variance is significant enough to completely screw you over (that's why ECC is so crazy in modern drives.) An analogy would be like trying to perfectly hear someone whispering to you while someone is talking to you in the middle of a noisy room.
Second problem, time. Even if the electron microscope is very fast and accurate (1ms per bit! That's lightning for an electron microscope. It's also slower than a 1200 baud modem), there is a LOT of data on a hard drive and a full image will take a very long time. (WA says 126 years for an entire 500GB hard drive, and that's NOT including ECC data (which you need). There's also lots of other metadata associated with hard drive sectors that I didn't mention, like ID fields, and Address markers, but these don't get overwritten, perhaps you can come up with a faster way to image them normally? Doubtless there are ways to speed up this process (such as selectively imaging portions of the drive) but even that will take you months of 24/7 around the clock work just to get the $MFT file on a standard hard drive (typically around 50-300MB on a drive with Windows installed)).
Third problem, admissibility. If the government is after you they're after you for only a few reasons, they want to know something that you know, or they want to arrest you and put you in prison. There are easier ways to get the former (rubber hose cryptography), and the latter will require regular evidence procedures. Going back to the analogy, if someone testified that someone told them something in a whisper, while someone else was talking to them in the middle of a crowded and noisy room, there is a lot of room for doubt there. It would never be the sort of strong evidence that would want to spend lots of time and money.
You're asking the wrong question. Attempting to securely erase a drive by writing to user-visible blocks completely ignores the fact that there could be user data in sectors marked as bad (but which still contain readable sensitive data).
Of course it is possible to work around that by issuing ATA commands, but then a single ATA secure erase command will do everything you want in the first place. See https://ata.wiki.kernel.org/index.php/ATA_Secure_Erase for details on how to use hdparm to issue the Secure Erase command with the --security-erase option.

is 1024 bit rsa secure

Is 1024 bit rsa secure, or is it crackable now? Is it safe for my program to use 1024 bit rsa? I read at http://pcworld.about.com/od/privacysecurity1/Researcher-RSA-1024-bit-encry.htm that 1024 bit encryption is unsecure, but I find 2048 bit slower, and also I see that various https sites (even paypal) use 1024 bit encryption. Is 1024 bit encryption secure enough?
Last time I checked, NIST recommends 2048-bit RSA and predicts that it will remain secure until 2030. Page 67 of this PDF has the table.
Edit: They actually predict 1024-bit is OK until 2010, then 2048-bit until 2030, then 3072-bit after that. And it's NIST, not the NSA. Been too long since I did my thesis, LOL.
What are you trying to protect? If you are encrypting something that is not terribly vital, then 1024 may be fine, but, if you are protecting something that is very vital, such as someone's medical or financial info then 4096 bits would be better.
The size of the key really depends on what you are protecting, and how long you expect the encryption to hold. If your timeframe is that the info is only valid for 10 mins then 1024 works fine, for 10 years of protection it isn't.
So, what are you protecting?
There is no easy answer to the question "is size n secure ?" because it depends on the resources of an expected attacker. This has two parts:
Resources that the attacker is willing to invest heavily depend on the situation: defeating your grandmother, a bored computer-science student, or the full secret service of some big, rich country, does not involve the same attack power. It also depends on the perceived value of the protected data.
When designing the system, you want some margin of security, which means that you will make some prophecies on how computing power will evolve in the future, and this raises the difficult question of the notion of cost.
So there are several estimates which have been proposed by various researchers and government institutes. This site offers a survey of such methods, with online calculators so that you may play a bit with some of the input parameters.
Short answer is that if you want short-term security (i.e. security is not relevant beyond, say, year 2015) and 1024 bits are not enough for you, then your enemies must be very powerful indeed. Scarily so. To the point that you should have other, more urgent trouble on your hands.
It is necessary to define the meaning of secure to get a useful answer.
Is your house secure? Mostly we make it "good enough." For example, making it harder to break in than the neighbors is often adequate. That way the thieves spend time trying to break into next door rather than your place.
It might be secure if it requires X hours to break in and the valuable content is worth Y. Converting time to money is tricky, but if it takes a cracker 100 hours of his time to break in, and the contents of your information is worth, say $100, then your data is probably secure enough.
Nothing is going to be totally secure forever. If you're that worried about it, just use 2048-bit and sacrifice speed for better security.
Besides, as the article states:
But determining the prime numbers that make up a huge integer is nearly impossible without lots of computers and lots of time.
It all depends on whether or not you think people will actually try that hard to get at whatever information you're trying to protect.
Found a recent paper addressing exactly this question:
On the Security of 1024-bit RSA and
160-bit Elliptic Curve Cryptography
version 2.1, September 1, 2009
http://eprint.iacr.org/2009/389.pdf
It is said that, currently 1024 bit numbers cannot be factored but, RSA 1024 bit (which is about 310 decimal digits) is not considered secured enough. It is advisable to use RSA with 2048 bit or more, if one needs long term security. There are too many research companies, which are well-funded, doing research and there is a chance that they would not share everything at all. So i think, we can say it is not secure at all. I mean, if one day I happened encrypt an important data, i would prefer 2048 bits or more considering the long term security and the unknown developments in that field.

How does a 7- or 35-pass erase work? Why would one use these methods?

How and why do 7- and 35-pass erases work?
Shouldn't a simple rewrite with all zeroes be enough?
A single pass with zeros doesn't completely erase magnetic artifacts from a disk. It's still possible to recover the data from the drive. A 7-pass erasure using random data will do a pretty complete job to prevent reconstruction of the data on the drive.
Wikipedia has a number of different articles relating to this topic.
http://en.wikipedia.org/wiki/Data_remanence
http://en.wikipedia.org/wiki/Computer_forensics
http://en.wikipedia.org/wiki/Data_erasure
I'd never heard of the 35-part erase: http://en.wikipedia.org/wiki/Gutmann_method
The Gutmann method is an algorithm for
securely erasing the contents of
computer hard drives, such as files.
Devised by Peter Gutmann and Colin
Plumb, it does so by writing a series
of 35 patterns over the region to be
erased. The selection of patterns
assumes that the user doesn't know the
encoding mechanism used by the drive,
and so includes patterns designed
specifically for three different types
of drives. A user who knows which type
of encoding the drive uses can choose
only those patterns intended for their
drive. A drive with a different
encoding mechanism would need
different patterns. Most of the
patterns in the Gutmann method were
designed for older MFM/RLL encoded
disks. Relatively modern drives no
longer use the older encoding
techniques, making many of the
patterns specified by Gutmann
superfluous.[1]
Also interesting:
One standard way to recover data that
has been overwritten on a hard drive
is to capture the analog signal which
is read by the drive head prior to
being decoded. This analog signal will
be close to an ideal digital signal,
but the differences are what is
important. By calculating the ideal
digital signal and then subtracting it
from the actual analog signal it is
possible to ignore that last
information written, amplify the
remaining signal and see what was
written before.
As mentioned before, magnetic artifacts are present from the previous data on the platter.
In a recent issue of MaximumPC they put this to the test. They took a drive, ran it through a pass of all zeros, and hired a data recovery firm to try and recover what they could. Answer: Not one bit was recovered. Their analysis was that unless you expect the NSA to try, a zero pass is probably enough.
Personally, I'd run an alternating pattern or two across it.
one random pass is enough for plausible deniability, as the lost data will have to be mostly "reconstructed" with a margin of error that grows with the length of the data trying to be recovered, as well as whether or not the data is contiguous (most cases, its not).
for the insanely paranoid, three passes is good. 0xAA (10101010), 0x55 (01010101), and then random. the first two will grey out residual bits, the last random pass will obliterate any "residual residual" bits.
never do passes with zeros. under magnetic microscopy the data is still there, its just "faded".
never trust "single file shredding", especially on solid state mediums like flash drives. if you need to "shred" a file, well, "delete" it and fill your drive with random data files until it runs out of space. then next time think twice about housing shred-worthy data on the same medium as "low-clearance" stuff.
the gutmann method is based on tin-foil hat speculation, it does various things to get drives to degauss themselves, which is admirable in an artistic sense, but pragmatically its overkill. no private organisation to-date has successfully recovered data from even a single random pass. and as for big brother, if the DoD considers it gone then you know its gone, the military industrial complex gets all the big bucks to try and do exactly what gutmann claims they can do, and believe you me if they had the tech to do so it would already have been leaked to the private sector since they're all in bed with each other. however if you want to use gutmann in spite of this, check out the secure-delete package for linux.
7 pass and 35 pass would take forever to finish. HIPAA only requires DOD 3-pass overwrite,
and I am not certain why DOD even has a 7 pass overwrite as it seems they just simply
shred the disks before disposing of machines anyway. Theoretically, you could recover
data off of the outer edges of each track (using a scanning electron microscope or
microscopic magnetic probe), but it practice you would need the resources of a disk
drive maker or one of the three letter government organizations to do this.
The reason to perform multipass writes is to take advantage of the slight errors in positioning to overwrite the edges of the track also, making recovery far less likely.
Most drive recovery companies can't recover a drive that has had its data overwritten
even once. They are typically taking advantage of the fact that Windows doesn't zero out the data blocks, just changes the directory to mark the space free. They simply 'undelete'
the file and make it visable again.
If you don't believe me, call them up and ask them if they can recover a disk
that has been dd'ed over... they will typically tell you no, and if they do agree to try, it will be serious $$$ to get it back...
DOD 3 pass followed by a zero overwrite should be more than sufficent for most (i.e.
non- TOP SECRET) folks.
DBAN (and its commercially supported decendent, EBAN) do this all cleanly... I would
recommed these.
See: Secure Deletion of Data from Magnetic and Solid-State Memory
Advanced recovery tools can recover single pass deleted files easily. And they are expensive too (e.g http://accessdata.com/).
A visual GUI for Gutmann passes from http://sourceforge.net/projects/gutmannmethod/ shows it has 8 semi random passes. I never seen a proof that files deleted by Gutmann been recovered.
An overkill, maybe, still far better that Windows soft delete.
Regarding the second part of the question, some of the answers here actually contradict real research on that exact atopic. According the the Number of overwrites needed of the Data erasure article on wikipedia, on modern drives, erasing with more than one pass is redundant:
"ATA disk drives manufactured after 2001 (over 15 GB) clearing by
overwriting the media once is adequate to protect the media from both
keyboard and laboratory attack." (citation)
Also, infosec did a nice article entitled "The Urban Legend of Multipass Hard Disk Overwrite", on the entire subject, talking about the old USA Government erasure standards, among others, of how the multi-pass myth established itself in the industry.
"Fortunately, several security researchers presented a paper [WRIG08]
at the Fourth International Conference on Information Systems Security
(ICISS 2008) that declares the “great wiping controversy” about how
many passes of overwriting with various data values to be settled:
their research demonstrates that a single overwrite using an arbitrary
data value will render the original data irretrievable even if MFM and
STM techniques are employed.
The researchers found that the probability of recovering a single bit
from a previously used HDD was only slightly better than a coin toss,
and that the probability of recovering more bits decreases
exponentially so that it quickly becomes close to zero.
Therefore, a single pass overwrite with any arbitrary value (randomly
chosen or not) is sufficient to render the original HDD data
effectively irretrievable."
There's a lot of misinformation around this, though most of the answers I see on this page are correct. I've worked in the data recovery industry for 25 years and have addressed this exact question an enormous number of times.
The "residual magnetism" hypothesis never worked in real life. And back then, tolerances were millions of times looser.
If you still doubt this, remember that a rotational hard drive uses the same storage principle as an audio tape - moving magnetic substrate storage - and the audio tape that was recorded over a single time in the Watergate case has still not been recovered.
A single zero-pass wipe renders all the data on a HDD unrecoverable unless some malfunction or mistake causes the overwrite to be incomplete. This was true even back in the days when Peter Gutmann released his paper (which was like a tsunami in the erasure industry.) Gutmann's paper was pure hypothesis, it never panned out in reality. Even in the days of MFM/RLL drives, nobody could recover from a single-pass overwrite. It should be noted that Gutmann patented the algorithm that his paper said would be required to ensure complete erasure. Presumably, every time erasure was sold with his algorithm, he got paid. I am not saying there was intentional deception on his part, just pointing out that his algorithm, though there was never any evidence it erased better than a single overwrite, was patented and sold.
Please note that SSDs are different. SSDs can (and often do) use a pool of sectors that are rotated in and out of use, so if data is written to an SSD and then "deleted" and the drive rotates the sectors on which the deleted file is on out of the pool, an erasure might not be able to reach those sectors because the firmware in the SSD has control that software can't override. One way around this is to continuously overwrite until all sectors have been rotated into use.
The reason multiple passes exist is because hardware can malfunction. If the drive somehow malfunctions during one pass, it's possible that not all sectors will be erased - however, most good erasure software offers a full verification, which basically reads every bit on the drive to make sure the erasure didn't malfunction. With that, multi-pass overwrites are overkill.
And sometimes, data is so sensitive, it makes sense to go overboard in making sure it's destroyed. For example, I heard about a drive that was erased by the military with a 7-pass zero-fill, then the drive was run over by a tank, and then the remains were buried in a secret location in a highly secured area. Practically, the recoverability is about the same as a single-pass overwrite, but if lives could be lost as a result of the data falling into the wrong hands, then why not go for the overkill?

Resources