Get reasonable long subsequences in multiple files - search

I have 3 files, each about 6MB in size, and I need to know what would be the blocks common in those files. With blocks I mean chunks of binary data 128 bytes long. Unfortunately, they might start anywhere. I need to know, where in the files those blocks occur. I've read some text about the Longest Common Sequence Problem but my problem is a bit different as I dont need the longest in 2 files but all reasonable long in 3 files. The blocks have no alignment, so a block of 128 bytes might start anywhere. Im pretty sure that this is quite complex but perhaps someone knows a smart solution for the problem, preferably with already existing tools.
I have a vague idea of how to code the most stupid version (compare everything with everything with everything) but I need a result in this century ;)

Related

A new idea on how to beat the 32,767 text limit in Excel

So as many others have asked in the past is there a way to beat the 32k limit per cell in Excel?
I have found ways to do it by splitting the work load into two different .txt files and then merging the two .txt files, however it is a giant PITA and more often then not I end up only using excel to its limits as I do not have time to validate the data after .txt file merges anymore this is a long process and tedious IMO.
However I think that if the limitation is there it is there because it was coded when Microsoft developed Excel, and since they have yet to raise it (2013 version the limit is still the same limit so it would do no good to upgrade)
I also know that many will say if you have a need for information in a single cell in that length then you should use ACCESS well I have no idea how to use ACCESS or how to import a tab delimited file into ACCESS like you would into EXCEL, and then even if I could figure that out I still now have to figure out how to learn all the new commands and he EXCEL equivalents if there is even such a thing.
So I was browsing some blog posts the other day on how to beat limitations by software and I read something about reverse engineering.
Would it be possible to load excel into a hex editor, go in and change every instance of 32767 to something greater?
While 32767 may seem like an arbitrary number, it's actually the upper limit of a 16-bit signed integer (called a short in C). The range of a short goes from -32768 to 32767.
A 16-bit integer can also be unsigned, in which case its range is 0 to 65535.
Since it's impossible for a cell to have a negative number of characters, it seems odd that Microsoft would limit a cell's length based on a signed rather than unsigned 16-bit integer. When they wrote the original program, they probably couldn't imagine anyone storing so much information in a single cell. Using shorts may have simplified the code. (My first computer had only 4K of memory, so it's still amazing to me that Excel can store 8 times that much information in a single cell.)
Microsoft may have kept the 32767 limit to maintain backward compatibility with previous versions of Excel. However, that doesn't really make sense, because the row and column counts greatly increased in recent versions of Excel, making large spreadsheets incompatible with previous versions.
Now to your question of reverse-engineering Excel. It would be a gargantuan task, but not impossible. In the early '90s, I reverse-engineered and wrote vaccines for a few small computer viruses (several hundred bytes). In the '80s, I reverse-engineered an 8KB computer chess program.
When reverse-engineering an executable, you'll need a good disassembler or decompiler. Depending on what you use, you may get assembly-language or C code as the output. But note that this will not be commented code, and you will not see meaningful variable or function names. You'll have to read every line of code to determine what it does. And you'll quickly discover that the executable is the least of your worries. Excel's executable links in a number of DLL files, which would also need reverse-engineering.
To be successful, you will need an extensive knowledge of Windows programming in addition to C or Intel assembly code – not to mention a large amount of patience. Learning Access would be a much simpler task.
I'd be interested in why 32767 is insufficient for your needs. A database may make more sense, and it wouldn't necessarily need to duplicate the functionality of Excel. I store information in a database for output to Web pages, in which case I use HTML+JavaScript for anything that needs to be interactive.
In case anyone is still having this issue:
I had the same problem with generating a pipe-separated file of longitudinal research data. The header row exceeded the 32767 limit. Not an issue unless the end-user opens the file in excel. Work around is to have end-user open file in google sheets, perform the text-to-columns transformation, then download and open file in excel.
https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Length-limit-of-cell-contents-in-Excel-when-opening-exported-bibliographic-data?language=en_US
Jack Straw from Wichita (https://stackoverflow.com/users/10327211/jack-straw-from-wichita) surely you can do an import of a pipe separated file directly into Excel, using Data>Get Data? For me it finds the pipe and treats the piped file in the same way as a CSV. Even if for you it did not, you have an option on the import to specify the separator that you are using in your text file.
Kind regards
Sefton Hall

Protocol buffers handling very large String message?

I was finally able to write protocol buffers code over REST and did some comparison with XStream which we are currently uses.
Everything seems great, only stumble with one thing.
We have very large messages in one particular attributes, say something like this
message Data {
optional string datavalue=1;
}
Datavalue above are extremely huge text messages. Size is 512kb - 5 Mb.
Protocol buffers deserialize just fine, with superb performance comparing to XStream.
However, I notice when I send this message to wire (via REST), it took longer to get response. Always twice longer than XStream.
I am thinking this might come from serializing time.
From google documents, it says Protocol buffers is not designed to handle very large messages, although it can handle very large data set.
I was wondering if anyone has some opinion or maybe solution from my case above?
Thanks
I was benchmarking different serialization tools a while ago and noticed that the Protobuf Java library took about 1.7x as long to serialize strings as java.io.DataOutputStream did. When I looked into it, it seemed to have to do with weird artifact of how the JVM optimizes certain code paths. However, in my benchmarking, XStream was always slower, even with really long strings.
One quick thing to try is the format-compatible Protostuff library in place of Google's Protobuf library.
I remember reading somewhere (trying to locate the article) that protobuf is very good if you have a mix of binary and textual data types. When you are working purely on textual data then you could get better performance and size by compressing it.

Protocol buffers logging

In our business, we require to log every request/response which coming to our server.
At this time being, we are using xml as standard implementation.
Log files are used if we need to debug/trace some error.
I am kind of curious if we switch to protocol buffers, since it is binary, what will be the best way to log request/response to file?
For example:
FileOutputStream output = new FileOutputStream("\\files\log.txt");
request.build().writeTo(outout);
For anyone who has used protocol buffers in your application, how do you log your request/response, just in case we need it for debugging purpose?
TL;DR: write debugging logs in text, write long-term logs in binary.
There are at least two ways you can do this logging (and maybe, in fact, you should do both):
Writing your logs in text format. This is good for debugging and quickly checking for problems with your eyes.
Writing your logs in binary format - this will make future analysis much quicker since you can load the data using same protocol buffers code and do all kinds of things on them.
Quite honestly, this is more or less the way this is done at the place this technology came from.
We use the ShortDebugString() method on the C++ object to write down a human-readable version of all incoming and outgoing messages to a text-file. ShortDebugString() returns a one-line version of the same string returned by the toString() method in Java. Not sure how easy it is to accomplish the same thing in Java.
If you have competing needs for logging and performance then I suppose you could dump your binary data to the file as-is, with perhaps each record preceded by a tag containing a timestamp and a length value so you'll know where this particular bit of data ends. But I hasten to admit this is very ugly. You will need to write a utility to read and analyze this file, and will be helpless without that utility.
A more reasonable solution would be to dump your binary data in text form. I'm thinking of "lines" of text, again starting with whatever tagging information you find relevant, followed by some length information in decimal or hex, followed by as many hex bytes as needed to dump your buffer - thus you could end up with some fairly long lines. But since the file is line structured, you can use text-oriented tools (an editor in the simplest case) to work with it. Hex dumping essentially means you are using two bytes in the log to represent one byte of data (plus a bit of overhead). Heh, disk space is cheap these days.
If those binary buffers have a fairly consistent structure, you could even break out and label fields (or something like that) so your data becomes a little more human readable and, more importantly, better searchable. Of course it's up to you how much effort you want to sink into making your log records look pretty; but the time spent here may well pay off a little later in analysis.
If you've non-ASCII character strings in your messages, simply logging them by using implicit or explicit call to toString would escape the characters.
"오늘은 무슨 요일입니까?" becomes "\354\230\244\353\212\230\354\235\200 \353\254\264\354\212\250 \354\232\224\354\235\274\354\236\205\353\213\210\352\271\214?"
If you want to retain the non-ASCII characters, use TextFormat.printer().escapingNonAscii(false).printToString(message).
See this answer for more details.

Do you use "kibibyte" as a unit of measurement in your programs? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
For decades, in the field of computing (except disk manufacturers), a KB (kilobyte) was understood to mean 1024 bytes. In the past few years, there has been a movement to use KiB ("kibibyte") to mean 1024 bytes, and change the meaning of kilobyte to be 1000 bytes, dooming us to many more years of confusion. On the other hand, the movement seems to be confined to Gnome, and some overzealous wikipedia editing.
Will you be converting your programs to use KiB? If you have ever displayed a filesize in KB, did you divide by 1000 or 1024?
KB is 1024 bytes, damnit.
I did this once before in an app. While internally it used kibbi's and mebbi's (KiB, MiB, etc), it would still display in what users (in this case IT folks) were used to. The underlying field was just a long that was in bytes IIRC.
It was forward compatible, and would at least allow you to enter 4 GB as well as 4GiB. It also understood shorthand entry like 4.5G and properly rounded back to the real number of bytes - not forcing poor user to have to enter it that way and prevent their mistakes. Updating to use IEC notation is 1 line of code.
kilo's are 1000 and 98% of the world uses metric. We need to get over it already.
I see a lot of anger in many of these responses which baffles me. SI prefixes are SI prefixes, and programmers have no right to alter them for no better reason than convenience and custom. It's odd that those in Computer Science, a highly technical field, are the one's clamoring to go back to the days of cubits furlongs and rods. wtf?
We all know what we mean, but sticking to custom alienates and confuses users. Just because in the early pioneer days some guys, when talking about computer memory, decided to reuse SI notation doesn't mean they were correct to do so.
In my opinion, 1 Kilobyte equals to 1000 bytes is something drivemakers want you to believe, so that your drive looks more spacious than it really is. ;)
Since I spent a few years learning to be a mechanical engineer before switching majors, I have to admit that "kilo" always means 10^3 to me. From that standpoint, KiB makes sense. However, try saying "kibibyte" outloud a few times, and think about how dumb you sound.
Therefore, kilogram is 1000 grams, kilobyte is 1024 bytes.
Addendum: In addition, I agree with those who have been saying that we shouldn't change what is already established if it works. 1024 is simply a nicer number in binary. Also, "kibibyte" still sounds like something a dog eats.
It's not changing the meaning of "kilobyte". Kilo means 1000. Some people were using it incorrectly to refer to units of 1024 bytes.
I never display file sizes in kibibytes, because users don't care about 1000 vs 1024. Instead, I always use "XXX KB/MB/GB", where XXX is the number of bytes divided by 1 thousand / 1 million / etc.
There are 2 ways to think about this:
Use what the operating system you're running on uses. That way users have a consistent experience.
Use what is correct.
If you use KiB always though there will be no confusion. If you use KB there will be confusion. So if you chose option #2 then you're better off actually using 1024 and using the KiB suffix. Working with powers of 2 is more efficient anyway.
It's up to you but my rule of thumb would be that if you have a technical audience, then use KiB and avoid all confusion. If you have a large user base of non technical users, then use what your operating system uses. By the way Windows uses KB to mean 1024 bytes.
Areas of speciality have always used terms in ways that are understood by that specialisation. For example, a mechanical engineer building a bridge uses the term "stress" to mean something completely different from, say, a lawyer who finds out his star witness has been lying on the first day in court. Should we mandate that the engineer use the same definition for "stress" as the lawyer just because that definition is more widely used? If we do, I'm not driving across that bridge!
Kilobytes = 1024 bytes. Its an industry accepted specialisation of the term.
I use KiB.
Do you really want to hurt everyone by refusing to use well-established standards just like IE?
I've always displayed file size in 1000-byte Kilobytes. It hardly ever matters to the people who can't tell the difference, and often relieves confusion when they see the actual number. 65323 bytes = 65Kb when rounded, and the "normal" people like that.
I probably won't ever display "KiB", since that's never what my customers want.
The arrogance of deciding not to follow the standard created by more than just the computer community (see... it isn't "new" that Kilo actually means 1000) is staggering.
Only if the situation called for it. In almost all cases, 1,000-based units are more appropriate.
The only exceptions I know of are memory, since it naturally comes in multiples of a power of two, and CD size, since it's measured in multiples of 220 bytes by the manufacturers. Everything else, including hard drives, DVDs, flash drives, bandwidths, processor speeds, memory buses, etc. is currently measured in 1000s, and file sizes should be, too. (Or, at least, me and Steve Jobs think so. Windows will probably continue measuring file sizes in 1024s for years...)
To avoid confusing the user, use k- = 1,000, and Ki- = 1,024.
The sloppy usage of "k" to mean 1024 is an unholy abomination that should be killed with fire.
Mac OS X doesn't use KiB, MiB, GiB. On the other hand, when it uses the metric ones, it at least does the maths correctly:
Personally I prefer to get this stuff right so that users who are currently in the dark would learn from it. Waiting for users to change first is just foolish. Users didn't suddenly wake up some day and think that a kilobyte is 1024 bytes - it was software which made them think that, so shouldn't it be software's job to correct the mistake?
I've worked in the storage industry for a decade. Arguments over the size of a TB can vary the size of a solution by 10%. In short: programmers and the storage industry use different measurements. Neither are right all the time.
The Storage Networking Industry Association (SNIA) dictionary defines kilobyte as:
Kilobyte (KB)
[General] 1,000 (10^3) bytes.
The SNIA uses the 10^3 convention commonly found in storage and data transfer-related literature rather than the 1,024 (2^10) convention common in computer system random access memory and software literature.
My rule of thumb is:
Measure memory, files, file systems, and data on a network in 1024^n byte blocks.
Measure raw disk space — and only raw disk space — in 1000^n byte blocks.
Tell the customer which unit you're using. Repeat yourself often.
By and large, that keeps me out of trouble.
One program I'm working on uses "KiB" by default, but has a user pereference as to which unit of measurement to use (1024 B in a KiB, 1024 B in a KB, or 1000 B in a KB).
No. 1024 bytes is a kilobyte, regardless of whether that makes sense.
The usage of the "kilo-" prefix for units of 1024 bytes back in the day was probably a mistake. But it's now the standard. Trying to change it now only adds to the confusion.
We don't deal with the world as it should be; we deal with the world as it is.
Technically KiB is correct, but I have seen it only in a few applications (mainly linux console apps). Users are either used to work with 1024 for both KB and KiB (IT people) or they don't really care and will think that "KiB" is misspelled (non-IT people).
However: I have been used to work with "Kilobytes = 1024 bytes" for over 20 years now and even though I know that it is scientifically wrong will go on using it.
If you need to provide KiB to allow your soul to rest, make it available as an option, but don't confuse poor users with yet another definition - especially if they work with an OS, that uses the non-scientific approach and defines KB as 1024.
(BTW: Kibibytes always reminds me of Tinky Winky and his friends... ;) )
I tried to start using these terms when teaching my students, but I've sort of given up now.
I've taught an introductory computer course ("and this is a disk drive") a few times, and it can be confusing for the students that the prefixes mean different things in different contexts. Kilo means 1024 when you have a kilobyte or a kilobit of data, except if you store it on disk when it is 1000, and if you send a kilobit per second over a network then it is 1000, and a kilohertz is of course 1000 too. And one kilometer of fiber cable is 1000 meters! But it turns out that it really isn't that much of a problem. The engineering and computer science students need to know the difference, and they will get used to it anyway. When I meet them again in database courses or in the compiler course, there is never any confusion about the different kinds of kilos, megas and teras. And students from other areas (media design and so on) don't really care.
And after I did an informal poll among the other computer science people in my corridor at the university, and found out that most of them had never heard of these new prefixes, I definitely gave up.
A KB is 1024 bytes
A kB is 1000 bytes
unfortunately spelled out is ambiguous. I always use 1024.
Knuth refers to MB as KKBytes or kkBytes to differentiate between 1024*1024 and 1000*1000
I have honestly never heard of this & I doubt it's going to gain much traction in mainstream usage. I can't imagine why I would want to start doing this. The current definition of kilobyte is accurate & sufficient. I would much rather see hard drive manufacturers start using accurate terminology rather than further dumb-down technical terminology. Why can't manufacturers either build drives that are exactly xGB in size or simply say what they really are?
Other than rants about how the terminology needs to change, I have never heard those expressions used. It is not going to catch on.
I'm still going by measurements of 210*n until computers are based on decimal...
Kilo means 10^3 when you're working in the decimal number system.
Kilo means 2^10 when you're working in the binary number system.
I mean, just look at it... they're both quite arbitrary. It seems to me that anything else is equally arbitrary - so we have 40-year entrenched arbitrary versus brand-new arbitrary. Which should win? For now, I vote for the entrenched method, simply because it will cause less total confusion.
At some point our technology is bound to change - think quantum/genetic computers - that point will be a good opportunity to sanitize our measuring system.
Also, some users will always be confused - should we remove confusion for them at the risk of confusing the community that makes it all happen (us and the hardware guys)? I think not.
For me, this is a bit like the 'hacker' arguments we had, back in the day.
Depending on how old and stubborn you are, 'hacker' may mean a different thing to you. For a while in the media (and probably still today, partly) people consider hacking to be the act of breaking into machines illegally. However, in the industry now, the feeling people get is that it is someone who enjoys tinkering with things.
For a while the security community wasn't sure if this would take off, and we actually tried to use 'cracker' to refer to the bad guys. I don't think cracker has really taken off like we'd like, but we have reclaimed 'hacker' as a legitimate term, to quite a reasonable degree of success.
So to me this is the same: just because the media has tried to consider a KB as 1,000, I will never back down, and always stand up for the rights of the remaining 24 bits.
24bFL
Drivemaker/denary Kilobytes can burn in hell. Binary units for binary machines.

Will random data appended to a JPG make it unusable?

So, to simplify my life I want to be able to append from 1 to 7 additional characters on the end of some jpg images my program is processing*. These are dummy padding (fillers, etc - probably all 0x00) just to make the file size a multiple of 8 bytes for block encryption.
Having tried this out with a few programs, it appears they are fine with the additional characters, which occur after the FF D9 that specifies the end of the image - so it appears that the file format is well defined enough that the 'corruption' I'm adding at the end shouldn't matter.
I can always post process the files later if needed, but my preference is to do the simplest thing possible - which is to let them remain (I'm decrypting other file types and they won't mind, so having a special case is annoying).
I figure with all the talk of Steganography hullaballo years ago, someone has some input here...
(encryption processing by 8 byte blocks, I don't want to save pre-encrypted file size, so append 0x00 to input data, and leave them there after decoding)
No, you can add bits to the end of a jpg file, without making it unusable. The heading of the jpg file tells how to read it, so the program reading it will stop at the end of the jpg data.
In fact, people have hidden zip files inside jpg files by appending the zip data to the end of the jpg data. Because of the way these formats are structured, the resulting file is valid in either format.
You can .. but the results may be unpredictable.
Even though there is enough information in the format to tell the client to ignore the extra data it is likely not a case the programmer tested for.
A paranoid program might look at the size, notice the discrepancy and decide it won't process your file because clearly it doesn't fully understand it. This is particularly likely when reading data from the web when random bytes in a file could be considered a security risk.
You can embed your data in the XMP tag within a JPEG (or EXIF or IPTC fields for that matter).
XMP is XML so you have a fair bit of flexibility there to do you own custom stuff.
It's probably not the simplest thing possible but putting your data here will maintain the integrity of the JPEG and require no "post processing".
You data will then show up in other imaging software such as PhotoShop, which may not be ideal.
As others have stated, you have no control how programs process image files and therefore some programs may find the images valid others may not.
However, there is a bigger issue here. Judging by your question, I'm deducing you're practicing "security through obscurity." It's widely considered a very bad practice. Use Google to find a plethora of articles about the topic.

Resources