According the documentation, ID3 tags have an unsynchronization flag. As I understood, it should only be applied to ID3 frames (not headers or footer).
But how exactly should I process the frames before parsing (for reading, not writing)? Should I just replace all '11111111 111xxxxx' sequences for '11111111 00000000 111xxxxx'?
No, that's what you do when WRITING the tag (and don't forget, in this case you also need to replace any "0xff,0x00" with "0xff,0x00,0x00", as stated in the spec).
When you are READING the tag, you can simply replace any "0xff,0x00" sequence with "0xff". It's easiest to do this at once while you're reading the file, by keeping track of the last byte read and discarding any single 0x00 byte which follows an 0xff.
It's not really so easy to figure this out because the spec only describes what to do in the way of unsynchronisation when you're writing the tag, not what you do when reading it.
Related
The basic question is, how does notepad (or other basic text editors) store data. I ran into this because I was trying to compare file size of different compression techniques, and realized something isn't quite right.
To elaborate..
If I save a text file with the following contents:
a
The file is 1 byte. This one happens to be 97, or 0x61.
I create a text file with the following contents:
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
Which is all the characters from 0-255, or 0x00 to 0xFF.
This file is 256 bytes. 1 byte for each character. This makes sense to me.
Then I append the following character to the end of the above string.
†
A character not contained in the above string. All 8 bit characters were already used. This character is 8224, or 0x2020. A 2 bytes character.
And yet, the file size has only changed from 256 to 257 bytes. In fact, the above character saved by itself only shows 1 byte.
What am I missing?
Edit: Please note that in the second text block, many of the characters do not show up on here.
In ANSI encoding (This 8-bit Microsoft-specific encoding), you save each character in one byte (8-bit).
ANSI also called Windows-1252, or Windows Latin-1
You should have a look at ANSI table in ANSI Character Codes Chart or Windows-1252
So for † character, its code is 134, byte 0x86.
Using one byte to encode a character only makes sense on the surface. Works okay if you speak English, it is a fair disaster is you speak Chinese or Japanese. Unicode today has definitions for 110,187 typographic symbols with room to grow up to 1.1 million. A byte is not a good way to store a Unicode symbol since it can encode only 256 distinct values.
Accordingly, text editors must always encode text when they store it to a file. Encoding is required to map 110,187 values onto a byte-oriented storage medium. Inevitably that takes more than 1 byte per character if you speak Chinese.
There have been lots and lots of encoding schemes in common use. Popular in the previous century were code pages, a scheme that uses a character set. A language-specific mapping that tries as hard as it can to need only 1 byte of storage per character by picking 256 characters that are likely to be needed in the language. Japanese, Korean and Chinese used a multi-byte mapping because they had to, other languages used 1.
Code pages have been an enormous disaster, a program cannot properly read a text file that was encoded in another language's code page. It worked when text files stayed close to the machine that created it, the Internet in particular broke that usage. Japanese was particularly prone to this disaster since it had more than one code page in common use. The result is called mojibake, the user looks at gibberish in the text editor. Unicode came around in 1992 to try solve this disaster. One new standard to replace all the other ones, tends to invoke another kind of disaster.
You are subjected to that kind of disaster, particularly if you use Notepad. A program that tries to be compatible with text files that were created in the past 30 years. Google "bush hid the facts" for a hilarious story about that. Note the dialog you get when you use File > Save As, the dialog has an extra combobox titled "Encoding". The default is ANSI, a broken name from the previous century that means "code page". As you found out, that character indeed only needed 1 byte in your machine's default code page. Depends where you live, it is 1252 in Western Europe and the Americas. You'd get 0x86 if you look at the file with a hex viewer.
Given that the dialog gives you a choice and you should not favor ANSI's mojibake anymore, always favor UTF-8 instead. Maybe they'll update Notepad some day so it uses a better default, very hard to do.
How to read the pixel values of jpeg image using c/c++, without using any library.
I read about how compression takes place in jpeg in my course,i want header information.
For the syntax of the file you can check wikipedia.
Each segment has its own marker. The variable length segments have a two byte field for their length. So far is not really a problem, as you are able to extract all segments using this information (or at least it seems so on a first glance).
The more problematic part is to actually do something useful with the data inside the segments. The wikipedia page provides information on this topic, but it will require quite some mathematic knowledge to actually decode and grab the pixels.
Finally found some really helpful links..
link 1
link 2
Thanks for help and support.
In our business, we require to log every request/response which coming to our server.
At this time being, we are using xml as standard implementation.
Log files are used if we need to debug/trace some error.
I am kind of curious if we switch to protocol buffers, since it is binary, what will be the best way to log request/response to file?
For example:
FileOutputStream output = new FileOutputStream("\\files\log.txt");
request.build().writeTo(outout);
For anyone who has used protocol buffers in your application, how do you log your request/response, just in case we need it for debugging purpose?
TL;DR: write debugging logs in text, write long-term logs in binary.
There are at least two ways you can do this logging (and maybe, in fact, you should do both):
Writing your logs in text format. This is good for debugging and quickly checking for problems with your eyes.
Writing your logs in binary format - this will make future analysis much quicker since you can load the data using same protocol buffers code and do all kinds of things on them.
Quite honestly, this is more or less the way this is done at the place this technology came from.
We use the ShortDebugString() method on the C++ object to write down a human-readable version of all incoming and outgoing messages to a text-file. ShortDebugString() returns a one-line version of the same string returned by the toString() method in Java. Not sure how easy it is to accomplish the same thing in Java.
If you have competing needs for logging and performance then I suppose you could dump your binary data to the file as-is, with perhaps each record preceded by a tag containing a timestamp and a length value so you'll know where this particular bit of data ends. But I hasten to admit this is very ugly. You will need to write a utility to read and analyze this file, and will be helpless without that utility.
A more reasonable solution would be to dump your binary data in text form. I'm thinking of "lines" of text, again starting with whatever tagging information you find relevant, followed by some length information in decimal or hex, followed by as many hex bytes as needed to dump your buffer - thus you could end up with some fairly long lines. But since the file is line structured, you can use text-oriented tools (an editor in the simplest case) to work with it. Hex dumping essentially means you are using two bytes in the log to represent one byte of data (plus a bit of overhead). Heh, disk space is cheap these days.
If those binary buffers have a fairly consistent structure, you could even break out and label fields (or something like that) so your data becomes a little more human readable and, more importantly, better searchable. Of course it's up to you how much effort you want to sink into making your log records look pretty; but the time spent here may well pay off a little later in analysis.
If you've non-ASCII character strings in your messages, simply logging them by using implicit or explicit call to toString would escape the characters.
"오늘은 무슨 요일입니까?" becomes "\354\230\244\353\212\230\354\235\200 \353\254\264\354\212\250 \354\232\224\354\235\274\354\236\205\353\213\210\352\271\214?"
If you want to retain the non-ASCII characters, use TextFormat.printer().escapingNonAscii(false).printToString(message).
See this answer for more details.
So, to simplify my life I want to be able to append from 1 to 7 additional characters on the end of some jpg images my program is processing*. These are dummy padding (fillers, etc - probably all 0x00) just to make the file size a multiple of 8 bytes for block encryption.
Having tried this out with a few programs, it appears they are fine with the additional characters, which occur after the FF D9 that specifies the end of the image - so it appears that the file format is well defined enough that the 'corruption' I'm adding at the end shouldn't matter.
I can always post process the files later if needed, but my preference is to do the simplest thing possible - which is to let them remain (I'm decrypting other file types and they won't mind, so having a special case is annoying).
I figure with all the talk of Steganography hullaballo years ago, someone has some input here...
(encryption processing by 8 byte blocks, I don't want to save pre-encrypted file size, so append 0x00 to input data, and leave them there after decoding)
No, you can add bits to the end of a jpg file, without making it unusable. The heading of the jpg file tells how to read it, so the program reading it will stop at the end of the jpg data.
In fact, people have hidden zip files inside jpg files by appending the zip data to the end of the jpg data. Because of the way these formats are structured, the resulting file is valid in either format.
You can .. but the results may be unpredictable.
Even though there is enough information in the format to tell the client to ignore the extra data it is likely not a case the programmer tested for.
A paranoid program might look at the size, notice the discrepancy and decide it won't process your file because clearly it doesn't fully understand it. This is particularly likely when reading data from the web when random bytes in a file could be considered a security risk.
You can embed your data in the XMP tag within a JPEG (or EXIF or IPTC fields for that matter).
XMP is XML so you have a fair bit of flexibility there to do you own custom stuff.
It's probably not the simplest thing possible but putting your data here will maintain the integrity of the JPEG and require no "post processing".
You data will then show up in other imaging software such as PhotoShop, which may not be ideal.
As others have stated, you have no control how programs process image files and therefore some programs may find the images valid others may not.
However, there is a bigger issue here. Judging by your question, I'm deducing you're practicing "security through obscurity." It's widely considered a very bad practice. Use Google to find a plethora of articles about the topic.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I've got many, many mp3 files that I would like to merge into a single file. I've used the command line method
copy /b 1.mp3+2.mp3 3.mp3
but it's a pain when there's a lot of them and their namings are inconsistent. The time never seems to come out right either.
David's answer is correct that just concatenating the files will leave ID3 tags scattered inside (although this doesn't normally affect playback, so you can do "copy /b" or on UNIX "cat a.mp3 b.mp3 > combined.mp3" in a pinch).
However, mp3wrap isn't exactly the right tool to just combine multiple MP3s into one "clean" file. Rather than using ID3, it actually inserts its own custom data format in amongst the MP3 frames (the "wrap" part), which causes issues with playback, particularly on iTunes and iPods. Although the file will play back fine if you just let them run from start to finish (because players will skip these is arbitrary non-MPEG bytes) the file duration and bitrate will be reported incorrectly, which breaks seeking. Also, mp3wrap will wipe out all your ID3 metadata, including cover art, and fail to update the VBR header with the correct file length.
mp3cat on its own will produce a good concatenated data file (so, better than mp3wrap), but it also strips ID3 tags and fails to update the VBR header with the correct length of the joined file.
Here's a good explanation of these issues and method (two actually) to combine MP3 files and produce a "clean" final result with original metadata intact -- it's command-line so works on Mac/Linux/BSD etc. It uses:
mp3cat to combine the MPEG data frames only into a continuous file, then
id3cp to copy all metadata over to the combined file, and finally
VBRFix to update the VBR header.
For a Windows GUI tool, take a look at Merge MP3 -- it takes care of everything. (VBRFix also comes in GUI form, but it doesn't do the joining.)
As Thomas Owens pointed out, simply concatenating the files will leave multiple ID3 headers scattered throughout the resulting concatenated file - so the time/bitrate info will be wildly wrong.
You're going to need to use a tool which can combine the audio data for you.
mp3wrap would be ideal for this - it's designed to join together MP3 files, without needing to decode + re-encode the data (which would result in a loss of audio quality) and will also deal with the ID3 tags intelligently.
The resulting file can also be split back into its component parts using the mp3splt tool - mp3wrap adds information to the IDv3 comment to allow this.
Use ffmpeg or a similar tool to convert all of your MP3s into a consistent format, e.g.
ffmpeg -i originalA.mp3 -f mp3 -ab 128kb -ar 44100 -ac 2 intermediateA.mp3
ffmpeg -i originalB.mp3 -f mp3 -ab 128kb -ar 44100 -ac 2 intermediateB.mp3
Then, at runtime, concat your files together:
cat intermediateA.mp3 intermediateB.mp3 > output.mp3
Finally, run them through the tool MP3Val to fix any stream errors without forcing a full re-encode:
mp3val output.mp3 -f -nb
The time problem has to do with the ID3 headers of the MP3 files, which is something your method isn't taking into account as the entire file is copied.
Do you have a language of choice that you want to use or doesn't it matter? That will affect what libraries are available that support the operations you want.
MP3 files have headers you need to respect.
You could ether use a library like Open Source Audio Library Project and write a tool around it.
Or you can use a tool that understands mp3 files like Audacity.
What I really wanted was a GUI to reorder them and output them as one file
Playlist Producer does exactly that, decoding and reencoding them into a combined MP3. It's designed for creating mix tapes or simple podcasts, but you might find it useful.
(Disclosure: I wrote the software, and I profit if you buy the Pro Edition. The Lite edition is a free version with a few limitations).
As David says, mp3wrap is the way to go. However, I found that it didn't fix the audio length header, so iTunes refused to play the whole file even though all the data was there. (I merged three 7-minute files, but it only saw up to the first 7 minutes.)
I dug up this blog post, which explains how to fix this and also how to copy the ID3 tags over from the original files (on its own, mp3wrap deletes your ID3 tags). Or to just copy the tags (using id3cp from id3lib), do:
id3cp original.mp3 new.mp3
I would use Winamp to do this. Create a playlist of files you want to merge into one, select Disk Writer output plugin, choose filename and you're done. The file you will get will be correct MP3 file and you can set bitrate etc.
I'd not heard of mp3wrap before. Looks great. I'm guessing someone's made it into a gui as well somewhere. But, just to respond to the original post, I've written a gui that does the COPY /b method. So, under the covers, nothing new under the sun, but the program is all about making the process less painful if you have a lot of files to merge...AND you don't want to re-encode AND each set of files to merge are the same bitrate. If you have that (and you're on Windows), check out Mp3Merge at: http://www.leighweb.com/david/mp3merge and see if that's what you're looking for.
If you want something free with a simple user interface that makes a completely clean mp3 I recommend MP3 Joiner.
Features:
Strips ID3 data (both ID3v1 and ID3v2.x) and doesn't add it's own (unlike mp3wrap)
Lossless joining (doesn't decode and re-encode the .mp3s). No codecs required.
Simple UI (see below)
Low memory usage (uses streams)
Very fast (compared to mp3wrap)
I wrote it :) - so you can request features and I'll add them.
Links:
MP3 Joiner website: Here
Latest installer: Here
Personally I would use something like mplayer with the audio pass though option eg -oac copy
Instead of using the command line to do
copy /b 1.mp3+2.mp3 3.mp3
you could instead use "The Rename" to rename all the MP3 fragments into a series of names that are in order based on some kind of counter. Then you could just use the same command line format but change it a little to:
copy /b *.mp3 output_name.mp3
That is assuming you ripped all of these fragment MP3's at the same time and they have the same audio settings. Worked great for me when I was converting an Audio book I had in .aa to a single .mp3. I had to burn all the .aa files to 9 CD's then rip all 9 CD's and then I was left with about 90 mp3's. Really a pain in the a55.