Why are there different STRING formats?

Why are there different STRING formats? - string

When some time ago I had developed a script to query network interfaces via snmpwalk and IF-MIB::ifDescr the output format was like STRING: eth0.
The OS was SLES11 using net-snmp (it still works in SLES12 using net-snmp-5.7.3).
However on a different OS (still Linux) the interface strings are represented as STRING: "port1" (note the surrounding double-quotes).
Now the question is: Who is responsible for the extra double quotes? A different version of net-snmp, or a different SNMP agent? Or is one of the results incorrectly implemented in the agent?
As far as I understand SNMP the double quotes are not necessary for the protocol as strings are always transmitted with their length.

While it would be unusual (and undesirable) for an SNMP Agent to provide a quoted string in response to a query for ifDescr (or for anything else!) as they are indeed not part of the contract of a string at that level, the SNMP world is rife with oddities and variations and specification deviations, so this is not something you can assume will never happen.
Meanwhile, the format of the command-line output of a tool like Net-SNMP is effectively arbitrary: the developers can choose whether or not to quote strings, and as long as they document their choice, the end result is the same. So you cannot make any assumptions here either.
You should examine the actual data. You can do this by sniffing the SNMP packets with a tool like tcpdump and loading them into a UI like Wireshark (previously Ethereal). Then you can observe the actual contents of the datagram without the Net-SNMP formatting. If it contains quotes, it's the agent's fault; if it doesn't, the app is adding them for display.
(There's probably a Net-SNMP flag to make it display the bytes, in hex form, making up the string, which would be an easier way to gather this evidence if I remembered what the flag was.)
As an editorial note, if you'd told us what the "different" Linux OS actually was, and what version of Net-SNMP you were using on it, we could have confirmed (or ruled out) option two for you.
(For what it's worth, I'm not aware of any Net-SNMP change that added or removed quotation marks from the command-line output, so this is probably an oddity of the agent on that "different" system.)

Related

Delphi/windows and Linux/Lazarus sharing character above #127

I am maintaining a project where data has to be shared between windows and linux machines.
The program has been developed in DELPHI (Windows) in 2003 - so there is a lot of legacy data files that must be (at least probably) read by both systems in the future.
I have ported the programm to Lazarus and it runs on Linux quite well.
But the data (in a proprietary format) has stored strings as general ascii-characters from #0-#255. Reading the data on a linux machine leads to a lot of '?'-Symbols instead of 'ñ,äöüß...' etc.
What I tried to solve the problem:
1.) I read the data on a windows machine - as usual.
2.) I saved the data with a modified version, that will encode all strings with URLEncode()
on saving.
3.) I also modified the routine reading the data with URLDecode
4.) I saved the data with the modified version.
5.) I compiled the modiefied version on linux and copied the data from the windows machine.
6.) I opened the data in question ... and got questionmarks (?) instead of 'ñ,äöüß...' etc.
Well, the actual question is: How to share the data maintained by both systems and preserving those characters when editing the data (on both sides)?
Thanks in advance

8bit Ansi values between 128-255 are charset-specific. Whatever charset is used to save the data on Windows (assuming you are relying on Windows default encoding, which is dependent on the user's locale), you have to use that same charset when loading the data on Linux, and vice versa. There are dozens, if not hundreds, of charsets used in the world, which makes portability of Ansi data difficult. This is exactly the kind of problem that Unicode was designed to address. You are best off saving your data in a portable charset, such as UTF-8, and then perform conversions to/from the system charset when loading/saving the data.

Consider using UTF-8 for all your text storage.
Or, if you are sure that your data will always have the same code page, you can use conversion from the original Windows code page to UTF-8, which is the default Linux/Lazarus encoding.
You should better not rely on any proprietary binary layout for your application file format, if you want it to be cross-platform. You just discovered the character encoding problem, but you have potentially other issues, like binary endianess. SQLite3 is a very good application file format. It is fast, reliable, cross-platform, stable and atomic.

Note that Lazarus always expects utf8 strings for GUI. So even on Windows this probably wouldn't work without proper utf8 sanitation

Printing in color from printf(9)

Is it possible to print to console in color from the kernel's version of printf? Can I see the same escape codes that I can in userland? Does the kernel understand the console well enough to be able to provide termcap style APIs and constants for specific color? If so, which header are they defined in?

You can certainly print arbitrary escape sequences from kernel. It would happily put whatever bytes out on a terminal. Whether those bytes would be interpreted as color, kernel, generally speaking, has no idea.
So, it is possible to print and you can see the same escape codes once you read kernel messages (i.e. if kernel prints XTERM-style colors and you happen to look at them via serial port with a terminal program that either uses XTERM or emulates XTERM escape sequences itself)
As for whether kernel knows much about your terminal type and able to use termcap info, the answer is, in general, no.
In userland terminal type is a matter of convention. Login script tries to figure out what kind of terminal you may be connected to and then sets TERM to appropriate type in the shell's environment. Forked processes inherit it and use the type in order to figure out how to do certain things on specific terminal. Usually it involves some sort of curses library.
Kernel, on the other hand is fairly minimalistic beast that does not really give much of a damn what's on the other end of whatever happens to be its console -- serial port, firewire or video card. For all practical purposes, console may not even be connected to anything at all.
Practically, you will need to solve two problems:
Have a way to configure terminal type for particular TTY device you want to use.
Provide kernel with som sort of termcap/terminfo data for that terminal type and an API to produce appropriate escape sequences for output on specific TTY. In other words -- in-kernel curses library.

Different UTF-8 signature for same diacritics (umlauts) - 2 binary ways to write umlauts

I have a quite big problem, where I can't find any help around in the web:
I moved a page from a website from OSX to Linux (both systems are running in de_DE.UTF-8) and run in an quite unknown problem:
Some of the files were not found anymore, but obviously existed on the harddrive with (visibly) the same name. All those files contained german umlauts.
I took one sample image, copied the original request-uri from the webpage and called it directly - same error. After rewriting the file-name it worked. And yes, I did not mistype it!
This surprised me and I took a look into the apache-log where I found these entries:
192.168.56.10 - - [27/Aug/2012:20:03:21 +0200] "GET /images/Sch%C3%B6ne-Lau-150x150.jpg HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1"
192.168.56.10 - - [27/Aug/2012:20:03:57 +0200] "GET /images/Scho%CC%88ne-Lau-150x150.jpg HTTP/1.1" 404 4205 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1"
That was something for me to investigate ... Here's what I found in the UTF8 chartable http://www.utf8-chartable.de/:
ö c3 b6 LATIN SMALL LETTER O WITH DIAERESIS
¨ cc 88 COMBINING DIAERESIS
I think you've already heard of dead-keys: http://en.wikipedia.org/wiki/Dead_key If not, read the article. It's quite interesting ;)
Does that mean, that OSX saves all diacritics separate to the letter? Does that really mean, that OSX saves the character ö as o and ¨ instead of using the real character that results of the combination?
If yes, do you know of a good script that I could use to rename these files? This won't be the first page I move from OSX to Linux ...

It's not quite the same thing as dead keys, but it's related. As you've worked out, U+00F6 and U+006F followed by U+0308 have the same visual result.
There are in fact Unicode rules in knowing to treat them the same, which is based on decompositions. There's a decomposition table in the character database, that tells us that U+00F6 canonically decomposes to U+006F followed by U+0308.
As well as canonical decomposition, there are compatibility decompositions. These lose some information, for example ² ends up being decomposed to 2. This is clearly a destructive change, but it is useful for searching when you want to be a bit fuzzy (how google knows a search for fiſh should return results about fish).
If there are more than one combining character after a non-combining character, then we can re-order them as long as we don't re-order those of the same class. This becomes clear when we consider that it doesn't matter whether we put a cedilla on something and then an acute accent, or an acute and then a cedilla, but if we put both an acute and an umlaut on a letter it clearly matters what way around they go.
From this, we have 4 normalisation forms. Put strings into an appropriate normalisation form before doing comparisons, and you don't get tripped up.
NFD: Break everything apart by canonically decomposing it as much as possible. Reorder combining characters in order of their combining class, but keep any with the same class in the same order relative to each other.
NFC: First put everything into NFD. Then continually look at the combining characters in order, if there isn't an earlier one of the same class. If there is an equivalent single character, then replace them, and re-do the scan looking to compose further.
NFKD: Like NFD, but using compatibility decomposition (damaging change, but useful for comparisons as explained above).
NFD: Do NFKD, then re-combine canonical only as per NFC.
There are also some re-combinations banned from use in NFC so that text that was valid NFC in one version of Unicode doesn't cease to be NFC if Unicode has more characters added to it.
Of NFD and NFC, NFC is clearly the more concise. It's not the most concise possible, but it is one that is very concise and can be tested for and/or created in a very efficient streaming manner.
Mac OSX uses NFD for file names. Because they're weirdos. (Okay, there are better arguments than that, they just didn't convince me!)
The Web Character Model uses NFC.* As such, you should use NFC on web stuff as much as possible. There can though be security considerations in blindly converting stuff to NFC. But if it starts from you, it should start in NFC.
Any programming language that deals with text should have a nice way of normlising text into any of these forms. If yours doesn't complain (or if yours is open source, contribute!).
See http://unicode.org/faq/normalization.html for more, or http://unicode.org/reports/tr15/ for the full gory details.
*For extra fun, if you inserted something beginning with a combining long solidus overlay (U+0338) at the start of an XML or HTML element's content, it would turn the > of the tag into ≯, turning well-formed XML into gibberish. For this reason the web character model insists that each entity must itself be NFC and not start with a combining character.

Thanks, Jon Hanna for much background-information here! This was important to get the full answer: a way to convert from the one to the other normalisation form.
As my changes are in the filesystem (because of file-upload) that is linked in the database, I now have to update my database-dump. The files got already renamed during the move (maybe by the FTP-Client ...)
Command line tools to convert charsets on Linux are:
iconv - converting the content of a stream (maybe a file)
convmv - converting the filenames in a directory
The charset utf-8-mac (as described in http://loopkid.net/articles/2011/03/19/groking-hfs-character-encoding), I could use in iconv, seems to exist just on OSX systems and so I have to move my sql-dump to my mac, convert it and move it back. Another option would be to rename the files back using convmv to NFD, but this would more hinder than help in the future, I think.
The tool convmv has a build-in (os-independent) option to enforcing NFC- or NFD-compatible filenames: http://www.j3e.de/linux/convmv/man/
PHP itself (the language my system - Wordpress is based on) supports a compatibility-layer here:
In PHP, how do I deal with the difference in encoded filenames on HFS+ vs. elsewhere? After I fixed this issue for me, I will go and write some tests and may also write a bug-report to Wordpress and other systems I work with ;)

Linux distros treat filenames as binary strings, meaning no encoding is assumed - though the graphical shell (Gnome, KDE, etc) might make some assumptions based on environment variables, locale, etc.
OS-X on the other hand requires or enforces (I forget which) their own version of UTF-8 with Unicode normalization to expand all diacritics into combining characters.
On Linux when people do use Unicode in filenames they tend to prefer UTF-8 with precomposed characters when it comes to diacritics.

Tool to discard packet payload?

I'm trying to anonymize packets from a pcap file that I have. I need to discard all the packets payloads/content (leaving only header information) and was wondering if there would be a tool that I could use for this (on Linux)? I have thought about using tcpdump with specifying the snaplen but with the header length changing, I don't think that would work.
If there isn't a tool that could accomplish this, a point in the direction of what library for coding would be best(easiest) would work as well. I'd rather not take that route since I have virtually no experience in network programming.
Any help is much appreciated.

You don't need any network programming experience to anonymize the packets. The format of the output file is well documented in the pcap-savefile(5) manpage. You will need to lookup the layouts of the various protocols you'll be handling in order to identify what fields need to be anonymized. You should also look at the link layer header types documentation at tcpdump.org to help you get started.
EDIT: Also look at libpcap itself... according to the pcap-savefile manpage:
NOTE: applications and libraries
should, if possible, use libpcap to
read savefiles, rather than having their own code to read
savefiles.
If, in the future, a new file format is supported by libpcap,
applica-
tions and libraries using libpcap to read savefiles will be
able to
read the new format of savefiles, but applications and
libraries using
their own code to read savefiles will have to be changed to
support the
new file format.

Protocol buffers logging

In our business, we require to log every request/response which coming to our server.
At this time being, we are using xml as standard implementation.
Log files are used if we need to debug/trace some error.
I am kind of curious if we switch to protocol buffers, since it is binary, what will be the best way to log request/response to file?
For example:
FileOutputStream output = new FileOutputStream("\\files\log.txt");
request.build().writeTo(outout);
For anyone who has used protocol buffers in your application, how do you log your request/response, just in case we need it for debugging purpose?

TL;DR: write debugging logs in text, write long-term logs in binary.
There are at least two ways you can do this logging (and maybe, in fact, you should do both):
Writing your logs in text format. This is good for debugging and quickly checking for problems with your eyes.
Writing your logs in binary format - this will make future analysis much quicker since you can load the data using same protocol buffers code and do all kinds of things on them.
Quite honestly, this is more or less the way this is done at the place this technology came from.

We use the ShortDebugString() method on the C++ object to write down a human-readable version of all incoming and outgoing messages to a text-file. ShortDebugString() returns a one-line version of the same string returned by the toString() method in Java. Not sure how easy it is to accomplish the same thing in Java.

If you have competing needs for logging and performance then I suppose you could dump your binary data to the file as-is, with perhaps each record preceded by a tag containing a timestamp and a length value so you'll know where this particular bit of data ends. But I hasten to admit this is very ugly. You will need to write a utility to read and analyze this file, and will be helpless without that utility.
A more reasonable solution would be to dump your binary data in text form. I'm thinking of "lines" of text, again starting with whatever tagging information you find relevant, followed by some length information in decimal or hex, followed by as many hex bytes as needed to dump your buffer - thus you could end up with some fairly long lines. But since the file is line structured, you can use text-oriented tools (an editor in the simplest case) to work with it. Hex dumping essentially means you are using two bytes in the log to represent one byte of data (plus a bit of overhead). Heh, disk space is cheap these days.
If those binary buffers have a fairly consistent structure, you could even break out and label fields (or something like that) so your data becomes a little more human readable and, more importantly, better searchable. Of course it's up to you how much effort you want to sink into making your log records look pretty; but the time spent here may well pay off a little later in analysis.

If you've non-ASCII character strings in your messages, simply logging them by using implicit or explicit call to toString would escape the characters.
"오늘은 무슨 요일입니까?" becomes "\354\230\244\353\212\230\354\235\200 \353\254\264\354\212\250 \354\232\224\354\235\274\354\236\205\353\213\210\352\271\214?"
If you want to retain the non-ASCII characters, use TextFormat.printer().escapingNonAscii(false).printToString(message).
See this answer for more details.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string