We've added a TXT record for DKIM validation (copy-pasted the DKIM string), but there seems to be a weird character in the record that:
doesn't appear at all in the DNS manager
doesn't appear at all in the DKIM Core validator
do appear as empty quotes in the mail-tester.com validator
do appear as a whitespace within quotes during dig in Linux
This character makes the DKIM invalid, so my questions are: What is it, why isn't it detected and how do I remove it?
DKIM Core:
mail-tester.com:
Dig output:
dkim._domainkey.example.com. 3600 IN TXT "v=DKIM1\; k=rsa\; g=*\; s=email\; h=sha1\; t=s\; p=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDKCyTnwDTY7yp1Xd/ApOgq7rzfSB8N2s+cX0sHzpwAt/I60KGGLV/qq/Wx462PX7LiL9O9UngvjoH6VILDJAnS3xGVHkVXIC9lzPcgTREV56AisCfIXa9t6ZELvXDAHJY1YfghPOUlh0KnXzL37W2hwTj4J3tJt1iEeKNgYnEwxQ" "IDAQAB\;"
Quick answer
The DKIM specification (RFC 6376 Sec. 3.6.2.2) dictates that
Strings in a TXT RR MUST be concatenated together before use with no intervening whitespace.
In other words: The whitespace in between the strings does not have any relevance. The strings inside the quotes are simply processed as one.
Technical background
TXT records consist of what's called character strings. Each of these consists of up to 256 bytes, where the first byte ("octet") carries the length of the string (see RFC1035 Sec. 3.3). When displayed, character strings are typically bounded by quotes " on either side (RFC1035 Sec. 5.1). The quotes are not stored and don't count towards the length.
That means that if the value is no more than 255 bytes long (plus quotes), one character string suffices. If it is longer, the TXT record will contain multiple character strings (RFC1035 Sec. 3.3.14).
The interpretation of the multiple character strings depends on the specific scenario, and for DKIM it is specified as described above. So, what you are seeing is a technical artifact, and not an erroneous space.
Related
RFC 4566 is the controlling RFC for SDP syntax.
It states in Section 5 - SDP Specification that:
An SDP session description consists of a number of lines of text of
the form:
<type>=<value>
where <type> MUST be exactly one case-significant character and
<value> is structured text whose format depends on <type>. In
general, <value> is either a number of fields delimited by a single
space character or a free format string, and is case-significant
unless a specific field defines otherwise. Whitespace MUST NOT be
used on either side of the "=" sign.
However, nowhere is it clear whether there can be whitespace before the case-significant character.
Section 9.0 which provides the BNF grammar is also ambiguous on this issue. All SDP entriess I have seen appear to start the attribute lines from the first position but is whitespace allowed at the start of an SDP entry is the question.
The answer provided to a somewhat similar but definitely different question I had asked earlier sheds some light but is not definitive on this particular issue.
Spaces before the case-significant character is not allowed. The BNF/ABNF does not show that you can add spaces before the lines defined in session-description. They even explicit say which letter you have to use like v=....
While reading RFC 1035 Section 5.1 in order to write a master file parser, I stumbled across the following statement:
5.1. Format
The format of these files is a sequence of entries. Entries are
predominantly line-oriented, though parentheses can be used to continue
a list of items across a line boundary, and text literals can contain
CRLF within the text. Any combination of tabs and spaces act as a
delimiter between the separate items that make up an entry. The end of
any line in the master file can end with a comment. The comment starts
with a ";" (semicolon).
What do the authors mean by "text literals can contain CRLF within the text"? I am aware that the beneath entry is valid as outlined in Section 5.3 but I fail to find either an example of the statement or a proper definition of "text literal". I have furthermore searched the companion RFC 1034 without success for any mention of the above statement.
# IN SOA VENERA Action\.domains (
20 ; SERIAL
7200 ; REFRESH
600 ; RETRY
3600000; EXPIRE
60) ; MINIMUM
I would assume a text literal could be delimited by parentheses. Would any of the following comments be valid per RFC 1035 and in what different ways would a CRLF be valid in the file?
# IN SOA VENERA Action\.domains (
20 ; Some example of a multi-line comment
inside parentheses
7200
600
3600000
60) ; (Some example of parentheses
inside a multi-line comment)
It means that this is supposed to be valid:
example.com. IN TXT "hello,
world"
The RFC authors probably expect it to be equivalent to:
example.com. IN TXT "hello,\013\010world"
Due to the ambiguity of line ending encodings in this situations (if the platform uses LF as the line terminator, do you still get CRLF in the TXT record?), I doubt this is widely implemented.
I am trying to search for multiple strings in a text log alltogether with the following pattern:
s(n)KEY: some data
s(n)Measurement: some data
s(n)Units: some data
Where s(n) is the number of spaces that varies. KEY will change at every iteration in the loop as it comes from the .ini file. As an example see the following snippet the of log:
WHITE On Axis Lum_010 OPTICAL_TEST_01 some.seq
WHITE On Axis Lum_010 Failed
Bezel1 Luminance-Light Source: Passed
Measurement: 148.41
Units: fc
WHITE On Axis Lum_010: Failed
Measurement: 197.5
Units: fL
In this case, I only want to detect when the key (WHITE On Axis Lum_010) appears along with Measurement and I don't want to detect if it appears anywhere else in the log. My ultimate goal is to get the measurement and unit data from file.
Any help will be greatly appreciated. Thank you, Rav.
I'd do it similar to Salome, using regular expressions. Since those are a little tricky, I have a test VI for them:
The RegEx is:
^\s{2}(.*?):\s*(\S*)\n\s*Measurement:\s*(\S*)\n\s*Units:\s*(\S*)
and means:
^ Find a beginning of a line
\s{2} followed by exactly two whitespaces
(.*?) followed by multible characters
: followed by a ':'
\s* followed by several whitespaces
(\S*) followed by several non-whitespaces
\n followed by a newLine
\s* followed by several whitespaces
Measurement: followed by this string
\s* followed by several whitespaces
(\S*) followed by several non-whitespaces
\n followed by a newLine
... and the same for the 'Unit'
The parentheses denote groups, and allow to easily collect the interesting parts of the string.
The RegEx string might need more tuning if the format of the data is not as expected, but this is a starting point.
To find more data in your string, put this in a while loop and use a shift register to feed the offset past match into the offset of the next iteration, and stop if it's =-1.
It's easier to search through and to implement.
LabVIEW also has VIs to create and manage JSONs.
Alternatively, you could use Regular Expressions in a while-loop to look if it exists in your log, maybe something like this:
WHITE On Axis Lum_010:(\s)*((Failed)|(Pass))\n(\s)+Measurement:(\s)*[0-9]*((\.)[0-9]*){0,1}\n(\s)*Units:\s*\w*
Then you can split the string or pick lines and take the information.
But I would not recommend that, as it is impractical to change and not useful if you want to use the code for other keys.
I hope it helps you :)
I am working on a processor that parts texts into blocks with marks:
LOREM IPSUM SED AMED
will be parsed like:
{word:1}LOREM{/word:1}{space:2}
{word:3}IPSUM{/word:3}{space:4}
{word:5}SED{/word:5}{space:6}
{word:7}AMED{/word:7}
But I dont want to use "{word}" etc, because it causes processor down, because it is an string again... I need to mark like these:
\E002\0001 LOREM \E003\0001 \E004\0002
\E002\0003 IPSUM \E003\0004 \E004\0005
\E002\0006 SED \E003\0006 \E004\0007
\E002\0008 AMED \E003\0008
First \E002 means element type number, its last bit represent element's close. So element number increments with +2.
Second \0001 means element index for stacking.
I am just used \E002 irrelevantly for this example.
But \0001 also using in Unicode Range, and this leads me to where I start again...
So which unicode range can I use? \ff0000? or how can I solve this?
Thanks!
The Unicode Consortium thought of this. There is a range of Unicode code points that are meant to never represent a displayable character, but meta-codes instead:
Noncharacters are code points that are permanently reserved and will never have characters
assigned to them.
...
Tag characters were intended to support a general scheme for the internal tagging of text
streams in the absence of other mechanisms, such as markup languages. The use of tag
characters for language tagging is deprecated.
(http://www.unicode.org/versions/Unicode9.0.0/ch23.pdf)
You should be able to use regular control characters as "private" tags, because these should never occur in proper strings. This would be the range from U+0000 to U+001F, excluding tab (U+0009), the common "returns" (U+000A and U+000D), and, for safety, U+0000 itself (some libraries do not like Null characters in the middle of strings).
Non-characters
Noncharacters are code points that are permanently reserved in the Unicode Standard for
internal use. They are not recommended for use in open interchange of Unicode text data.
You can use U+FEFF (which is currently officially defined as Not-A-Character), or U+FFFE and U+FFFF. There are several more "officially not-a-characters" defined, and you can be fairly sure they would not occur in regular text strings.
A few random sequences with predefined definitions, and so highly unlikely to occur in plain text strings are:
Specials: U+FFF0–U+FFF8
The nine unassigned Unicode code points in the range U+FFF0..U+FFF8 are reserved for
special character definitions.
Annotation Characters: U+FFF9–U+FFFB
An interlinear annotation consists of annotating text that is related to a sequence of annotated
characters. For all regular editing and text-processing algorithms, the annotated characters
are treated as part of the text stream. The annotating text is also part of the content,
but for all or some text processing, it does not form part of the main text stream.
Tag Characters: U+E0000–U+E007F
This block encodes a set of 95 special-use tag characters to enable the spelling out of ASCIIbased
string tags using characters that can be strictly separated from ordinary text content
characters in Unicode.
(all quotations from the chapter as above)
Staying within conventions, you can also use U+2028 (line separator) and/or U+2029 paragraph separator.
Technically, your use of U+E000–U+F8FF (the "Private Use Area") is okay-ish, because these code points only can define an unambiguous character in combination with a certain font. However, it is possible these codes may pop up if you get your plain text from a source where the font was included.
As for how to encode this into your strings: it doesn't really matter if the numerical code immediately following your private tag marker is a valid Unicode character or not. If you see one of your own tag markers, then the value immediately following is always your own private sequence number.
As you see, there are lots of possibilities. I guess the most important criterium is whether you want to use other functions on these strings. If you create a string that is technically invalid Unicode (for instance, because it includes not-a-character values), some external functions may choose to fail to work on them, or silently remove the bad values. In such a case, you'd need to rigorously stick to a system in which you only use 'valid' code points.
I know Yahoo and Gmail do not accept it. But I want to know if it's possible for a person to create an email address with double # in address and if they can receive emails with that address?
For example: info#stackoverflow.com#example.com.
I do not want to use this non standard format, but I want to know if a hacker can do it?
According to the wiki, it is allowed.
Space and special characters "(),:;<>#[\] are allowed with restrictions (they are only allowed inside a quoted string, as described in the paragraph below, and in addition, a backslash or double-quote must be preceded by a backslash).
No it is not allowed. See the RFC Section 3.4.1
An addr-spec is a specific Internet identifier that contains a
locally interpreted string followed by the at-sign character ("#",
ASCII value 64) followed by an Internet domain. The locally
interpreted string is either a quoted-string or a dot-atom. If the
string can be represented as a dot-atom (that is, it contains no
characters other than atext characters or "." surrounded by atext
characters), then the dot-atom form SHOULD be used and the quoted-
string form SHOULD NOT be used. Comments and folding white space
SHOULD NOT be used around the "#" in the addr-spec.
As this question's answer says:
The local-part of the e-mail address may use any of these ASCII characters:
Uppercase and lowercase English letters (a-z, A-Z)
Digits 0 to 9
Characters ! # $ % & ' * + - / = ? ^ _ ` { | }
Character . (dot, period, full stop) provided that it is not the first or last
character, and provided also that it does not appear two or more times
consecutively.
so it is usually not allowed :)