Microsoft Excel will write a CSV file containing fields with multiple lines. The newlines are 0A (UNIX-style) instead of 0D0A.
However, it will not read correctly the .csv file it just wrote. The fields that contain 0A newlines, become new rows. How can this be overcome?
This Excel spreadsheet is saved as a CSV file named t-xl.xlsx and t-xl.csv.
PS H:\r> Format-Hex .\t-xl.csv
Path: H:\r\t-xl.csv
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 66 31 2C 66 32 0D 0A 31 2C 66 6F 72 0D 0A 32 2C f1,f2..1,for..2,
00000010 22 6E 6F 77 0A 69 73 0A 74 68 65 22 0D 0A 33 2C "now.is.the"..3,
00000020 74 69 6D 65 0D 0A time..
When the t-xl.csv is loaded, Excel seems to remember and handle the newlines correctly (as it was in t-xl.xlsx).
However, when using Data > From Text, it will not handle the newlines correctly.
At least one CSV reference describes support for field embedded newlines. Is there any reason Microsoft Excel does not support this?http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
The Excel legacy text import wizard does not respect quoted line-breaks as not splitting the field.
Opening the file directly with Excel, as you have seen, will respect the quoted line-breaks.
If you have Excel 2010+, you can use Power Query to Get & Transform from Text/CSV. There is an option to enable this (I believe it is enabled by default).
The only work-around for the legacy wizard of which I am aware would be to pre-process the file replacing the quoted line-break with something else, and then processing it again after import to replace "something else" with the line-break.
Related
I copied a piece of text from a website. This piece of text contains a space. I later try to manipulate this string in C#, but my code doesn't recognize the space.
I started digging deeper, so I tried the following Powershell command to convert the string to hexadecimal to see what's going on:
"2+1 53" | Format-Hex
(see screenshot here: Powershell code)
As you can see in the image, it shows that the result is:
32 2B 31 3F 35 33
which converted back to normal text is
2+1?53
Notice that the question mark wasn't present in my original string. What is going on? How can a question mark be present but not show up? Or where did it come from if it was not present in my original string?
Update:
Perhaps I should stress that I need to figure out what that "space" character is, so that I can later get rid of it using "replace" method.
Most likely, there's another character in that text, which is not a space. You can check it by putting the text into a file and then using
Get-Content C:\temp\file.txt | Format-Hex
To reproduce, I used that text:
Get-Service –Name BITS
# ^ it's not a normal dash, check it at http://asciivalue.com/index.php
This happens if I paste into console window:
31.88 ms | C:\> "Get-Service -Name BITS" | Format-Hex
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 47 65 74 2D 53 65 72 76 69 63 65 20 3F 4E 61 6D Get-Service ?Nam
00000010 65 20 42 49 54 53 e BITS
And that when I get it from the script:
60.02 ms | C:\> Get-Content C:\temp\script.ps1 | Format-Hex
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 47 65 74 2D 53 65 72 76 69 63 65 20 3F 3F 3F 4E Get-Service ???N
00000010 61 6D 65 20 42 49 54 53 ame BITS
As you can see, that character is being converted to question mark (3F in hex output) or triple question mark (3F 3F 3F) while getting content from file.
It depends on the version of PowerShell. I have 5.1, there the default encoding of Format-Hex is ASCII, which will replace every non-ASCII character (like your space) with a question mark.
Specify a different encoding to prevent non-ASCII characters from being replaced. Example:
PS> "⇆" | Format-Hex -Encoding Unicode
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 C6 21 Æ!
Here, the code point is U+21C6. Google that and you will find out what it represents.
A regular space is 0x20. There are many unicode spaces. http://jkorpela.fi/chars/spaces.html How did you make the string? Here's an example of EN SPACE (nut), U+2002. You should be able to copy and paste this yourself. Hmm, in powershell 7 for windows, the special space doesn't paste.
[int[]][char[]]'foo bar' | % tostring x
66
6f
6f
2002
62
61
72
'foo bar' | Format-Hex -Encoding BigEndianUnicode
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 00 66 00 6F 00 6F 20 02 00 62 00 61 00 72 .f.o.o ..b.a.r
Format-hex will translate it to ascii by default in powershell 5. Unicode characters will be replaced by 3F or ?.
'foo bar' | format-hex -Encoding ascii
Label: String (System.String) <72F012A4>
Offset Bytes Ascii
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
------ ----------------------------------------------- -----
0000000000000 66 6F 6F 3F 62 61 72 foo?bar
UTF8 will encode in one to three bytes depending on how high the unicode number is (how many bits to encode). Three bytes in this case (U+2002), always starting with 'E' and then the first number, in this case '2'.
' ' | format-hex -Encoding utf8
# "`u{2002}" | format-hex # powershell 7
Label: String (System.String) <03ACFE1C>
Offset Bytes Ascii
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
------ ----------------------------------------------- -----
0000000000000 E2 80 82 �
I found the answer to my question here: String Comparison, .NET and non breaking space
The "space" that was present in the string was the "non-breaking space" and getting rid of it in C# was as easy as:
string cellText = "String with non breaking spaces.";
cellText = Regex.Replace(cellText, #"\u00A0", " ");
I'm attempting to copy/paste ASCII characters from a Hex editor into a Sublime Text 3 Plain Text document, although NUL characters do not show/display and the string is truncated:
Hexadecimal:
48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 21 00 66 6F
6F 62 61 72 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00
ASCII:
Hello, World!�foobar�������������������������
Sublime Text: Truncates copied string and does not show NUL characters
TextMate: Shows NUL via "Show Invisibles"
I've tried the suggestion mentioned here by adding "draw_white_space": "all" to my preferences — still no luck! Is this possible with Sublime Text 3?
You're not alone in having this problem - others have posted bug reports about this behaviour: https://github.com/SublimeTextIssues/Core/issues/393
However it's not consistent:
Behaviour seems dependent on the file and where the NUL chars exist;
Similar issue here, with the console: https://github.com/SublimeTextIssues/Core/issues/1939
I have a need to test if a program that I'm writing is parsing the gzip header correctly, and that includes reading the FEXTRA, FNAME, and FCOMMENT fields. Yet it seems that gzip doesn't support creating archives with the FEXTRA and FCOMMENT fields -- only FNAME. Are there any existing tools which can do all three of these?
The Perl module IO::Compress::Gzip optionally lets you set the three fields you are intrested in. (Fair disclosure: I am the author of the module)
Here is some sample code that sets FNAME to "filename", FCOMMENT to "This is a comment" and creates an FEXTRA field with a single subfield with ID "ab" and value "cde".
use IO::Compress::Gzip qw(gzip $GzipError);
gzip \"payload" => "/tmp/test.gz",
Name => "filename",
Comment => "This is a comment",
ExtraField => [ "ab" => "cde"]
or die "Cannot create gzip file: $GzipError" ;
And here is a hexdump of the file it created.
00000000 1f 8b 08 1c cb 3b 3a 5a 00 03 07 00 61 62 03 00 |.....;:Z....ab..|
00000010 63 64 65 66 69 6c 65 6e 61 6d 65 00 54 68 69 73 |cdefilename.This|
00000020 20 69 73 20 61 20 63 6f 6d 6d 65 6e 74 00 2b 48 | is a comment.+H|
00000030 ac cc c9 4f 4c 01 00 15 6a 2c 42 07 00 00 00 |...OL...j,B....|
0000003f
I am unable to unzip file in linux centos. Getting following error
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
As you are mentioning jar in your comments we can consider this a programming question ;-)
First of all you should try to validate your file. If available you can even compare the checksum provided for this file and / or the filesize with the location you downloaded it from.
To verify the zip file on a low level you can use this command:
hexdump -C -n 100 file.zip
This will show you the first 100 bytes of the zips structure which will look similar to this:
00000000 50 4b 03 04 0a 00 00 00 00 00 88 43 65 47 11 7a |PK.........CeG.z|
00000010 39 1e 15 00 00 00 15 00 00 00 0e 00 1c 00 66 69 |9.............fi|
00000020 6c 65 31 69 6e 7a 69 70 2e 74 78 74 55 54 09 00 |le1inzip.txtUT..|
00000030 03 0f 05 3b 56 2f 05 3b 56 75 78 0b 00 01 04 e8 |...;V/.;Vux.....|
00000040 03 00 00 04 e8 03 00 00 54 68 69 73 20 69 73 20 |........This is |
00000050 61 20 66 69 6c 65 0a 1b 5b 31 37 7e 0a 50 4b 03 |a file..[17~.PK.|
00000060 04 0a 00 00 |....|
The first two byte of the file have to be PK, if not the file is invalid. Some bytes later you will find the name of the first file stored. In this example it is file1inzip.txt.
I've been using RFC 1035.4.1.3 as a reference for DNS RR format:
http://www.freesoft.org/CIE/RFC/1035/42.htm
The RFC says that RDLENGTH is "an unsigned 16 bit integer that specifies the length in octets of the RDATA field" but in the datagrams I'm getting RDLENGTH is sometimes 2 less than it should be. I've checked with wireshark to ensure that I'm getting the datagram correctly. Here's a CNAME record I got while looking up google:
C0 0C 00 05 00 01 00 03 95 FC 00 10 03 77 77 77
01 6C 06 67 6F 6F 67 6C 65 03 63 6F 6D 00
So that's the name: C0 0C (a pointer to www.google.com earlier in the dgram)
Then the type: 00 05 (CNAME)
Then the class: 00 01 (IN)
Then the TTL: 00 03 95 FC (whatever)
Then RDLENGTH: 00 10 (that's 16 bytes, yes?)
Then RDATA:
03 77 77 77 01 6C 06 67 6F 6F 67 6C 65 03 63 6F 6D 00 (www.l.google.com - format is correct)
As you can see, the RDATA is 18 bytes in length. 18 bytes is 0x12, not 0x10.
The type A records that come after that correctly report RDLENGTH 4 for the address data. Am I missing something here? I'd dismiss it as an error, but I get this for every DNS servers and every domain.
I guess really what I'm asking is why the RDATA is longer than RDLENGTH and what rules should I follow to adapt to it so I can parse any type of record. (Specifically, can I expect this kind of thing from other RR types?)
Thank you in advance to anyone who gives advice. :)
The response data appears to be messed up - either RDLENGTH should be 18 (0x00 0x12), or RDATA should be different.
I just ran a few google lookups from here, and I do not see that problem.
I get RDLENGHT of 7 and RDATA to match (a compressed name).
Is something messing with your packet data?