Encoding of "ä", "ö", ... doesn't work properly - f#-data

I'm using the CsvProvider on a csv File generated out of an Excel. The csv file contains some specific German characters ('ä', 'ö', ...) that are not read correctly.
I tryied with the Encoding parameter, used a stream with UTF8 Encoding, saved the Excel file as ".csv", ".csv" in MS-DOS Format, ... But, the result was always the same.
Here a reduced code sample:
open FSharp.Data
type MyCsvProvider = CsvProvider<"Mappe1.csv",Separators=";">
[<EntryPoint>]
let main argv =
let csvDatas = MyCsvProvider.Load("..\..\Mappe1.csv")
for i in csvDatas.Rows do
printfn "Read: Beispiel='%s' and Änderungsjahr='%d" i.Beispiel i.``�nderungsjahr``
0
Here the corresponding CsvFile:
Beispiel;Änderungsjahr
Data1;2000
Überlegung;2010
And here the result after execution:
Read: Beispiel='Data1' and Änderungsjahr='2000
Read: Beispiel='?berlegung' and Änderungsjahr='2010

I'm not into F#, but I reckon could be a locale settings for the console. Set a breakpoint and check the actual bytes value with the debugger for the string.
For instance Überlegung starts with Ü which has 0xDC as ASCII Code if you get this value from the debugger then it's only the console locale must be set.
Try having a look at this so question about setting the locale even if it is for c++ should be something you can adapt to your environment.

OK, I found the problem: Using CSV in Excel generates more or less ASCII, but no UTF. The Format to use is "Unicode (Text)", which generates real unicode, with '\t' as separator instead of ';' or ','. Works for me... Thus I close the question... Thanks to all!

Related

Parsing a non-Unicode string with Flask-RESTful

I have a webhook developed with Flask-RESTful which gets several parameters with POST.
One of the parameters is a non-Unicode string, encoded in cp1251.
Can't find a way to correctly parse this argument using reqparse.
Here is the fragment of my code:
parser = reqparse.RequestParser()
parser.add_argument('text')
msg = parser.parse_args()
Then, I write msg to a text file, and it looks like this:
{"text": "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd !\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\n\n-- \n\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd."}
As you can see, Flask somehow replaces all Cyrillic characters with \ufffd. At the same time, non-Cyrillic characters, like ! or \n are processed correctly.
Anything I can do to advise RequestParser with the string encoding?
Here is my code for writing the text to disk:
f = open('log_msg.txt', 'w+')
f.write(json.dumps(msg))
f.close()
I tried f = open('log_msg.txt', 'w+', encoding='cp1251') with the same result.
Then, I tried
f = open('log_msg_ascii.txt', 'w+')
f.write(ascii(json.dumps(msg)))
Also, no difference.
So, I'm pretty sure it's RequestParser() tries to be too smart and can't understand the non-Unicode input.
Thanks!
Okay, I finally found a workaround. Thanks to #lenz for helping me with this issue. It seems that reqparse wrongly assumes that every string parameter comes as UTF-8. So when it sees a non-Unicode input field (among other Unicode fields!), it tries to load it as Unicode and fails. As a result, all characters are U+FFFD (replacement character).
So, to access that non-Unicode field, I did the following trick.
First, I load raw data using get_data(), decode it using cp1251 and parse with a simple regexp.
raw_data = request.get_data()
contents = raw_data.decode('windows-1251')
match = re.search(r'(?P<delim>--\w+\r?\n)Content-Disposition: form-data; name=\"text\"\r?\n(.*?)(?P=delim)', contents, re.MULTILINE | re.DOTALL)
text = match.group(2)
Not the most beautiful solution, but it works.

Writing to files in ASCII with Python3, not UTF8

I have a program that I created with two sections.
The first one copies a text file with an integer in the middle of the file name in this format.
file = "Filename" + "str(int)" + ".txt"
the user can create as many copies of the file that they would like.
The second part of the program is what I am having the problem with. There is an integer at the very bottom of the file that is to correspond with the integer in the file name. After the first part is done, I open each file one at a time in "r+" read/write format. So I can file.seek(1000) to about where the integer is in the file.
Now in my opinion the next part should be easy. I should just simply have to write str(int) into the file right here. But it wasn't that easy. It worked just fine doing it like that in Linux at home, but at work on Windows it proved difficult. What I ended up having to do after file.seek(1000) is write to the file using Unicode UTF-8. I accomplished this with this code snippet of the rest of the program. I will document it so that it is able to be understood what is going on. Instead of having to write this in Unicode, I would love to be able to write this in good old regular English ASCII characters. Eventually this program will be expanded to include a lot more data at the bottom of each file. Having to write the data in Unicode is going to make things extremely difficult. If I just write the data without turning it into Unicode this is the result. This string is supposed to say #2 =1534, instead it says #2 =ㄠ㌵433.
If someone can show me what I am doing wrong that would be great. I would love to just use something like file.write('1534') to write the data to the file instead of having to do it in Unicode UTF-8.
while a1 < d1 :
file = "file" + str(a1) + ".par"
f = open(file, "r+")
f.seek(1011)
data = f.read() #reads the data from that point in the file into a variable.
numList= list(str(a1)) # "a1" is the integer in the file name. I had to turn the integer into a list to accomplish the next task.
replaceData = '\x00' + numList[0] + '\x00' + numList[1] + '\x00' + numList[2] + '\x00' + numList[3] + '\x00' #This line turns the integer into Utf 8 Unicode. I am by no means a Unicode expert.
currentData = data #probably didn't need to be done now that I'm looking at this.
data = data.replace(currentData, replaceData) #replaces the Utf 8 string in the "data" variable with the new Utf 8 string in "replaceData."
f.seek(1011) # Return to where I need to be in the file to write the data.
f.write(data) # Write the new Unicode data to the file
f.close() #close the file
f.close() #make sure the file is closed (sometimes it seems that this fails in Windows.)
a1 += 1 #advances the integer, and then return to the top of the loop
This is an example of writing to a file in ASCII. You need to open the file in byte mode, and using the .encode method for strings is a convenient way to get the end result you want.
s = '12345'
ascii = s.encode('ascii')
with open('somefile', 'wb') as f:
f.write(ascii)
You can obviously also open in rb+ (read and write byte mode) in your case if the file already exists.
with open('somefile', 'rb+') as f:
existing = f.read()
f.write(b'ascii without encoding!')
You can also just pass string literals with the b prefix, and they will be encoded with ascii as shown in the second example.

How can I write a string to binary file using Python3?

I would like to write strings to a bin file as header.
However, I can only write type 'bytes' to the binary file.
Here is my code:
header1 = str.encode("1\n")
header1 = str.encode("2\n")
print (type(header))
with open("abc.bin",'wb') as f_test:
f_test.write(header1)
f_test.write(header2)
Here are my questions:
1, when I open the abc.bin file using notepad, I can see "1" and "2" but they are not at the separated line. Why is it seems that \n is not functional?
2, in the .bin file, what are the format of "1" and "2". are they strings?
3, I tried pickle and marshal too. However, when I open .bin file, I found something in front of "1" and "2"(like when I used marshal.dump(header1,f_test), it gave me: ?1?2). What are these'?' and where do they come frome?
This is not originally from me, but I get the solution from the comment on this post:
https://pythonconquerstheuniverse.wordpress.com/2011/05/08/newline-conversion-in-python-3/
To Sum up, the newline need to be converted to a byte. i.e. b"\n"
if you try the following, it will print a new line:
header1 = str.encode("1")
header1 = str.encode("2")
print (type(header))
with open("abc.bin",'wb') as f_test:
f_test.write(header1+b"n")
f_test.write(header2+b"n")

Encode to Unicode UTF-8 not working

My Code -
var utf8 = require('utf8');
var y = utf8.encode('एस एम एस गपशप');
console.log(y);
Input -
एस एम एस गपशप
Expecting Output - \xE0\xA4\x8F\xE0\xA4\xB8\x20\xE0\xA4\x8F\xE0\xA4\xAE\x20\xE0\xA4\x8F\xE0\xA4\xB8\x20\xE0\xA4\x97\xE0\xA4\xAA\xE0\xA4\xB6\xE0\xA4\xAA
Example Encoding using utf8.js
Output -
à¤à¤¸ à¤à¤® à¤à¤¸ à¤à¤ªà¤¶à¤ª
What am I doing wrong? Please help!
That code appears to be working. That output looks like UTF-8 bytes interpreted as some 8-bit character set, most likely ISO-8859-1, which is easily recognisable by the repeating patterns.
That example output is just how you would represent that string in source code.
Try this:
var utf8 = require('utf8');
var y = utf8.encode('एस');
console.log(y);
console.log('\xE0\xA4\x8F\xE0\xA4\xB8');
You will probably see the same output twice.
You can easily write some code to get that hexadecimal forms back using a lookup table and the charCodeAt function, but it is a rather unusual way to represent a string in JavaScript. JSON for example either just uses the literal characters, or '\uXXXX' escapes.

Node Buffers, from utf8 to binary

I'm receiving data as utf8 from a source and this data was originally in binary form (it was a Buffer). I have to convert back this data to a Buffer. I'm having a hard time figuring how to do this.
Here's a small sample that shows my problem:
var hexString = 'e61b08020304e61c09020304e61d0a020304e61e65';
var buffer1 = new Buffer(hexString, 'hex');
var str = buffer1.toString('utf8');
var buffer2 = new Buffer(str, 'utf8');
console.log('original content:', hexString);
console.log('buffer1 contains:', buffer1.toString('hex'));
console.log('buffer2 contains:', buffer2.toString('hex'));
prints
original content: e61b08020304e61c09020304e61d0a020304e61e65
buffer1 contains: e61b08020304e61c09020304e61d0a020304e61e65
buffer2 contains: efbfbd1b08020304efbfbd1c09020304efbfbd1d0a020304efbfbd1e65
Here, I would like buffer2 to be the exact same thing as buffer1.
How can I convert an utf8 string to its original binary Buffer?
You cannot expect binary data converted to utf8 and back again to be the same as the original binary data because of the way utf8 works (especially when invalid utf8 characters are replaced with \ufffd).
You have to use another format that correctly preserves the data. This could be 'hex', 'base64', 'binary', or some other binary-safe format provided by a third-party module. Obviously you should probably keep it as a Buffer if you can.
The accepted answer is misleading. Your main problem is that you're dealing with invalid UTF-8. If the data were valid, the conversion would not cause issues.
Specifically, take the first two bytes: e61b.
In binary, that's: 11100110, 00011011. This is invalid. Take a look at this diagram from the utf-8 wikipedia page.
This says that if a byte starts with 1110, the next byte must start with two bytes starting with 10 after it. This is not the case here.
Whenever js hits an invalid character, it replaces it with �, the unicode replacement character. The codepoint for that is U+FFFD, and the utf-8 encoding of that code point is efbfbd. Notice that this shows up in your output a few times.

Resources