I'm having issues with the c++ sdk of Azure cognitives services speech to text with the spanish language related to accentuation.
I'm seeing the following error:
'sÃ' instead of 'Si' or 'Sí' which will be the correct transcription.
I'm guessing this is due to the api responde encoding. Is there any way to set headers to enable response on UTF-8 or any encoding with full spanish support?
The return is UTF8 encoded, if you redirect the output to file and load it into a UTF8 capable editor, you will see the text is actually correct. the problem is UTF8 output in the Windows cmd console.
There are several stackoverflow discussions about this. Perhaps something like this helps: how to convert utf-8 to ASCII in c++?
Related
I have recently learned more in depth about ASCII, Unicode, UTF-8, UTF-16, etc. in Python3, but I am struggling to understand when would one run into issues while reading/writing to files.
So if I open a file:
with open(myfile, 'a') as f:
f.write(stuff)
where stuff = 'Hello World!'
I have no issues writing to a file.
If I have something like:
non_latin = '娜', I can still write to the file with no problems.
So when does one run into issues regarding encodings? When does one use encode() and decode()?
You run into issues if the default encoding for your OS doesn't support the characters written. In your case the default (obtained from locale.getpreferredencoding(False)) is probably UTF-8. On Windows, the default is an ANSI encoding like cp1252 and wouldn't support Chinese. Best to be explicit and use open(myfile,'w',encoding='utf8') for example.
I am trying to translate a code file and I see the below issues with Microsoft translator
Translating below text
// 行番号削除正常終
Gives output
Line number deletion successful end
where it removes the //(double slash)
Translating below text
System.out.println("【○】COBOL
Gives output
System.out.println_○○ COBOL
Where the special characters are removed and replaced with some other.
I am reading my file using CP932 encoding and writing using UTF-8.
Please let me know if there are any ideas on how to resolve this issue.
I have tried Google translate and it works well on the same encodings.
I have been reading quite some posts including this one
Javascript export CSV encoding utf-8 issue
I know lots mentioned it's because of microsoft excel that using something like this should work
https://superuser.com/questions/280603/how-to-set-character-encoding-when-opening-excel
I have tried on ubuntu (which didn't even have any issue), on windows10, which I have to use the second posts to import, on mac which has the biggest problem because mac does not import, does not read the unicode at all.
Is there anyway I can do it in coding while exporting to enforce excel to open with utf-8? or some other workaround I might be able to try?
Thanks in advance for any help and suggestions.
Many Windows applications, including Excel, assume the localized ANSI encoding (Windows-1252 on US Windows) when opening a file, unless the file starts with byte-order-mark (BOM) code point. While UTF-8 doesn't need a BOM, a UTF-8-encoded BOM at the start of a file clues Excel that the file is UTF-8. The byte sequence is EF BB BF and the equivalent Unicode code point is U+FEFF.
If i try to download any text, srt, or exe program, it is downloading in ANSI encoding instead of UTF-8 Bom encoding. I'm Turkish user. Some of Turkish characters like ı, ş, İ, ğ has been showing as ý, þ, ð
For example Internet Download Manager uses lang files as text. Some Turkish characters are broken.
Windows language is totally Turkish. I bought the laptop from USA.
How can i fix this problem? Please help me.
I FIXED BY EDITING ADMINISTRATOR LANGUAGE SETTINGS.
IMAGE
I fixed this issue by editing administrator language settings.
I'm trying to create a webpage in Chinese and I realized that while the text looks fine when I run it on browsers, once I change the Character Encoding, the text becomes gibberish. Here's what's happening:
I create my html file in Emacs, encoded in UTF-8.
I upload it to the server, and view it on my browsers (FF, IE, Chrome, Opera) - no problem.
I try to view the page in other encodings via FF > View > Character Encoding > All those different Chinese encoding systems, e.g. Chinese Simplified (HZ)
Apart from UTF-8, on every other encoding the text becomes gibberish.
I'm assuming this isn't a problem - i.e. browsers are smart enough to know which encoding the page is in, and parse the content accurately. What I'm wondering is why I can't read the Chinese text anymore once I change encoding - is it because I don't have Chinese fonts installed on my OS? Should I stick to UTF-8 if my audience are Chinese or should I choose among one of their many encoding systems?
Thanks in advance for your help/opinions.
UTF isn't a 'catch-all' encoding. It's designed to contain international language character symbols for ease of use, but it is still an encoding, just like the other encodings you've selected. You would have to retype the text in each encoding to make it appear correctly when viewed with that encoding.
Viewer encoding MUST match the file being read. Viewing UTF-8 as something other makes about same sense as renaming .txt to .exe and trying to run it.
You should specify correct encoding in HTML. The option you're using in web browser exist only for those rare occasions when web developer screwed up his job and declared other encoding than actually used OR mixed up 2 different encodings on one page.
Of course changing the encoding in your browser will "break" the text! The browser is taking the stream of UTF-8 codepoints and tries to force another encoding on the raw data. Needless to say, the result ain't pretty. Changing the encoding in the browser is NOT the equivalent of converting.
As you surmised correctly, modern browsers usually guess correctly -- but not always. As Agent_L make sure to declare the encoding in the headers.