Unicode problem with CStdioFile in VC++ - visual-c++

I am trying to read in a URL such as http://google.com
The Url is opened fined, and as soon as I read in the first line in the while(...) loop, instead of getting some sensible characters representing html, I get weird Chinese characters into sCurlLine which is a CString. I think i am missing a unicode encoding/decoding part.
The following is the simple code that reads a URL. The while loop reads line by line the file and the text is then updated into a text box.
Thanks for the help
void CInetSessionDlg::OnBnClickedBurl()
{
CStdioFile * fpUrlFile;
CString sCurlLine;
UpdateData(TRUE);
LPCTSTR url = m_sURL;
fpUrlFile = m_misSession.OpenURL(url);
if(fpUrlFile)
{
while(fpUrlFile->ReadString(sCurlLine))
{
m_sResult += sCurlLine;
UpdateData(FALSE);
}
}
}

Check that you are building is configured to use the correct project configuration settings.
Setting found: Project Properties|General|Project Defaults|Character Set
Maybe you have the wrong set "Not Set" | "Use Unicode"

Related

Convert a text file to UTF8 in D

I'm attempting to use the Phobos standard library functions to read in any valid UTF file (UTF-8, UTF-16, or UTF-32) and get it back as a UTF-8 string (aka D's string). After looking through the docs, the most concise function I could think of to do so is
using std.file, std.utf;
string readToUTF8(in string filename)
{
try {
return readText(filename);
}
catch (UTFException e) {
try {
return toUTF8(readText!wstring(filename));
}
catch (UTFException e) {
return toUTF8(readText!dstring(filename));
}
}
}
However, catching a cascading series of exceptions seems extremely hackish. Is there a "cleaner" way to go about it without relying on catching a series of exceptions?
Additionally, the above function seems to return a one-byte BOM in the resulting string if the source file was UTF-16 or UTF-32, which I would like to omit given that it's UTF-8. Is there a way to omit that besides explicitly stripping it?
One of your questions answers the other: the BOM allows you to identify the exact UTF encoding used in the file.
Ideally, readText would do this for you. Currently, it doesn't, so you'd have to implement it yourself.
I'd recommend using std.file.read, casting the returned void[] to a ubyte[], then looking at the first few bytes to see if they start with a BOM, then cast the result to the appropriate string type and convert it to a string (using toUTF8 or to!string).

Seperated text line in Apache POI XWPFRun object

I 'm trying to replace a template DOCX document with Apache POI by using the XWPFDocument class. I have tags in the doc and a JSON file to read the replacement data. My problem is that a text line seems separated in a certain way in DOCX when I change its extension to ZIP file and open document.xml. For example [MEMBER_CONTACT_INFO] text becomes [MEMBER_CONTACT_INFO and ] separately. POI reads this in the same way since the DOCX original is like this. This creates 2 XWPFRun objects in the paragraph which show the text as [MEMBER_CONTACT_INFO and ] separately.
My question is, is there a way to force POI to run like Word via merging related runs or something like that? Or how can I solve this problem? I 'm matching run texts while replacing and I can't find my tag because it is split into 2 different run object.
Best
This wasted so much of my time once...
Basically, an XWPFParagraph is composed of multiple XWPFRuns, and XWPFRun is a contagious text that has a fixed same style.
So when you try writing something like "[PLACEHOLDER_NAME]" in MS-Word it will create a single XWPFRun. But if you somehow add a few things more, and then you go back and change "[PLACEHOLDER_NAME]" to something else it is never guaranteed that it will remain a single XWPFRun it is quite possible that it will split to two Runs. AFAIK this is how MS-Word works.
How to avoid splitting of Runs in such cases?
Solution: There are two solutions that I know of:
Copy text "[PLACEHOLDER_NAME]" to Notepad or something. Make your necessary modification and copy it back and paste it instead of "[PLACEHOLDER_NAME]" in your word file, this way your whole "[PLACEHOLDER_NAME]" will be replaced with new text avoiding splitting of XWPFRuns.
Select "[PLACEHOLDER_NAME]" and then click of MS-Word "Replace" option and Replace with "[Your-new-edited-placeholder]" and this will guarantee that your new placeholder will consume a single XWPFRun.
If you have to change your new placeholder again, follow step 1 or 2.
Here is the java code to fix that separate text line issue. It will also handle the mult-format string replacement.
public static void replaceString(XWPFDocument doc, String search, String replace) throws Exception{
for (XWPFParagraph p : doc.getParagraphs()) {
List<XWPFRun> runs = p.getRuns();
List<Integer> group = new ArrayList<Integer>();
if (runs != null) {
String groupText = search;
for (int i=0 ; i<runs.size(); i++) {
XWPFRun r = runs.get(i);
String text = r.getText(0);
if (text != null)
if(text.contains(search)) {
String safeToUseInReplaceAllString = Pattern.quote(search);
text = text.replaceAll(safeToUseInReplaceAllString, replace);
r.setText(text, 0);
}
else if(groupText.startsWith(text)){
group.add(i);
groupText = groupText.substring(text.length());
if(groupText.isEmpty()){
runs.get(group.get(0)).setText(replace, 0);
for(int j = 1; j<group.size(); j++){
p.removeRun(group.get(j));
}
group.clear();
groupText = search;
}
}else{
group.clear();
groupText = search;
}
}
}
}
for (XWPFTable tbl : doc.getTables()) {
for (XWPFTableRow row : tbl.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph p : cell.getParagraphs()) {
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text.contains(search)) {
String safeToUseInReplaceAllString = Pattern.quote(search);
text = text.replaceAll(safeToUseInReplaceAllString, replace);
r.setText(text);
}
}
}
}
}
}
}
For me it didn't work as I expected (every time). In my case I used "${PLACEHOLDER} in the text. At first we need to take a look how Apache Poi recognize each Paragraph which we want to iterate through with Runs. If you go deeper with docx file construction you will know that one run is a sequence of characters of text with the same font style/font size/colour/bold/italic etc. That way placeholder sometimes was divided into parts OR sometimes whole paragraph was recognized as a one Run and it was impossible to iterate through words. What I did is to bold placeholder name in a template document. Than when iterating through RUN I was able to iterate through whole placeholder name ${PLACEHOLDER}. When I replaced that value with
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text != null && text.contains("originalText")) {
text = text.replace("originalText", "newText");
r.setText(text,0);
}
}
I've added just r.isBold(false); after setText.
That way placeholder is recognized as a different run -> I'm able to replace specific placeholder, and in the processed document I have no bolding, just a plain text. For me one of a additional advantage was that visualy I'm able to faster find placeholders in text.
So finally above loop looks like that:
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text != null && text.contains("originalText")) {
text = text.replace("originalText", "newText");
r.setText(text,0);
r.isBold(false);
}
}
I hope it will help to someone, while I spend too much time for that :)
I also had this issue few days ago and I couldn't find any solution. I chose to use PLACEHOLDER_NAME instead of [PLACEHOLDER_NAME]. This is working fine for me and it's seen like a single XWPFRun object.
To be sure that a word will be consider as a single XWPFRun,
You can use merge_field as variable in word like that
Place cursor on the word you want to be a single run.
Press CTRL and F9 together and { } in gray will appear.
Right-click on the { } field and select Edit Field.
In pop-up box, select Mail Merge from Categories and then MergeField from Field Names.
Click OK.

CSV field with newline character in a cell to import to excel

I've got a problem with data in CSV file to import into Excel.
I've parsed data from a website, and contain line break <br>, I convert this tag into "\n" and write to a CSV file. However, when I import this CSV file into Excel, the line-break display incorrectly. It results new line as a new row instead of a new line in a single cell itself.
Anyone have face this problem before? Really appreciate your suggestion.
Thanks!
Edit: Here the sample to demonstrate my situation
static void TestLine()
{
string sampleData = "日찬양 까페에 올린 충격적인 <br>글코리아타임스";
string formattedData = sampleData.Replace("<br>", "\n");
using (StreamWriter writer = new StreamWriter(#"C:\SampleData.csv", false, Encoding.Unicode))
{
writer.WriteLine(formattedData);
}
}
I want the sampleData displaying in a cell, however, the result happens in 2 cells.
It looks to me like misformated CSV file. To handle this situation correctly, you will need to have a field with line break contained within quotation marks.
UPDATE after the sample:
This works fine:
static void Main(string[] args)
{
string sampleData = "\"日찬양 까페에 올린 충격적인 <br>글코리아타임스\"";
string formattedData = sampleData.Replace("<br>", "\n");
using (StreamWriter writer = new StreamWriter(#"C:\SampleData.csv", false, Encoding.Unicode))
{
writer.WriteLine(formattedData);
}
}
You need to wrap the field in the quotation marks (")
Please take a look at CSV-1203 File Format Specification, in particular the sections on "End-of-Record Marker" and "Field Payload Protection". Hopefully this should give clear guidance on the inner workings of the CSV file format.

c# uploading data error -> return "�" for space

i am using c# with http helper and using stream reader to read a text. But When i upload a text file containing this text
"Look  exactly what I found on # eBay! Willy Lee LifeLike  Chatting Butler Prop Motion Sen"
the space is replced by "�" and used in the code.
Code for reading the text is:-
List<string> list = new List<string>();
StreamReader reader = new StreamReader(filepath);
string text = "";
while ((text = reader.ReadLine()) != null)
{
if (!string.IsNullOrEmpty(text))
{
list.Add(text);
}
}
reader.Close();
return list;
list contains this data-
"Look��exactly�what�I�found�on�#�eBay!�Willy�Lee�LifeLike��Chatting�Butler�Prop�Motion�Sen"
Looks like encoding problem - I have had such text problems, when a text is multibyte encoded and shown in a non-unicode based webpage like a Windows-1252 or CP-125X or such.
Here looks like the same - text looks UTF-8 encoded and is displayed in ansi mode, so here the spaces are "special" spaces like these M$ Word puts sometimes, and the english characters are single byte as is the UTF-8 format (forr all chars below ASCII code 128) and this means they are compatible with ANSI codetable and visible correctly.
Or option 2 if it written in a file, and this text is saved like that, witout BOM in the beginning, the text editor may not understand that the context is unicode and opens it in ansi /regular ascii mode/.
If you give more details from where the data is read and where is saved and opened, I can give more concrete details.

How to get the file name from the full path of the file, in vc++?

I need to get the name of a file from its full path, in vc++. How can I get this? I need only the file name. Can I use Split method to get this? If not how can I get the file name from the full path of the file?
String^ fileName = "C:\\mydir\\myfile.ext";
String^ path = "C:\\mydir\\";
String^ result;
result = Path::GetFileName( fileName );
Console::WriteLine( "GetFileName('{0}') returns '{1}'", fileName, result );
See Path::GetFileName Method
Find the last \ or /1 using one of the standard library string/char * search methods. Then extract the following text. Remember to special case where the / or \ is the last character.
1 The Windows API, for most purposes2 supports both.
1 The exception is when using long paths starting \\?\ to break 260 character limit on paths.
Directory::GetFiles Method (String, String)
Returns the names of files (including their paths) that match the specified search pattern in the specified directory.

Resources