How to use MS Word to create html that displays correctly on windows and linux server? - linux

When I create a document with MS Word and upload it to an html server it it correctly displayed when it is a windows server, but not when it is a linux server.
I tried this with both IE and Firefox.
The meta tag in the source says charset=windows-1252
Displaying the source code in the browser shows exactly the same source as I uploaded, so the server is not changing that. Nevertheless are characters like accented e displayed as silly characters when obtained from the linux server.
So somewhere in the tcp/http/??? records that the server sends to the browser makes the browser interpret the characters different from what is ment.
What could that be?

When you create a document in MS Word, there are a lot of characters that you can't see that are actually in the file, such as end of line markers, page breaks, etc. which you will not notice until after you upload the file to the server.
You should always use a plain text editor such as Notepad++, or even bluefish to create these files. Sometimes you can get MSWord to do the trick if you make sure to save the file as a web document(htm or html), but the special characters will usually begin to cause problems depending on your goal.

Related

Edit contents of RTF file with Powershell - Hyperlink/Mailto

Trying to update a mailto value of an RTF document with a powershell script, but it seems that the issue is RTF file related rather than Powershell related because can't get it to work when doing it by hand either:
I have previously made small find and replace changes to RTF files by finding that bit of text and changing it within the file, not using any kind of RTF library or cmdlet but just using plain string processing. With more recent RTF files updating the mailto: value in the raw file text does not seem to update the address a new message is created addressed to, and the previous value is used for the new message's TO value. The previous value not appearing in the file in plain text at that point and yet being known once the mailto link is clicked.
I am aware that RTF files have changed over time and not all of the content is purely ASCII and formatting control markup and I presume that the mailto: target is held somewhere that's not plaintext. I need to know where this other instance of the data is held in the file and a way to edit it.
mailto still showing the old email that no longer appears in plaintext in file
mailto value updated in file
Thanks for any thoughts or suggestions for things to try next!
Kind regards,
Oscar
I am able to update html and txt files just fine, but more recently RTF files are seemingly not showing the updated values because somehow they are storing the hyperlink target in a second place in the file that is not human readable. Updating other elements just by changing the human-readable instances of them in the file seems to work fine, just not the hyperlink 'mailto:' section now. Updating the link in word processor causes the human-readable 'mailto:' section to be updated when viewed in a text editor, but updating the value in a text editor and then opening it in the word processor does not show the update. So it seems to be storing the value in multiple places and the plain text version is not used in the event they're different, in so far as I've established.
Perhaps there is an RTF cmdlet or library that lets you edit the binary portion (or whatever alternate location it's held in) of the RTF file, or it would be easier to create the RTF version of the file from the properly updated HTML version. Open to ideas!

Web pages served by local IIS showing black diamonds with question marks

I'm having an issue in a .NET application where pages served by local IIS display random characters (mostly black diamonds with white question marks in them). This happens in Chrome, Firefox, and Edge. IE displays the pages correctly for some reason.
The same pages in production and in lower pre-prod environments work in all my browsers. This is strictly a local issue.
Here's what I've tried:
Deleted code and re-cloned (also tried switching branches)
Disabled all browser extensions
Ran in incognito mode
Rebooted (you never know)
Deleted temporary ASP.NET files
Looked for corrupt fonts on machine but didn't find any
Other Information:
Running IIS 10.0.17134.1
.NET MVC application with Knockout
I realize there are several other posts regarding black diamonds with question marks, but none of them seem to address my issue.
Please let me know if you need more information.
Thanks for your help!
You are in luck. The explicit purpose of � is to indicate that character encodings are being misused. When users see that, they'll know that we've messed up and lost some of their text data, and we'll know that, at one or more points, our processing and/or configuration is wrong.
(Fonts are not at issue [unless there as no font available to render �]. When there is no font available for a character, it's usually rendered as a white-filled rectangle.)
Character encoding fundamentals are simple: use a sufficient character set (say Unicode), pick an appropriate encoding (say UTF-8), encode text with it to obtain bytes, tell every program and person that gets the bytes that they represent text and which encoding is used. The encoding might be understood from a standard, convention, or specification.
Your editor does the actual encoding.
If the file is part of a project or similar system, a project file might store the intended encoding for all or each text file in the project. If your editor is an IDE, it should understand how the project does that.
Your compiler needs the know the encoding of each text file you give it. A project system would communicate what it knows.
HTML provides an optional way to communicate the encoding. Example: <meta charset="utf-8">. An HTML-aware editor should not allow this indicator to be different than the encoding it uses when saving the file. An HTML-aware editor might discover this indicator when opening the file and use the specified encoding to read the file.
HTTP uses another optional way: Content-Type response header. The web server emits this either statically or in conjunction with code that it runs, such as ASP.NET.
Web browsers use the HTTP way if given.
XHR (AJAX, etc) uses HTTP along with JavaScript processing. If needed the JavaScript processing should apply the HTTP and HTML rules, as appropriate. Note: If the content is JSON, the current RFC requires the encoding to be UTF-8.
No one or thing should have to guess.
Diagnostics
Which character encoding did you intend to use? This century, UTF-8 is so much the norm that if you choose to use a different one, you should have a good reason and document it (for others and your future self).
Compare the bytes in the file with the text you expect it to represent. Does it use the entended encoding? Use an editor or tool that shows bytes in hex.
As suggested by #snakecharmerb, what does the server send? Use a web browser's F12 network tab.
What does the HTTP response header say, if anything?
What does the HTML meta tag say, if anything?
What is the HTML doctype, if any?

Has anyone figured out yet how to prevent Chrome from stripping line breaks while saving files, or at least load output into Excel

Looks like the problem has persisted since 2012 based on users' agony.
I'm trying to spool CSV data into client Chrome browsers, which turn it into one long line, stripping all LF and CR characters.
Tried all combos: \n, \r\n, \\n - strips consistently regardless of content-type or content-disposition.
So it's more of a two-part question:
a) does anyone know how to prevent/trick Chrome browsers from doing this?
b) is there a way to break the record in CSV without using the ASCII(10) character
I'm streaming data from the application web server to clients. All browsers are fine saving CSV or loading the doc straight into Excel, except Chrome.
The only solution that finally worked was the generation of a chr(10)-delimited CSV file on the server and execution of the following JS expression upon pressing the "export" button:
location.href="http://$server_name/$CSV_name";
In such case, records rendered correctly and were loaded in Excel as expected. Any attempts to generate an inline or an attached document via manipulation of server headers or document types/extensions ended in '\n' (or '\r\n') stripping in Chrome, as opposed to other browsers.
Hope this helps.

Can I link to a .txt file in a way that prevents browsers from treating it like html?

Since a while, when I open a plain text file with long lines, the lines will break.
See this example: https://oeis.org/A195665/a195665_4.txt
In Firebug I can see, that the text is in <pre> tags in an html structure.
To avoid the line breaks, I have to click on "View Page Source".
Is there any server side way to prevent that?
I do not think the browser is at fault. I believe it is the web server's job to serve it up differently. You should google how to do that for your particular web server.
This works in Firefox and Chrome:
view-source:http://oeis.org/A195665/a195665_4.txt
But not in IE. However, IE doesn't break the lines in the first place.

Getting data from a browser by screen-scraping

I have gone thru several relevant looking questions but they did not contain the answer I am looking for. So, here is my question:
I have several web applications at my workplace, which are written using different frameworks and the authors are long gone to ask for feature updates. Hence I have to go thru the same grueling sequence of actions to get, which amounts to a file size of few kilobytes, everyday.
I tried parsing the page source but the programming technique of the authors were all over the place. Some even intentionally obscure the code to not let the data show as text, and there is no reason for this as the code they wrote is company asset. Long story short, I realized if I can copy and paste the textual content of these pages, I can process that data much easily than parsing the page source to get the text (which is sometimes totally impossible)
So, I am now looking for a browser plug-in (in windows or linux environments) or equivalent text based tools on windows or linux, which will load these pages and save the text on the screen to file(s) when invoked.
Despite how hard I tried, I am coming up empty handed.
I do not want to utilize the services of a third party screen-scraping web site, as the data is company confidential and not accessible by outside parties. Everything has to happen on the client end as I do not have access to the servers these apps are running on (mostly IIS on windows front end and a oracle db at the back end. The middle tier, as I have explained before is anyone's wild guess, ranging from native oracle apps to weblogic to tomcat and to some in house developed java/javascript stuff.
Thanks for all the help in advance
After searching for an answer for well over a year, I came to realize, as long as I use windows, a modern version of it that is, autohotkey is my savior.
I open the web page, maximize it, place my cursor (mousemove, x, y) then left click (mouseclick, L) then send ctrl-A followed by ctrl-C.
Voila ! everything is in the clipboard. Then I activate my unix session (winactivate PuTTY) and send appropriate key press commands to launch the editor of my choice (which is vi) and finally send a shift-Insert to paste the clipboard into my document. Then save and exit of course.
As an added bonus, right after my document is saved, I can invoke the script of my choice to parse this file and give me back the portion(s) I am interested in.
I know it is not bullet proof, but for my purpose, it helps to a great extent. As a matter of fact, I can do whatever I want with this method.
What about something like this: http://www.nirsoft.net/utils/htmlastext.html
Freeware that converts an HTML page to text
Any of links, lynx or w3m will do what you want, they are text browsers and you can dump text from a webpage with, for example:
w3m -dump http://www.google.com > g.txt

Resources