Moving data from MS Word to MS Excel - excel

I have transcripts of data in MS Word want to read into a stats program called R. The problem is these documents contain special characters (not plain text). My process for dealing with them has been to sub them out in MS Word/save as a txt document/read into MS Excel (makes a column for people and dialogue using the import wizard)/Convert to .csv/read into R. This process works but is time consuming. I found out how to read the text with special characters right into R (R generally wants plain text) but this requires the document be in an excel document. This is desirable because if I can read the special characters into R it's rather simple to sub out all the special characters at once. The problem arises because I can't get the MS Word document into Excel directly. I have to save it as a text file first (which I don't mind doing) and then read it in. This turns the special characters into boxes and question marks. I need to get the MS Word doc into Excel as a data frame with 2 columns (person, dialogue) without destroying the special characters (“, ”, —, ’, ‘, …, etc.).
I can do this by subbing out in Word with replace but again if I could get it to Excel doing this in R would be much easier.
Here is a sample MS Word doc of what my data looks like (tab separated columns)
https://dl.dropbox.com/u/61803503/TEST.doc
Excel and Word versions 2010 on a Win 7 machine.

One way: use Edit->Copy in Word and Edit->Paste in Excel. A simple tabular structure should be preserved if you do that, with preservation of Unicode characters. Not so sure about non-Unicode stuff such as Wingdings. Haven't tried VBA-ing that, either.

Related

How does ms word vba detect the end of a paragraph

I converted an html text into a docx document using several different online converters.
Then I analysed the number of paragraphs using an Excel vba macro which opens the document and examines it. Supplied with an original docx document (ie one not converted from another format) this macro always gives the correct number of paragraphs.
Only one converter yielded a docx from which the number of paragraphs could be determined. All the others simply said there was a single paragraph with hundreds of words in it.
Somehow the html to docx converters are missing something. What is missing ? Can I dob it in ?
Tools / Options / View.
Examine the characters that Word uses to delimit paragraphs in the docx and the translated html.
I suspect that "paragraphs" in the translated html might be manual line breaks. If so, that would account for the fact that the paragraph count in the translated html is incorrect.

How to properly display special characters in excel 2003 file

I have an excel 2003 spreadsheet contining special characters like á, é, í, ó, ú, ü, ñ. The problem is that they are displayed with their HTML entity codes.
So instead of
Almócita
Gádor
I see
Almócita
Gádor
and so on.
How can i save/re-encode the file so that is displays properly the special characters?
CSV/Open Office/Excel file formats are all acceptable as long as the characters are properly displayed.
UPDATE: The excel file is uploaded for reference HERE
Ok, so after a lot of searching and considering options like importing the sheet in MySQL and fixing the values via PHP, I finally decided to do it the old fashioned dummy like way - I made a filter to show only cells that contain & and then via Find/Replace I replaced all HTML entity codes with their respective characters. The whole operation took about 15 min, which is noting compared to the time I spent searching for alternative soluton.

Create csv immediately recognizable by Excel (both US and EU)

In many EU countries a comma ',' is used as the decimal separator, whereas in the US a dot '.' is used.
CSV (Comma Separated Values) files are supposed to use the comma to separate cell values. However, often a tab '\t' or other characters are used instead.
What's interesting, Excel if you save a .csv file using Microsoft Excel in a EU country using the comma as a decimal separator, the value it uses to separate cell values is not an escaped comma, but a semicolon ';'. Looking on the net it seems that, if you are in the US, Excel will save .csv files using a proper comma (I can't verify this).
I'm trying to find a way to create a csv file that can be recognized by Excel without any user action, both in the EU and the US.
Here's an example using Excel with an Italian locale
The above, saved as .csv (MS-DOS), translates to
foo;foo bar;
foo'bar;"foo""bar";
foo,bar;foo.bar;
foo:bar;"foo;bar";
foo/bar;foo\bar;
"foo
bar";foo|bar;
foo;bar;foobar
this is to make the empty line appear
It may be possible that, depending on the local "list separator", this may not be recognized correctly.
I've read that the new Excel 2013 needs sep=; to be set as the first line in order to work correctly. This is an ugly hack, but it seems to also be working for Excel 2010 (except it gets overwritten on save)...
Does the above text work for you, if you save it as a csv?
Is there a less hacky way to tell Excel which character is the cell separator, without having the user to set things up?
Thanks.
Time to head back to a time before visual anything, and grab a command from the past. It will involve you manually writing the file out with VBA, but it has the criteria you expect: Write
Open "c:\tmp\myfile.csv" for output as #1
for i=1 to 100
write #1,range("A"&i),range("B"&i),range("C"&i)
next i
close #1
You will have to do a little manual work - it doesn't translate a single quote into a double quote, but the rest is as desired:
the Write # statement inserts commas between items and quotation marks around strings as they are written to the file
Numeric data is always written using the period as the decimal separator.
Dates are written as #yyyy-mm-dd hh:mm:ss#

How do I stop MS Word from auto-left-aligning new paragraphs generated from linked Excel objects?

I am created a form-letter using an Excel spreadsheet as a forming tool connected to a database and using paste-link to connect the results to an MS Word document.
Each section of the document is given a single cell to draw from which utilizes a formula to comprise itself of several other cells based on a logic determinate upon the data from the database queries.
All of this functions perfectly well.
The problem arises when the generated blocks of text from Excel include two carriage-returns in a row, creating what MS Word thinks is a new paragraph (and technically it is). The rest of the letter is justified, and I have attempted to set justified text as the default alignment. But no matter what I try, any newly formed paragraphs generated inside of linked text from Excel will be left-aligned.
For this form letter to function properly it must have justified text throughout. Inconsistent formatting won't be accepted by management.
To be clear, I have attempted to modify the settings of the "Normal" style of the document in Word, as well as creating a new style based on Normal called "Justified" and setting that as the default by selecting it and clicking "Change Styles" -> "Set as Default".
The first paragraph of any given block will always remain justified-aligned, it is only subsequent, newly-created (as far as MS Word knows) paragraphs that aren't. So I suspect I am just not setting the default properly or...I don't know, something.
I tried linking as unformatted text but that, for some maddening reason, includes QUOTATIONS MARKS bookending the text! I'm baffled and frustrated.
Please help. I don't like to look the fool at work.
While I still do not know how to make Word insert new paragraphs into linked blocks of text without left-aligning them, I have a working solution to my particular problem.
By forcing my spreadsheet to create blocks of text with the maximum number of paragraphs, then forcibly justifying the output in MS Word, I was able to ensure that, as long as I close the document between updates, that the text blocks will only shrink in size, rather than grow. This way, Word does not recognize the updated text as "new" paragraph, as there was already a paragraph in that block.
I saved the Word document with this overabundance of paragraphs, and put the Excel spreadsheet back the way it was.

CSV for Excel, Including Both Leading Zeros and Commas

I want to generate a CSV file for user to use Excel to open it.
If I want to escape the comma in values, I can write it as "640,480".
If I want to keep the leading zeros, I can use ="001234".
But if I want to keep both comma and leading zeros in the value, writing as ="001,002" will be splitted as two columns. It seems no solution to express the correct data.
Is there any way to express 001, 002 in CSV for Excel?
Kent Fredric's answer contains the solution:
"=""001,002"""
(I'm bothering to post this as a separate answer because it's not clear from Kent's answer that it is a valid Excel solution.)
Put a prefix String on your data:
"N001,002","N002,003"
( As long as that prefix is not an E )
That notation ( In OpenOffice at least) above parses as a total of 2 columns with the N001,002 bytes correctly stored.
CSV Specification says that , is permitted inside quote strings.
Also, A warning from experience: make sure you do this with phone numbers too. Excel will otherwise interpret phone numbers as a floating point number and save them in scientific notation :/ , and 1.800E10 is not a really good phone number.
In OpenOffice, this RawCSV chunk also decodes as expected:
"=""001,002""","=""002,004"""
ie:
$rawdata = '001,002';
$equation = "=\"$rawdata\"";
$escaped = str_replace('"','""',$equation);
$csv_chunk = "\"$escaped\"" ;
Do
"""001,002"""
I found this out by typing "001,002" and then doing save-as CSV in Excel. If this isn't exactly what you want (you don't want quotes), this might be a good way for you to find what you want.
Another option might be use tab-delimited text, if this is an option for you.
A reader of my blog found a solution, ="001" & CHAR(44) & "002", it seems workable on my machine!
Pretty old thread but why don't you just add whitespace after your value. It will be then treated as string and no leading zeros will be stripped.
"001,002"." "
Since no-one mentioned it already, figured it was worth mentioning it in this old post.
If you add a horizontal tab character \t before the number, then MS Excel will also show the leading zero's. And the tab character doesn't show in the excel sheet. Even if it's surrounded by double-quotes. (F.e. \"\t001,002\")
It also looks nicer in Notepad++, compared to putting a \0 aka NULL before such number.
Looking more at the Excel spreadsheet it looks what you want can't be done using CSV.
This site http://office.microsoft.com/en-us/excel/HP052002731033.aspx says "If cells display formulas instead of formula values, the formulas are converted as text. All formatting, graphics, objects, and other worksheet contents are lost. The euro symbol will be converted to a question mark."
However, you can change how you load it to get the result you want. See this web page:
Microsoft import a text file.
The key thing is to choose Import External Data-Import Data-Text Files, go Next, Next, and then tick "Text" under column data format. This will prevent it being interpreted as a number, and losing formatting.
I was fiddling around with CSV to Excel (i use PHP to create the CSV, but i guess this solution works for any language. When you spot that a leading characters (such as + , - or 0 are disappearing, create the CSV with chr(13) as a prefix. This is a non printable character and it works wonders for my Excel Office 2010 version. I tried other non printable characters, but with no luck.
so i use Chirp Internet solution but tweaked with my prefix:
if (preg_match("/^0/", $str) || preg_match("/^\+?\d{8,}$/", $str) || preg_match("/^\d{4}.\d{1,2}.\d{1,2}/", $str)) {
$str = chr(13)."$str";
}
If you are using "Content-Disposition" and exporting from asp to excel using HTML tags,then you have to add "style='mso-number-format:\#;'" to that tag and making it to accept only Text values ,thereby leading zeroes omission will be avoided,If Forward slash"\" is accepted use double forward slash "\"
All the suggested answers don't seem to work for me right now ("=""blahblah""" and others) in all current Excel versions or Numbers app on OS X.
The only solution I found to be working by fiddling around is to add an escaped null character at the beginning of the string (which is \0 in PHP or C based languages). Everything ends up treated as is without being calculated or processed by the software when opening the calc sheet.
echo "\0" . $data;
Excel uses a default formatting for CSV columns depending on the content. So if you have 001 in a csv, excel will automatically turn it to 1.
The only way to keep the leading zeros in excel from a csv file is by changing the extension of the csv file to .txt, then just open excel, click on open, select the txt file, and you'll see the Text Import Wizard. Select your csv format (separated by commas), then just make sure you select "Text" as the format.
And that's it, now you can export that previous csv data to any other while keeping the leading zeros.
This is straightforward using Excel's Power Query functionality that allows you to perform step-by-step transformations.
Original File:
Add a Custom Column:

Resources