Versionable data format to work with Excel 2010 - excel

I need to work on tabular data with some people who will edit it using Excel 2010, likely set up with , as decimal delimiter and ; as list separator.
The data needs to be under version control, and it must be easy to fork/share, merge, and compare different states of the data set. This includes low-barrier ways to contribute for people outside our work group who do not have me around to help them set the system up.
What setup will allow my co-workers to easily open the file using Excel and edit cells, and commit and compare with very few clicks and maybe entering a commit message?
In particular, I require that if I open the file in Excel, do a null-change and save it again, and then do whatever is the commit step for this setup, the commit will be empty.
The data contains non-ASCII characters, in particular IPA-symbols.
I expected that I would just be able to use git with csv files. But while for ASCII data, .csv with comma as separator seems to fulfill these conditions, for more complete encodings I cannot seem to get Excel keep data the same on round-trips, not even by appearance, not to mention binary compatibility – it either loses unicode characters upon saving or does not recognize the format upon reading.

Related

Can we specified format rules directly in file loaded in Excel?

I'm generating a csv file that looks like:
column1,column2,column3
hello,02,some comments
hello,AF,some comments
hello,15,some comments
hello,08,some comments
hello,FF,some comments
When opening into Excel, the second columns will convert automatically 02 to 2 and 08 to 8.
Recommended workaround is to tweak settings in Excel or specifying format of each columns during the file loading. (time-consuming, not user-friendly)
Another workaround for file itself, would be to append quotes around the concerned values, so they would be treated as strings.
(for some reasons, I can't do this, as the file I generate, must reflect the original source one)
But... Is there another way ?
Like adding some kind of a commentary line, containing hints for the Excel parser, directly in the CSV.
Taking Oracle SQL queries hints system as an example, which allows us to change the query optimizer way of work.
I'm not sure this exists for Excel, maybe it is, but obscure and not well documented ?
I can't figure out why they would not add such a useful feature to "preconfigure" the format instead of having the user to navigate Excel settings.
Thanks

Save versions of Excel file on Git to reconcile differences manually later

I will be one month updating Excel files. These files are in a language other than English. I thought I could use Git too to manage what I want to do.
The situation (the initial commit)
I have an Excel file that is written in the other language.
I have to perform some work and fill an Excel file with data from that.
My plan
After an initial commit, create a branch called toEnglish. Then translate some text on the Excel files to English so that I feel more comfortable. Once I do this I will commit.
Then, the one-month work will start and I will fill the data in the Excel file. I will commit periodically.
After the one month finishes, I will commit, and so I will have the data filled in a Excel page where some labels are in English.
However the output of that one month work has to have those labels in the original language.
So I have a original branch with the original language labels but no data
and the toEnglish branch with the data but English labels.
The question
I can not merge (fast-forward merge) the branches since that will eliminate the original language labels, so how can I merge in order to produce conflicts (the labels in two different languages) that I will solve one by one so that the final merge will have both the data and the labels in the original language?
There is an even bigger problem with versioning Excel files in Git, which is that Excel files (xls and xlsx) are binary. Git doesn't generally handle binary very well. Each commit you make on an Excel file will likely record the entire file as the diff. In addition, comparing Excel files from two different commits/branches won't give you much insight.
One workaround which comes to mind would be to version plain text CSV versions of your Excel worksheets. Such CSV files would likely version well with Git. Of course, if the worksheets have lots of rich content on top of the data, then this option might not work as well.
There is an open-source Git extension that makes Excel workbook files diff- and mergeable: https://github.com/ZoomerAnalytics/git-xltrail (disclaimer, I'm one of the authors)
It installs a custom differ and merger for xls* types and configures Git accordingly so that it behaves the same way as if it were a text file.
For docs and a short video, have a look at https://www.xltrail.com/client
Excel is a bit useless in Git - it does not matter whether it is a binary (xlsb or not xlsx) - it will just copy the file and leave it as it is. Thus, it is a bit of a challenge to do a working source control for VBA developers - in general it is accepted that it does not exist and cannot be done (this is what I usually hear), but there are some ways for workaround - e.g., if you follow MVC and you do not put any business logic in the worksheets.
What you can do is simply to save the worksheets to a csv and proceed working as if it is normal plain text. At the end, even some "manual" merge with formulas is possible, based on the different worksheets (this is the bonus excel gives).

Difference between Excel .csv and plain .csv?

I am running Windows 7 and have MS Office installed. Any time I download a .csv file the "file type" line in the "save as..." dialog defaults to "Microsoft Office Excel comma separated values file". Is there actually a Microsoft specific format that is distinct from "plain" .csv?
Googling the relevant terms returns various incredibly uninformative pages such as this one. Is any information lost or gained, or anything encoded differently by using this format than by just treating a file as a .csv, conforming to the general standards?
Yes, there are almost certainly differences.
From the top of my head: English Excel uses "," as a seperator. German locale uses ";" as a seperator, requiring an additional importing step if you want to import a csv with a comma seperator. This is not unique to german locales, roughly 1/4 to 1/3 of the world uses ";".
Also, there might be differences in how complicated strings are escaped (; and " in texts) which are probably different from program to program.
This is not excels fault, since the csv "format" is not really standardised and there are uncountable numbers of programs which are rolling their own csv parser, which leads to all sorts of problems because they forgot to handle corner cases.
I once read the comment that csv is the plague of data exchange formats because it is so difficult to do right. I could not agree more, I have to deal with them on a daily basis and they are extremly annoying to work with.
Open source fans will hate me for this, but I think csv is a poor choice for data exchange, even xlsx is better because it has rules which are well defined.
There are two things going on. The abbreviation (and suffix) "CSV" can mean character-separated values or it can mean comma-separated values. "Microsoft Office Excel comma separated values file" is a disambiguation, and means that you have a number of values in a record, with the field values separated by a comma.
The values themselves, in comma-separated value files, may contain commas if they are properly stropped (quoted). Usually, the stropping is putting a double quote around some or all of the field.
MS Excel also supports newlines in the middle of fields, again being properly stropped.

Excel - Variable number of leading zeros in variable length numbers?

The format of our member numbers has changed several times over the years, such that 00008, 9538, 746, 0746, 00746, 100125, and various other permutations are valid, unique and need to be retained. Exporting from our database into the custom Excel template needed for a mass update strips the leading zeros, such that 00746 and 0746 are all truncated to 746.
Inserting the apostrophe trick, or formatting as text, does not work in our case, since the data seems to be already altered by the time we open it in Excel. Formatting as zip won't work since we have valid numbers less than five digits in length that cannot have zeros added to them. And I am not having any luck with "custom" formatting as that seems to require either adding the same number of leading zeros to a number, or adding enough zeros to every number to make them all the same length.
Any clues? I wish there was some way to set Excel to just take what it's given and leave it alone, but that does not seem to be the case! I would appreciate any suggestions or advice. Thank you all very much in advance!
UPDATE - thanks everybody for your help! Here are some more specifics. We are using a 3rd party membership management app -- we cannot access the database directly, we need to use their "query builder" tool to get the data we want to mass update. Then we export using their "template" format, which is called XLSX but there must be something going on behind the scenes, because if we try to import a regular old Excel, we get an error. Only their template works.
The data is formatted okay in the database, because all of the numbers show correctly in the web-based management tool. Also, if I export to CSV, save it as a .txt and import it into Excel, the numbers show fine.
What I have done is similar to ooo's explanation below -- I exported the template with the incorrect numbers, then exported as CSV/txt, and copied / pasted THOSE numbers into the template and re-imported. I did not get an error, which is something I guess, but I will not be able to find out if it was successful until after midnight! :-(
Assuming the data is not corrupt in the database, then try and export from the database to a csv or text file.
The following can then be done to ensure the import is formatted correctly
Text file with comma delimiter:
In Excel Data/From text and selected Delimited, then next
In step 3 of the import wizard. For each column/field you want as text, highlight the column and select Text
The data should then be placed as text and retain leading zeros.
Again, all of this assumes the database contains non-corrupt data and you are able to export a simple text or csv file. It also assumes you have Excel 2010 but it can be done with minor variation across all versions.
Hopefully, #ooo's answer works for you. I'm providing another answer mainly for informational purposes, and don't feel like dealing with the constraints on comments.
One thing to understand is that Excel is very aggressive about treating "numeric-looking" data as actual numbers. If you were to open the CSV by double-clicking and letting Excel do its thing (rather than using ooo's careful procedure), those numbers would still have come up as numbers (no leading zeros). As you've found, one way to counteract this is to append clearly nonnumeric characters onto your data (before Excel gets its grubby hands on it), to really convince Excel that what it's dealing with is text.
Now, if the thing that uploads to their software is a file ending in .xlsx, then most likely it is the current Excel format (a compressed XML document, used by Excel 2007 and later). I suppose by "regular old Excel" you mean .xls (which still works with the newer Excels in "compatibility mode").
So in case what you've tried so far doesn't work, there are still avenues to explore before resorting to appending characters to the end of your data. (I'll update this answer as needed.)
You're on the right track with the apostrophe.
You'll need to store your numbers in excel as text at the time they are added to the file.
What are you using to create the original excel file / export from database?
This will likely be where your focus needs to be regarding your export.
For example one approach is that you could potentially modify the database export to include the ' symbol prefix before the numbers so that excel will know to display them as text.
I use the formula =text(cell,"# of zeros of the field") to add preceding zeros.
Example, Cell C2 has 12345 and I need it to be 10 characters long. I would put =text(c2,"0000000000").
The result will be 0000012345.

CSV Exporting: Preserving leading zeros

I'm working on a .NET application which exports CSV files to open in Excel and I'm having a problem with preserving leading zeros when the file is opened in Excel. I've used the method mentioned at http://creativyst.com/Doc/Articles/CSV/CSV01.htm#CSVAndExcel
This works great until the user decides to save the CSV file within Excel. If the file is opened again in Excel then the leading zeros are lost.
Is there anything I can do when generating the CSV file to prevent this from happening.
This is not a CSV issue.
This is Excel loving to play with CSV files.
Change the extension to something else.
As #GSerg mentions, this is not a CSV issue.
If your users must edit/save in Excel they need to select the entire worksheet, right-click and choose "Format Cells" and from the Category list select "Text" after opening the csv file. This will preserve the leading zeros since the numbers will be treated as simple text.
Alternatively, you could use Open XML SDK 2.0, or some other Excel library, to create an xlsx file from your csv data and programmaticaly set the Cell type to Text in order to take the end users out of the equation...
I found a nice way around this, if you add a space anywhere along the phone number, the cell is then not treated as number and is treated as a text cell in both Excel and Apple's iWork Numbers.
It's the only solution I've found so far that plays nice with Numbers.
Yes I realise the number then has a space, but this is easy to process out of large chunks of data, you just have to select a column and remove all spaces.
Also, if this is web related, most web type things are ok with users entering a space in the number field. E.g you can tap-to-call on mobiles.
The challenge is to get the space in there in the first place.
In use:
01202123456 = 1202123456
but
01202 123456 = 01202 123456
Ok, new discovery.
Using Quick Preview on Mac to view a CSV file the telephone column will display perfectly, but opening the file fully with Numbers or Excel will ruin that column.
On some level Mac OS X is capable of handling that column correctly with no user meddling.
I am now working on the best/easiest way to make a website output a universally accepted CSV with telephone numbers preserved.
But maybe with that info someone else has an idea on how to make Numbers handle the file in the same way that Quick Preview does?

Resources