Related
I have a series of Excel files that I send out to customers. They fill them out and send them back with their info. How do I ensure that the excel files coming back in are the same ones I sent out and don't just share the same title and rows/column names?
The data could be falsified with the same title, row/columns. I ideally need some kind of fingerprint, artifact, or key attached to each excel file that ensures it came from my original data source.
I used to add white characters to headings as one simple trick.
Or I would put in cells odd names combined with dates in rows way below or columns far to the right.
Even inserted a name using insert name. You can also define names with vba and sometimes delete does not completely remove them - used that to hide passwords...
My customer has an issue with certain .csv files auto detecting data types and altering data when they open in excel. Current workaround is to open an instance of excel, open the file, and go through the many-step process of choosing data types.
There is no standard format for which data elements will be in each csv file, so I've been thinking up methods to write code that is fairly flexible. To keep this short, basically, I think I've got a good idea of how to make something flexible to support the customer's needs that involves running an append query in Access to dynamically alter/create specifications, but I cannot figure out how to obtain values for the "Start" and "Width" fields in the MSysIMEXColumns table.
Is there a function in vba that can help me read a csv file, and gather the column names along with the "Start" and "Width" values? Bonus if you can also help me plug those values into an Access table. Thanks for your help!!
First of all... there is NO "easy csv to Excel" conversion when your customer has:
"...no standard format for which data elements will be in each csv file."
I used to work for a data processor where we combined thousands of different customer files, trying to plunger them into a structured database. The ways customers can mangle data are endless. And just when you think you've figured them out, they find news ways of mangling data.
I once had one customer who had the brilliant idea of storing their "Dead Beat" flag IN their Full name field. And then didn't tell us they did so. And then when we mailed the list out to their customers, they tried to blame us for not catching that. Can you imagine someone waking up some morning and get junk mail addressed to "Dear, Dead Beat"?
But that's only one way "no standard format" customers can make it impossible to catch their errors. They can be notorious for mixing in text with number fields. They can be notorious for including invisible escape characters in text fields that make printers crash. And don't even get started on how many different ways abbreviations can cause data to be inconsistent.
Anyway... to answer your question:
If you are using CSV files, they are comma delimited. You don't need "Start" and "Width".
"Start" and "Width" are for Fixed Width files. And if your customer is giving you a fixed width file, they NEED to give you a "standard format". If they don't then you are just trying to mind read what they did. And while you can probably guess correctly most of the time, inevitably, you are going to guess wrong and your customer is going to try to blame you for the error.
Other than that, sometimes you just have to go through the long slog of having a human visually inspect things to make sure the convert went as planned. I'd also suggest lots of counts and groupings on your data afterwards to make sure they didn't do something unexpected.
Trying to convert undocumented files is a very difficult and time consuming task. It's why they are paying you big bucks to do it.
So to answer your question again, "Start" and "Width" are for Fixed Width files. If they are sending you Fixed Width files, they need to send specifications.
If it's a csv file, you don't need "Start" and "Width". The delimiter (ususally a comma) is what separates your fields.
** EDIT **
Ok... thinking through this some more... I'll take a guess at what you are doing:
1) You create and save a generic spec in Access for delimited files.
2) You open your CSV file through vba and read the new delimited header record with all the column header names.
3) You try to modify the MSysIMEXColumns table to include new fields and modify old ones.
4) You now run your import based on the new spec you created.
If that is the case, you need to do a couple of things:
First, understand that this a dangerous thing to do. Access uses wizards to create it's systems tables. If you muck with these, you don't know how it might affect the wizards when they try to access these tables again. You are best off creating a new spec for each new file type, using the Access wizards.
Second, once you come to the conclusion you are smarter than microsoft (which is probably a good bet anyway), you can try to make a dynamic spec file.
Third, you NEED to make sure your spec record in MSysIMEXSpecs is correct. That means you need to have it set as a delimited file and have the correct delimiter in there. Plus you need to have the FileType correct. You need to know if it's Unicode or any number of other file types that your customer could be sending you.
And by "correct delimiter" I mean... try to get your customer to send you "pipe delimited" | files. If they send you "comma delimited" files, you run the risk of them sending you text fields with comments or addresses that include a comma in the data. Say they cut and paste a street address that has a comma in it... that has the fantastic effect of splitting that address into two fields and pushing ALL of your subsequent columns off by one. It's lots of work to figure out if your data is corrupted this way. Pipes are much less likely to be populated in your data by accident.
Fourth, assuming your MSysIMEXSpecs is correct, you can then modify your MSysIMEXColumns table. You will need to extract your column headers from your csv file. You'll need to know how many fields there are and their order. You can then modify your current records to have the new field names and add any new records for new fields, or delete records if there are less fields than before.
I'd suggest saving them all as text fields DataType=10 in a staging table. That way you can go back and do analysis on each field to see if they mixed text into numeric fields or any other kind of craziness that customers love to do.
And since you know your spec in MSysIMEXSpecs is a delimited field file, you can give each record a "Start" field equal to the sequence your Header record calls for.
Attributes DataType FieldName IndexType SkipColumn SpecID Start Width
0 10 Rule 0 0 3 1 1
0 10 From Bin 0 0 3 2 1
0 10 toBin 0 0 3 3 1
0 10 zone 0 0 3 4 1
0 10 binType 0 0 3 5 1
0 10 Warehouse 0 0 3 6 1
0 10 comment 0 0 3 7 1
Thus, the first field will have a "Start" of 1. The second field will have a "Start" of 2. etc.
Then your "Width" fields will all have a length of 1. Since your file is a delimited file, Access will figure out the correct width when it does the import.
And as long as your SpecID is pointing to the correct delimited spec, you should be able to import any csv file.
Fifth, after the data is in your staging table, you should then do an analysis of each field to make sure you know what type of data you really have and that your suspected data type doesn't violate any data type rules. At that point you can transfer them to a second staging table where you convert to the correct data types. You can do this all through VBA. You'll probably need to create quite a few home grown functions to validate your data. (This is the NOT so easy part about "easy csv to Excel" coding.)
Sixth, after you are done all of your data massaging, you can now transfer the good data to your live database or Excel spreadsheet. Inevitably you'll always have some records that don't fit the rules and you'll have to have someone eyeball the data and figure out what to do with it.
You are at the very tip of the iceberg in data conversion management. You will soon learn about all kinds of amazingly horrible stuff. And it can take years to write automated procedures that catch all the craziness that data processors / customers send to you. There are a gazillion data mediums that they can send the data to you in. There are a gazillion different data types the data can be represented in. Along with a a gazillion different formats that the data resides in. And that's all before you even get to think about data integrity.
You'll probably become very acquainted with GIGO (Garbage In, Garbage Out). If your customer tries to slip a new format past you (and they will) without telling you the "standard format for which data elements will be", you will be left trying to guess what the data is. And if it's garbage... best of luck trying to create an automated system for that.
Anyway, I hope the MSysIMEXColumns info helps. And if they ever give you Fixed Length files, just know you'll have to write a whole new system to get those into your database.
Best of luck :)
Our database needs to be filled with the zip code for every state in our country, we are provided with a catalog of zip codes in a xls file, we have to import this file to a table in a database hosted in Windows Azure.
I don't know if Stack Overflow allows me to post a link to our xls, but I'll describe the structure of the file:
Every sheet holds the zip code information for a whole state, inside every sheet we have fifteen columns with information such as zip code, type of terrain, type of area, locality, state, city, etc. Every sheet has the same columns and the information inside the cells may contain special characters (i.e. á, é, ó, ú, etc.) normal to Spanish language and this special characters need to be preserved. Also some cell may be empty or not and blank spaces are likely to appear in the contents of the cells (i.e. Villa de Montenegro).
We are looking for a way to import every sheet into our table without losing special characters or skipping empty cells. We have no prior experience doing this kind of task and wanted to know what is the best way to import it.
We tried a suggestion of importing the xls to CSV files and then importing those CSV to our database, but we tried some of the variations of the macro recommended here but the CSV are generated with many errors (Macros aren't our forte).
In short, what is the best way to import our xls to an Azure database table without losing empty cells, special characters nor failing when blank spaces are inside a cell?
I recently had to migrate some data in a similar way. I used the SQL Server 2014 Import and Export Data Wizard. I initially tried with a .csv, but it was finicky about quoted commas and such. When I saved it as a .xlsx file, I was able to upload it without a problem. It's pretty straight forward to use, just select your xls file as the source, configure the connection to your Azure database, next-next-next, and hopefully you get the happy path. I wrote about it on my blog, step by step with screenshots.
We found an easy, although slow, way to copy the contents from an xls using Visual Studio, the version we used was 2012 but it works with 2008 and 2013 too.
Open the Server Explorer.
Add a new connection, the url for the database is required, the credentials are the same as the ones you use to access the database on Azure. Test the connection if you like, if the credentials are correct then you're good to go.
After the connection has been made, expand the Tables section and select the table you wish to dump your data.
Right click and select view table data.
If the table is empty or it has already some data, the workflow is the same. The last record will be empty, select it.
Go to your xls file, for this to work, the number and order of the columns must be the same as the table you will be dumping the data. Select the rows desired, copy them.
Return to Visual Studio, while the last empty row is selected paste the data. The data will start to copy directly into your Azure database.
Depending on your internet connection and the amount of data you're coping, this might take a long time.
This is an easy solution, although not optimal. This works if you don't own SQL Server with all of its tools. Still gotta check if this works on the express edition, will update when I test.
The format of our member numbers has changed several times over the years, such that 00008, 9538, 746, 0746, 00746, 100125, and various other permutations are valid, unique and need to be retained. Exporting from our database into the custom Excel template needed for a mass update strips the leading zeros, such that 00746 and 0746 are all truncated to 746.
Inserting the apostrophe trick, or formatting as text, does not work in our case, since the data seems to be already altered by the time we open it in Excel. Formatting as zip won't work since we have valid numbers less than five digits in length that cannot have zeros added to them. And I am not having any luck with "custom" formatting as that seems to require either adding the same number of leading zeros to a number, or adding enough zeros to every number to make them all the same length.
Any clues? I wish there was some way to set Excel to just take what it's given and leave it alone, but that does not seem to be the case! I would appreciate any suggestions or advice. Thank you all very much in advance!
UPDATE - thanks everybody for your help! Here are some more specifics. We are using a 3rd party membership management app -- we cannot access the database directly, we need to use their "query builder" tool to get the data we want to mass update. Then we export using their "template" format, which is called XLSX but there must be something going on behind the scenes, because if we try to import a regular old Excel, we get an error. Only their template works.
The data is formatted okay in the database, because all of the numbers show correctly in the web-based management tool. Also, if I export to CSV, save it as a .txt and import it into Excel, the numbers show fine.
What I have done is similar to ooo's explanation below -- I exported the template with the incorrect numbers, then exported as CSV/txt, and copied / pasted THOSE numbers into the template and re-imported. I did not get an error, which is something I guess, but I will not be able to find out if it was successful until after midnight! :-(
Assuming the data is not corrupt in the database, then try and export from the database to a csv or text file.
The following can then be done to ensure the import is formatted correctly
Text file with comma delimiter:
In Excel Data/From text and selected Delimited, then next
In step 3 of the import wizard. For each column/field you want as text, highlight the column and select Text
The data should then be placed as text and retain leading zeros.
Again, all of this assumes the database contains non-corrupt data and you are able to export a simple text or csv file. It also assumes you have Excel 2010 but it can be done with minor variation across all versions.
Hopefully, #ooo's answer works for you. I'm providing another answer mainly for informational purposes, and don't feel like dealing with the constraints on comments.
One thing to understand is that Excel is very aggressive about treating "numeric-looking" data as actual numbers. If you were to open the CSV by double-clicking and letting Excel do its thing (rather than using ooo's careful procedure), those numbers would still have come up as numbers (no leading zeros). As you've found, one way to counteract this is to append clearly nonnumeric characters onto your data (before Excel gets its grubby hands on it), to really convince Excel that what it's dealing with is text.
Now, if the thing that uploads to their software is a file ending in .xlsx, then most likely it is the current Excel format (a compressed XML document, used by Excel 2007 and later). I suppose by "regular old Excel" you mean .xls (which still works with the newer Excels in "compatibility mode").
So in case what you've tried so far doesn't work, there are still avenues to explore before resorting to appending characters to the end of your data. (I'll update this answer as needed.)
You're on the right track with the apostrophe.
You'll need to store your numbers in excel as text at the time they are added to the file.
What are you using to create the original excel file / export from database?
This will likely be where your focus needs to be regarding your export.
For example one approach is that you could potentially modify the database export to include the ' symbol prefix before the numbers so that excel will know to display them as text.
I use the formula =text(cell,"# of zeros of the field") to add preceding zeros.
Example, Cell C2 has 12345 and I need it to be 10 characters long. I would put =text(c2,"0000000000").
The result will be 0000012345.
This question is long winded because I have been updating the question over a very long time trying to get SSIS to properly export Excel data. I managed to solve this issue, although not correctly. Aside from someone providing a correct answer, the solution listed in this question is not terrible.
The only answer I found was to create a single row named range wide enough for my columns. In the named range put sample data and hide it. SSIS appends the data and reads metadata from the single row (that is close enough for it to drop stuff in it). The data takes the format of the hidden single row. This allows headers, etc.
WOW what a pain in the butt. It will take over 450 days of exports to recover the time lost. However, I still love SSIS and will continue to use it because it is still way better than Filemaker LOL. My next attempt will be doing the same thing in the report server.
Original question notes:
If you are in Sql Server Integrations Services designer and want to export data to an Excel file starting on something other than the first line, lets say the forth line, how do you specify this?
I tried going in to the Excel Destination of the Data Flow, changed the AccessMode to OpenRowSet from Variable, then set the variable to "YPlatters$A4:I20000" This fails saying it cannot find the sheet. The sheet is called YPlatters.
I thought you could specify (Sheet$)(Starting Cell):(Ending Cell)?
Update
Apparently in Excel you can select a set of cells and name them with the name box. This allows you to select the name instead of the sheet without the $ dollar sign. Oddly enough, whatever the range you specify, it appends the data to the next row after the range. Oddly, as you add data, it increases the named selection's row count.
Another odd thing is the data takes the format of the last line of the range specified. My header rows are bold. If I specify a range that ends with the header row, the data appends to the row below, and makes all the entries bold. if you specify one row lower, it puts a blank line between the header row and the data, but the data is not bold.
Another update
No matter what I try, SSIS samples the "first row" of the file and sets the metadata according to what it finds. However, if you have sample data that has a value of zero but is formatted as the first row, it treats that column as text and inserts numeric values with a single quote in front ('123.34). I also tried headers that do not reflect the data types of the columns. I tried changing the metadata of the Excel destination, but it always changes it back when I run the project, then fails saying it will truncate data. If I tell it to ignore errors, it imports everything except that column.
Several days of several hours a piece later...
Another update
I tried every combination. A mostly working example is to create the named range starting with the column headers. Format your column headers as you want the data to look as the data takes on this format. In my example, these exist from A4 to E4, which is my defined range. SSIS appends to the row after the defined range, so defining A4 to E68 appends the rows starting at A69. You define the Connection as having the first row contains the field names. It takes on the metadata of the header row, oddly, not the second row, and it guesses at the data type, not the formatted data type of the column, i.e., headers are text, so all my metadata is text. If your headers are bold, so is all of your data.
I even tried making a sample data row without success... I don't think anyone actually uses Excel with the default MS SSIS export.
If you could define the "insert range" (A5 to E5) with no header row and format those columns (currency, not bold, etc.) without it skipping a row in Excel, this would be very helpful. From what I gather, noone uses SSIS to export Excel without a third party connection manager.
Any ideas on how to set this up properly so that data is formatted correctly, i.e., the metadata read from Excel is proper to the real data, and formatting inherits from the first row of data, not the headers in Excel?
One last update (July 17, 2009)
I got this to work very well. One thing I added to Excel was the IMEX=1 in the Excel connection string: "Excel 8.0;HDR=Yes;IMEX=1". This forces Excel (I think) to look at all rows to see what kind of data is in it. Generally, this does not drop information, say for instance if you have a zip code then about 9 rows down you have a zip+4, Excel without this blanks that field entirely without error. With IMEX=1, it recognizes that Zip is actually a character field instead of numeric.
And of course, one more update (August 27, 2009)
The IMEX=1 will succeed importing data with missing contents in the first 8 rows, but it will fail exporting data where no data exists. So, have it on your import connection string, but not your export Excel connection string.
I have to say, after so much fiddling, it works pretty well.
P.S. If you are using a x64 bit version, make sure you call the DTExec from C:\Program Files\Microsoft SQL Server\90\DTS.x86\Binn. It will load the 32 bit Excel driver and work fine.
Would it be easier to create the Excel Workbook in a script task, then just pick it up later in the flow?
The engine part of SSIS is good but the integration with Excel is awful
"Using SSIS in conjunction with Excel is like having hot tar funnelled up your iHole in a road cone"
Dr. Zim, I believe you were the one that originally brought up this question. I totally feel your pain. I love SSIS overall, but I absolutely hate the limited tools that come standard for Excel. All I want to do is Bold the Heading or Row1 record in Excel, and not bold the following records. I have not found a great way to do that; granted I am approaching this with no script tasks or custom extensions, but you would think something this simple would be a standard option. Looks like I may be forced to research and program up something fancy for a task that should be so fundamental. I've already spent a rediculous amount of time on this myself. Does anyone know if you can use Excel XML with Excel versions: 2000/XP/2003? Thanks.
This is an old thread but what about using a flat file connection and writing the data out as a formatted html document. Set the mime type in the page header to "application/excel". When you send the document as an attachment and the recipient opens the attachment, it will open a browser session but should pop Excel up over the top of it with the data formatted according to the style (CSS) specified in the page.
Can you have SSIS write the data to an Excel sheet starting at A1, then create another sheet, formatted as you like, that refers to the other sheet at A1, but displays it as A4? That is, on the "pretty" sheet, A4 would refer to A1 on the SSIS sheet.
This would allow SSIS to do what it's good for (manipulate table-based data), but allow the Excel to be formatted or manipulated however you'd like.
When excel is the destination in SSIS, or the target export type in SSRS, you do not have much control over formatting and specifying how you want the final file to be. I have written a custom excel rendering engine for SSRS once, as my client was so strict about the format of final Excel report generated. I used 'Excel xml' to get the job done inside my custom renderer. May be you can use XML output and convert it to Excel XML using XSLT.
I understand you would rather not use a script component so perhaps you could create your own custom task using the code that a script contains so that others can use this in the future. Check here for an example.
If this seems feasible the solution I used was CarlosAg Excel Xml Writer Library. With this you can create code which is similar to using the Interop library but produces excel in xml format. This avoids using the Interop object which can sometimes lead to excel processes hanging around.
Instead of using a roundabout way to do this exercise of trying to write data to particular cell(s), format the cell(s), style them which is indeed a very tedius effort considering the support SSIS has for EXCEL, we could go the "template" way to do this.
assume we need to write data in the so & so cell with all the custom formating thats done on it. Have all the formatting in a sheet, say "SheetActual", Whereas the cells that will hold the data will actually have Lookups/ refrences/ Formulaes to refer to the original data that SSIS exports in a hidden sheet say "SheetMasterHidden" of the same Excel connection. This "SheetMasterHidden" will essentially hold the master data in default format that SSIS writes data to the excel. This way you need not worry about formatting the data runtime.
Formatting the Excel is a one time work "IF" the formatting dont change very often. If the format changes and the format is decided runtime this solution maynot go very well.
The answer is in the question. Over time, it became a progress status. However, there is SSRS that will create Excel files if you create TABLE presentations. It works pretty well too.