My customer has an issue with certain .csv files auto detecting data types and altering data when they open in excel. Current workaround is to open an instance of excel, open the file, and go through the many-step process of choosing data types.
There is no standard format for which data elements will be in each csv file, so I've been thinking up methods to write code that is fairly flexible. To keep this short, basically, I think I've got a good idea of how to make something flexible to support the customer's needs that involves running an append query in Access to dynamically alter/create specifications, but I cannot figure out how to obtain values for the "Start" and "Width" fields in the MSysIMEXColumns table.
Is there a function in vba that can help me read a csv file, and gather the column names along with the "Start" and "Width" values? Bonus if you can also help me plug those values into an Access table. Thanks for your help!!
First of all... there is NO "easy csv to Excel" conversion when your customer has:
"...no standard format for which data elements will be in each csv file."
I used to work for a data processor where we combined thousands of different customer files, trying to plunger them into a structured database. The ways customers can mangle data are endless. And just when you think you've figured them out, they find news ways of mangling data.
I once had one customer who had the brilliant idea of storing their "Dead Beat" flag IN their Full name field. And then didn't tell us they did so. And then when we mailed the list out to their customers, they tried to blame us for not catching that. Can you imagine someone waking up some morning and get junk mail addressed to "Dear, Dead Beat"?
But that's only one way "no standard format" customers can make it impossible to catch their errors. They can be notorious for mixing in text with number fields. They can be notorious for including invisible escape characters in text fields that make printers crash. And don't even get started on how many different ways abbreviations can cause data to be inconsistent.
Anyway... to answer your question:
If you are using CSV files, they are comma delimited. You don't need "Start" and "Width".
"Start" and "Width" are for Fixed Width files. And if your customer is giving you a fixed width file, they NEED to give you a "standard format". If they don't then you are just trying to mind read what they did. And while you can probably guess correctly most of the time, inevitably, you are going to guess wrong and your customer is going to try to blame you for the error.
Other than that, sometimes you just have to go through the long slog of having a human visually inspect things to make sure the convert went as planned. I'd also suggest lots of counts and groupings on your data afterwards to make sure they didn't do something unexpected.
Trying to convert undocumented files is a very difficult and time consuming task. It's why they are paying you big bucks to do it.
So to answer your question again, "Start" and "Width" are for Fixed Width files. If they are sending you Fixed Width files, they need to send specifications.
If it's a csv file, you don't need "Start" and "Width". The delimiter (ususally a comma) is what separates your fields.
** EDIT **
Ok... thinking through this some more... I'll take a guess at what you are doing:
1) You create and save a generic spec in Access for delimited files.
2) You open your CSV file through vba and read the new delimited header record with all the column header names.
3) You try to modify the MSysIMEXColumns table to include new fields and modify old ones.
4) You now run your import based on the new spec you created.
If that is the case, you need to do a couple of things:
First, understand that this a dangerous thing to do. Access uses wizards to create it's systems tables. If you muck with these, you don't know how it might affect the wizards when they try to access these tables again. You are best off creating a new spec for each new file type, using the Access wizards.
Second, once you come to the conclusion you are smarter than microsoft (which is probably a good bet anyway), you can try to make a dynamic spec file.
Third, you NEED to make sure your spec record in MSysIMEXSpecs is correct. That means you need to have it set as a delimited file and have the correct delimiter in there. Plus you need to have the FileType correct. You need to know if it's Unicode or any number of other file types that your customer could be sending you.
And by "correct delimiter" I mean... try to get your customer to send you "pipe delimited" | files. If they send you "comma delimited" files, you run the risk of them sending you text fields with comments or addresses that include a comma in the data. Say they cut and paste a street address that has a comma in it... that has the fantastic effect of splitting that address into two fields and pushing ALL of your subsequent columns off by one. It's lots of work to figure out if your data is corrupted this way. Pipes are much less likely to be populated in your data by accident.
Fourth, assuming your MSysIMEXSpecs is correct, you can then modify your MSysIMEXColumns table. You will need to extract your column headers from your csv file. You'll need to know how many fields there are and their order. You can then modify your current records to have the new field names and add any new records for new fields, or delete records if there are less fields than before.
I'd suggest saving them all as text fields DataType=10 in a staging table. That way you can go back and do analysis on each field to see if they mixed text into numeric fields or any other kind of craziness that customers love to do.
And since you know your spec in MSysIMEXSpecs is a delimited field file, you can give each record a "Start" field equal to the sequence your Header record calls for.
Attributes DataType FieldName IndexType SkipColumn SpecID Start Width
0 10 Rule 0 0 3 1 1
0 10 From Bin 0 0 3 2 1
0 10 toBin 0 0 3 3 1
0 10 zone 0 0 3 4 1
0 10 binType 0 0 3 5 1
0 10 Warehouse 0 0 3 6 1
0 10 comment 0 0 3 7 1
Thus, the first field will have a "Start" of 1. The second field will have a "Start" of 2. etc.
Then your "Width" fields will all have a length of 1. Since your file is a delimited file, Access will figure out the correct width when it does the import.
And as long as your SpecID is pointing to the correct delimited spec, you should be able to import any csv file.
Fifth, after the data is in your staging table, you should then do an analysis of each field to make sure you know what type of data you really have and that your suspected data type doesn't violate any data type rules. At that point you can transfer them to a second staging table where you convert to the correct data types. You can do this all through VBA. You'll probably need to create quite a few home grown functions to validate your data. (This is the NOT so easy part about "easy csv to Excel" coding.)
Sixth, after you are done all of your data massaging, you can now transfer the good data to your live database or Excel spreadsheet. Inevitably you'll always have some records that don't fit the rules and you'll have to have someone eyeball the data and figure out what to do with it.
You are at the very tip of the iceberg in data conversion management. You will soon learn about all kinds of amazingly horrible stuff. And it can take years to write automated procedures that catch all the craziness that data processors / customers send to you. There are a gazillion data mediums that they can send the data to you in. There are a gazillion different data types the data can be represented in. Along with a a gazillion different formats that the data resides in. And that's all before you even get to think about data integrity.
You'll probably become very acquainted with GIGO (Garbage In, Garbage Out). If your customer tries to slip a new format past you (and they will) without telling you the "standard format for which data elements will be", you will be left trying to guess what the data is. And if it's garbage... best of luck trying to create an automated system for that.
Anyway, I hope the MSysIMEXColumns info helps. And if they ever give you Fixed Length files, just know you'll have to write a whole new system to get those into your database.
Best of luck :)
Related
We designed an SSIS package for our client to export data from their databases to flat files that need to be further processed by us.
In short, one of the flat file destination in DFT does not have text qualifier specified. These days we received files with lots data that has the same symbol as the column delimiter in the text. And the appearance is unpredictable, meaning it could show up in any of the column.
Before sending the updated package to the client, is there any other way (No hard coded update from the backend, such as update each column with the column from the right) to know where the original column ends between each?
The easiest way is ask them to export the files again with Text Qualifier to escape the symbol, but for business concern, it might not be the our top1 pick, any one experienced this before, any advices and suggestions?
I'm finishing up a program that I built to import an excel to a database, do some manipulations/edits, and then spit back out the edited Excel. Except my problem is that the file size just balloon'ed to a huge amount from approx 3mb to ~19mb.
It has the same record count ~20k. It has ~3 more columns (out of 40+ columns total) - but that shouldn't make the file size x6, should it? Below is the code I use for the output:
DoCmd.OutputTo acOutputQuery, "Q_Export", acFormatXLS, txtFilePath & txtFileName
Any ideas on how I can get that file size a bit more down? Or anyone have a possible indication of what is doing it at least?
There are three possible reasons that spring immediately to mind;
1) You are importing more records than you think you are. Check the table after the Excel file has been imported. Make sure the table has EXACTLY as many records as there are rows of data in Excel. Often the import process will bring in many empty records, and that data is then exported as empty strings. To you it looks the same, but to Excel it's information that must be stored, which takes up space.
2) Excel handles NULL values differently than Access does. If your data has a lot of missing information, it's going to be stored differently when it's imported into Access. This actually brings us back to Reason #1.
3) When you import data, sometimes it comes in with trailing spaces. make sure you TRIM() your data before exporting it, to get rid of any potential storage space being used.
The format of our member numbers has changed several times over the years, such that 00008, 9538, 746, 0746, 00746, 100125, and various other permutations are valid, unique and need to be retained. Exporting from our database into the custom Excel template needed for a mass update strips the leading zeros, such that 00746 and 0746 are all truncated to 746.
Inserting the apostrophe trick, or formatting as text, does not work in our case, since the data seems to be already altered by the time we open it in Excel. Formatting as zip won't work since we have valid numbers less than five digits in length that cannot have zeros added to them. And I am not having any luck with "custom" formatting as that seems to require either adding the same number of leading zeros to a number, or adding enough zeros to every number to make them all the same length.
Any clues? I wish there was some way to set Excel to just take what it's given and leave it alone, but that does not seem to be the case! I would appreciate any suggestions or advice. Thank you all very much in advance!
UPDATE - thanks everybody for your help! Here are some more specifics. We are using a 3rd party membership management app -- we cannot access the database directly, we need to use their "query builder" tool to get the data we want to mass update. Then we export using their "template" format, which is called XLSX but there must be something going on behind the scenes, because if we try to import a regular old Excel, we get an error. Only their template works.
The data is formatted okay in the database, because all of the numbers show correctly in the web-based management tool. Also, if I export to CSV, save it as a .txt and import it into Excel, the numbers show fine.
What I have done is similar to ooo's explanation below -- I exported the template with the incorrect numbers, then exported as CSV/txt, and copied / pasted THOSE numbers into the template and re-imported. I did not get an error, which is something I guess, but I will not be able to find out if it was successful until after midnight! :-(
Assuming the data is not corrupt in the database, then try and export from the database to a csv or text file.
The following can then be done to ensure the import is formatted correctly
Text file with comma delimiter:
In Excel Data/From text and selected Delimited, then next
In step 3 of the import wizard. For each column/field you want as text, highlight the column and select Text
The data should then be placed as text and retain leading zeros.
Again, all of this assumes the database contains non-corrupt data and you are able to export a simple text or csv file. It also assumes you have Excel 2010 but it can be done with minor variation across all versions.
Hopefully, #ooo's answer works for you. I'm providing another answer mainly for informational purposes, and don't feel like dealing with the constraints on comments.
One thing to understand is that Excel is very aggressive about treating "numeric-looking" data as actual numbers. If you were to open the CSV by double-clicking and letting Excel do its thing (rather than using ooo's careful procedure), those numbers would still have come up as numbers (no leading zeros). As you've found, one way to counteract this is to append clearly nonnumeric characters onto your data (before Excel gets its grubby hands on it), to really convince Excel that what it's dealing with is text.
Now, if the thing that uploads to their software is a file ending in .xlsx, then most likely it is the current Excel format (a compressed XML document, used by Excel 2007 and later). I suppose by "regular old Excel" you mean .xls (which still works with the newer Excels in "compatibility mode").
So in case what you've tried so far doesn't work, there are still avenues to explore before resorting to appending characters to the end of your data. (I'll update this answer as needed.)
You're on the right track with the apostrophe.
You'll need to store your numbers in excel as text at the time they are added to the file.
What are you using to create the original excel file / export from database?
This will likely be where your focus needs to be regarding your export.
For example one approach is that you could potentially modify the database export to include the ' symbol prefix before the numbers so that excel will know to display them as text.
I use the formula =text(cell,"# of zeros of the field") to add preceding zeros.
Example, Cell C2 has 12345 and I need it to be 10 characters long. I would put =text(c2,"0000000000").
The result will be 0000012345.
I have a csv file with commas inside of fields that are non-enclosed. I unfortunately must parse this file and cannot get it replaced with a properly formatted one.
I really don't even know where to begin.
OK. What I'm seeing is the following: You have about 8,000 rows that essentially have a CSV syntax error in them. You can manually figure out which they are, but manually fixing 8,000 entries is a bit much.
The obvious first approach would be to try to see how it is that you can manually figure out which columns have this issue. If it is something you can define rules for, you are in business. If its simple enough, you can write a small text editor macro to go through the file and do it for you. If your text editor doesn't support macros. Use awk. If you are on Windows and don't have awk, then go get it.
If it is too complicated for that, fix your real problem. Go fix whatever generated this CSV file to generate it right. If it was someone else's code you don't have access to, tell them to fix it. "You are generating 8,000 unparsable entries" seems like a pretty good argument in my book. Sooner or later they will probably generate a new revision of this file for you to process, so this is really the Right Thing to do.
There's probably nothing you can do with it short of analyzing the records manually in a text editor. The comma delimiters are essentially useless if there is no discernable way to distinguish them from valid commas in the data.
If you can get a cleaner file from whoever created the bad one, that's probably far less trouble than trying to fix up the one you've got.
You could run an excel macro to reformat the comma's to some other character (let's say $, something not in your file) for the time being, then once you've parsed the file you could run the results through some code to reformat the character back into the original commas.
EDIT: I am assuming that you have access to the original file seeing as you've tagged excel here?
I think the best you can hope for is 80% automatic, which means you'll be doing over 1,000 manually best case. You just need to be clever about the data that's there. Read each line in and count the commas. If it's the right amount, write it out to a new file. If it's too many, send it to the exception handler.
Start with what you absolutely know about the data. Is the first column a TimeStamp? If you know that, you can go from "20 commas when there should be 18" to "19 commas when there should be 17". I know that doesn't exactly lift your spirits but it's progress. Is there a location, like a plant name, somewhere in there? Maybe you can develop a list from the good data and search for it in the bad data. If column 7 should be the plant name, go through your list of plant names and see if one of them exists. If so, count the commas between that and the start and between that and the end (or another good comma location that you've established).
If you have some unique data, you can regex to find it's location in the string and again, count commas before and after to see if it's where it should be. Like if you have a Lat/Long reading or a part number that's in the format 99A99-999.
If you can post five or ten rows of good data, maybe someone can suggest more specific ways to identify columns and their locations.
Good luck.
I need to import sheets which look like the following:
March Orders
***Empty Row
Week Order # Date Cust #
3.1 271356 3/3/10 010572
3.1 280353 3/5/10 022114
3.1 290822 3/5/10 010275
3.1 291436 3/2/10 010155
3.1 291627 3/5/10 011840
The column headers are actually row 3. I can use an Excel Sourch to import them, but I don't know how to specify that the information starts at row 3.
I Googled the problem, but came up empty.
have a look:
the links have more details, but I've included some text from the pages (just in case the links go dead)
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/97144bb2-9bb9-4cb8-b069-45c29690dfeb
Q:
While we are loading the text file to SQL Server via SSIS, we have the
provision to skip any number of leading rows from the source and load
the data to SQL server. Is there any provision to do the same for
Excel file.
The source Excel file for me has some description in the leading 5
rows, I want to skip it and start the data load from the row 6. Please
provide your thoughts on this.
A:
Easiest would be to give each row a number (a bit like an identity in
SQL Server) and then use a conditional split to filter out everything
where the number <=5
http://social.msdn.microsoft.com/Forums/en/sqlintegrationservices/thread/947fa27e-e31f-4108-a889-18acebce9217
Q:
Is it possible during import data from Excel to DB table skip first 6 rows for example?
Also Excel data divided by sections with headers. Is it possible for example to skip every 12th row?
A:
YES YOU CAN. Actually, you can do this very easily if you know the number columns that will be imported from your Excel file. In
your Data Flow task, you will need to set the "OpenRowset" Custom
Property of your Excel Connection (right-click your Excel connection >
Properties; in the Properties window, look for OpenRowset under Custom
Properties). To ignore the first 5 rows in Sheet1, and import columns
A-M, you would enter the following value for OpenRowset: Sheet1$A6:M
(notice, I did not specify a row number for column M. You can enter a
row number if you like, but in my case the number of rows can vary
from one iteration to the next)
AGAIN, YES YOU CAN. You can import the data using a conditional split. You'd configure the conditional split to look for something in
each row that uniquely identifies it as a header row; skip the rows
that match this 'header logic'. Another option would be to import all
the rows and then remove the header rows using a SQL script in the
database...like a cursor that deletes every 12th row. Or you could
add an identity field with seed/increment of 1/1 and then delete all
rows with row numbers that divide perfectly by 12. Something like
that...
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/847c4b9e-b2d7-4cdf-a193-e4ce14986ee2
Q:
I have an SSIS package that imports from an Excel file with data
beginning in the 7th row.
Unlike the same operation with a csv file ('Header Rows to Skip' in
Connection Manager Editor), I can't seem to find a way to ignore the
first 6 rows of an Excel file connection.
I'm guessing the answer might be in one of the Data Flow
Transformation objects, but I'm not very familiar with them.
A:
Question Sign in to vote 1 Sign in to vote rbhro, actually there were
2 fields in the upper 5 rows that had some data that I think prevented
the importer from ignoring those rows completely.
Anyway, I did find a solution to my problem.
In my Excel source object, I used 'SQL Command' as the 'Data Access
Mode' (it's drop down when you double-click the Excel Source object).
From there I was able to build a query ('Build Query' button) that
only grabbed records I needed. Something like this: SELECT F4,
F5, F6 FROM [Spreadsheet$] WHERE (F4 IS NOT NULL) AND (F4
<> 'TheHeaderFieldName')
Note: I initially tried an ISNUMERIC instead of 'IS NOT NULL', but
that wasn't supported for some reason.
In my particular case, I was only interested in rows where F4 wasn't
NULL (and fortunately F4 didn't containing any junk in the first 5
rows). I could skip the whole header row (row 6) with the 2nd WHERE
clause.
So that cleaned up my data source perfectly. All I needed to do now
was add a Data Conversion object in between the source and destination
(everything needed to be converted from unicode in the spreadsheet),
and it worked.
My first suggestion is not to accept a file in that format. Excel files to be imported should always start with column header rows. Send it back to whoever provides it to you and tell them to fix their format. This works most of the time.
We provide guidance to our customers and vendors about how files must be formatted before we can process them and it is up to them to meet the guidlines as much as possible. People often aren't aware that files like that create a problem in processing (next month it might have six lines before the data starts) and they need to be educated that Excel files must start with the column headers, have no blank lines in the middle of the data and no repeating the headers multiple times and most important of all, they must have the same columns with the same column titles in the same order every time. If they can't provide that then you probably don't have something that will work for automated import as you will get the file in a differnt format everytime depending on the mood of the person who maintains the Excel spreadsheet. Incidentally, we push really hard to never receive any data from Excel (only works some of the time, but if they have the data in a database, they can usually accomodate). They also must know that any changes they make to the spreadsheet format will result in a change to the import package and that they willl be charged for those development changes (assuming that these are outside clients and not internal ones). These changes must be communicated in advance and developer time scheduled, a file with the wrong format will fail and be returned to them to fix if not.
If that doesn't work, may I suggest that you open the file, delete the first two rows and save a text file in a data flow. Then write a data flow that will process the text file. SSIS did a lousy job of supporting Excel and anything you can do to get the file in a different format will make life easier in the long run.
My first suggestion is not to accept a file in that format. Excel files to be imported should always start with column header rows. Send it back to whoever provides it to you and tell them to fix their format. This works most of the time.
Not entirely correct.
SSIS forces you to use the format and quite often it does not work correctly with excel
If you can't change he format consider using our Advanced ETL Processor.
You can skip rows or fields and you can validate the data the way you want.
http://www.dbsoftlab.com/etl-tools/advanced-etl-processor/overview.html
Sky is the limit
You can just use the OpenRowset property you can find in the Excel Source properties.
Take a look here for details:
SSIS: Read and Export Excel data from nth Row
Regards.