I have a csv file with commas inside of fields that are non-enclosed. I unfortunately must parse this file and cannot get it replaced with a properly formatted one.
I really don't even know where to begin.
OK. What I'm seeing is the following: You have about 8,000 rows that essentially have a CSV syntax error in them. You can manually figure out which they are, but manually fixing 8,000 entries is a bit much.
The obvious first approach would be to try to see how it is that you can manually figure out which columns have this issue. If it is something you can define rules for, you are in business. If its simple enough, you can write a small text editor macro to go through the file and do it for you. If your text editor doesn't support macros. Use awk. If you are on Windows and don't have awk, then go get it.
If it is too complicated for that, fix your real problem. Go fix whatever generated this CSV file to generate it right. If it was someone else's code you don't have access to, tell them to fix it. "You are generating 8,000 unparsable entries" seems like a pretty good argument in my book. Sooner or later they will probably generate a new revision of this file for you to process, so this is really the Right Thing to do.
There's probably nothing you can do with it short of analyzing the records manually in a text editor. The comma delimiters are essentially useless if there is no discernable way to distinguish them from valid commas in the data.
If you can get a cleaner file from whoever created the bad one, that's probably far less trouble than trying to fix up the one you've got.
You could run an excel macro to reformat the comma's to some other character (let's say $, something not in your file) for the time being, then once you've parsed the file you could run the results through some code to reformat the character back into the original commas.
EDIT: I am assuming that you have access to the original file seeing as you've tagged excel here?
I think the best you can hope for is 80% automatic, which means you'll be doing over 1,000 manually best case. You just need to be clever about the data that's there. Read each line in and count the commas. If it's the right amount, write it out to a new file. If it's too many, send it to the exception handler.
Start with what you absolutely know about the data. Is the first column a TimeStamp? If you know that, you can go from "20 commas when there should be 18" to "19 commas when there should be 17". I know that doesn't exactly lift your spirits but it's progress. Is there a location, like a plant name, somewhere in there? Maybe you can develop a list from the good data and search for it in the bad data. If column 7 should be the plant name, go through your list of plant names and see if one of them exists. If so, count the commas between that and the start and between that and the end (or another good comma location that you've established).
If you have some unique data, you can regex to find it's location in the string and again, count commas before and after to see if it's where it should be. Like if you have a Lat/Long reading or a part number that's in the format 99A99-999.
If you can post five or ten rows of good data, maybe someone can suggest more specific ways to identify columns and their locations.
Good luck.
Related
My customer has an issue with certain .csv files auto detecting data types and altering data when they open in excel. Current workaround is to open an instance of excel, open the file, and go through the many-step process of choosing data types.
There is no standard format for which data elements will be in each csv file, so I've been thinking up methods to write code that is fairly flexible. To keep this short, basically, I think I've got a good idea of how to make something flexible to support the customer's needs that involves running an append query in Access to dynamically alter/create specifications, but I cannot figure out how to obtain values for the "Start" and "Width" fields in the MSysIMEXColumns table.
Is there a function in vba that can help me read a csv file, and gather the column names along with the "Start" and "Width" values? Bonus if you can also help me plug those values into an Access table. Thanks for your help!!
First of all... there is NO "easy csv to Excel" conversion when your customer has:
"...no standard format for which data elements will be in each csv file."
I used to work for a data processor where we combined thousands of different customer files, trying to plunger them into a structured database. The ways customers can mangle data are endless. And just when you think you've figured them out, they find news ways of mangling data.
I once had one customer who had the brilliant idea of storing their "Dead Beat" flag IN their Full name field. And then didn't tell us they did so. And then when we mailed the list out to their customers, they tried to blame us for not catching that. Can you imagine someone waking up some morning and get junk mail addressed to "Dear, Dead Beat"?
But that's only one way "no standard format" customers can make it impossible to catch their errors. They can be notorious for mixing in text with number fields. They can be notorious for including invisible escape characters in text fields that make printers crash. And don't even get started on how many different ways abbreviations can cause data to be inconsistent.
Anyway... to answer your question:
If you are using CSV files, they are comma delimited. You don't need "Start" and "Width".
"Start" and "Width" are for Fixed Width files. And if your customer is giving you a fixed width file, they NEED to give you a "standard format". If they don't then you are just trying to mind read what they did. And while you can probably guess correctly most of the time, inevitably, you are going to guess wrong and your customer is going to try to blame you for the error.
Other than that, sometimes you just have to go through the long slog of having a human visually inspect things to make sure the convert went as planned. I'd also suggest lots of counts and groupings on your data afterwards to make sure they didn't do something unexpected.
Trying to convert undocumented files is a very difficult and time consuming task. It's why they are paying you big bucks to do it.
So to answer your question again, "Start" and "Width" are for Fixed Width files. If they are sending you Fixed Width files, they need to send specifications.
If it's a csv file, you don't need "Start" and "Width". The delimiter (ususally a comma) is what separates your fields.
** EDIT **
Ok... thinking through this some more... I'll take a guess at what you are doing:
1) You create and save a generic spec in Access for delimited files.
2) You open your CSV file through vba and read the new delimited header record with all the column header names.
3) You try to modify the MSysIMEXColumns table to include new fields and modify old ones.
4) You now run your import based on the new spec you created.
If that is the case, you need to do a couple of things:
First, understand that this a dangerous thing to do. Access uses wizards to create it's systems tables. If you muck with these, you don't know how it might affect the wizards when they try to access these tables again. You are best off creating a new spec for each new file type, using the Access wizards.
Second, once you come to the conclusion you are smarter than microsoft (which is probably a good bet anyway), you can try to make a dynamic spec file.
Third, you NEED to make sure your spec record in MSysIMEXSpecs is correct. That means you need to have it set as a delimited file and have the correct delimiter in there. Plus you need to have the FileType correct. You need to know if it's Unicode or any number of other file types that your customer could be sending you.
And by "correct delimiter" I mean... try to get your customer to send you "pipe delimited" | files. If they send you "comma delimited" files, you run the risk of them sending you text fields with comments or addresses that include a comma in the data. Say they cut and paste a street address that has a comma in it... that has the fantastic effect of splitting that address into two fields and pushing ALL of your subsequent columns off by one. It's lots of work to figure out if your data is corrupted this way. Pipes are much less likely to be populated in your data by accident.
Fourth, assuming your MSysIMEXSpecs is correct, you can then modify your MSysIMEXColumns table. You will need to extract your column headers from your csv file. You'll need to know how many fields there are and their order. You can then modify your current records to have the new field names and add any new records for new fields, or delete records if there are less fields than before.
I'd suggest saving them all as text fields DataType=10 in a staging table. That way you can go back and do analysis on each field to see if they mixed text into numeric fields or any other kind of craziness that customers love to do.
And since you know your spec in MSysIMEXSpecs is a delimited field file, you can give each record a "Start" field equal to the sequence your Header record calls for.
Attributes DataType FieldName IndexType SkipColumn SpecID Start Width
0 10 Rule 0 0 3 1 1
0 10 From Bin 0 0 3 2 1
0 10 toBin 0 0 3 3 1
0 10 zone 0 0 3 4 1
0 10 binType 0 0 3 5 1
0 10 Warehouse 0 0 3 6 1
0 10 comment 0 0 3 7 1
Thus, the first field will have a "Start" of 1. The second field will have a "Start" of 2. etc.
Then your "Width" fields will all have a length of 1. Since your file is a delimited file, Access will figure out the correct width when it does the import.
And as long as your SpecID is pointing to the correct delimited spec, you should be able to import any csv file.
Fifth, after the data is in your staging table, you should then do an analysis of each field to make sure you know what type of data you really have and that your suspected data type doesn't violate any data type rules. At that point you can transfer them to a second staging table where you convert to the correct data types. You can do this all through VBA. You'll probably need to create quite a few home grown functions to validate your data. (This is the NOT so easy part about "easy csv to Excel" coding.)
Sixth, after you are done all of your data massaging, you can now transfer the good data to your live database or Excel spreadsheet. Inevitably you'll always have some records that don't fit the rules and you'll have to have someone eyeball the data and figure out what to do with it.
You are at the very tip of the iceberg in data conversion management. You will soon learn about all kinds of amazingly horrible stuff. And it can take years to write automated procedures that catch all the craziness that data processors / customers send to you. There are a gazillion data mediums that they can send the data to you in. There are a gazillion different data types the data can be represented in. Along with a a gazillion different formats that the data resides in. And that's all before you even get to think about data integrity.
You'll probably become very acquainted with GIGO (Garbage In, Garbage Out). If your customer tries to slip a new format past you (and they will) without telling you the "standard format for which data elements will be", you will be left trying to guess what the data is. And if it's garbage... best of luck trying to create an automated system for that.
Anyway, I hope the MSysIMEXColumns info helps. And if they ever give you Fixed Length files, just know you'll have to write a whole new system to get those into your database.
Best of luck :)
I have some data in excel file. Now I need to find their significance value which is not possible with excel. It is only possible with PSPP. But when I import my excel file (after converting to csv file) to pspp it is making hell lot of problems specially with variable names.
Could anyone please tell me some easy solutions?
Excel to PSPP: Make sure your excel file is prepared:
Red boxes in the import-process of the cvs-file indicate problems in the rows, like words added in what otherwise looks like a variable with numeric values.
The variable names must not be too long so fix that first, it works like a very old version of SPSS in this respect.
Then look carefully at the steps when importing.
Choose the second row to be the first (as the first row is variable names, and not a case).
Then click the box for choosing the top row as variable names.
It is less smooth than SPSS in the initial procedure, but so fantastic with a free alternative to the SPSS.
When the dataset is ready, I found it worked well to do the same analyzes as in SPSS.
I suppose your problem is solved a long time ago, but maybe someone else may benefit from this.
A site for sharing information about the PSPP would be great...
One should first convert excel file into CSV (maybe through Apple Mac software Number) and then import the converted into PSPP software...easy
I have the following problem:
Part number: 625009E11
Excel rep: 6.25009E+16
I want to recover the original information. Is this possible? Or, does Excel automatically dump the original data if it can format things as a number? (I also have another, similar, problem with leading zeroes.)
You might be able to recover part numbers with some clever scripting, but for leading zeroes you're pretty much screwed. Excel "helpfully" tries to cast strings it recognizes as other data types and does not keep a copy of the original. This has been a known problem for almost a decade, especially in bio research: http://www.biomedcentral.com/1471-2105/5/80
Don't use Excel as a database, kids.
The format of our member numbers has changed several times over the years, such that 00008, 9538, 746, 0746, 00746, 100125, and various other permutations are valid, unique and need to be retained. Exporting from our database into the custom Excel template needed for a mass update strips the leading zeros, such that 00746 and 0746 are all truncated to 746.
Inserting the apostrophe trick, or formatting as text, does not work in our case, since the data seems to be already altered by the time we open it in Excel. Formatting as zip won't work since we have valid numbers less than five digits in length that cannot have zeros added to them. And I am not having any luck with "custom" formatting as that seems to require either adding the same number of leading zeros to a number, or adding enough zeros to every number to make them all the same length.
Any clues? I wish there was some way to set Excel to just take what it's given and leave it alone, but that does not seem to be the case! I would appreciate any suggestions or advice. Thank you all very much in advance!
UPDATE - thanks everybody for your help! Here are some more specifics. We are using a 3rd party membership management app -- we cannot access the database directly, we need to use their "query builder" tool to get the data we want to mass update. Then we export using their "template" format, which is called XLSX but there must be something going on behind the scenes, because if we try to import a regular old Excel, we get an error. Only their template works.
The data is formatted okay in the database, because all of the numbers show correctly in the web-based management tool. Also, if I export to CSV, save it as a .txt and import it into Excel, the numbers show fine.
What I have done is similar to ooo's explanation below -- I exported the template with the incorrect numbers, then exported as CSV/txt, and copied / pasted THOSE numbers into the template and re-imported. I did not get an error, which is something I guess, but I will not be able to find out if it was successful until after midnight! :-(
Assuming the data is not corrupt in the database, then try and export from the database to a csv or text file.
The following can then be done to ensure the import is formatted correctly
Text file with comma delimiter:
In Excel Data/From text and selected Delimited, then next
In step 3 of the import wizard. For each column/field you want as text, highlight the column and select Text
The data should then be placed as text and retain leading zeros.
Again, all of this assumes the database contains non-corrupt data and you are able to export a simple text or csv file. It also assumes you have Excel 2010 but it can be done with minor variation across all versions.
Hopefully, #ooo's answer works for you. I'm providing another answer mainly for informational purposes, and don't feel like dealing with the constraints on comments.
One thing to understand is that Excel is very aggressive about treating "numeric-looking" data as actual numbers. If you were to open the CSV by double-clicking and letting Excel do its thing (rather than using ooo's careful procedure), those numbers would still have come up as numbers (no leading zeros). As you've found, one way to counteract this is to append clearly nonnumeric characters onto your data (before Excel gets its grubby hands on it), to really convince Excel that what it's dealing with is text.
Now, if the thing that uploads to their software is a file ending in .xlsx, then most likely it is the current Excel format (a compressed XML document, used by Excel 2007 and later). I suppose by "regular old Excel" you mean .xls (which still works with the newer Excels in "compatibility mode").
So in case what you've tried so far doesn't work, there are still avenues to explore before resorting to appending characters to the end of your data. (I'll update this answer as needed.)
You're on the right track with the apostrophe.
You'll need to store your numbers in excel as text at the time they are added to the file.
What are you using to create the original excel file / export from database?
This will likely be where your focus needs to be regarding your export.
For example one approach is that you could potentially modify the database export to include the ' symbol prefix before the numbers so that excel will know to display them as text.
I use the formula =text(cell,"# of zeros of the field") to add preceding zeros.
Example, Cell C2 has 12345 and I need it to be 10 characters long. I would put =text(c2,"0000000000").
The result will be 0000012345.
I'm trying to import an excel file in to a SQL Server 2000 database using DTS. This is nothing fancy, just a straight import. Where I work, we do this 1000 times a day. This procedure usually works without an issue but something must have changed in the file.
I'm getting the below error:
Screen shot of Error http://www.understandingguitar.org/wp-content/uploads/2008/12/packageerror-screenshot-20081212.jpg
I've checked to ensure that the column "AssignmentID" is stored as "text" in the excel sheet. I've also tried to change it to general. Exact same error regardless of setting. The field does contain numbers... I appreciate everyone's help on this!
Regards, Frank
Try opening at the excel file and see the column content.
Is any of the row value in that column right-aligned? (generally for numbers)?
I am guessing that such a row could be a problem.
It may be obvious, but is the destination string long enough to hold the string representation of the float? I'm not sure if Excel is rounding what it displays to you, so it may be worth trying with a wider column.
The answer has something to do with the fact that the procedure is expecting text but even if you set the properties (in the format dialog) to "text", excel may not handle the data as text. And hence, SQL Server (or the libraries) won't handle it to text.
When the procedures try to import it, the system feels that it is converting from a number to text and it expect that data maybe lost (even though no data will be lost) and the error is raised.
If figured out that I can get around this error by placing a ' (apostrophe) before each listed number. [I.E. '124321] This force excel to treat the number as text.
Hopefully this will save others the headache I now have from this. :-)
Regards,Frank