I am trying to remove multiple special characters in a CSV file that I am copying into a created table in Postgresql. I have about 4CSV files like this, with 100,000 rows and 10 columns. I am getting errors every few 50-100 rows and I don't know what all the special characters are, as this is a large data set. Is there anyway I can just delete these or create something in excel/csv to delete these? I am afraid that I will be deleting important data
What would be best code?
Thanks!
Brook
Related
I am currently trying to push content from a set of excel tables into a database. For that matter, i first try to convert said tables to .csv for further processing in python. However, the tables contain cells with multiple linebreaks, which apparently causes the resulting .csv file to be garbeled (newlines that do not contain any seperators). Manually removing all linebreaks prior to exporting to csv solves the problem. However, since there are 100+ files to be processed, i would like to automate this process with some kind of vba/vbs script (which i failed miserably to script for several hours).
How can i batch-remove all linebreaks from a bunch of excel files?
I am trying to set up a query that will simply combine data from CSVs into a table as new files get added to a specific folder, where each row contains the data from a separate file. While doing tests with CSVs that I created in excel, this was very simple. After expanding the content column, I would see an individual row of data for each file.
In practice however, where I am trying to use CSVs that are put out from a proprietary android app, expanding the content column leads to 1 single row, with data from all files placed end to end.
Does this have something to do with there not being and "end of line" character in the CSVs the app is producing? If so, is there an easy way to remedy this without changing the app? If not, is there something simple and direct I can ask the developer to change which would prevent this behavior?
Thanks for any insight!
I have 5 folders and each folder consists of around 20 excel sheets.
And these excel sheets contain duplicates within it. It is becoming very hectic to open every file and remove duplicates.
Is there anyother way to remove duplicates from all these files at once ?
All the files contain different set of duplicates and no common columns will be present.
XD I'm really understanding your situation but I think that the solution will be one of two :) :
1-make a program with any programming language you can use and try to load the files one by one to do what you want
2-(the easiest one)Try to find a good converter to convert all your files to SQL tables then come here to this site and ask how to delete duplicated rows from different SQL tables after doing that reconvert the SQL tables to EXCEL files again and it will be done (y) ;)
I see that the number of rows at a worksheet is limited to 1,048,576.
Is this just an excel thing? For example can I create a csv file that has more rows say 5 Million rows? I understand I can't open it with excel but can I still have the file and access it some other way (say C++).
I assume this is feasible as CSV is not necessarily an excel thing right?
Thanks in advance.
A CSV file is simply a text file formatted in a certain way. Excel's row limitation is simply an artificial limitation. There is no artificial limit to the size of a CSV file.
Excel is most certainly Not the only program that can open or create a CSV file. If you want to create a CSV file with something besides Excel, then you can create as many rows or fields as you wish to.
I have a data flow task set up in SSIS.
The source is from an Excel source not an SQL DB.
The problem i seem to get is that, the package is importing empty rows.
My data has data in 555200 rows, but however when importing the SSIS package imports over 900,000 rows. The extra rows are imported even though the other empty.
When i then download this table into excel there are empty rows in between the data.
Is there anyway i can avoid this?
Thanks
Gerard
The best thing to do. If you can, is export the data to a flat file, csv or tab, and then read it in. The problem is even though those rows are blank they are not really empty. So when you hop across that ODBC-Excel bridge you are getting those rows as blanks.
You could possibly adjust the way the spreadsheet is generated to eliminate this problem or manually the delete the rows. The problem with these solutions is that they are not scalable or maintainable over the long term. You will also be stuck with that rickety ODBC bridge. The best long term solution is to avoid using the ODBC-Excel bridge entirely. By dumping the data to a flat file you have total control over how to read, validate, and interpret the data. You will not be at the mercy of a translation layer that is to this day riddled with bugs and is at the best of times "quirky"
You can also add in a Conditional Split component in your Data flow task, between the source task and the destination task. In here, check if somecolumn is null or empty - something that is consistent - meaning for every valid row, it has some data, and for every invalid row it's empty or null.
Then discard the output for that condition, sending the rest of the rows to the destination. You should then only get the amount of rows with valid data from Excel.