Totaling figures in .csv files using Excel - excel

I have 12 .csv files produced by another program. The .csv files contain numeric data, separated by commas.
I need an easy way of totaling the values in certain columns in each of the files and comparing the totals across the various files e.g. compare the total from file 1 to the total from file 5.
The format of each file is the same i.e. 5 values in each record, separated by commas. Each of the 12 .csv files is about 50 Mb in size. Each file has a different number of records.
The environment I work in is 'secure' and I cant run any programs other than what I have installed on the PC I use. I have Excel installed and assume I can write VBA code/macros and I have access to the Command line. I can't (for example) load anything from a USB key and can not install any scripting language e.g. Python.
I have thought of doing this manually e.g. open each .csv file in Excel and total the columns using Excel functions i.e. SUM()
My challenge I need to do this many times of the next few weeks as new versions of the .csv files are produced i.e. I now have the first version, there will be many versions of the 12 files produced as I conduct testing on the other system. For each new version I need to sum the data and compare across files.
Last thing to say is, I cant change the system that produces the .csv files e.g. to create a set of totals
I'm looking for a programming solution that I can use, given my limited resources (ability to use any tools other than what is already on the PC)

You should be able to do this easily using an excel VBA macro but it might take quite some time if it needs to load and convert a 50MB csv file.
JScript (a microsoft form of JavaScript) is generally available on all machines and runs under the windows scripting host. Just create a file with a .js extension and try to run with a double click. Or you can use vbscript with a .vbs extension.
I think your easiest solution would be to write an excel macro (as you will have the IDE for excel vba as limited as it is).

Powershell or a batch script? A CSV is nothing more than a text file split with commas. Should be fairly easy to knock something up.

ADO can work on CSV files and you could then use SQL statements to sum the appropriate values - see this MSDN article for full details.
If you go to the Visual Basic Editor in Excel then try to add a reference via the Tools menu you should have several for Microsoft ActiveX Data Objects (2.8 being the most recent one.) Adding that reference lets you use ADO.

Related

Pentaho Data Inegration - Multiple Excel File Inputs Loading

I've been using Spoon as a tool to complete a project. One of the requirements is to load multiple Excel files, that have the same format (sheets), in order to output it to a Table Output.
However the number of Excel Files has to be variable (requirement) but they are located on the same folder. Which step(s) allows to load all the Excel files that are on a folder?
Thanks.
The Microsoft Excel input step support reading all files in a folder, or some based on regular expressions. You can also read all files including subfolders.

Excel Files and Visual Basic

I have never used Visual Basic before but could do with a pointer on where to begin.
I have 750 excel spreadsheets that contains various amounts of data of different types. The columns are always the same, but the number of data rows vary per spreadsheet. I need to extract data and put it into two new spreadsheets.
Obviously to do this 750 times manually would be a nightmare. I just want to run a script that can do it for me and thus thought of Visual Basic although i've never used it before.
My specific questions are:
What type of command should i research that would allow me to copy data where the row number to start at varies (as data above varies in no of rows). There is a title before this new data - how can i get it to search for this title and then choose the row below?
Would all my spreadsheets have to be in one folder so that the script goes through them all, or can i have some kind of folder structure in that folder too?
Anyone recommend any good resources for me to get to grips with visual basic and grasp what i need to do?
thanks
Tom
So the compilation task got easier with the introduction of MS PowerQuery. If you are using MS Excel 2013, you already have this. If no, you should download it and use the extension from MS.
The following guide outlines how to Using Power Query to Combine Data from Multiple Excel Files into One Table. This means that with Power Query (PQ), MS has taken and enabled easy aggregation using a few simple button clicks. PQ is a lightweight alternative to a lot of tasks that used to require VBA.
In this example, you will use PQ to point to an entire folder (750 should be no problem) worth of commonly formatted Excel files. The only limitation is that each data file should have a similarly named tab.
I won't repeat the details of the guide for how to do it, as it is in-depth and visual. But if you run into issues, get in touch.

Excel CSV. file with more than 1,048,576 rows of data

I have been given a CSV file with more than the MAX Excel can handle, and I really need to be able to see all the data. I understand and have tried the method of "splitting" it, but it doesnt work.
Some background: The CSV file is an Excel CSV file, and the person who gave the file has said there are about 2m rows of data.
When I import it into Excel, I get data up to row 1,048,576, then re-import it in a new tab starting at row 1,048,577 in the data, but it only gives me one row, and I know for a fact that there should be more (not only because of the fact that "the person" said there are more than 2 million, but because of the information in the last few sets of rows)
I thought that maybe the reason for this happening is because I have been provided the CSV file as an Excel CSV file, and so all the information past 1,048,576 is lost (?).
DO I need to ask for a file in an SQL database format?
You should try delimit it can open up to 2 billion rows and 2 million columns very quickly has a free 15 day trial too. Does the job for me!
I would suggest to load the .CSV file in MS-Access.
With MS-Excel you can then create a data connection to this source (without actual loading the records in a worksheet) and create a connected pivot table. You then can have virtually unlimited number of lines in your table (depending on processor and memory: I have now 15 mln lines with 3 Gb Memory).
Additional advantage is that you can now create an aggregate view in MS-Access. In this way you can create overviews from hundreds of millions of lines and then view them in MS-Excel (beware of the 2Gb limitation of NTFS files in 32 bits OS).
Excel 2007+ is limited to somewhat over 1 million rows ( 2^20 to be precise), so it will never load your 2M line file. I think that the technique you refer to as splitting is the built-in thing Excel has, but afaik that only works for width problems, not for length problems.
The really easiest way I see right away is to use some file splitting tool - there's tons of 'em and use that to load the resulting partial csv files into multiple worksheets.
ps: "excel csv files" don't exist, there are only files produced by Excel that use one of the formats commonly referred to as csv files...
You can use PowerPivot to work with files of up to 2GB, which will be enough for your needs.
First you want to change the file format from csv to txt. That is simple to do, just edit the file name and change csv to txt. (Windows will give you warning about possibly corrupting the data, but it is fine, just click ok). Then make a copy of the txt file so that now you have two files both with 2 millions rows of data. Then open up the first txt file and delete the second million rows and save the file. Then open the second txt file and delete the first million rows and save the file. Now change the two files back to csv the same way you changed them to txt originally.
I'm surprised no one mentioned Microsoft Query. You can simply request data from the large CSV file as you need it by querying only that which you need. (Querying is setup like how you filter a table in Excel)
Better yet, if one is open to installing the Power Query add-in, it's super simple and quick. Note: Power Query is an add-in for 2010 and 2013 but comes with 2016.
If you have Matlab, you can open large CSV (or TXT) files via its import facility. The tool gives you various import format options including tables, column vectors, numeric matrix, etc. However, with Matlab being an interpreter package, it does take its own time to import such a large file and I was able to import one with more than 2 million rows in about 10 minutes.
The tool is accessible via Matlab's Home tab by clicking on the "Import Data" button. An example image of a large file upload is shown below:
Once imported, the data appears on the right-hand-side Workspace, which can then be double-clicked in an Excel-like format and even be plotted in different formats.
I was able to edit a large 17GB csv file in Sublime Text without issue (line numbering makes it a lot easier to keep track of manual splitting), and then dump it into Excel in chunks smaller than 1,048,576 lines. Simple and quite quick - less faffy than researching into, installing and learning bespoke solutions. Quick and dirty, but it works.
Try PowerPivot from Microsoft. Here you can find a step by step tutorial. It worked for my 4M+ rows!
"DO I need to ask for a file in an SQL database format?" YES!!!
Use a database, is the best option for this problem.
Excel 2010 specifications .
Use MS Access. I have a file of 2,673,404 records. It will not open in notepad++ and excel will not load more than 1,048,576 records. It is tab delimited since I exported the data from a mysql database and I need it in csv format. So I imported it into Access. Change the file extension to .txt so MS Access will take you through the import wizard.
MS Access will link to your file so for the database to stay intact keep the csv file
The best way to handle this (with ease and no additional software) is with Excel - but using Powerpivot (which has MSFT Power Query embedded). Simply create a new Power Pivot data model that attaches to your large csv or text file. You will then be able to import multi-million rows into memory using the embedded X-Velocity (in-memory compression) engine. The Excel sheet limit is not applicable - as the X-Velocity engine puts everything up in RAM in compressed form. I have loaded 15 million rows and filtered at will using this technique. Hope this helps someone... - Jaycee
I found this subject researching.
There is a way to copy all this data to an Excel Datasheet.
(I have this problem before with a 50 million line CSV file)
If there is any format, additional code could be included.
Try this.
Sub ReadCSVFiles()
Dim i, j As Double
Dim UserFileName As String
Dim strTextLine As String
Dim iFile As Integer: iFile = FreeFile
UserFileName = Application.GetOpenFilename
Open UserFileName For Input As #iFile
i = 1
j = 1
Check = False
Do Until EOF(1)
Line Input #1, strTextLine
If i >= 1048576 Then
i = 1
j = j + 1
Else
Sheets(1).Cells(i, j) = strTextLine
i = i + 1
End If
Loop
Close #iFile
End Sub
You can try to download and install TheGun Text Editor. Which can help you to open large csv file easily.
You can check detailed article here https://developingdaily.com/article/how-to/what-is-csv-file-and-how-to-open-a-large-csv-file/82
Split the CSV into two files in Notepad. It's a pain, but you can just edit each of them individually in Excel after that.

writing to excel file from a two dimentional array VB.net

I have a two dimentional array formed by iterating a data reader. Earlier i was using automation to write to excel and using range, i was able to write the contents of two dimentional array to excel in one shot. This improves the performance a lot because of only one interaction with excel. but came across a problem that my server does not have office installed, so am trying a different alternative using openxml(as i justneed to install only one dll in this case).
Online i saw few example of using the openxml, but i am not sure if there is a way to directly transfter the contents of two dimentional array to the worksheet. i don't want to iterate the datareader and update each cell by cell as i have 65 columns and almost 90000 rows.
So does the SDK offer any inbuild command to do this?
you should't fear the iteration because there's no longer an "interaction with excel" dcom penalty. The open xml is just writing to a stream, which you can buffer to save flushing to disk.
fyi i've personally used closed xml (nuget # http://nuget.org/packages/ClosedXML ) to create Excel files and found it much better than working with the raw Open XML standard.
finally, even if you had excel on the server you should never use excel as a dcom server in a no UI environment.

Tool for transforming Excel files? (swapping columns, basic string manipulation etc)

I need to import tabular data into my database. The data is supplied via spreadsheets (mostly Excel files) from multiple parties. The format of each of these files is similar but not the same and various transformations will be necessary to massage the data into the final format suitable for import. Furthermore the input formats are likely to change in the future. I am looking for a tool that can be run and administered by regular users to transform the input files.
Now let me list some of the transformations I am looking to do:
swap columns:
Input is:
|Name|Category|Price|
|data|data |data |
Output is
|Name|Price|Category|
|data|data |data |
rename columns
Input is:
|PRODUCTNAME|CAT |PRICE|
|data |data|data |
Output is
|Name|Category|Price|
|data|data |data |
map columns according to a lookup table, like in the above examples:
replace every occurrence of the string "Car" by "automobile" in the column Category
basic maths:
multiply the price column by some factor
basic string manipulations
Lets say that the format of the Price column is "3 x $45", I would want to split that into two columns of amount and price
filtering of rows by value: exclude all rows containing the word "expensive"
etc.
I have the following requirements:
it can run on any of these platform: Windows, Mac, Linux
Open Source, Freeware, Shareware or commercial
the transformations need to be editable via a GUI
if the tool requires end user training to use that is not an issue
it can handle on the order of 1000-50000 rows
Basically I am looking for a graphical tool that will help the users normalize the data so it can be imported, without me having to write a bunch of adapters.
What tools do you use to solve this?
The simplest solution IMHO would be to use Excel itself - you'll get all the Excel built-in functions and macros for free.Have your transformation code in a macro that gets called via Excel controls (for the GUI aspect) on a spreadsheet. Find a way to insert that spreadsheet and macro in your client's Excel files. That way you don't need to worry about platform compatibility (it's their file, so they must be able to open it) and all the rest. The other requirements are met as well. The only training would be to show them how to enable macros.
The Mule Data Integrator will do all of this from a csv file. So you can export your spreadsheet to a CSV file, and load the CSV file ito the MDI. It can even load the data directly to the database. And the user can specify all of the transformations you requested. The MDI will work fine in non-Mule environments. You can find it here mulesoft.com (disclaimer, my company developed the transformation technology that this product is based on).
You didn't say which database you're importing into, or what tool you use. If you were using SQL Server, then I'd recommend using SQL Server Integration Services (SSIS) to manipulate the spreadsheets during the import process.
I tend to use MS Access as a pipeline between multiple data sources and destinations - but you're looking for something a little more automated. You can use macros and VB script with Access to help through a lot of the basics.
However, you're always going to have data consistency problems with users mis-interpreting how to normalize their information. Good luck!

Resources