For analysis of test data - excel

Have about 2000 files of text information each file having 20000+ lines and 8 columns. Each files has to be read for a calculation and closes and the result of the calculation has to be stored in Excel.

Your question is so vague, I do not understand what sort of answer you expected.
For example:
are all 2000 files in the same folder,
do they have particular names,
do they have to be processed in a particular sequence,
how are the rows within the files divided into columns,
what calculations are required,
how are results to be stored?
Do not come back with answers to these questions. This site does not provide a free coding service. Instead split your requirement into little steps and look for help with each step.
Step 1 will be to find the 2000 files. The last related question (see on the right) is How can I extract data from multiple files in a folder of excel which shows how to find every file in a folder and open it as a workbook. That code is probably the outline structure for your macro. You will have to look up GetFolder to find the exact syntax but that is not too difficult.
Try to develop the macro you require. Come here with specific problems and code that does not function as you wish and you will find people who will help.

Related

Extracting specific data from a pdf into excel

community
I need to extract select data from a pdf form into excel. Eventually, the data gathered will be used in another step (excel table) as part of an additional calculation.
I am hoping to find a way to automate this process so I tried importing the pdf file to excel using Power Query. Unfortunately, each time I loaded the pdf, I get a message (Page is blank).
After doing some initial search, I found out that this may be due to the fact the way the pdf file was built originally (not as a table converted to a pdf).
I went back and converted the pdf file into a spreadsheet and now I can actually see the data that I need to extract in excel but needs a lot of cell formatting and rearranging.
I would really like to know if there is an alternative to solving this problem. More importantly, I'd very much appreciate any bright ideas or recommendations on how to best tackle this task since I have to repeat the same process 30+ times.
Also, I don't have a lot of coding experience, knowledge- very minimal.
Thank you so much

How can I separate an Excel document into different files?

I have an excel document that have this structure:
[Customer name] [Customer street] [Customer post code]
It has a couple of thousands rows, and I need to separate the into their own files.
I have tried to find someone else that has made the same thing or similar, but without success.
I want to create a script that asks me "Which post code interval do you wanna export?" and answer to that might be "100-300". Then the script exports those rows into a new txt/csv-file that is tab or comma delimited.
Is this even possible in Excel? I am developer but not an Excel developer :D haha.
Would be so grateful to receive some examples on how to do this.
You can use filter option in you excel file it has several option for example you can set a filter on Post code must be bigger then 100 and less then 300
->

Excel Files and Visual Basic

I have never used Visual Basic before but could do with a pointer on where to begin.
I have 750 excel spreadsheets that contains various amounts of data of different types. The columns are always the same, but the number of data rows vary per spreadsheet. I need to extract data and put it into two new spreadsheets.
Obviously to do this 750 times manually would be a nightmare. I just want to run a script that can do it for me and thus thought of Visual Basic although i've never used it before.
My specific questions are:
What type of command should i research that would allow me to copy data where the row number to start at varies (as data above varies in no of rows). There is a title before this new data - how can i get it to search for this title and then choose the row below?
Would all my spreadsheets have to be in one folder so that the script goes through them all, or can i have some kind of folder structure in that folder too?
Anyone recommend any good resources for me to get to grips with visual basic and grasp what i need to do?
thanks
Tom
So the compilation task got easier with the introduction of MS PowerQuery. If you are using MS Excel 2013, you already have this. If no, you should download it and use the extension from MS.
The following guide outlines how to Using Power Query to Combine Data from Multiple Excel Files into One Table. This means that with Power Query (PQ), MS has taken and enabled easy aggregation using a few simple button clicks. PQ is a lightweight alternative to a lot of tasks that used to require VBA.
In this example, you will use PQ to point to an entire folder (750 should be no problem) worth of commonly formatted Excel files. The only limitation is that each data file should have a similarly named tab.
I won't repeat the details of the guide for how to do it, as it is in-depth and visual. But if you run into issues, get in touch.

Google Script for spreadsheet to extract data from a PDF page linked on the sheet

I have a series of PDF files uploaded to Google Drive (and also stored on my computer here) in different rows of a Google or Excel spreadsheet. Each row has a distinct PDF file linked to it. What I want to figure out is a way to extract a 5 row of data (not a table) from the PDF and add it to certain columns on the sheet:
Here's a sample pdf:
https://www.dropbox.com/s/2j7pqeja38jxmzc/Sample.pdf?dl=0
The sheet looks like this:
https://www.dropbox.com/s/40u1n7umacd74kw/Sample%20sheet.xlsx?dl=0
So the process will be like Excel open linked file in Row 1, extracts data needed, then adds the data to certain columns in Excel/Google spreadsheet.
I was just wondering if this is possible.. The PDF has lots of pages, but I only need data from a single page in it.
If this doesn't work in Excel/Google spreadsheet, any suggestion how I can automate this process?
PS: I'm not asking for the exact way to do it, because I know that's a violation here, just wanted to know if this is something possible and can be done in Excel or Google spreadsheet. If not, any suggestion will greatly help. Thanks!
Yes, it's possible, but it depends a lot on the PDF, which I imagine would be the biggest hurdle. You'll probably find this answer is at least relevant, if not exactly what you're looking for.
Otherwise, if everything is stored in Drive, it's just a question of:
1) Looping through the sheet and opening the doc you want.
2) Getting the content of the PDF (Probably as a string).
3) Finding a consistent way to cut the relevant data from the PDF (This depends a lot on the content of the PDFs).
4) Pasting the data to the sheet.
Number 3 might be your biggest challenge, but once you get started you might find it to be a lot easier then you'd think.

Working with Office "open" XML - just how hard is it?

I'm considering replacing a (very) large body of Office-automation code with something that works with the Office XML format directly. I'm just starting out, but already I'm worried that it's too big a task.
I'll be dealing with Word, Excel and PowerPoint. So far I've only looked at Word and Excel. It looks like Word documents should be reasonably easy to manipulate, but Excel workbooks look like a nightmare. For example...
In Word, it looks like you could delete a paragraph simply by deleting the corresponding "w:p" tag. However, the supplied code snippet for deleting a row in Excel takes about 150 lines of code(!).
The reason the Excel code is so big is that deleting a row means updating the row indexes of all the subsequent rows, fixing up the "shared strings" table, etc. According to a comment at the top, the code snippet is not even complete, in that it won't deal with a workbook that has tables in it (I can live with that).
What I'm not clear on is whether that's the only restriction that the sample code has. For example, would there also be a problem if the workbook contained a Pivot Table? Or a chart that references data from the same sheet? Or some named ranges? Wouldn't you also have to update the formulae for any cells (etc.) that referenced a row whose row index had changed?
[That's not to mention the "calc chain", which (thankfully) I think you can simply delete since it's only a chache that can be re-built.]
And that's my question, woolly though it is. Just how hard do you have to work do something as simple as deleting a row properly? Is it an insurmountable task?
Also, if there are other, similar issues either with Excel or with Word or PowerPoint, I'd love to hear about them now, before I waste too much time going down a blind alley. Thanks.
Having worked with the Open XML SDK 2.0 for almost two years now I can say that doing seemingly trivial tasks can take many hours and sometimes days to figure out how to do it properly. For example, deleting an Excel row should be fairly straightforward and easy to do right? Nope because not only do you need code to delete your row, but then you have to update all the row indices, update any merged cell references, update hyperlink references, etc. Our internal delete method is close to 500 lines of code to just delete a row and I'm sure we don't have all the cases accounted for either.
The biggest complaint I have is the lack of documentation on how to do the most common tasks. The MSDN section on the Open XML SDK is very limited and whenever you need to do anything complicated you are really on your own. I've had to read the Open XML standard a lot to figure out what certain elements mean and how they should be implemented since I could find very little online.
The other challenging part is if you insert an element in a spot where it doesn't belong or put an invalid attribute on an element you will get a corrupt file when you try and open it. Most of the time you will not get any information on what caused the error and you will have to look at the Open XML standard spec to see what you did wrong.
If you need a fast turnaround time on converting that Office automation code into Open XML and what you are doing is not really basic, then I would say pass. If you have time and the patience to read up on the Word, Excel and PowerPoint XML structures and get familiar with how they relate then I say go for it. In my opinion it is really the only way to have very fine control over these office documents, but there will be a great learning curve when you start.
Oh and just for fun here is how much code is needed to add a comment to an Excel cell.
Just for completeness, here are some libraries I found for working with Excel XML:
www.extremexml.com - a layer on top of the Open XML SDK classes; focusses on injecting data into an existing spreadsheet; handles many of the cross-reference problems I identified in my question. Open source but GPL2 not LGPL. Code looks nice, and documentation is excellent. Does not appear terribly active on codeplex though.
Closed XML - another layer on top of the Open XML SDK - again open source, but with a less restrictive license (MIT). Looks nice, and looks more "active" than the above.
SpreadsheetLight - from what I can tell, a closed-source library sitting atop the Open XML SDK classes. Targeted more at those looking to create a spreadsheet from scratch rather than making changes to existing spreadsheets.
Here is another third party library dedicated to working with OpenXML:
http://www.officewriter.com
In the example cited by amurra above of deleting Excel spreadsheet rows, this is a single method call with this tool. It updates formulas and all the other references for which it seems that 500 lines of code would be required for otherwise.
The OpenXML SDK itself is a great tool for very simple things, but you still have to concern yourself with a lot of the internals of the file format and packaging structure to get things really right.
Here are some additional libraries that can manipulate with OOXML formats:
- GemBox.Spreadsheet (XLSX)
- GemBox.Document (DOCX)
Also GemBox published some articles that demonstrate how to manipulate with OOXML file format with pure .NET (without a use of any library), I think you'll find this interesting:
www.codeproject.com/Articles/15593/Read-and-write-Open-XML-files-MS-Office
(Introduction to SpreadsheetML format and an explanation on how we can read and write worksheet's cell content)
www.codeproject.com/Articles/649064/Show-Word-File-in-WPF
(Introduction to WordprocessingML format and demonstration on how we can read document's text)

Resources