I have a large .xlsx file where each row contains a person's name and various other information. Some rows have duplicate entries throughout the file. I'd like to create a Node.js script that parses the file and deletes the rows with duplicate entries. What is the easiest way to go about this?
I have found Sheet.js to be the easiest way to interact with Excel files in node. They publish the xlsx node module: https://www.npmjs.com/package/xlsx.
The documentation can be a bit confusing, however. If you have specific issues during your implementation, feel free to edit your question with code or ask a new question!
Concerning your specific scenario, the xlsx module comes with some nifty ways to convert spreadsheets to and from arrays of arrays as well as arrays of objects. You say you have "a large .xlsx file". If it is truly massive, you might consider something like a stream read from the spreadsheet populating a new array with duplicates as you go. Then, using the original spreadsheet, stream it again into a new document ommiting the entries from the duplicates array.
However the array-of-arrays helpers etc might be an easier route. I have done in-memory processing of CSVs with nearly 100,000 rows (~50MB). It's a bit slow, but definitely possible.
Hope that helps
https://docs.sheetjs.com
Related
It sounds very easy but I looked for this similar question, but looks like I didn't find suitable. Mostly are slightly different issues then mine..
I am receiving monthly one big Excel file, where I got different sheets, but only on one sheet I am having 3x different data ranges (not formatted tables). I am saying it again, ranges not tables, because some "smart" collogues decided just to overwrite file with new data but just to expand the range...so it stayed as range (it goes horizontal), and not table. For Power Query is needed table format I know..
So my issue is to somehow consolidate those ranges (3 of them) on that one sheet into one Query, but without disrupting the original Excel file, and of course to make it dynamic when I am getting new files.
I am comfortable with Power Query, but I didn't have similar things like this where you have more ranges that have to be cleaned, edited and appended into one query...Positive thing is, the column names are the same, just the content are different...
As you can see the data range is in so called "blocks" on data that are going horizontally...
This is basically something what I would like to have:
If question already exists please link!
Here is my test file to check it up:
https://docs.google.com/spreadsheets/d/1RDAoZqxKPk1NdhtcYec8nG_31PFwQ7Lj/edit?usp=sharing&ouid=101738555398870704584&rtpof=true&sd=true
I solved it by combining into 3x queries and then appended into one bigger table.
and, import From Folder is the best import, rather then direct from Excel Workbook, it gives me more space for adding the filter for instance "Date Created" so you can always have the newest on the top or whatever.
Thx anyways for some input of you guys.
I have a series of PDF files uploaded to Google Drive (and also stored on my computer here) in different rows of a Google or Excel spreadsheet. Each row has a distinct PDF file linked to it. What I want to figure out is a way to extract a 5 row of data (not a table) from the PDF and add it to certain columns on the sheet:
Here's a sample pdf:
https://www.dropbox.com/s/2j7pqeja38jxmzc/Sample.pdf?dl=0
The sheet looks like this:
https://www.dropbox.com/s/40u1n7umacd74kw/Sample%20sheet.xlsx?dl=0
So the process will be like Excel open linked file in Row 1, extracts data needed, then adds the data to certain columns in Excel/Google spreadsheet.
I was just wondering if this is possible.. The PDF has lots of pages, but I only need data from a single page in it.
If this doesn't work in Excel/Google spreadsheet, any suggestion how I can automate this process?
PS: I'm not asking for the exact way to do it, because I know that's a violation here, just wanted to know if this is something possible and can be done in Excel or Google spreadsheet. If not, any suggestion will greatly help. Thanks!
Yes, it's possible, but it depends a lot on the PDF, which I imagine would be the biggest hurdle. You'll probably find this answer is at least relevant, if not exactly what you're looking for.
Otherwise, if everything is stored in Drive, it's just a question of:
1) Looping through the sheet and opening the doc you want.
2) Getting the content of the PDF (Probably as a string).
3) Finding a consistent way to cut the relevant data from the PDF (This depends a lot on the content of the PDFs).
4) Pasting the data to the sheet.
Number 3 might be your biggest challenge, but once you get started you might find it to be a lot easier then you'd think.
Need your help badly. I am dealing with a workbook which has 7000 rows X 5000 columns data in one sheet. Each of this datapoint has to be manipulated and pasted in another sheet. The manipulations is relatively simple where each manipulation will take less than 10 lines of code (simple multiplications and divisions with a couple of Ifs). However, the file crashes every now and then and getting various types of errors. The problem is the filesize. To overcome this problem, I am trying a few approaches
a) Separate the data and output in different files. Keep both files open and take data chunk by chunk (typically 200 rows x 5000 columns) and manipulate that and paste that in output file. However, if both files are open, then I am not sure it remedies the problem since the memory consumed will be same either way i.e. instead of one file consuming a large memory, it would be two files together consuming the same memory.
b) Separate the data and output in different files. Access the data in the data file while it is still closed by inserting links in the output file through a macro, manipulate the data and paste it in output. This can be done chunk by chunk.
c) Separate the data and output in different files. Run a macro to open the data file and load a chunk of data say 200 rows into memory into an array and close it. Process the array and open the output file and paste the array results.
Which of the three approaches are better? I am sure there are other methods which are more efficient. Kindly suggest.
I am not familiar with Access but I tried to import the raw data into Access and it failed because it allowed only 255 columns.
Is there a way to keep the file open but wash it in and out of Memory. Then slight variations to a and c above can be tried. (I am afraid repeated opening and closing will crash the file.)
Look forward to your suggestions
If you don't want to leave Excel, one trick you can use is to save the base excel file as a binary ".xlsb". This will clean out a lot of potential rubbish that might be in the file (it all depends on where it first came from.)
I just shrank a load of webdata by 99.5% - from 300MB to 1.5MB - by doing this, and now the various manipulation in excel works like a dream.
The other trick (from the 80s :) ) if you are using a lot of in cell formulae rather than a macro to iterate through, is to:
turn calculate off.
copy your formulae
turn calculate on, or just run calculate manually
copy and paste-special-values the formulae outputs.
My suggestion is using a scripting language of your choice and working with decomposition/composition of spreadsheets in it.
I was composing and decomposing spreadsheets back in the days (in PHP, oh shame) and it worked like a charm. I wasn't even using any libraries.
Just grab yourself xlutils library for Python and get your hands dirty.
I'm considering replacing a (very) large body of Office-automation code with something that works with the Office XML format directly. I'm just starting out, but already I'm worried that it's too big a task.
I'll be dealing with Word, Excel and PowerPoint. So far I've only looked at Word and Excel. It looks like Word documents should be reasonably easy to manipulate, but Excel workbooks look like a nightmare. For example...
In Word, it looks like you could delete a paragraph simply by deleting the corresponding "w:p" tag. However, the supplied code snippet for deleting a row in Excel takes about 150 lines of code(!).
The reason the Excel code is so big is that deleting a row means updating the row indexes of all the subsequent rows, fixing up the "shared strings" table, etc. According to a comment at the top, the code snippet is not even complete, in that it won't deal with a workbook that has tables in it (I can live with that).
What I'm not clear on is whether that's the only restriction that the sample code has. For example, would there also be a problem if the workbook contained a Pivot Table? Or a chart that references data from the same sheet? Or some named ranges? Wouldn't you also have to update the formulae for any cells (etc.) that referenced a row whose row index had changed?
[That's not to mention the "calc chain", which (thankfully) I think you can simply delete since it's only a chache that can be re-built.]
And that's my question, woolly though it is. Just how hard do you have to work do something as simple as deleting a row properly? Is it an insurmountable task?
Also, if there are other, similar issues either with Excel or with Word or PowerPoint, I'd love to hear about them now, before I waste too much time going down a blind alley. Thanks.
Having worked with the Open XML SDK 2.0 for almost two years now I can say that doing seemingly trivial tasks can take many hours and sometimes days to figure out how to do it properly. For example, deleting an Excel row should be fairly straightforward and easy to do right? Nope because not only do you need code to delete your row, but then you have to update all the row indices, update any merged cell references, update hyperlink references, etc. Our internal delete method is close to 500 lines of code to just delete a row and I'm sure we don't have all the cases accounted for either.
The biggest complaint I have is the lack of documentation on how to do the most common tasks. The MSDN section on the Open XML SDK is very limited and whenever you need to do anything complicated you are really on your own. I've had to read the Open XML standard a lot to figure out what certain elements mean and how they should be implemented since I could find very little online.
The other challenging part is if you insert an element in a spot where it doesn't belong or put an invalid attribute on an element you will get a corrupt file when you try and open it. Most of the time you will not get any information on what caused the error and you will have to look at the Open XML standard spec to see what you did wrong.
If you need a fast turnaround time on converting that Office automation code into Open XML and what you are doing is not really basic, then I would say pass. If you have time and the patience to read up on the Word, Excel and PowerPoint XML structures and get familiar with how they relate then I say go for it. In my opinion it is really the only way to have very fine control over these office documents, but there will be a great learning curve when you start.
Oh and just for fun here is how much code is needed to add a comment to an Excel cell.
Just for completeness, here are some libraries I found for working with Excel XML:
www.extremexml.com - a layer on top of the Open XML SDK classes; focusses on injecting data into an existing spreadsheet; handles many of the cross-reference problems I identified in my question. Open source but GPL2 not LGPL. Code looks nice, and documentation is excellent. Does not appear terribly active on codeplex though.
Closed XML - another layer on top of the Open XML SDK - again open source, but with a less restrictive license (MIT). Looks nice, and looks more "active" than the above.
SpreadsheetLight - from what I can tell, a closed-source library sitting atop the Open XML SDK classes. Targeted more at those looking to create a spreadsheet from scratch rather than making changes to existing spreadsheets.
Here is another third party library dedicated to working with OpenXML:
http://www.officewriter.com
In the example cited by amurra above of deleting Excel spreadsheet rows, this is a single method call with this tool. It updates formulas and all the other references for which it seems that 500 lines of code would be required for otherwise.
The OpenXML SDK itself is a great tool for very simple things, but you still have to concern yourself with a lot of the internals of the file format and packaging structure to get things really right.
Here are some additional libraries that can manipulate with OOXML formats:
- GemBox.Spreadsheet (XLSX)
- GemBox.Document (DOCX)
Also GemBox published some articles that demonstrate how to manipulate with OOXML file format with pure .NET (without a use of any library), I think you'll find this interesting:
www.codeproject.com/Articles/15593/Read-and-write-Open-XML-files-MS-Office
(Introduction to SpreadsheetML format and an explanation on how we can read and write worksheet's cell content)
www.codeproject.com/Articles/649064/Show-Word-File-in-WPF
(Introduction to WordprocessingML format and demonstration on how we can read document's text)
My current employer (to remain nameless) has a collection of incredibly sophisticated Microsoft Excel 2003 worksheets (developed by contractors, also to remain nameless).
The employer is replacing the Excel-based solution with a SalesForce-based solution (developed by other contractors, likewise to remain unnamed). The SalesForce solution is also very complex using dozens of related objects and "Dynamic SOQL" to contain the data and formulas which previously was contained in the Excel-based solution.
The employer's problem, which has become my problem, is that the data from the Excel spreadsheets needs to be meticulously and tediously recreated in .CSV files so it can be imported into SalesForce.
While I've recently learned I can use CTRL-` to review formulas in Excel, this doesn't solve the problem that variables in Excel have cryptic names like $O$15. If I'm lucky, when I investigate $O$15, I'll find some metadata explaining if n cells up and/or some other data m cells to the left, and/or (in rare instances) there may be a comment on the cell.
Patterns within the Excel spreadsheets are very limited, rarely lasting more than 6 concurrent rows or columns and no two sheets which need to be imported have much similarity.
Documentation of all systems are very limited.
Without my revealing any confidential data, does anyone have any good ideas how I might optimize my workflow?
It's not clear exactly what you need to do: here are 3 possible scenarios, requiring increasing knowledge of Excel.
1. If all you want is to convert the Excel spreadsheets into CSV format then just save the worksheets as CSVs.
2. If you just want the data and not the formulae then it would be simple (using VBA) to output anything that isn't a formula (the cell.Formula won't start with =).
3. If you need to create a linkage excel-->csv-->existing Salesforce objects/SOQL then you will need to understand both the Excel Spreadsheets and the Salesforce objects/SOQL that have been created. This will be difficult unless you have good knowledge and experience of Excel and also understand what the salesforce App requires.
Brian, if you're still working on this, here's one way to approach the problem. I use this kind of process often for updating data between SFDC and marketing automation apps.
1) Analyze the formulae that you're re-creating in Salesforce.com to determine what base data fields you need (stuff that doesn't have to be calculated from something else.
2) Find those columns/rows in your spreadsheets and use Paste Special -> Values in a new spreadsheet to create an upload file with values instead of formulae that you need for each data area (leads, prospects, accounts, etc.)
3) If you have to associate the info with leads or contacts or accounts and you have already uploaded or created those records in Salesforce.com, be sure to export them with their ID numbers. That makes it easy to use the vlookup formula in Excel to match up fields that you need to add and then re-upload the data into Salesforce.
Like data cleaning, this can be a tedious process. But if you take it step by step it shouldn't be too hard. Good luck.