How to replace index/match with a connection - excel

In order to view customer data in an Excel sheet I have used the functions index/match to retrieve data. However, due to the large amount of customers the file has gotten very large. Right now it is 13MB. This file is regularly sent through mail, so it is a real headache having to open it every time.
Is there a way to replace Index/Match with something else in order to reduce the file size? Transforming the source file into an SQL file? Adding a connection to the source file?
Thanks.

Related

How handle mass data in Excel (e.g. powerquery)

please dont hurt me for this question... I know it is not the ideal way to handle mass data- but I need to try...
I got a folder with 10 csv Files which are always under the xlsx row limitation of 1.048.576 rows.
Now I try to combine those files in one file. The combination of all files reach over 1.048.576 rows. With the import dialog I always get the error saying: not possible to load all data etc..
I found a way to load the data only in the data model of power query and not directly in the sheet. But I cannot find any way to split the data into different sheets.
Ideal split for example:
Sheet 1: File 1-3
Sheet 2: File 4-8
Sheet 3: File 9-10.
Is there a way to get for each file a different query and then to append those queries in the sheets? I would like to get 10 queries, which I can append the way mention above.
Thank you for your Input!
You can load each CSV file separately as a unique query, with each File... Close and Load saved as Connection Only. Then create separate queries that use a Table.Combine() to put together the combinations you need [data .. Get data … combine queries .. Append...] in separate queries that you file load as either tables or pivot reports back on the sheets

Pentaho Data Integration - Dump Excel into table

I'm very new to this tool and I want to do a simple operation:
Dump data from an XML to tables.
I have an Excel file that has around 10-12 sheets, and almost every sheet coresponds to a table.
With the first Excel input operation there is no problem.
The only problem is that, I don't know why but, when I try to edit (show the list of sheets, or get the list of columns) a second Excel Input the software just hangs, and when it responds just opens a warning with an error.
This is an image of the actual diagram that I'm trying to use:
This is a typical case of out of memory problem. PDI is not able to read the file and required more amount of memory to process the excel file. You need to give PDI more memory to work with your excel. Try increasing the memory of the Spoon. You can read Increase Spoon memory.
Alternatively, try to replicate your excel file with few rows of data keeping the structure of the file as it is e.g. a test file. You can either use that test file to generate the necessary sheet names and columns in excel step. Once you are done, you can point the original file and execute the job.

How to select specific rows to load into an Excel Workbook from another at run-time

I have to .xlsx files. One has data "source.xlsx" and one has macros "work.xlsm". I can load the data from "source.xlsx" into "work.xlsm" using Excel's built-in load or using Application.GetOpenFilename. However, I don't want all the data in the source.xlsx. I only want to select specific rows, the criteria for which will be determined at run time.
Thinks of this as a SELECT from a database with parameters. I need to do this to limit the time and processing of the data being processed by "work.xlsx".
Is there a way to do that?
I tried using parameterized query from Excel --> [Data] --> [From Other Sources] but when I did that, it complained about not finding a table (same with ODBC). This is because the source has no table defined, so it makes sense. But I am restricted from touching the source.
So, In short, I need to filter data before exporting it in the target sheet without touching the source file. I want to do this either interactively or via a VBA macro.
Note: I am using Excel 2003.
Any help or pointers will be appreciated. Thx.
I used a macro to convert the source file from .xlsx to .csv format and then loaded the csv formatted file using a loop that contained the desired filter during the load.
This approach may not be the best, nevertheless, no other suggestion was offered and this one works!
The other approach is to abandon the idea of pre-filtering and sacrifice the load time delay and perform the filtering and removal of un-wanted rows in the "work.xlsm" file. Performance and memory size are major factors in this case, assuming code complexity is not the issue.

Power Query Excel File Missing Data Issue

I'm new to Power Query and running into a strange issue that I haven't been able to resolve.
I'm creating a query to to extract data from roughly 300 Excel files. Each file has one sheet, 115 Columns and around 100 rows. However, the query is only returning the data from the first two columns and rows and I'm not sure why the query won't return all of the data on the sheet.
Ex:
Header 1 Header 2
Data Data
I converted one file to a .csv file and the query will return all data from the file. I've scoured Google and I haven't been able to find anything that seems to relate to this issue. Is there an excel file limitation that I'm not aware of?
I'm assisting someone that is not technical savvy so I would like to try to avoid VB code and Access if possible. Also, I can't really provide a file I'm working with because the data contains PHI.
Thank you in advance!

SSIS Data Flow Task Excel Source

I have a data flow task set up in SSIS.
The source is from an Excel source not an SQL DB.
The problem i seem to get is that, the package is importing empty rows.
My data has data in 555200 rows, but however when importing the SSIS package imports over 900,000 rows. The extra rows are imported even though the other empty.
When i then download this table into excel there are empty rows in between the data.
Is there anyway i can avoid this?
Thanks
Gerard
The best thing to do. If you can, is export the data to a flat file, csv or tab, and then read it in. The problem is even though those rows are blank they are not really empty. So when you hop across that ODBC-Excel bridge you are getting those rows as blanks.
You could possibly adjust the way the spreadsheet is generated to eliminate this problem or manually the delete the rows. The problem with these solutions is that they are not scalable or maintainable over the long term. You will also be stuck with that rickety ODBC bridge. The best long term solution is to avoid using the ODBC-Excel bridge entirely. By dumping the data to a flat file you have total control over how to read, validate, and interpret the data. You will not be at the mercy of a translation layer that is to this day riddled with bugs and is at the best of times "quirky"
You can also add in a Conditional Split component in your Data flow task, between the source task and the destination task. In here, check if somecolumn is null or empty - something that is consistent - meaning for every valid row, it has some data, and for every invalid row it's empty or null.
Then discard the output for that condition, sending the rest of the rows to the destination. You should then only get the amount of rows with valid data from Excel.

Resources