Improve speed of very slow appending of files to master file

Improve speed of very slow appending of files to master file - excel

I am trying to combine (or perhaps append is a better term) a group (10) of identical column Excel files into one master file.
I have tried a very simple process using a foreach loop in the control flow and simply doing an Excel Source to an Excel Destination. The process was not only slow (about 1 record pasted per second) but the process died after about 50k records.
It looks like:
Foreach Loop Container --> Data Flow task
where the Data Flow Task is Excel Source --> Excel Destination
At the end, I'd want to see one master file with all files appended. I recognize there are other tools that can do this like PowerQuery directly in Excel, but I'm trying to better understand SSIS and I have a lot of processing that would be better done in SQL Server.
Is there a better way to do this? I searched high and low online but couldn't find an example of this in SSIS.

This is very simple. The one thing I would suggest is to load to a flat file in csv format that easily opens in Excel.
Foreach Loop enumerated on filename.
In Foreach GUI set:
The path of the Excel files
The structure of the file (ex myfiles*.xls)
Go to Variable mapping and map the fully qualified name to a variable
Create an Excel connection to any one of the files.
In excel connection properties open Expression and set filepath to the variable from 5
Also in properties set delay validation to true
Add a dataflow task to the foreach loop container
goto dataflow
Use source assistant to read excel source
use destination assistant to load to a flat file (may sure not to overwrite destination or you will only get the last workbook

Related

Move data stored in excel file and queries to external source (local) to be available to other excel files

I recently modernized all my excel files, and started using the magic of PowerQueries and PowerPivot.
Background: I have 2 files:
- First one is a "master" with all sales and production logs, and everything works inside that excel file with Power queries to tables stored in that same file.
- Second one is mostly a different set of data about continuous improvement data, but i'd like to start linking them with the master file by having charts that compare efficiency to production, etc.
As it is now, I am using links by entering a direct reference to the cells/ranges in the master file (i.e: [Master.xlsm]!$A1:B2) However, every new version of the Master file, I have to update the links and it's not scalable if I have more documents in the future.
Options:
- Is it possible to store all the queries or data from the Master files in a separate file in the same folder and "call" for it when needed either in my Sales/Production master file or the Manufacturing file? That could be a database or connection file that has the queries to the data stored in the master file.
- If not, what is the best way to connect my Manufacturing file to my Master file without entering specifically the filename?
My fear is that as soon as the Master file name will change (date, version), I will have to navigate inside the queries and fix all the links again. Additionally, I wanna make this futureproof early one as I plan to gather large amounts of data and start more measurements.
Thanks for your help!

Once you have a data model built, you can create a connection to it from other Excel files. If you are looking for a visible way to control the source path of the connected file, you can add a named range to the Excel file that is connecting to the data model, and in the named range, enter the file path. In Power Query, add a new query that returns your named range (the file path), and swap out the static file path in your queries with the new named range query.
Here is a sample M code that gets the contents of a named range. This query is named "folderPath_filesToBeAudited".
let
Source = Excel.CurrentWorkbook(){[Name="folderPath_filesToBeAudited"]}[Content]{0}[Column1]
in
Source
Here is an example of M code showing how to use the new query to reference the file path.
Folder.Files(folderPath_filesToBeAudited)
Here is a step-by-step article.
https://accessanalytic.com.au/powerquery_namedcells_parameters/

Pentaho Data Integration - Dump Excel into table

I'm very new to this tool and I want to do a simple operation:
Dump data from an XML to tables.
I have an Excel file that has around 10-12 sheets, and almost every sheet coresponds to a table.
With the first Excel input operation there is no problem.
The only problem is that, I don't know why but, when I try to edit (show the list of sheets, or get the list of columns) a second Excel Input the software just hangs, and when it responds just opens a warning with an error.
This is an image of the actual diagram that I'm trying to use:

This is a typical case of out of memory problem. PDI is not able to read the file and required more amount of memory to process the excel file. You need to give PDI more memory to work with your excel. Try increasing the memory of the Spoon. You can read Increase Spoon memory.
Alternatively, try to replicate your excel file with few rows of data keeping the structure of the file as it is e.g. a test file. You can either use that test file to generate the necessary sheet names and columns in excel step. Once you are done, you can point the original file and execute the job.

how to dynamically create excel files with ssis

I have set up a sql task that loads the full result set of names into an object variable, I have it connected to a foreach loop that scans the whole object row by row. I'm unsure about the next steps though. If I can create a data flow task and somehow set up the destination variable equal to the for each loop mapping variable that would be nice. Any tips?

Based on what you described, all you need to do is as following:
1: The execute SQL task to return a list of excel file names, which you have already had.
Connect the output to Foreach Loop Container and starts to iterate each name.
Inside the container, the first task you need is Script Task, which is used for creating each excel file.
I assume that the excel format are the same for all that you need to populate. You need create a new template with desired column header name specified.
For that script task, take the mapped variable from container as the read only variable, you need to create another variable, set it as read and write, suppose it is named as A; for storing the dynamic excel file path for each one,and edit the script.
If you are familiar with C#, it will be easy for you to Copy the template for each iterating name.
Code will be like this:
Using System.IO;
...
...
...
string source = "C:\\template.xlsx";//need to be a full path
string target = "C:\\" + Dts.Variables["that read only variable"].Value.Tostring() + ".xlsx"
File.Copy(source,target );
Dts.Variables["A"].Value = target; //important!
After Script task, need a constrained data flow task, inside that, you need a excel destination, tricky part is (1): you need set a dynamic ExcelFile path from the properties for that excel connection manager, I suggest first time use an existing excel for cache the mapping, then for the dynamic connection part, select A, which is the read and write variable from the script task.
For populating the data to excel, you need to convert all the varchar type to nvarchar, this could be done using either derived column or data conversion
Last but not least, set delay validation to TRUE for both the connection manager, excel destination and the entire data flow task, it is very important for dynamic process.
All above might be a brief explanation, but that is the main idea.
PS: (1)Excel is very picky in SSIS, if you do not have data access engine installed, might not populate the data successfully. for excel it may need .JET (older) or .ACE(newer) provider.
(2)If your header row is not simply in the 1st row, you might also need to think about OPENROWSET properties.

SSIS - looping through excel files using dynamic file name and sheet name

I am try to load multiple excel files into database, I have tried this link:
How to loop through Excel files and load them into a database using SSIS package?
but it keeps looping through the files and never ends.
can anyone help?

This is not likely given you have a small number of files which you should when testing.
You need to log the file names inside the ForLoop and see if the values are ever changing.
With the dynamic sheet name may have a stability problem, e.g. some characters may not be able to be picked up by the OLEDB driver.
This is in general a not recommended practice to process dynamic data.

SSIS excel Destination, how to force LongText?

I'm using SSIS to perform data migration.
I'm using an Excel destination file to output everything that's going wrong.
In this Excel file, I want to output the two Errors column (Error number and Error column) and also all columns from my input component.
This is nearly working except when I have string columns having more than 255 characters. When I set up my Excel destination, I create a new Table.
The Create Table statement defines Longtext properly as the data type :
CREATE TABLE `My data` (
`ErrorCode` Long,
`ErrorColumn` Long,
`ID` Long,
`MyStringColumn` LongText
)
This works the first time. Then, I remove all data from the Excel file because I want to clean up the excel file before outputing errors.
When I return in the Package designer, my columns definitions are messed up. Every text columns are handled as nvarchar(255), and no more ntext. That breaks my component as my data is exceeding 255.
How can I properly manage excel destinations ?
thx
[Edit] As I'm not sure of my interpretation, here is the errors message when I run the task :
Error: 0xC0202009 at MyDataTask, To Errors file [294]: SSIS Error Code DTS_E_OLEDBERROR. An OLE DB error has occurred. Error code: 0x80040E21.
Error: 0xC0202025 at MyDataTask, To Errors file [294]: Cannot create an OLE DB accessor. Verify that the column metadata is valid.
Error: 0xC004701A at MyDataTask, SSIS.Pipeline: component "To Errors file" (294) failed the pre-execute phase and returned error code 0xC0202025.

In SSIS packages that involve Excel Destination, I have used an Excel Template file format strategy to overcome the error that you are encountering.
Here is an example that first shows how to simulate your error message and then shows how to fix it. The example uses SSIS 2008 R2 with Excel 97-2003.
Simulation
Created a simple table with two fields Id and Description. Populated the table with couple of records.
Created an SSIS package with single Data Flow Task and the Data Flow Task was configured as shown below. It basically reads the data from the above mentioned SQL Server table and then tries to convert the Description column to Unicode text with character length set to 20.
Since the table has two rows that have Description column values exceeding 20 characters in length, the default Error configuration setting on the Data Conversion transformation would fail the package. However, we need to redirect all the error rows. So the Error configuration on the Data conversion task has to be changed as shown below to redirect the error rows.
Then, I have redirected the Error output to an Excel Destination that is configured to save the output to a file in the path C:\temp\Errors.xls. First execution of the package would be successful because the Excel file is empty to begin with.
The file will contain both the rows from table because both would have encountered the truncation error and hence redirected to the error output.
After deleting the contents in the Excel file without changing the column header, if we execute the package again it will fail.
Cause of the failure would be due to the error messages shown below.
That completes the simulation of the error mentioned in the question. And here is one possible way that the issue could be fixed.
Possible Solution
Delete the existing Excel File Destination to which the error output is redirected to. Create a new Excel Connection manager with the path C:\temp\Template.xls. Place a new Excel Destination and point it to the new Excel connection manager and also create the sheet within the new Excel file using the New button on the Excel Destination.
Create two package variables named TemplatePath and ActualPath. TemplatePath should have the value C:\temp\Template.xls and the ActualPath should have the value C:\temp\Errors.xls. the actual path is the path where you would like the file to be created.
Right-click on the Excel Connection Manager and set the DelayValidation property to False and set the ServerName expression to the variable #[User::ActualPath]. DelayValidation will make sure that the package doesn't throw errors during design time if the file C:\temp\Errors.xls doesn't exist. Setting the ServerName expression will ensure that the package will use the file path mentioned in the variable ActualPath to generate the file.
On the Control Flow tab, place a File System Task above the Data Flow task.
Configure the File System Task as shown below. So, the File System Task will copy the Template file C:\temp\Template.xls and will create a new destination file C:\temp\Errors.xls every time the package runs. If the file C:\temp\Errors.xls already exists, then the File System Task will simply overwrite the file when the OverwriteDestination property within the File System Task is set to True.
Now, you can continue to run the package any number of times. The package will not fail and also you will have only the error messages from the last execution without having to manually clear the Excel file content.
Hope that helps.
[Edit] Added by Steve B. to provide a bit more details directly in the post because its too long for a comment
In my solution, I have in my SSIS project tow Excel files: Errors_Design_Template.xls and Errors_Template.xls'. The former file contains my sheets with the headers and one line of data (using formulas like =Rept("A",1024)` for input columns having 1024 length max), the latter is exactly the same without the first line of data.
Both files are copied at the start of the package from my source directory to temp directory. I use two files because I want to keep the design time validation, and I’m pointing to the copy of the template file in the Excel connection. I’m duplicating the template file also because I’m often executing a single data flow task of my package, and I want to populate a temp file, not the template file in my project (which has to remain empty but the headers and the first dummy line of data).
I also used two variables, one to use in Excel connection expression, one for the actual output file. I also had to write a script having my two variables as input. ActualFilePath is read/write. The script copies at run-time the value of the ActualFilePath to the ErrorFilePath variable. (I don’t have the source code by now, but I can paste it next week if it can helps).
Using this component together allows me to have the Excel connection pointing to the design file while designing, and pointing to the actual error file at run-time, without having to set the delayvalidation to true.

its better to use a 'execute task' in control flow.In execute task specify the connection to excel connection manager.In the SQL statement drop the excel table which is created during the sheet creation in excel destination. after drop create the same table.hence next time the data will be inserted in excel table.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string