Outputting a single Excel file with multiple worksheets - excel

Is there a component in Talend Open Studio for Data Integration to be able to output a single Excel file but with 2 separate sheets in it?
I want to separate some columns in the original file into another sheet and another set of columns to the second sheet.

You'll need to output your data into two separate tFileOutputExcel components with the second one set to append the data to the file as a different sheet.
A quick example has some name and age data held against a unique id that needs to be split into two separate sheets with id and name on one sheet and id and age on another sheet.
I'm generating this data using the tRowGenerator component configured to generate a sequence for the id and random first names and ages between 18 and 75:
I then split this data using a tMap component:
The first flow of data can go to the first tFileOutputExcel component to create the file with a "Names" sheet:
Unfortunately we can't just output the second sheet of data straight away to the next file as Talend will need to open a write lock on the Excel file. So instead we stash the data into memory using the tBufferOutput component in this case (although we could also use a tHashOutput component or potentially stash the data on disk in either a temporary file or database if this is likely to exceed total memory).
Once the first sub job is completed writing the names data to the Names sheet of our target file we can then read the Age data out of the buffer and into the second tFileOutputExcel which is then configured to append the sheet of data to the target file:

Related

Create Connection to one raw data file for multiple excel files

I currently have one excel file with four worksheets with data (Name: target value 2022.xlsx). This data is used in multiple excel files to make calculations and to show the values using VLookUp. Until now I copy-paste the values from this one file with four worksheets into all the other files with those four worksheets (and more) when one value changes throughout the year. It also seems to be problematic when a new year begins and a new "target value 2023.xlsx" is required. I tested a lot of ways to make a connection, but nothing seems to be the perfect way:
copy-paste each table via VBA (current way, but I don't want to open every file just because one value changed and click the "Refresh" button)
external reference Cell A1: =[target value 2022.xlsx]Table1!E3 (if one column is deleted, the connection shows #REF!)
Data > New Query > From File > From Workbook (if one column is deleted, the power query doesn't work anymore)
Data > From Text (only works, if all four worksheets are in four seperate csv-files, not optimal)
Data > From Access (seems to be the best way to get the data from four different tables within the database???)
What's the best way to do this, if multiple people use it? The values in "target value 2022.xlsx" change multiple times a year and many users need different files where the data is required. Thank you!

How to load Excel raw data into power query without converting into data table?

I am trying to find the name and full path of the current excel file where the power query is run.
I dont need the filename as such, its just that I want to have access to a sheet which do not have any data table, rather raw data is there.
When I try the Excel.CurrentWorkbook() it only gives a list of tables in the current workbook. But when I try to access the file using its name and full path using File.Contents() then all the sheet objects are returned which includes the sheets that contain raw data (without being converted into a data table).
So my plan is, if I could get the file name and path of the current workbook, then I can use it to access the sheet. I cant hardcode the file name as it gets changed everyday with the date as suffix.
Is there any other way around it?
I don't think this is currently possible using Excel.CurrentWorkbook().
It's possible to use a substring of CELL("filename") as a named range to read in the current path and workbook name into Power Query to use File.Contents but at that point, it's probably easier just to convert the sheet to a named range instead (only a few keys/clicks: select all data and hit the From Table button in the Data tab Get & Transform ribbon section).

Compounding multiple Excel sheets in the same workbook into one sheet

I have an Excel workbook with 36 different sheets in it that I receive every 2 weeks, the sheets have common headers across all tabs and unique headers which are different on each tab but each record has a unique ID which can have several records.
What I'm trying to do is strip the unique IDs from all of the sheets then pull the data through from each of them onto one sheet with all of the common headers as well as all of the unique headers.
I was considering using the code from the below post to import it into Access connect the tables and export it back into one sheet in Excel but the code doesn't work, I get the run-time error that: field "F1" does not exist in destination table error and I can't see how they've fixed that issue.
Importing multiple sheets from an excel file into multiple tables by sheet name
I'm not sure that's the best way to achieve what I'm to.
Don't import the sheets, link them.
Then create a straight select query using the linked table(s) as source and where you rename (alias) fields like F1 to something meaningful. Also apply simple filtering for invalid records and conversion of field values as needed.
Then use this/these query/queries for your further processing.

Combining CSVs in Power Query returning 1 row of data

I am trying to set up a query that will simply combine data from CSVs into a table as new files get added to a specific folder, where each row contains the data from a separate file. While doing tests with CSVs that I created in excel, this was very simple. After expanding the content column, I would see an individual row of data for each file.
In practice however, where I am trying to use CSVs that are put out from a proprietary android app, expanding the content column leads to 1 single row, with data from all files placed end to end.
Does this have something to do with there not being and "end of line" character in the CSVs the app is producing? If so, is there an easy way to remedy this without changing the app? If not, is there something simple and direct I can ask the developer to change which would prevent this behavior?
Thanks for any insight!

Skipping rows when importing Excel into SQL using SSIS 2008

I need to import sheets which look like the following:
March Orders
***Empty Row
Week Order # Date Cust #
3.1 271356 3/3/10 010572
3.1 280353 3/5/10 022114
3.1 290822 3/5/10 010275
3.1 291436 3/2/10 010155
3.1 291627 3/5/10 011840
The column headers are actually row 3. I can use an Excel Sourch to import them, but I don't know how to specify that the information starts at row 3.
I Googled the problem, but came up empty.
have a look:
the links have more details, but I've included some text from the pages (just in case the links go dead)
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/97144bb2-9bb9-4cb8-b069-45c29690dfeb
Q:
While we are loading the text file to SQL Server via SSIS, we have the
provision to skip any number of leading rows from the source and load
the data to SQL server. Is there any provision to do the same for
Excel file.
The source Excel file for me has some description in the leading 5
rows, I want to skip it and start the data load from the row 6. Please
provide your thoughts on this.
A:
Easiest would be to give each row a number (a bit like an identity in
SQL Server) and then use a conditional split to filter out everything
where the number <=5
http://social.msdn.microsoft.com/Forums/en/sqlintegrationservices/thread/947fa27e-e31f-4108-a889-18acebce9217
Q:
Is it possible during import data from Excel to DB table skip first 6 rows for example?
Also Excel data divided by sections with headers. Is it possible for example to skip every 12th row?
A:
YES YOU CAN. Actually, you can do this very easily if you know the number columns that will be imported from your Excel file. In
your Data Flow task, you will need to set the "OpenRowset" Custom
Property of your Excel Connection (right-click your Excel connection >
Properties; in the Properties window, look for OpenRowset under Custom
Properties). To ignore the first 5 rows in Sheet1, and import columns
A-M, you would enter the following value for OpenRowset: Sheet1$A6:M
(notice, I did not specify a row number for column M. You can enter a
row number if you like, but in my case the number of rows can vary
from one iteration to the next)
AGAIN, YES YOU CAN. You can import the data using a conditional split. You'd configure the conditional split to look for something in
each row that uniquely identifies it as a header row; skip the rows
that match this 'header logic'. Another option would be to import all
the rows and then remove the header rows using a SQL script in the
database...like a cursor that deletes every 12th row. Or you could
add an identity field with seed/increment of 1/1 and then delete all
rows with row numbers that divide perfectly by 12. Something like
that...
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/847c4b9e-b2d7-4cdf-a193-e4ce14986ee2
Q:
I have an SSIS package that imports from an Excel file with data
beginning in the 7th row.
Unlike the same operation with a csv file ('Header Rows to Skip' in
Connection Manager Editor), I can't seem to find a way to ignore the
first 6 rows of an Excel file connection.
I'm guessing the answer might be in one of the Data Flow
Transformation objects, but I'm not very familiar with them.
A:
Question Sign in to vote 1 Sign in to vote rbhro, actually there were
2 fields in the upper 5 rows that had some data that I think prevented
the importer from ignoring those rows completely.
Anyway, I did find a solution to my problem.
In my Excel source object, I used 'SQL Command' as the 'Data Access
Mode' (it's drop down when you double-click the Excel Source object).
From there I was able to build a query ('Build Query' button) that
only grabbed records I needed. Something like this: SELECT F4,
F5, F6 FROM [Spreadsheet$] WHERE (F4 IS NOT NULL) AND (F4
<> 'TheHeaderFieldName')
Note: I initially tried an ISNUMERIC instead of 'IS NOT NULL', but
that wasn't supported for some reason.
In my particular case, I was only interested in rows where F4 wasn't
NULL (and fortunately F4 didn't containing any junk in the first 5
rows). I could skip the whole header row (row 6) with the 2nd WHERE
clause.
So that cleaned up my data source perfectly. All I needed to do now
was add a Data Conversion object in between the source and destination
(everything needed to be converted from unicode in the spreadsheet),
and it worked.
My first suggestion is not to accept a file in that format. Excel files to be imported should always start with column header rows. Send it back to whoever provides it to you and tell them to fix their format. This works most of the time.
We provide guidance to our customers and vendors about how files must be formatted before we can process them and it is up to them to meet the guidlines as much as possible. People often aren't aware that files like that create a problem in processing (next month it might have six lines before the data starts) and they need to be educated that Excel files must start with the column headers, have no blank lines in the middle of the data and no repeating the headers multiple times and most important of all, they must have the same columns with the same column titles in the same order every time. If they can't provide that then you probably don't have something that will work for automated import as you will get the file in a differnt format everytime depending on the mood of the person who maintains the Excel spreadsheet. Incidentally, we push really hard to never receive any data from Excel (only works some of the time, but if they have the data in a database, they can usually accomodate). They also must know that any changes they make to the spreadsheet format will result in a change to the import package and that they willl be charged for those development changes (assuming that these are outside clients and not internal ones). These changes must be communicated in advance and developer time scheduled, a file with the wrong format will fail and be returned to them to fix if not.
If that doesn't work, may I suggest that you open the file, delete the first two rows and save a text file in a data flow. Then write a data flow that will process the text file. SSIS did a lousy job of supporting Excel and anything you can do to get the file in a different format will make life easier in the long run.
My first suggestion is not to accept a file in that format. Excel files to be imported should always start with column header rows. Send it back to whoever provides it to you and tell them to fix their format. This works most of the time.
Not entirely correct.
SSIS forces you to use the format and quite often it does not work correctly with excel
If you can't change he format consider using our Advanced ETL Processor.
You can skip rows or fields and you can validate the data the way you want.
http://www.dbsoftlab.com/etl-tools/advanced-etl-processor/overview.html
Sky is the limit
You can just use the OpenRowset property you can find in the Excel Source properties.
Take a look here for details:
SSIS: Read and Export Excel data from nth Row
Regards.

Resources