script task in SSIS to import excel spreadsheet - excel

I have reviewed the questions that may have had my answer and unfortunately they don't seem to apply. Here is my situation. I have to import worksheets from my client. In columns A, C, D, and AA the client has the information I need. The balance of the columns have what to me is worthless information. The column headers are consistent in the four columns I need, but are very inconsistent in the columns that don't matter. For example cell A1 contains Division. This is true across all of the spreadsheets. Cell B1 can contain anything from sleeve length to overall length to fit. What I need to do is to import only the columns I need and map them to an SQL 2008 R2 table. I have defined the table in a stored procedure which is currently calling an SSIS function.
The problem is that when I try to import a spreadsheet that has different column names the SSIS fails and I have to go back in an run it manually to get the fields set up right.
I cannot imagine that what I am trying to do has not been done before. Just so the magnitude is not lost, I have 170 users who have over 120 different spreadsheet templates.
I am desperate for a workable solution. I can do everything after getting the file into my table in SQL. I have even written the code to move the files back to the FTP server.

I put together a post describing how I've used a Script task to parse Excel. It's allowe me to import decidedly non-tabular data into a data flow.
The core concept is that you will use a the JET or ACE provider and simply query the data out of an Excel Worksheet/named range. Once you have that, you have a dataset you can walk through row-by-row and perform whatever logic you need. In your case, you can skip row 1 for the header and then only import columns A, C, D and AA.
That logic would go in the ExcelParser class. So, the Foreach loop on line 71 would probably be distilled down to something like (code approximate)
// This gets the value of column A
current = dr[0].ToString();
// this assigns the value of current into our output row at column 0
newRow[0] = current;
// This gets the value of column C
current = dr[2].ToString();
// this assigns the value of current into our output row at column 1
newRow[1] = current;
// This gets the value of column D
current = dr[3].ToString();
// this assigns the value of current into our output row at column 2
newRow[2] = current;
// This gets the value of column AA
current = dr[26].ToString();
// this assigns the value of current into our output row at column 3
newRow[3] = current;
You obviously might need to do type conversions and such here but that's core of the parsing logic.

Related

Excel Online Workbook Links - Linking full row range

I am using Excel Online in the browser, have setup a workbook link to my main file from a source. In my main file I have table headers and additional columns with formula. I just need from A2 to AC down. The issue is that the source file changes daily. There might be more rows the next day or fewer. I need to be able to reference set columns and then detect how many rows are in the data source and update the main file
So far, I have something like this
='https://sharepoint.com/personal/myFolder/Documents/[data_source.xlsx]in'!A2
Which on columns B2 and C2 load the first row. I can select a range from the source data so it loads all of it, but if the next day there is more rows, it wont load those, or if there are fewer, it will display as blanks.
How can I tell the formula to select Columns A2 to C2 and extend down, or refresh the data like it does in Excel desktop when using data connections?
As you can see Source data, Day 2 has extra rows that wont be loaded in my main file.
You can use PowerAutomate and two Office Scripts to link the two workbooks together.
You'd start by using a recurrence. So you'd pick how often you'd like the flow to run (weekly, daily, etc.)
After you set the recurrence, you have to write an office script that work with the table data. You can work with the dataBodyRange of the table by using the table's GetRangeBetweenHeaderAndTotal() method. And once you have that, you can resize the range to get the data you need. Next, you need to get the values which you can use with the GetValues method. GetValues returns a 2d array which you can't return from a PowerAutomate RunScript. Since you can't do that, but you can return a string, you get around that by converting the 2d array to a json string. You can see the code below:
function main(workbook: ExcelScript.Workbook): string {
let sh: ExcelScript.Worksheet = workbook.getActiveWorksheet();
//get table
let tbl: ExcelScript.Table = sh.getTable("Table1");
//get table's column count
let tblColumnCount: number = tbl.getColumns().length;
//set number of columns to keep
let columnsToKeep: number = 3;
//set the number of rows to remove
let rowsToRemove: number = 0;
//resize the table range
let tblRange: ExcelScript.Range = tbl.getRangeBetweenHeaderAndTotal().getResizedRange(rowsToRemove,columnsToKeep - tblColumnCount);
//get the table values
let tblRangeValues: string[][] = tblRange.getValues() as string[][];
//create a JSON string
let result: string = JSON.stringify(tblRangeValues);
//return JSON string
return result;
}
Once you created your script, consider naming it something you'll remember when you call it in PowerAutomate (I called mine getTableValues). Next, after the recurrence in PowerAutomate, add a Run Script step. Fill out the values and select the script like so:
Next, you have to create the script which takes the input returned from the previous script and completes the final steps. So the script has to have a parameter that takes the string returned from the previous script (I called it tableValues in mine). In the script, you have to parse the json string array to create a 2d array, resize the initial range, and then set the values of the resized range. You can see a script that does that below:
function main(workbook: ExcelScript.Workbook, tableValues: string)
{
let sh: ExcelScript.Worksheet = workbook.getWorksheet("Sheet1")
//parses the JSON string to create array
let tableValuesArray: string[][] = JSON.parse(tableValues);
//gets row count from the array
let valuesRowCount: number = tableValuesArray.length - 1
//gets column count from the array
let valuesColumnCount: number = tableValuesArray[0].length - 1
//resizes the range
let rang: ExcelScript.Range = sh.getRange("A1").getResizedRange(valuesRowCount,valuesColumnCount)
//sets the value of the resized range to the array
rang.setValues(tableValuesArray)
}
In PowerAutomate, you have to create a second run script step. In the second step, you should be prompted with a value to enter after you've selected the script (the value is called tableValues in my step.) In the table values input, you have to enter the dynamic content Result value. Once this is done, you can save the script and test.
One thing to note is that the second script doesn't delete old range values from previous runs. This can be done in a number of different ways. But the preferred way may depend on how the workbook is structured. So I'd recommend writing code to clear the range in the second script somewhere in the beginning. Or better yet, add the output of the first script into an Excel table. And just empty out the table every time you run the second script.
If you'd like to see how you might do that, you can take a look at this post here

Pyspark conditionally replace value in column with value from another column

I am working with some weather data that is missing some values (indicated via value code). For example, if SLP data is missing, it is assigned code 99999. I was able to use a window function to calculate a 7 day average and save it as a new column. A significantly reduced example of a single row is shown below:
SLP_ORIGIN
SLP_ORIGIN_7DAY_AVG
99999
11945.823516044207
I'm trying to write code such that when SLP_ORIGIN has the missing code it gets replaced using the SLP_ORIGIN_7DAY_AVG value. However, most code explains how to replace a column value based on a conditional with a constant value, not the column value. I tried using the following:
train_impute = train.withColumn("SLP_ORIGIN", \
when(train["SLP_ORIGIN"] == 99999, train["SLP_ORIGIN_7DAY_AVG"]).otherwise(train["SLP_ORIGIN"]))
where the dataframe is called train.
When I perform a count on the SLP_ORIGIN column using train.where("SLP_ORIGIN = 99999").count() I get the same count from before I attempted replacing the value in that column. I have already checked and my SLP_ORIGIN_7DAY_AVG does not have any values that match the missing code.
So how do I actually replace the 99999 values in the SLP_ORIGIN column with the associated SLP_ORIGIN_7DAY_AVG value?
EVEN BETTER, is there a way to do this replacement and window calculation without making a 7 day average column (I have other variables I need to do the same thing with so I'm hoping there is a more efficient way to do this).
Make sure to double check with dataframe you are verifying on.
I was using train.where("SLP_ORIGIN = 99999").count() when I should have been using train_impute.where("SLP_ORIGIN = 99999").count()
Additionally, instead of making a whole new column to store the imputed 7 day average, one can only calculate the average when the missing value code is present:
train = train.withColumn("SLP_ORIGIN", when(train["SLP_ORIGIN"] == 99999, f.avg('SLP_ORIGIN').over(w)).otherwise(train["SLP_ORIGIN"]))\

How to permanently +1 a variable outside of a for loop

I have a Python script that pulls cryptocurrency data from Yahoo Finance. It scrapes this data and then inputs it into a Google Sheet. What I want is for this script to run once per day, which I will schedule through task scheduler, and each day pull the data from Yahoo Finance and then input into the sheet.
I want this spreadsheet to store data historically. For example, first column = today's data, second column = tomorrow's data (which the script will collect when it runs tomorrow) which means the column value needs to +1 after the script runs every time. I've put the column number into a variable, so what I'm essentially trying to do is this;
today, c
tomorrow, c + 1
day after, c + 2
and so on, so that the sheet gets filled out and the data doesn't get overwritten. I doubt there's a way to permanently change the value of the variable other than going into the script and changing the value of c manually before it runs. Is there any other way to go about this?
Not sure how to tackle this - c = c + 1 at the end of the function was my best guess but I know this won't work. The only other thing I can think of is to write out this function hundreds of times, first time c = 1, second time c = 2 and so on with "time.sleep(86400)" in between functions and have the script running constantly but as you can imagine I'd rather not do that.
def google_sheet_import(var1):
r = 2
c = 3
for w in (var1):
cc_worksheet.update_cell(r, c, w.text)
r = r + 1
google_sheet_import(intraday_price)
All that happens is the script overwrites the data in c because I can't get it to permanently change the value of c after the script runs.
In order to accomplish your needs, you have many possible options to follow:
Store the c value somewhere in your sheet. After inserting a new column, you should increment it, and before inserting a new one you should read it and set it to be c's value.
Store the c variable in a file (see python's open function) and increment it accordingly.
The one that is probably the most sensible and the one I will provide code for. Compute the number of columns you have in your sheet when the python script is executed, and then assign c's value based on that:
def google_sheet_import(var1):
r = 2
c = len(cc_worksheet.get_all_values()[0])
for w in (var1):
cc_worksheet.update_cell(r, c, w.text)
r = r + 1
google_sheet_import(intraday_price)
The only change is in the 3d line. What it does is to check which is the last cell that is populated within the first row (in case your first row is empty or can be empty, I recommend that you set it to another row that you know will be populated).

Making a vector out of excel columns using python

everyone...
I just started on python a couple of days ago because I require to handle some excel data in order to automatically update the data of certain cells from one file into another.
However, I'm kind of stuck since I have barely programmed before, and it's my first time using python as well, but my job required me to find a solution and I'm trying to make it work even though it's not my field of expertise.
I used the "xlrd library", imported my file and managed to print the columns I'm needing... However, I can't find a way to put those columns into a matrix in order to handle the data like this:
Matrix =[DataColumnA DataColumnG DataColumnH] in the size [nrows x 3]
As for now, I have 3 different outputs for the 3 different columns I need, but I'm trying to join them together into one big matrix.
So far my code looks like this:
import xlrd
workbook = xlrd.open_workbook("190219_serviciosWRAmanualV5.xls");
worksheet = workbook.sheet_by_name("ServiciosDWDM");
workbook2 = xlrd.open_workbook("Potencia2.xlsx");
worksheet2 = workbook2.sheet_by_name("Hoja1");
filas = worksheet.nrows
filas2 = worksheet2.nrows
columnas = worksheet.ncols
for row in range (2, filas):
Equipo_A = worksheet.cell(row,12).value
Client_A = worksheet.cell(row,13).value
Line_A = worksheet.cell(row, 14).value
print (Equipo_A, Line_A, Client_A)
So I have only gotten, as mentioned above, the data in the columns which is what I'm printing which you can see.
What I'm trying to do, or the main thing I need to do is to read the cell of the first row in Column A and look for it in the other excel file... if the names match, I would have to validate that for the same row (in file 1) the data in both the ColumnG and ColumnH is the same as the data in the second file.
If they match I would have to update Column J in the first file with the data from the second file.
My other approach is to retrieve the value of the cell in ColumnA and look for it in the column A of the second file, then I would make an if conditional to see if ColumnsG and H are equal to Column C of 2nd file and so on...
The thing here is, I have no idea how to pin point the position of the cell and extract the data to make the conditional for this second approach.
I'm not sure if by making that matrix my approach is okay or if the second way is better, so any suggestion would be absolutely appreciated.
Thank you in advance!

How do I tell Matlab that some of the data which is being imported is in Hex?

I am trying to import the data from an excel sheet in which there are 3 columns (Time; Id; and Data).
The first column contains time (in seconds) while the next column has numbers (Id). The third column (Data), however, has hexadecimal numbers, therefore Matlab is unable to process that information and shows 'NaN' in the command window/structure whenever I try to retrieve any information from the third column.
I want to create a structure for each Ids and display the respective information inside that Id.
Hence is there anyway to obtain the information from the third column without getting NaN error?
Here's my code:
[dat1, dat2, dat3] = xlsread('1');
flds=dat3(1,:);
bus=cell2struct(dat3(2:end,:),flds,2);
for k=1:length(bus)
if bus(k).Id == 150
i=i+1;
can_bus(k,:)
end
end
You can get the raw text and do the conversion yourselves:
To get the text, you have to get other output parameters
from xlsread. For example:
[num,txt,raw] = xlsread('file.xls');
will return the numeric values in num, the text values in txt, and
the raw cell data in raw.
Source

Resources