I have data which users submit in columns. I am trying to convert them from columnar into stacked rows so a database can read them as single columns, rather than having to pull in X columns to capture the data.
You can see in the Columnar Example screenshot where I am presently, and in the Stacked Example where I want to be.
I can do this in either Google Sheets using App Script - load the data as columnar then convert to stacked and move to BigQuery - OR - use Google Cloud Storage to load the columnar data then use Node to convert to stacked when moving to BigQuery.
Either way, this needs to get into BigQuery as the Stacked Example.
Any ideas on how to do this?
The function below does the following:
Get all data from the sheet where your original data is located (called Source, please change accordingly) with getDataRange.
Remove the headers from the retrieved array and append them to your destination sheet (called Target) with shift, slice and appendRow.
Iterate through the rest of rows with a forEach and, for each row, iterate through each successive group of four columns with a for loop, appending (1) the first for columns of each row and (2) each successive group of four columns in a new row in your destination sheet.
function myFunction() {
var ss = SpreadsheetApp.getActive();
var sourceSheet = ss.getSheetByName('Source'); // Change accordingly
var targetSheet = ss.getSheetByName('Target'); // Change accordingly
targetSheet.clearContents(); // Removes all old content from target sheet before appending new data (remove this if you don't want that)
var sourceValues = sourceSheet.getDataRange().getValues();
var numCols = 4; // Number of desired columns in destination spreadsheet
var headers = sourceValues.shift().slice(0, numCols * 2);
targetSheet.appendRow(headers); // Remove if the destination sheet already has headers
sourceValues.forEach(function(row) { // Iterate through each row in source sheet
for (var i = numCols; i < row.length; i += numCols) { // Iterate through each group of four columns (excluding first four) in each row
var part1 = row.slice(0, numCols); // First four columns
var part2 = row.slice(i, i + numCols); // Each successive group of four
var targetRow = part1.concat(part2); // Concatenate the four first columns with each group of four
targetSheet.appendRow(targetRow); // Append row (8 columns)
}
});
}
Related
A looking for a way to get specific columns by name from several tables. My data comes in several sheets with different number of columns upto 38 columns so i cannot use getColumnById. I only need 7 columns from this.
First am converting all sheet ranges to tables, then am getting all tables.
What I want is to get specific columns by names and merge all into one table on a new sheet.
I followed example from Docs but am stuck at getting column name for each Table.
I know my header Values, shown in example below.
function main(workbook: ExcelScript.Workbook) {
let sheets = workbook.getWorksheets();
for (let sheet of sheets) {
sheet.getTables()[0].convertToRange();
sheet.addTable(sheet.getRange('A1').getUsedRange().getAddress(),true)
}
workbook.getWorksheet('Combined')?.delete();
const newSheet = workbook.addWorksheet('Combined');
const tables = workbook.getTables();
const headerValues = [['Column1', 'Column6', 'Column8', 'Column9','Column11', 'Column16', 'Column18', 'Column19']];
const targetRange = newSheet.getRange('A1').getResizedRange(headerValues.length - 1, headerValues[0].length - 1);
targetRange.setValues(headerValues);
const combinedTable = newSheet.addTable(targetRange.getAddress(), true);
for (let table of tables) {
let dataValues = table.getColumnByName( // this where am stuck //).getRangeBetweenHeaderAndTotal().getTexts();
let rowCount = table.getRowCount();
// If the table is not empty, add its rows to the combined table.
if (rowCount > 0) {
combinedTable.addRows(-1, dataValues);
}
}
}
Thanks for your help.
George
A few things:
In most circumstances for this scenario, I'd recommend iterating
through a specific set of table objects. Unfortunately, that's
difficult to do here. Every time you unlink and recreate a new table,
Excel may give your table a new name. That makes it difficult to
work with the table. You can get around this in your code by
capturing the table name before you unlink it, unlinking the table,
recreating the table, and setting the table name to the original one
captured. If you go that route then you could reliably work with the
table names
Because table names in this scenario can be a bit tricky, I'm going
to use the sheet names so that I can work with the sheets that contain
the underlying tables. This will allow us to use and get data from the
tables regardless of what they're named in the sheets.
Please see my code below:
function main(workbook: ExcelScript.Workbook) {
//JSON object called SheetAndColumnNames. On the left hand side is the sheet name.
//On the right hand side is an array with the column names for the table in the sheet.
//NOTE: replace the sheet and column names with your own values
let columnNames : string[] = ["ColA","ColB","ColC"]
const sheetAndColumnNames = {
"Sheet1": columnNames,
"Sheet2": columnNames
}
//JSON object called columnNamesAndCellValues. On the left hand side is the column name.
//On the right hand side is an array that will hold the values for the column in the table.
//NOTE: replace these column names with your own values
const columnNamesAndCellValues = {
"ColA": [],
"ColB": [],
"ColC": []
}
//Iterate through the sheetAndColumnNames object
for (let sheetName in sheetAndColumnNames) {
//Use sheet name from JSON object to get sheet
let sheet: ExcelScript.Worksheet = workbook.getWorksheet(sheetName)
//get table from the previously assigned sheet
let table: ExcelScript.Table = sheet.getTables()[0]
//get array of column names to be iterated on the sheet
let tableColumnNames: string[] = sheetAndColumnNames[sheetName]
//Iterate the array of table column names
tableColumnNames.forEach(columnName=> {
//get the dataBodyRange of the tableColumn
let tableColumn : ExcelScript.Range = table.getColumn(columnName).getRangeBetweenHeaderAndTotal()
//iterate through all of the values in the table column and add them to the columnNamesAndCellValues array for that column name
tableColumn.getValues().forEach(value=>{
columnNamesAndCellValues[columnName].push(value)
})
})
}
//Delete previous worksheet named Combined
workbook.getWorksheet("Combined")?.delete()
//Add new worksheet named Combined and assign to combinedSheet variable
let combinedSheet : ExcelScript.Worksheet = workbook.addWorksheet("Combined")
//Activate the combined sheet
combinedSheet.activate()
//get the header range for the table
let headerRange : ExcelScript.Range = combinedSheet.getRangeByIndexes(0,0,1,columnNames.length)
//set the header range to the column headers
headerRange.setValues([columnNames])
//iterate through the arrays returned by the columnNamesAndCellValues object to write to the Combined sheet
columnNames.forEach((column,index)=>{
combinedSheet.getRangeByIndexes(1, index, columnNamesAndCellValues[column].length, 1).setValues(columnNamesAndCellValues[column])
})
//Get the address for the current region of the data written from the tableColumnData array to the sheet
let combinedTableAddress : string = combinedSheet.getRange("A1").getSurroundingRegion().getAddress()
//Add the table to the sheet using the address and setting the hasHeaders boolean value to true
combinedSheet.addTable(combinedTableAddress,true)
}
I have a data frame which contains huge number of records. In that DF a record can be repeated multiple times and every time when it got updated the last updated field will have the date on which the record modified.
We have a group of columns on which we want to compare the rows of similar ids. During this comparison we want to capture what are the fields/columns has changed from previous record to current record and capture that in a "updated_columns" column of the updated record. Compare this second record to third record and identify the updated columns and capture that in "updated_columns" field of third record, continue the same till the last record of that id and do the same thing for each and every id which has more than one entry.
Initially we grouped the columns and created a hash out of that group of columns and compare against hash values of next row,this way it is helping me to identify records which has updates, but want the columns which got updated.
Here I am sharing some data, which is expected outcome and that is how the final data should like look post adding updated columns (here I can say, use columns Col1, Col2, Col3, col4 and Col5 for comparison between two rows):
Want to do this in a efficient way. Any one tried some thing like this.
Looking for a help!
~Krish.
A window can be used.
The idea is to group the data by ID, sort it by LAST-UPDATED, copy the values of the previous row (if it exists) into the current row and then compare the copied data with the current values.
val data = ... //the dataframe has the columns ID,Col1,Col2,Col3,Col4,Col5,LAST_UPDATED,IS_DELETED
val fieldNames = data.schema.fieldNames.dropRight(1) //1
val columns = fieldNames.map(f => col(f))
val windowspec = Window.partitionBy("ID").orderBy("LAST_UPDATED") //2
def compareArrayUdf() = ... //3
val result = data
.withColumn("cur", array(columns: _*)) //4
.withColumn("prev", lag($"cur", 1).over(windowspec)) //5
.withColumn("updated_columns", compareArrayUdf()($"cur", $"prev")) //6
.drop("cur", "prev") //7
.orderBy("LAST_UPDATED")
Remarks:
create a list of all fields to compare. All fields but the last one (LAST-UPDATED) are used
create a window that is partitioned by ID and each partition is sorted by LAST-UPDATED
create a udf that compares two arrays and maps the discovered differences to the field names, code see below
create a new column that contains all values that should be compared
create a new column that contains all values of the previous row (by using the lag-function) that should be compared. The previous row is the row with the same ID and the biggest LAST-UPDATED that is smaller than the current one. This field can be null
compare the two new columns and put the result into updated-columns
drop the two intermediate columns created in step 3 and 4
The compareArraysUdf is
def compareArray(cur: mutable.WrappedArray[String], prev: mutable.WrappedArray[String]): String = {
if (prev == null || cur == null) return ""
val res = new StringBuilder
for (i <- cur.indices) {
if (!cur(i).contentEquals(prev(i))) {
if (res.nonEmpty) res.append(",")
res.append(fieldNames(i))
}
}
res.toString()
}
def compareArrayUdf() = udf[String, mutable.WrappedArray[String], mutable.WrappedArray[String]](compareArray)
You can join your DataFrame or DataSet to itself, joining the rows where the id is the same in both rows and where the version of the left row is i and the version of the right row is i+1. Here's an example
case class T(id: String, version: Int, data: String)
val data = Seq(T("1", 1, "d1-1"), T("1", 2, "d1-2"), T("2", 1, "d2-1"), T("2", 2, "d2-2"), T("2", 3, "d2-3"), T("3", 1, "d3-1"))
data: Seq[T] = List(T(1,1,d1-1), T(1,2,d1-2), T(2,1,d2-1), T(2,2,d2-2), T(2,3,d2-3), T(3,1,d3-1))
val ds = data.toDS
val joined = ds.as("ds1").join(ds.as("ds2"), $"ds1.id" === $"ds2.id" && (($"ds1.version"+1) === $"ds2.version"))
And then you can reference the columns in the new DataFrame/DataSet like $"ds1.data and $"ds2.data etc.
To find the rows where the data changed from one version to another, you can do
joined.filter($"ds1.data" !== $"ds2.data")
I am trying to use U-SQL to remove duplicate, null,'',and Nan cells in a specific column called "Function" of a csv file. I also want to keep the Product column correctly aligned with the Function column after the blank rows are removed. So i would want to remove the same rows in the Product column as I do in the Function column to keep them properly aligned. I want to only keep one occurrence of a duplicate Function row. In this case I only want to keep the very first occurrence. The Product column has no empty cells and has all unique values. Any help is greatly appreciated. I know this can be done in a much easier way, but I want to use the code to automate the process as the Data in the DataLake changes over time. I think I am somewhat close in the code i currently have. The actual data set is a very large file and I am fairly certain that there are at least 4 duplicate values in the Functions column that aren't simply empty cells. I need to eliminate both duplicate values and empty cells in the Function column because empty cells are being recognized as duplicates as well. I want to be able to use the Function values as a primary key in the next step of my school project that wont include the Product column.
DECLARE #inputfile string = "/input/Function.csv";
//DECLARE #OutputUserFile string = "/output/Test_Function/UniqueFunction.csv";
#RawData =
EXTRACT Function string,
Product string
FROM #inputfile
USING Extractors.Csv(encoding: Encoding.[ASCII]);
// Query from Function data
// Set ROW_NUMBER() of each row within the window partitioned by Function field
#RawDataDuplicates=
SELECT ROW_NUMBER() OVER (PARTITION BY Function) AS RowNum, Function AS function
FROM #RawData;
// ORDER BY Function to see duplicate rows next to one another
#RawDataDuplicates2=
SELECT *
FROM #RawDataDuplicates
ORDER BY function
OFFSET 0 ROWS;
// Write to File
//OUTPUT #RawDataDuplicates2
//TO "/output/Test_Function/FunctionOver-Dups.csv"
//USING Outputters.Csv();
// GROUP BY and count # of duplicates per Function
#groupBy = SELECT Function, COUNT(Function) AS FunctionCount
FROM #RawData
GROUP BY Function
ORDER BY Function
OFFSET 0 ROWS;
// Write to file
//OUTPUT #groupBy
//TO "/output/Test_Function/FunctionGroupBy-Dups.csv"
//USING Outputters.Csv();
#RawDataDuplicates3 =
SELECT *
FROM #RawDataDuplicates2
WHERE RowNum == 1;
OUTPUT #RawDataDuplicates3
TO "/output/Test_Function/FunctionUniqueEmail.csv"
USING Outputters.Csv(outputHeader: true);
//OUTPUT #RawData
//TO #OutputUserFile
//USING Outputters.Csv(outputHeader: true);
I have also commented out some code that I don't necessarily need. When I run the code as it is, I am currently getting this error: this E_CSC_USER_REDUNDANTSTATEMENTINSCRIPT, Error Message: This statement is dead code.. –
It does not give a line number but likely the "Function AS function" line?
Here is a sample file that is a small slice of the full spreadsheet and only includes data in the 2 relevant columns. The full spreadsheet has data in all columns.
https://www.dropbox.com/s/auu2aco4b037xn7/Function.csv?dl=0
here is a screenshot of the output I get when I follow wBob's advice and click.
You can apply a series of transformations to your data using string functions like .Length and ranking function like ROW_NUMBER to remove the records you want, for example:
#input =
EXTRACT
CompanyID string,
division string,
store_location string,
International_Id string,
Function string,
office_location string,
address string,
Product string,
Revenue string,
sales_goal string,
Manager string,
Country string
FROM "/input/input142.csv"
USING Extractors.Csv(skipFirstNRows : 1 );
// Remove empty columns
#working =
SELECT *
FROM #input
WHERE Function.Length > 0;
// Rank the columns by Function and keep only the first one
#working =
SELECT CompanyID,
division,
store_location,
International_Id,
Function,
office_location,
address,
Product,
Revenue,
sales_goal,
Manager,
Country
FROM
(
SELECT *,
ROW_NUMBER() OVER(PARTITION BY Function ORDER BY Product) AS rn
FROM #working
) AS x
WHERE rn == 1;
#output = SELECT * FROM #working;
OUTPUT #output TO "/output/output.csv"
USING Outputters.Csv(quoting:false);
My results:
I have an Excel file where I want to group the columns based on multiple columns in C#. I am using EPPlus and have followed this link (Selecting grouped max and min in a Epplus worksheet with Linq), but it shows only the group by based on one column. I am stuck on extending it to group by multiple coulmns, let's say here Group and Date (Column A and Column B).
So, the following worked for me where I have grouped by based on first three columns of the Excel file:
var rowcellgroups = skus.Cells["A:G"].GroupBy(c => c.Start.Row);
var group = rowcellgroups.Skip(2).GroupBy(x => new { FirstKey = x.ElementAt(0).Value, SecondKey = x.ElementAt(1).Value, ThirdKey = x.ElementAt(2).Value }); //starts with 0
In the control properties, new row has name equals to Volume Row 62 and Row Index equals to 61 - as recorded and add values in few columns.
When I insert a new row with external data and fill the respective columns, it tries to override the column data in the row which was used in the recording and not the one that is being inserted.
If i remove the 62 and 61 from the Row Properties and make that general, it goes to the first row in the grid and tries to edit.
What properties should be changed or searched for so that columns is populated correctly in the the new row (whose name and row index is not known).
any help is appreciated.
Thanks.
To solve this in our application we exported the UI map method and created a temp list using VAR I don't know if this will work for you or not but solved several of our issues around dynamic grid rows and columns:
public void DoubleClickLaunch_UOW()
{
var temp = this.UIWindow.UIUnitGridTable.GetChildren().ToList();
temp.RemoveAt(0);
var rows = temp.Select(t => t.GetChildren().Select(s => s.GetValue()).ToList()).ToList();
var tractLetters = rows.Select(s => s[1]).ToList();
var index = tractLetters.IndexOf(DoubleClickLaunch_UOWParams.UITESTUNIT_TPText);
if (index >= 0)
{
var textbox = temp[index].GetChildren()[1].GetChildren()[0];
Mouse.DoubleClick(textbox);
}
else
{
Mouse.DoubleClick(this.UIWindow.UIUnitGridTable.UIItemRow.UIUnitNameCell.UITESTUNIT_TPText);
}