Replace all error values of all columns after importing datas (while keeping the rows) - excel

An Excel table as data source may contain error values (#NA, #DIV/0), which could disturbe later some steps during the transformation process in Power Query.
Depending of the following steps, we may get no output but an error. So how to handle this cases?
I found two standard steps in Power Query to catch them:
Remove errors (UI: Home/Remove Rows/Remove Errors) -> all rows with an error will be removed
Replace error values (UI: Transform/Replace Errors) -> the columns have first to be selected for performing this operations.
The first possibility is not a solution for me, since I want to keep the rows and just replace the error values.
In my case, my data table will change over the time, means the column name may change (e.g. years), or new columns appear. So the second possibility is too static, since I do not want to change the script each time.
So I've tried to get a dynamic way to clean all columns, indepent from the column names (and number of columns). It replaces the errors by a null value.
let
Source = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],
//Remove errors of all columns of the data source. ColumnName doesn't play any role
Cols = Table.ColumnNames(Source),
ColumnListWithParameter = Table.FromColumns({Cols, List.Repeat({""}, List.Count(Cols))}, {"ColName" as text, "ErrorHandling" as text}),
ParameterList = Table.ToRows(ColumnListWithParameter ),
ReplaceErrorSource = Table.ReplaceErrorValues(Source, ParameterList)
in
ReplaceErrorSource
Here the different three queries messages, after I've added two new column (with errors) to the source:
If anybody has another solution to make this kind of data cleaning, please write your post here.

let
src = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],
cols = Table.ColumnNames(src),
replace = Table.ReplaceErrorValues(src, List.Transform(cols, each {_, "!"}))
in
replace

Just for novices like me in Power Query
"!" could be any string as substitute for error values. I initially thought it was a wild card.
List.Transform(cols, each {_, "!"}) generates the list of error handling by column for the main funcion:
Table.ReplaceErrorValues(table_with errors, {{col1,error_str1},{col2,error_str2},{},{}, ...,{coln,error_strn}})
Nice elegant solution, Sergei

Related

Error: Splitting rows into separate rows on all columns in Power Query

I had a problem spliting data in rows, used the solution provided by horseyride in the following link
Splitting rows into separate rows on all columns in Power Query.
Basically I am loocking to separete a row as breaks there are.
Many thanks #horseyride. The solution works in a simular problem. However, it's poping up the following error:
Expression.Error: We cannot convert a value of type Table to type Text.
Details:
Value=[Table]
Type=[Type]
My table is this one:
let
Source = Pdf.Tables(File.Contents("C:\Users\gmall\OneDrive\EF personales\EF\Temporales\IBK_Sueldo_PEN.pdf"), [Implementation="1.3"]),
Table002 = Source{[Id="Table002"]}[Data],
TableTransform = Table.Combine(List.Transform(List.Transform(Table.ToRecords(Source),
(x) => List.Transform(Record.ToList(x),each Text.Split(_,"#(lf)"))),
each Table.FromColumns(_,Table.ColumnNames(Source))))
in
TableTransform
Please let me know how to solve this issue:
Expression.Error: We cannot convert a value of type Table to type Text.
Details:
Value=[Table]
Type=[Type]
You need to use Table002 in step3 since that is the prior step name, not Source, which was the prior step name in my other answer
let
Source = Pdf.Tables(File.Contents("C:\Users\gmall\OneDrive\EF personales\EF\Temporales\IBK_Sueldo_PEN.pdf"), [Implementation="1.3"]),
Table002 = Source{[Id="Table002"]}[Data],
TableTransform = Table.Combine(List.Transform(List.Transform(Table.ToRecords(Table002),
(x) => List.Transform(Record.ToList(x),each Text.Split(_,"#(lf)"))),
each Table.FromColumns(_,Table.ColumnNames(Table002))))
in
TableTransform

Text.Contains for multiple values power query

I am attempting to create the following query:
The idea is to check if each row in the source query contains any of the following keywords in the Search list and return the Found words is present.
Importantly I need this to be dynamic i.e. the search list could be a single word or could be 100+ words. Therefore I need to work around just stitching a bunch of Text. Contains with or statements is possible.
In effect, I want to create something like
Text.Contains([Column1], {any value in search list}) then FoundWord else null
Data:
Physical hazards Flam. Liq. 3 - H226 Eliminate all sources of ignition.
Health hazards STOT SE 3 - H336. Avoid inhalation of vapours and contact with skin and eyes.
Environmental hazards Not Classified. Avoid the spillage or runoff entering drains, sewers or watercourses.
Personal precautions Keep unnecessary and unprotected personnel away from the spillage.
clothing as described in Section 8 of this safety data sheet. Provide adequate ventilation.
Search List:
Hazards
Eliminate
ventilation
Avoid
try this code for query Table2 after creating query lookfor
let Source = Excel.CurrentWorkbook(){[Name="Table2"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
Findmatch = Table.AddColumn(Source, "Found", (x) => Text.Combine(Table.SelectRows(lookfor, each Text.Contains(x[Column1],[Column1], Comparer.OrdinalIgnoreCase))[Column1],", "))
in Findmatch

Reordering data by manipulating column wise in Python

I have data in a csv file as follows:
60,27702,1938470,13935,18513,8
60,32424,1933740,16103,15082,11
60,20080,1946092,9335,14970,2
60,28236,1937936,13799,16871,6
60,22717,1943455,10809,16726,4
120,37702,2938470,23935,28513,8
120,42424,2933740,26103,25082,11
120,30080,2946092,2335,24970,2
120,38236,2937936,23799,26871,6
120,32717,2943455,20809,26726,4
180,47702,3938470,33935,8513,8
180,52424,3933740,36103,5082,11
180,40080,3946092,3335,4970,2
180,48236,3937936,33799,6871,6
180,42717,3943455,30809,6726,4
I then used the following code to insert column heading:
df = pd.read_csv("contikiMAC_new_out.csv", names=['Energest','CPU','LPM','Transmit','Listen','ID'])
I used df.groupby(['ID']) to see the data in group according to column 'ID'.
The problem is the data in column 'LPM' gets reset after some time so I would like to add the previous value with the new value whenever the new value in LPM column is smaller for specific 'ID' .
I tried doing :
for x in df.groupby(['ID']):
for i in df.ID:
if (df.loc[i, 'LPM'] < df.loc[i - 1, 'LPM']):
df.loc[i, 'LPM'] = df.loc[i, 'LPM'] + df.loc[i - 1, 'LPM']
But actually not getting the fruitful result I desire because it mixes with the 'LPM' value of different 'ID' and the process takes a long time. Can anyone please help me in suggesting a way to write the data group wise in a csv file based on 'ID' after performing the sum operation ?
The data structure I like to see is as follows:
60,27702,1938470,13935,18513,8
120,37702,2938470,23935,28513,8
180,47702,3938470,33935,37026,8
60,32424,1933740,16103,15082,11
120,42424,2933740,26103,25082,11
180,52424,3933740,36103,30164,11
60,20080,1946092,9335,14970,2
120,30080,2946092,2335,24970,2
180,40080,3946092,3335,29940,2
60,28236,1937936,13799,16871,6
120,38236,2937936,23799,26871,6
180,48236,3937936,33799,33742,6
60,22717,1943455,10809,16726,4
120,32717,2943455,20809,26726,4
180,42717,3943455,30809,33452,4
If I understood your problem correctly, DataFrame.shift is what you're looking for.
Something like:
df['LPM_prev'] = df.groupby(['ID'])['LPM'].shift(1)
And then you can work with that column

How do I subtract two arrays of cells in Matlab

I am trying to get some variables and numbers out from an Excel table using Matlab.
The variables below named "diffZ_trial1-4" should be calculated by the difference between two columns (between "start" and "finish"). However I get the error:
Undefined operator '-' for input arguments of type"
'cell'.
And I have read somewhere that it could be related to the fact that I get {} output instead of [] and maybe I need to use cell2mat or convert the output somehow. But I must have done that wrongly, as it did not work!
Question: How can I calculate the difference between two columns below?
clear all, close all
[num,txt,raw] = xlsread('test.xlsx');
start = find(strcmp(raw,'HNO'));
finish = find(strcmp(raw,'End Trial: '));
%%% TIMELINE EACH TRIAL
time_trial1 = raw(start(1):finish(1),8);
time_trial2 = raw(start(2):finish(2),8);
time_trial3 = raw(start(3):finish(3),8);
time_trial4 = raw(start(4):finish(4),8);
%%%MOVEMENT EACH TRIAL
diffZ_trial1 = raw(start(1):finish(1),17)-raw(start(1):finish(1),11);
diffZ_trial2 = raw(start(2):finish(2),17)-raw(start(2):finish(2),11);
diffZ_trial3 = raw(start(3):finish(3),17)-raw(start(3):finish(3),11);
diffZ_trial4 = raw(start(4):finish(4),17)-raw(start(4):finish(4),11);
You are right, raw contains data of all types, including text (http://uk.mathworks.com/help/matlab/ref/xlsread.html#outputarg_raw). You should use num, which is a numeric matrix.
Alternatively, if you have an updated version of Matlab, you can try readtable (https://uk.mathworks.com/help/matlab/ref/readtable.html), which I think is more flexible. It creates a table from an excel file, containing both text and numbers.

Replace empty strings with null values

I am rolling up a huge table by counts into a new table, where I want to change all the empty strings to NULL, and typecast some columns as well. I read through some of the posts and I could not find a query, which would let me do it across all the columns in a single query, without using multiple statements.
Let me know if it is possible for me to iterate across all columns and replace cells with empty strings with null.
Ref: How to convert empty spaces into null values, using SQL Server?
To my knowledge there is no built-in function to replace empty strings across all columns of a table. You can write a plpgsql function to take care of that.
The following function replaces empty strings in all basic character-type columns of a given table with NULL. You can then cast to integer if the remaining strings are valid number literals.
CREATE OR REPLACE FUNCTION f_empty_text_to_null(_tbl regclass, OUT updated_rows int)
LANGUAGE plpgsql AS
$func$
DECLARE
_typ CONSTANT regtype[] := '{text, bpchar, varchar}'; -- ARRAY of all basic character types
_sql text;
BEGIN
SELECT INTO _sql -- build SQL command
'UPDATE ' || _tbl
|| E'\nSET ' || string_agg(format('%1$s = NULLIF(%1$s, '''')', col), E'\n ,')
|| E'\nWHERE ' || string_agg(col || ' = ''''', ' OR ')
FROM (
SELECT quote_ident(attname) AS col
FROM pg_attribute
WHERE attrelid = _tbl -- valid, visible, legal table name
AND attnum >= 1 -- exclude tableoid & friends
AND NOT attisdropped -- exclude dropped columns
AND NOT attnotnull -- exclude columns defined NOT NULL!
AND atttypid = ANY(_typ) -- only character types
ORDER BY attnum
) sub;
-- RAISE NOTICE '%', _sql; -- test?
-- Execute
IF _sql IS NULL THEN
updated_rows := 0; -- nothing to update
ELSE
EXECUTE _sql;
GET DIAGNOSTICS updated_rows = ROW_COUNT; -- Report number of affected rows
END IF;
END
$func$;
Call:
SELECT f_empty2null('mytable');
SELECT f_empty2null('myschema.mytable');
To also get the column name updated_rows:
SELECT * FROM f_empty2null('mytable');
db<>fiddle here
Old sqlfiddle
Major points
Table name has to be valid and visible and the calling user must have all necessary privileges. If any of these conditions are not met, the function will do nothing - i.e. nothing can be destroyed, either. I cast to the object identifier type regclass to make sure of it.
The table name can be supplied as is ('mytable'), then the search_path decides. Or schema-qualified to pick a certain schema ('myschema.mytable').
Query the system catalog to get all (character-type) columns of the table. The provided function uses these basic character types: text, bpchar, varchar, "char". Only relevant columns are processed.
Use quote_ident() or format() to sanitize column names and safeguard against SQLi.
The updated version uses the basic SQL aggregate function string_agg() to build the command string without looping, which is simpler and faster. And more elegant. :)
Has to use dynamic SQL with EXECUTE.
The updated version excludes columns defined NOT NULL and only updates each row once in a single statement, which is much faster for tables with multiple character-type columns.
Should work with any modern version of PostgreSQL. Tested with Postgres 9.1, 9.3, 9.5 and 13.

Resources