Creating Expectations for all columns of certain type in Palantir Foundry - apache-spark

I use a expectations and Check to determine if a column of decimal type could be transformed into int or long type. A column could be safely transformed if it contains integers or decimals where decimal part only contains zeros. I check it using regex function rlike, as I couldn't find any other method using expectations.
The question is, can I do such check for all columns of type decimal without explicitly listing column names? df.columns is not yet available, as we are not yet inside the my_compute_function.
from transforms.api import transform_df, Input, Output, Check
from transforms import expectations as E
#transform_df(
Output("ri.foundry.main.dataset.1e35801c-3d35-4e28-9945-006ec74c0fde"),
inp=Input(
"ri.foundry.main.dataset.79d9fa9c-4b61-488e-9a95-0db75fc39950",
checks=Check(
E.col('DSK').rlike('^(\d*(\.0+)?)|(0E-10)$'),
'Decimal col DSK can be converted to int/long.',
on_error='WARN'
)
),
)
def my_compute_function(inp):
return inp

You are right in that df.columns is not available before my_compute_function's scope is entered. There's also no way to add expectations from runtime, so hard-coding column names and generating expectations is necessary with this method.
To touch on the first part of your question - in an alternative approach you could attempt decimal -> int/long conversion in an upstream transform, store the result in a separate column and then use E.col('col_a').equals_col('converted_col_a').
This way you could simplify your Expectation condition while also implicitly handling the cases in which conversion would under/over-flow as DecimalType can hold arbitrarily large/small values (https://spark.apache.org/docs/latest/sql-ref-datatypes.html).

Related

Excel: How to add two numbers that has unit prefixes?

I'm trying to add numbers that have the unit prefixes appended at the end (mili, micro, nano, pico, etc). For example, we have these two columns below:
Obviously doing something like =A2+A3+A4 and =B2+B3+B4 would not work. How would you resolve this? Thanks
Assuming you don't have excel version constraints, per the tags listed in your question. Put all the suffixes as delimiters inside {} delimited by a comma as follow in TEXTSPLIT, then define the conversion rules in XLOOKUP. We use SUBSTITUTE(col, nums, "") as input of XLOOKUP to extract the unit of measure.
=BYCOL(A2:B4, LAMBDA(col, LET(nums, 1*TEXTSPLIT(col,{"ms","us"},,1),
units, XLOOKUP(SUBSTITUTE(col, nums, ""), {"us";"ms"},{1;1000}),
SUM(nums * units))))
The above formula converts the result to a common unit of microseconds (us), i.e. to the lower unit, so milliseconds get converted by multiplying by 1000. If the unit of measure was not found it returns #N/A, it can be customized by adding a fourth parameter to XLOOKUP. If you want the result in milliseconds, then replace: {1;1000} with {0.001;1} or VSTACK(10^-3;1) for example.
If you would like to have everything in seconds, you can use the trick of using power combined with the XMATCH index position, to generate the multiplier. I took the idea from this question: How to convert K, M, B formatted strings to plain numbers?, check the answer from #pgSystemTester (for Gsheet, but it can be adapted to Excel). I included nanoseconds too.
=BYCOL(A2:B4,LAMBDA(col,LET(nums,1*TEXTSPLIT(col,{"ms","us"},,1),
units, 1000^(-IFERROR(XMATCH(RIGHT(col,2), {"ms";"us";"ns"}),0)),
SUM(nums * units))))
Under this approach, seconds is the output unit, because it is not part of the XMATCH lookup_array input argument, the multiplier will be 1 (as a result of 1000^0), so no units or seconds (s) will be treated the same way.
Notes:
In my initial version I used INDEX, but as #P.b pointed out in the comments, it is not really necessary to remove the second empty column, instead, we can use the ignore_empty input argument from TEXTSPLIT. Thanks
You can use TEXTBEFORE instead of TEXTSPLIT, as follows: TEXTBEFORE(A2:A4,{"ms","us"})

PySpark SQL TRY_CAST?

I have data in a Dataframe, all columns as strings. Now, some of the data in a column is numeric so I could cast to float. Other rows actually contain strings which I do not want to cast.
So I was looking for something like a try_cast, and already tried building something on .when().otherwise() but didn't succeed so far.
casted = data.select(when(col("Value").cast("float").isNotNull(), col("Value").cast("float")).otherwise(col("Value")))
This does not work, it will never cast in the end.
Is something like this generally possible (in a performant manner without UDFs etc)?
You can't have a column with two types in spark: either float or string. That's why your column has always string type (because it can contain both: strings and floats).
What your code does, is: if the number in Value column doesn't fit into float, it will be casted to float, and then to string (try with >6 decimal places). As far as I know TRY_CAST converts to value or null (at least in SQL Server), so this is exactly what spark's cast does.

converting strings to formula objects in Julia

I have a dataframe in Julia with less than 10 column names. I want to generate a list of all possible formulas that could be fed into a linear model (eg, [Y~X1+X2+X3, Y~X1+X2, ....]). I can accomplish this easily with combinations() and string versions of the column names. However, when I try to convert the strings into Formula objects, it breaks down. Looking at DataFrames.jl documentation, it seems like one can only construct Formulas from "expressions" and I can indeed make a list of individual column names as expressions. Is there any way I can somehow join together a bunch of different expressions using the "+" operator programmatically such that the resulting composite expression can then be passed into RHS of the Formula constructor? My impulse is to search for some function that will convert an arbitrary string into the equivalent expression, but not sure if that is correct.
The function parse takes a string, parses it, and returns an expression. I see nothing wrong with using it for what you're talking about.
Here is some actual working code, because I have been struggling with getting a similar problem to work. Please note this is Julia version 1.3.1 so parse is now Meta.parse and instead of combinations I used IterTools.subsets.
using RDatasets, DataFrames, IterTools, GLM
airquality = rename(dataset("datasets", "airquality"), "Solar.R" => "Solar_R")
predictors = setdiff(names(airquality), [:Temp])
for combination in subsets(predictors)
formula = FormulaTerm(Term(:Temp), Tuple(Term.(combination)))
if length(combination) > 0
#show lm(formula, airquality)
end
end

Stata tab over entire dataset

In Stata is there any way to tabulate over the entire data set as opposed to just over one variable/column? This would give you the tabulation over all the columns.
Related - is there a way to find particular values in Stata if one does not know which column they occur in? The output would be which column and row they are located in or at least which column.
Stata does not use row and column terminology except with reference to matrices and vectors. It uses the terminology of observations and variables.
You could stack or reshape the entire dataset into one variable if and only if all variables are numeric or all are string. If that assumption is incorrect, then you would need to convert numeric variables to string, at least temporarily, before you could do that. I guess wildly that you are only interested in blocks of variables that are all either numeric or string.
When you say "tabulate" you may mean the tabulate command. That has limits on the number of rows and/or columns it can show that might bite, but with a small amount of work list could be used for a simple table with many more values.
tabm from tab_chi on SSC may be what you seek.
For searching across several variables, you could automate a loop.
I'd say that if this is a felt need, it is quite probable that you have the wrong data structure for at least some of what you want to do and should reshape. But further details might explode that.

Loading dataset containing both strings and number

I'm trying to load the following dataset:
Afghanistan,5,1,648,16,10,2,0,3,5,1,1,0,1,1,1,0,green,0,0,0,0,1,0,0,1,0,0,black,green
Albania,3,1,29,3,6,6,0,0,3,1,0,0,1,0,1,0,red,0,0,0,0,1,0,0,0,1,0,red,red
Algeria,4,1,2388,20,8,2,2,0,3,1,1,0,0,1,0,0,green,0,0,0,0,1,1,0,0,0,0,green,white
...
Problem is it contains both integers and strings.
I found some information on how to get out the integers only.
But haven't been able to see if there's any way to get all the data.
My question is that possible ??
If that is not possible, is there then any way to find the numbers on each line and throw everything else away without having to choose the columns?
I need specifically since it seems I cannot use str2num on a whole line at a time.
Almost anything is possible, you just have to define your goal accurately.
Assuming that your database is stored as a text file, you can parse it line by line using textread, and then apply regexp to filter only the numerical fields (this does not require having prior knowledge about the columns):
C = textread('database.txt', '%s', 'delimiter', '\n');
C = cellfun(#(x)regexp(x, '\d+', 'match'), C, 'Uniform', false);
The result here is a cell array of cell array of strings, where each string corresponds to a numerical field in a specific line.
Since the numbers are still stored as strings, you'd probably need to convert them to actual numerical values. There's a multitude of ways to do that, but you can use str2num in a tricky way: it can convert delimited strings into an array of numbers. This means that if you concatenate all strings in a specific line back into one string, and put spaces in between, you can apply str2num on all of them at once, like so:
C = cellfun(#(x)str2num(sprintf('%s ', x{:})), C, 'Uniform', false);
The resulting C is a cell array of vectors, each vector containing the values of all numerical fields in the corresponding line. To access a specific vector, you can use curly braces ({}). For instance, to access the numbers of the second line, you would use C{2}.
All the non-numerical fields are discarded in the process of parsing, of course. If you want to keep them as well, you should use a different regular expression with regexp.
Good luck!

Resources