Use a list to define SELECT columns in a query - apache-spark

I have a need to query from a parquet file where the column names are completely inconsistent. In order to remedy this issue and insure that my model gets exactly the data it expects I need to 'prefetch' the columns list then apply some regex patterns to qualify which columns I need to retrieve. In pseudocode:
PrefetchList = sqlContext.read.parquet(my_parquet_file).schema.fields
# Insert variable statements to check/qualify the columns against rules here
dfQualified = SELECT [PrefetchList] from parquet;
I've searched around to see if this is achievable but not had any success. If this is syntactically correct (or close) or if someone has other suggestions I am open to it.
Thanks

You can use either the schema method but you can also use the .columns method.
Notice that the select method in spark is a little odd, it's defined as def
select(col: String, cols: String*) so you can't pass back to it select(fields:_*), and you'd have to use df.select(fields.head, fields.tail:_*) which is kind of ugly, but luckily there's selectExpr(exprs: String*) as an alternative. So this below will work. It takes only columns that begin with 'user'
fields = df.columns.filter(_.matches("^user.+")) // BYO regexp
df.selectExpr(fields:_*)
This of course assumes that df contains your dataframe, loaded with sqlContext.read.parquet().

Related

How to modify dynamic complex data type fields in azure data factory data flow

This bounty has ended. Answers to this question are eligible for a +400 reputation bounty. Bounty grace period ends in 22 hours.
kevzettler is looking for a more detailed answer to this question.
I have a complex data type (fraudData) that undesirably has hyphen characters in the field names I need to remove or change the hypens to some other character.
The input schema of the complex object looks like:
I have tried using the "Select" and "Derive Column" data flow functions and adding a custom mapping. It seems both functions have the same mapping interface. My current attempt with Select is:
This gets me close to the desired results. I can use the replace expression to convert hypens to underscores.
The problem here is that this mapping creates new root level columns outside of the fraudData structure. I would like to preserve the hierarchy of the fraudData structure and modify the column names in place.
If I am unable to modify the fraudData in place. Is there any way I can take the new columns and merge them into another complex data type?
Update:. I do not know the fields of the complex data type in advance. This is a schema drift problem. This is why I have tried using the pattern matching solution. I will not be able to hardcode out kown sub-column names.
You can rename the sub-columns of complex data type using derived column transformation and convert them as a complex data type again. I tried this with sample data and below is the approach.
Sample complex data type column with two sub fields are taken as in below image.
img:1 source data preview
In Derived column transformation, For the column fraudData, expression is given as
#(fraudData_1_chn=fraudData.{fraudData-1-chn},
fraudData_2_chn=fraudData.{fraudData-2-chn})
img:2 Derived column settings
This expression renames the subfields and nests them under the parent column fraudData.
img:3 Transformed data- Fields are renamed.
Update: To rename sub columns dynamically
You can use below expression to rename all the fields under the root column fraudData.
#(each(fraudData, match(true()), replace($$,'-','_') = $$))
This will replace fields which has - with _.
You can also use pattern match in the expression.
#(each(fraudData, patternMatch(`fraudData-.+` ), replace($$,'-','_') = $$))
This expression will take fields with pattern fraudData-.+ and replace - with _ in those fields only.
Reference:
Microsoft document on script for hierarchical definition in data flow.
Microsoft document on building schemas using derived column transformation .

Writing custom condition inside .withColumn in Pyspark

I have to add a customized condition, which has many columns in .withColumn.
My scenario is somewhat like this. I have to check many columns row wise if they have Null values, and add those column names to a new column. My code looks somewhat like this:
df= df.withColumn("MissingColumns",\
array(\
when(col("firstName").isNull(),lit("firstName")),\
when(col("salary").isNull(),lit("salary"))))
Problem is I have many columns which I have to add to the condition. So I tried to customize it using loops and f-strings and tried using that.
df = df.withColumn("MissingColumns",condition)
But this condition is not working. May be because, the condition I have written is of data type String.
Is there any efficient way to do this?
You need to unpack your list inside the array as follows:
columns = ["firstName","salary"]
condition = array(*[when(col(c).isNull(),lit(c)) for c in columns])

Separating values that are combined in one string

I would like to solve this either in Excel or in SPSS:
I have categorical data (each number representing a medical diagnosis) that are combined into single cells. In other words, a row (patient) has multiple diagnoses. However, I would like to know the frequencies of each diagnosis. What is the best way to go about this? (See picture for reference)
For SPSS:
First just creating some sample data to demonstrate on:
data list free/e_cerv_dis_state (a20).
begin data
"{1/2/3/6}" "{1/2/4}" "{2/4/5}" "{1/5/6}" "{4}" "{4/5/6}" "{1/2/3/4/5/6}"
end data.
Now the following code will create a separate variable for each possible diagnosis, and will put a 1 in it if the diagnosis exists in the original variable.
do repeat vr=diag1 to diag9/vl=1 to 9.
compute vr=char.index(e_cerv_dis_state, string(vl, f1) ) > 0.
end repeat.
freq diag1 to diag6.
Note this will only work for up to 9 diagnoses. If you have more than that the solution will have to be adapted to multiple digits.
Assuming that the number of columns is fairly regular, I would suggest using text to columns, and then using COUNTIF on the cells if they are the value wanted. However there is a more robust and reproducible solution that would involve using SQL. If you download the free version of SQL Express here: https://www.microsoft.com/en-gb/sql-server/sql-server-downloads
Then you can import your table of data, here's how to do that: How to import an Excel file into SQL Server?
Then you could use the more friendly SQL database to get the answers you want. For example you can use a select statement that would say:
SELECT count(e_cerv_dis_state)
WHERE e_cerv_dis_state = '6'
It would also be possible to use a CASE WHEN statement to add-in the names of the diagnoses.

How do I conditionally add a new line and tabs in Excel?

I have a list of values id, name, category, description and a variable amount of keyword values; between 0 and 18 for each row. I want to create a list of those values in the form of:
(id, 'keyword')
, (id, 'keyword')
Where the list only increments if there is a keyword to go with the identifier. This is meant to be an easy manual list for a SQL INSERT statement.
I realize that I can use &CHAR(9) for inserting tabs and &CHAR(10) for inserting new lines, and thus my sequence for proper tabulation is &CHAR(10)&CHAR(9)&CHAR(9) for each new entry.
=IF(G2<>"",CONCATENATE("(",A2,", '",UPPER(G2),"')"),"")
&CHAR(10)&CHAR(9)&CHAR(9)&
IF(H2<>"",CONCATENATE("(",A2,", '",UPPER(H2),"')"),"")
I've tried several different combinations such as:
=IF(G2<>"",CONCATENATE("(",A2,", '",UPPER(G2),"')"),"")+
IF(H2<>"",CHAR(10)&CHAR(9)&CHAR(9)&CONCATENATE("(",A2,", '",UPPER(H2),"')"),"")
and
=IF(G2<>"",CONCATENATE("(",A2,", '",UPPER(G2),"')",CHAR(10),CHAR(9),CHAR(9)),"")+
IF(H2<>"",CONCATENATE("(",A2,", '",UPPER(H2),"')"),"")
and
=IF(G2<>"",CONCATENATE("(",A2,", '",UPPER(G2),"')"),"")+
IF(H2<>"",CONCATENATE(CHAR(10),CHAR(9),CHAR(9),"(",A2,", '",UPPER(H2),"')"),"")
which all give errors in calculation. Has anyone else been dying to know how to do this and had this kind of frustration? Does anyone have a solution to this?
I actually figured this out within an hour of posting, but neglected to post the solution here in case anyone else wanted to know.
You have to CONCATENATE the whole series of IF blocks and add the CHAR(10) and CHAR(9)s to the inner CONCATENATE blocks, like so:
=CONCATENATE(IF(G2<>"",CONCATENATE("(",A2,", '",UPPER(G2),"')"),""),
IF(H2<>"",CONCATENATE(CHAR(10),CHAR(9),CHAR(9),"(",A2,", '",UPPER(H2),"')"),""))
Thanks,
-C§

More efficient way than nested ifs

In Excel, I want to use something other then nested if statements to execute a task. Is there a cleaner way of doing cases besides nested if statements? Is there a cases statement in excel? For example given a ordered tuple with ones and zeros (e.g (1,1,0)), I want the value of a cell to be something. Can I specify the ordered tuples in advance without something besides nested if statements?
If you already know the ordered tuples and what you want the final value to be, why not create a reference table somewhere else on your sheet with Col1 = tuple ; Col2 = Wanted output?
Then just use a Vlookup() statement on that table...
Hope this makes sense / does what you want....

Resources