This bounty has ended. Answers to this question are eligible for a +400 reputation bounty. Bounty grace period ends in 22 hours.
kevzettler is looking for a more detailed answer to this question.
I have a complex data type (fraudData) that undesirably has hyphen characters in the field names I need to remove or change the hypens to some other character.
The input schema of the complex object looks like:
I have tried using the "Select" and "Derive Column" data flow functions and adding a custom mapping. It seems both functions have the same mapping interface. My current attempt with Select is:
This gets me close to the desired results. I can use the replace expression to convert hypens to underscores.
The problem here is that this mapping creates new root level columns outside of the fraudData structure. I would like to preserve the hierarchy of the fraudData structure and modify the column names in place.
If I am unable to modify the fraudData in place. Is there any way I can take the new columns and merge them into another complex data type?
Update:. I do not know the fields of the complex data type in advance. This is a schema drift problem. This is why I have tried using the pattern matching solution. I will not be able to hardcode out kown sub-column names.
You can rename the sub-columns of complex data type using derived column transformation and convert them as a complex data type again. I tried this with sample data and below is the approach.
Sample complex data type column with two sub fields are taken as in below image.
img:1 source data preview
In Derived column transformation, For the column fraudData, expression is given as
#(fraudData_1_chn=fraudData.{fraudData-1-chn},
fraudData_2_chn=fraudData.{fraudData-2-chn})
img:2 Derived column settings
This expression renames the subfields and nests them under the parent column fraudData.
img:3 Transformed data- Fields are renamed.
Update: To rename sub columns dynamically
You can use below expression to rename all the fields under the root column fraudData.
#(each(fraudData, match(true()), replace($$,'-','_') = $$))
This will replace fields which has - with _.
You can also use pattern match in the expression.
#(each(fraudData, patternMatch(`fraudData-.+` ), replace($$,'-','_') = $$))
This expression will take fields with pattern fraudData-.+ and replace - with _ in those fields only.
Reference:
Microsoft document on script for hierarchical definition in data flow.
Microsoft document on building schemas using derived column transformation .
I have to add a customized condition, which has many columns in .withColumn.
My scenario is somewhat like this. I have to check many columns row wise if they have Null values, and add those column names to a new column. My code looks somewhat like this:
df= df.withColumn("MissingColumns",\
array(\
when(col("firstName").isNull(),lit("firstName")),\
when(col("salary").isNull(),lit("salary"))))
Problem is I have many columns which I have to add to the condition. So I tried to customize it using loops and f-strings and tried using that.
df = df.withColumn("MissingColumns",condition)
But this condition is not working. May be because, the condition I have written is of data type String.
Is there any efficient way to do this?
You need to unpack your list inside the array as follows:
columns = ["firstName","salary"]
condition = array(*[when(col(c).isNull(),lit(c)) for c in columns])
I would like to solve this either in Excel or in SPSS:
I have categorical data (each number representing a medical diagnosis) that are combined into single cells. In other words, a row (patient) has multiple diagnoses. However, I would like to know the frequencies of each diagnosis. What is the best way to go about this? (See picture for reference)
For SPSS:
First just creating some sample data to demonstrate on:
data list free/e_cerv_dis_state (a20).
begin data
"{1/2/3/6}" "{1/2/4}" "{2/4/5}" "{1/5/6}" "{4}" "{4/5/6}" "{1/2/3/4/5/6}"
end data.
Now the following code will create a separate variable for each possible diagnosis, and will put a 1 in it if the diagnosis exists in the original variable.
do repeat vr=diag1 to diag9/vl=1 to 9.
compute vr=char.index(e_cerv_dis_state, string(vl, f1) ) > 0.
end repeat.
freq diag1 to diag6.
Note this will only work for up to 9 diagnoses. If you have more than that the solution will have to be adapted to multiple digits.
Assuming that the number of columns is fairly regular, I would suggest using text to columns, and then using COUNTIF on the cells if they are the value wanted. However there is a more robust and reproducible solution that would involve using SQL. If you download the free version of SQL Express here: https://www.microsoft.com/en-gb/sql-server/sql-server-downloads
Then you can import your table of data, here's how to do that: How to import an Excel file into SQL Server?
Then you could use the more friendly SQL database to get the answers you want. For example you can use a select statement that would say:
SELECT count(e_cerv_dis_state)
WHERE e_cerv_dis_state = '6'
It would also be possible to use a CASE WHEN statement to add-in the names of the diagnoses.
I have a list of values id, name, category, description and a variable amount of keyword values; between 0 and 18 for each row. I want to create a list of those values in the form of:
(id, 'keyword')
, (id, 'keyword')
Where the list only increments if there is a keyword to go with the identifier. This is meant to be an easy manual list for a SQL INSERT statement.
I realize that I can use &CHAR(9) for inserting tabs and &CHAR(10) for inserting new lines, and thus my sequence for proper tabulation is &CHAR(10)&CHAR(9)&CHAR(9) for each new entry.
=IF(G2<>"",CONCATENATE("(",A2,", '",UPPER(G2),"')"),"")
&CHAR(10)&CHAR(9)&CHAR(9)&
IF(H2<>"",CONCATENATE("(",A2,", '",UPPER(H2),"')"),"")
I've tried several different combinations such as:
=IF(G2<>"",CONCATENATE("(",A2,", '",UPPER(G2),"')"),"")+
IF(H2<>"",CHAR(10)&CHAR(9)&CHAR(9)&CONCATENATE("(",A2,", '",UPPER(H2),"')"),"")
and
=IF(G2<>"",CONCATENATE("(",A2,", '",UPPER(G2),"')",CHAR(10),CHAR(9),CHAR(9)),"")+
IF(H2<>"",CONCATENATE("(",A2,", '",UPPER(H2),"')"),"")
and
=IF(G2<>"",CONCATENATE("(",A2,", '",UPPER(G2),"')"),"")+
IF(H2<>"",CONCATENATE(CHAR(10),CHAR(9),CHAR(9),"(",A2,", '",UPPER(H2),"')"),"")
which all give errors in calculation. Has anyone else been dying to know how to do this and had this kind of frustration? Does anyone have a solution to this?
I actually figured this out within an hour of posting, but neglected to post the solution here in case anyone else wanted to know.
You have to CONCATENATE the whole series of IF blocks and add the CHAR(10) and CHAR(9)s to the inner CONCATENATE blocks, like so:
=CONCATENATE(IF(G2<>"",CONCATENATE("(",A2,", '",UPPER(G2),"')"),""),
IF(H2<>"",CONCATENATE(CHAR(10),CHAR(9),CHAR(9),"(",A2,", '",UPPER(H2),"')"),""))
Thanks,
-C§
In Excel, I want to use something other then nested if statements to execute a task. Is there a cleaner way of doing cases besides nested if statements? Is there a cases statement in excel? For example given a ordered tuple with ones and zeros (e.g (1,1,0)), I want the value of a cell to be something. Can I specify the ordered tuples in advance without something besides nested if statements?
If you already know the ordered tuples and what you want the final value to be, why not create a reference table somewhere else on your sheet with Col1 = tuple ; Col2 = Wanted output?
Then just use a Vlookup() statement on that table...
Hope this makes sense / does what you want....