Using contains function in Azure Data Factory Dataflow expression builder - azure

I am using Azure Data Factory in which a data flow is used, I want to split my file in to two based on a condition. I am attaching an image with 2 lines, the first one is working but I want to use more programatic approach to achieve the same output:
I have a column named indicator inside my dataset, I want to use contains functionality to split the data, basically having 1 file where a string value inside indicator column has substring Weekly or does not.
Similar to what I would use in pandas:
df1 = df[df.indicator.str.contains('Weekly')]
df2 = df[~df.indicator.str.contains('Weekly')]

You can try the below expression as well in the Conditional split.
contains() expects an array. So first split the column content to create the array and give this to contains function.
contains(split(indicator, ' '),#item=='weekly')
This is my sample data.
Conditional split:
Weekly data in the output:
Remaining data:

If you are looking for the existing of a value inside of a string scalar column, use instr().
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expressions-usage#instr

Related

How to modify dynamic complex data type fields in azure data factory data flow

This bounty has ended. Answers to this question are eligible for a +400 reputation bounty. Bounty grace period ends in 22 hours.
kevzettler is looking for a more detailed answer to this question.
I have a complex data type (fraudData) that undesirably has hyphen characters in the field names I need to remove or change the hypens to some other character.
The input schema of the complex object looks like:
I have tried using the "Select" and "Derive Column" data flow functions and adding a custom mapping. It seems both functions have the same mapping interface. My current attempt with Select is:
This gets me close to the desired results. I can use the replace expression to convert hypens to underscores.
The problem here is that this mapping creates new root level columns outside of the fraudData structure. I would like to preserve the hierarchy of the fraudData structure and modify the column names in place.
If I am unable to modify the fraudData in place. Is there any way I can take the new columns and merge them into another complex data type?
Update:. I do not know the fields of the complex data type in advance. This is a schema drift problem. This is why I have tried using the pattern matching solution. I will not be able to hardcode out kown sub-column names.
You can rename the sub-columns of complex data type using derived column transformation and convert them as a complex data type again. I tried this with sample data and below is the approach.
Sample complex data type column with two sub fields are taken as in below image.
img:1 source data preview
In Derived column transformation, For the column fraudData, expression is given as
#(fraudData_1_chn=fraudData.{fraudData-1-chn},
fraudData_2_chn=fraudData.{fraudData-2-chn})
img:2 Derived column settings
This expression renames the subfields and nests them under the parent column fraudData.
img:3 Transformed data- Fields are renamed.
Update: To rename sub columns dynamically
You can use below expression to rename all the fields under the root column fraudData.
#(each(fraudData, match(true()), replace($$,'-','_') = $$))
This will replace fields which has - with _.
You can also use pattern match in the expression.
#(each(fraudData, patternMatch(`fraudData-.+` ), replace($$,'-','_') = $$))
This expression will take fields with pattern fraudData-.+ and replace - with _ in those fields only.
Reference:
Microsoft document on script for hierarchical definition in data flow.
Microsoft document on building schemas using derived column transformation .

Excel convert multiple columns to dataset based on unique timestamp

I want to convert(Formula or way to do it) excel from one output to another for google maps csv upload to plot data on maps.
Example:
Original CSV:
Expected output for mymaps API:
Also note that this coordinates are not constant and changing across the city or state.
Attempt 1) Manual but dataset is too large
Attempt 2) Text to Column but that only supports via delimiters
F2 =UNIQUE($B$2:$B$20)
G2 =FILTER($C$2:$C$20;($B$2:$B$20=$F2)*($A$2:$A$20=G$1))
H2 =FILTER($C$2:$C$20;($B$2:$B$20=$F2)*($A$2:$A$20=H$1))
With O365 you can try the following in E1 and you can get the entire result including the header:
=LET(id, A2:A5, time, B2:B5, str, C2:C5, idUx, SORT(UNIQUE(id)),
timeUx, UNIQUE(time),GET, LAMBDA(tt,ii, XLOOKUP(tt&"|"&ii, time&"|"&id, str)),
REDUCE(HSTACK("ref_time", TOROW(idUx)), timeUx, LAMBDA(ac,t,
VSTACK(ac, HSTACK(t, GET(t,INDEX(idUx,1)), GET(t, INDEX(idUx,2)))))))
Here is the output:
Check the following question on how to use REDUCE/VSTACK pattern to generate each row: how to transform a table in Excel from vertical to horizontal but with different length. We use GET user LAMBDA function to avoid repeating the same calculation with different inputs (tt,ii). Just update the input range names (id, time, str) for your real problem. Added "|" to concatenate the search for more than one value, to avoid any false positive. Check #JvdV answer for more detail and comments. It can be avoided using MMULT, but it produces a more verbose formula. Due to the nature of your data, I don't think it is necessary, using a delimiter will be enough.

Change data in Pandas dataframe by column

I have some data I imported from a excel spreadsheet as a csv. I created a dataframe using Pandas, and want to change a specific column. The column contains strings such as "5.15.1.0.0". I want to change these strings to floats like "5.15100".
So far I've tried using the method "replace" to change every instance in that column:
df['Fix versions'].replace("5.15.1.0.0", 5.15.1.0.0)
this however does not work. When I reprint the dataframe after the replace methods are called it shows me the same dataframe where no changes are made. Is it not possible to change a string to a float using replace? If not does anyone know another way to do this?
I could parse each string and remove the "." but I'd prefer not to do it this way as some of the strings represent numbers of different lengths and decimal place values.
Adding the parameter "inplace" which default is false. Changing this to true will change the dataframe in place, which can be type casted.
df['Fix versions'].replace(to_replace="5.15.1.0.0", value="5.15100", inplace=True)

Splitting the comma separated values in a column into multiple columns in Snaplogic

I have table data like below.
OTDATA
"ABC,CDE,EDF,123,10/20/2020"
"WDE,RED,ERT,231,09/22/2020"
"ERT,WED,TGY,453,08/10/2020"
I am trying to split into below through snaplogic.
OTDATA,OTDATA,OTDATA,OTDATA,OTDATA
ABC,CDE,EDF,123,10/20/2020
WDE,RED,ERT,231,09/22/2020
ERT,WED,TGY,453,08/10/2020
I have used mapper to do $OTDATA.split(',') but I am not achieving the desired output. Can you please give me a way to do it?
You can use two mappers one after the other with one mapper that splits the string and the other mapper that maps the elements of the resulting array to its corresponding fields.
Please note that you can't have fields with the same name.
Please refer to the following screenshots.
#1 Mapper that splits the string
#2 Mapper that maps the array elements to corresponding fields

Separating values that are combined in one string

I would like to solve this either in Excel or in SPSS:
I have categorical data (each number representing a medical diagnosis) that are combined into single cells. In other words, a row (patient) has multiple diagnoses. However, I would like to know the frequencies of each diagnosis. What is the best way to go about this? (See picture for reference)
For SPSS:
First just creating some sample data to demonstrate on:
data list free/e_cerv_dis_state (a20).
begin data
"{1/2/3/6}" "{1/2/4}" "{2/4/5}" "{1/5/6}" "{4}" "{4/5/6}" "{1/2/3/4/5/6}"
end data.
Now the following code will create a separate variable for each possible diagnosis, and will put a 1 in it if the diagnosis exists in the original variable.
do repeat vr=diag1 to diag9/vl=1 to 9.
compute vr=char.index(e_cerv_dis_state, string(vl, f1) ) > 0.
end repeat.
freq diag1 to diag6.
Note this will only work for up to 9 diagnoses. If you have more than that the solution will have to be adapted to multiple digits.
Assuming that the number of columns is fairly regular, I would suggest using text to columns, and then using COUNTIF on the cells if they are the value wanted. However there is a more robust and reproducible solution that would involve using SQL. If you download the free version of SQL Express here: https://www.microsoft.com/en-gb/sql-server/sql-server-downloads
Then you can import your table of data, here's how to do that: How to import an Excel file into SQL Server?
Then you could use the more friendly SQL database to get the answers you want. For example you can use a select statement that would say:
SELECT count(e_cerv_dis_state)
WHERE e_cerv_dis_state = '6'
It would also be possible to use a CASE WHEN statement to add-in the names of the diagnoses.

Resources