Azure Data Flows isolate-find substrings form text field - azure

I am trying to isolate several substrings from a specific column of a parquet file that contains text (string). The substrings are all in an array and I want to keep only those rows that contain one or more of these substrings - words, while I keep a new column with the substrings that where found at the text.
I have currently used the following transformations:
source: which is the parquet file I use
derived column: where I create a new column (words) which contains an array of the words-substrings that are contained in the text, by using the following expression
intersect(split(text_column, ' '), ['array','of','words'])
filter: where I want to filter the derived column that was created at the previous transformation and exclude those rows that are either Null or contain an empty array
sink
I have currently stuck to the 3rd transformation where I cannot filter and discard those rows that the 2 arrays do not intersect. I think that when intersect doesn't find any common element it returns an empty string array which I have not find the right condition that filters it out.
I have tried:
1. not(isNull(words))
2. word != array('')
3. not(isNull(word[1]))
but none of them worked.
Any suggestions regarding the whole process or the filtering of the empty string array will be perfect.
Thank you in advance.
I was expecting to get back only the rows that contain at least one of the substrings, but I get all the rows regardless if they contain one of the substrings.

You can check the size of the array and remove the rows with array size=0. In filter transformation, filter on size(words)!=0.
I repro'd this with sample inputs.
Derived column transformation with same expression is given.
intersect(split(text_column, ' '), ['array','of','words'])
Then in Filter transformation, condition is given as filter on size(words)!=0
By this way, we can remove the empty array.
Reference: MS document on size expression.

Related

Five random items from a a list into a single cell separated by a comma

I have n number of unique values in n cells in Column A. (For ex: EDN12, EDN122, EDN991, ....)
I want to return any five unique values without repetition in a random order from Column A into an individual cell n times separated by a comma. For example; (EDN12, EDN112, EDN991, EDN881, EDN12)
How do I achieve this?
I have tried this formula provided here (Return a random order of a list into a single cell )
=TEXTJOIN(",",,INDEX($A$1:$A$5,UNIQUE(RANDARRAY(1000,1,1,5,TRUE))))
But it only generates five values for starting five cells in column A and rest are omitted.
Assuming values in column A are unique on their own, try:
=LET(x,TOCOL(A:A,3),TEXTJOIN(", ",,TAKE(SORTBY(x,RANDARRAY(COUNTA(x))),5)))
Otherwise just nest 'x' in UNIQUE():
=LET(x,UNIQUE(TOCOL(A:A,3)),TEXTJOIN(", ",,TAKE(SORTBY(x,RANDARRAY(COUNTA(x))),5)))
This is an alternate formula to get the required results without using LET.
Although I prefer the solution using the LET function.
=INDEX(A3:A22,INDEX(UNIQUE(RANDARRAY(COUNTA(A3:A22),1,1,COUNTA(A3:A22))),SEQUENCE(5)))
Breaking it down:
Get an array of random numbers based on the number of data rows.
=RANDARRAY(COUNTA(A3:A22),1,1,COUNTA(A3:A22),TRUE)
Extract the unique values from the array of random numbers.
=UNIQUE(C3#)
Extract the first five unique values
=INDEX(D3#,SEQUENCE(5))
Use the extracted values to extract matching rows from the source data.
=INDEX(A3:A22,E3#)
Finally join the values into a single cell.
=TEXTJOIN(", ",TRUE,F3#)
If your list of data is very short, then it can return non-unique values.
Although your example appears to have at least 1000 data rows, so it will not be a problem.

I want to find multiple output based of 2 given conditions in excel?

I have a dummy data for origin and destination, I want to find list of cities as text which are 4 hrs. of destination. It should give me a list of cities when I change the origin. Can this be done easily by doing some lookup and match function?
The value to lookup is in A8, and this is matched to the first column:
=XMATCH(A8,A2:A6)
Then return this row from the numerical data using INDEX (A10):
=INDEX(B2:F6,A9,)
The last comma is needed to return the entire row.
FILTER this to return only the values greater than B8 (the criteria) (A11):
=FILTER(A10#,A10#>=B8)
Then multi-match using XLOOKUP (A12):
=XLOOKUP(A11#,A10#,B1:F1)
Put it all together using LET to save space (A14):
=LET(A,INDEX(B2:F6,A9,),XLOOKUP(FILTER(A,A>=B8),A,B1:F1))

Replace consecutive identical values from excel column using python dataframes

As seen in the first table with fact values beneath units are mentioned and the consecutive "Numbered" values should be replaced with blank preserving text values and boolean values in column.
The required output can be something similar as follow:
Try using mask and shift:
print(df.mask(df.shift() == df).fillna(''))

EXCEL - Array formula match in list

I am trying to make a formula that returns into a single cell (array formula) a vector of True/False based on whether each element in an array matches (identically) or not any of the elements in another array.
Example:
Array to be matched (array_compare): [A;A;B;C;D]
Array with elements (array_elements): [A;B;D]
The formula should return something like:
={formula(array_compare;array_elements)} ==> [TRUE;TRUE:TRUE;FALSE;TRUE]
I need this mid-step function so later on I can add rows or columns based on criteria or tell how many matching item there are in array_compare.
For example (for later use):
=sum(--formula(array_compare;array_elements))} ==> 4 (in the example)
THANKS!
You can use ISNUMBER(MATCH())
=SUMPRODUCT(--ISNUMBER(MATCH({"A","A","B","C","D"},{"A","B","D"},0)))
This will return 4 as it will iterate the large array.
If you want to iterate the smaller array reverse the two arrays and it will return 3:
=SUMPRODUCT(--ISNUMBER(MATCH({"A","B","D"},{"A","A","B","C","D"},0)))

delete particular string values in the column

I have column which contains 4000 unique values(rows). I want to delete values such as 'I__ND_LD(1),I__ND_LD(2),P__ND_LN(1),I__XF_XF(4)'.these values are unique in numbers in the brackets. for example. 'I__ND_LD(1) starts with 1 and end with 'I__ND_LD(70).
By this code,I can remove only one character using above function. I want to remove all the values as mentioned in the problem.
eda[~eda.Devices.str.Contains("^I__ND_LD(1)")]
Is there any other technique through which i can remove all these values, also we have different number of 'I__ND_LD' and 'P__ND_LN(1). I want to implement this in the function so that I can just pass the values and it delete all the values in the column.
to_remove = ['abc\(\d+\)', 'bca']
eda[~eda.devices.str.contains('|'.join(to_remove), regex=True)]

Resources