Combining fields in an index - search

This is about reducing index space without reducing search flexibility. I have fields A and B. I want to search fields A, B, or a block of A and B with a query of terms X Y. To do this I have the following fields in my index:
field A: A
field B: B
field AB: A + B
The resulting index duplicates data and wastes space. Is there a way to have only fields A and B in my index, but still allow searching of field AB? This is different from a search of field A or field B (hit is returned when A contains X Y or B contains X Y) and it is different from a search of field A and field B (hit is returned when A contains X Y and B contains X Y). I want to catch the situation where A contains X and B contains Y, for example. Please advise.

You have all kinds of settings for a multi match query to achieve what you want.
See here - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html

Related

Textjoin values of column B if duplicates are present in column A

I want to consolidate the data of column B into a single cell ONLY IF the index (ie., Column A) is duplicated.
For example:
Currently, I'm doing manually for each duplicated index by using the following formula:
=TEXTJOIN(", ",TRUE,B4:B6)
Is there a better way to do this all at once?
Any help is appreciated.
There may easier way but you can try this formula-
=BYROW(A2:A17,LAMBDA(p,IF(INDEX(MAP(A2:A17,LAMBDA(x,SUM(--(A2:INDEX(A2:A17,ROW(x)-1)=x)))),ROW(p)-1,1)=1,TEXTJOIN(", ",1,FILTER(B2:B17,A2:A17=p)),"")))
Using REDUCE might be possible for a more succinct solution, though try this for now:
=BYROW(A2:A17,LAMBDA(ζ,LET(α,A2:A17,IF((COUNTIF(α,ζ)>1)*(COUNTIF(INDEX(α,1):ζ,ζ)=1),TEXTJOIN(", ",,FILTER(B2:B17,α=ζ)),""))))
For the sake of alternatives about how to solve it:
Using XMATCH/UNIQUE
=LET(A, A2:A17, ux, UNIQUE(A),idx, FILTER(XMATCH(ux, A), COUNTIF(A, ux)>1),
MAP(SEQUENCE(ROWS(A)), LAMBDA(s, IF(ISNA(XMATCH(s, idx)), "", TEXTJOIN(",",,
FILTER(B2:B17, A=INDEX(A,s)))))))
or using SMALL/INDEX to identify the first element of the repetition:
=LET(A, A2:A17, n, ROWS(A), s, SEQUENCE(n),
MAP(A, s, LAMBDA(aa,ss, LET(f, FILTER(B2:B17, A=aa), IF((ROWS(f)>1)
* (INDEX(s, SMALL(IF(A=aa, s, n+1),1))=ss), TEXTJOIN(",",, f), "")))))
Here is the output:
Explanation
XMATCH and UNIQUE
The main idea here is to identify the first unique elements of column A via ux, and find their corresponding index position in A via XMATCH(ux, A). It is an array of the same size as ux. Then COUNTIF(A, ux)>1) returns an array of the same size as XMATCH output indicating where we have a repetition.
Here is the intermediate result:
XMATCH(ux, A) COUNTIF(A, ux)>1)
1 FALSE
2 FALSE
3 TRUE
6 FALSE
7 TRUE
9 TRUE
11 FALSE
12 TRUE
15 FALSE
16 FALSE
so FILTER takes only the rows form the first column where the second column is TRUE, i.e the index position (idx) where the repetition starts. For our sample it will be: {3;7;9;12}.
Now we iterate over the sequence of index positions (s) via MAP . If s is found in idx via XMATCH (also XLOOKUP(s, idx, TRUE, FALSE) can be used for the same purpose) then we join the values of column B filtered by column A equal to INDEX(A,s).
SMALL and INDEX
This is a more flexible approach because in the case we want to do the concatenation in another position of the repetition you just need to specify the order and the formula doesn't change.
We iterate via MAP through elements of column A and index position (s). The name f has the filtered values from column B where column A is equal to a given value of the iteration aa. We need to identify only filtered rows with repetition, so the first condition ROWS(f) > 1 ensures it.
The second condition identifies only the first element of the repetition:
INDEX(s, SMALL(IF(A=aa, s, n+1),1))=ss
The second argument of SMALL indicates we want the first smallest value, but it could be the second, third, etc.
Where A is equal to aa, IF assigns the corresponding value of the sequence (remember IF works as an array formula), if not then it assigns a value that will never be the smallest one, for example, n+1, where n represents the number of rows of column B. SMALL returns the smallest index position. If the current index position ss is not the smallest one, the conditions FALSE.
Finally, we do a TEXTJOIN only when both conditions are met (we multiply them to ensure an AND condition).

How to sort list by another list?

I have excel list of names(Datalist) and I need to sort it to be in exact same order as similar list(Patternlist).
How can i sort Datalist to have same order as Patternlist?
Patternlist(each letter is first cell in a row):
X
Y
Z
Q
Datalist(each letter is first cell in a row):
Q
X
Y
Z
Manually doing it :
Label each row in Patternlist with 1,2,3,..
use index match to generate a 'sequence' list from the Datalist
=index([Datalist_QXYZ],match([1st_named_cell],[Patternlist_XYZQ],0))
copy the generated sequence list and paste as values, then sort.
(3b) If you need to actively generate new list.. then use rank() to manually sort it.
Hope it helps..

How to match the ordering and sorting of multiple columns in Excel

I have data that look like this (going on for many more rows):
What I want to do is:
Match the relationship of C and G to the relationship of I and J.
For example, I:Q1652 matches up with J:Q1662; therefore, C:Q1652 should also match up with G:Q1662.
At the same time, A & B and E & F should maintain their relationships with C and G, respectively
For example, when C:Q1652 and G:Q1662 are being matched, they should carry with them their respective rows/values from columns A & B and E & F.
Please let me know if there's anything more I can clarify! Thanks!
Please see K1:N1 cells in the below graph.
K1: =INDEX(A:A,MATCH($I1,$C:$C,0))
L1: =INDEX(B:B,MATCH($I1,$C:$C,0))
M1: =INDEX(E:E,MATCH($J1,$G:$G,0))
N1: =INDEX(F:F,MATCH($J1,$G:$G,0))

Mutually dependent variables in a spreadsheet

Assume I have an input variable x and three parameters a,b,c such that:
Given b we have c = f(x,a,b) for some (known) function f
Given c we have b = g(x,a,c) for some (known, different) function g.
I want to model this in a spreadsheet (Excel for instance). More precisely, if the user provides x,a and b then c will be evaluated and if c is given then b will be evaluated. It seems like this cannot be achieved directly, since a cell can hold either a value or a formula.
Is there a canonical way to do this? If not, what would be a best-practice workaround (probably some VBA magic)?
You can separate input fields from the calculated values and add some validation that only one of the mutually exclusive field is used, e.g.:
in my example, I used following conditional formatting to highlight invalid input:
=AND($B$4<>"", $B$5<>"")
and I used following the formulas for calculated values:
=B2
=B3
=IF(AND($B$4<>"", $B$5<>""), "#ERROR: only 1 value can be specified",
IF($B$4<>"", $B$4, $B$5-1))
=IF(AND($B$4<>"", $B$5<>""), "#ERROR: only 1 value can be specified",
IF($B$5<>"", $B$5, $B$4+1))
more generally:
=if(error_condition, error_message, if(b_is_not_empty, b, g(x,a,c)))

String matching on two columns in [R]

I am looking to match multiple string criteria and then subset the row in R, using grepl to find the match. I have found a nice solution from another post where some specific code is used (but you get the idea): subset(GEMA_EO5, grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID))
I am wondering if it is possible to grepl in two columns, instead of just RefSeq_ID in the example above. That is, in grepl via any other method. In other words, I would like to look for the options in l not just in one column, but in two (or however many). Is this possible?
eg.: 3 columns, a b and c. I would like to criteria such that T (rows 3 and 4) is selected, despite the format "T I" in (3,b). it should identify both (4,a) and (3,b), hence the link to the previous question. I want it to look in column a AND column b, not one or the other.
a b c
A A C P L
V V B W E E
W T I P J G
T W P J
Here's some demo data to show how this works:
set.seed(1234)
dat <- data.frame(A = sample(letters[1:3],10,TRUE),
B = sample(letters[1:3],10,TRUE))
Using [ to subset makes this a lot more clear in my opinion - we can use grepl to give a logical vector based on a match, and use | to combine two tests (on multiple columns). If you wanted a subset of all the rows that contained an 'a' in either column:
dat.a <- dat[with(dat, grepl("a", A)|grepl("a", B)),]
A B
1 b a
2 b a
3 a c
5 a a
9 a a

Resources