Lucene BooleanQuery - Must be present in one of two columns

Lucene BooleanQuery - Must be present in one of two columns - azure

Not sure how to format the query in Lucene. The scenario is that the search term must be present in one of the two columns (either one is fine).
boolQuery.Add(query1, Occur.MUST) 'this one is fine
boolQuery.Add(query2, Occur.SHOULD)
boolQuery.Add(query3, Occur.SHOULD)
Brings up results even when the search term is not present at all in column 2 and column 3.
boolQuery.Add(query2, Occur.MUST)
boolQuery.Add(query3, Occur.SHOULD)
Does not bring up results when the search term is present in column 3 but not in column 2.
How do I format the query so that I get equivalent of this:
where column 1= val1 and (column 2 = val2 or column 3 = val2)

MUST, as the name suggests, makes the occurrence mandatory. SHOULD means optional. The first boolean query will basically match only documents hit by the first clause, but if any of them can be hit by the second or third clause, they will score higher. To get the results to match your desired linq (i assume that's what it is) statement, this should work (using java).
BooleanQuery q = new BooleanQuery();
BooleanQuery subQuery = new BooleanQuery();
subQuery.addClause(new BooleanClause(q2,Occur.SHOULD));
subQuery.addClause(new BooleanClause(q3,Occur.SHOULD));
q.addClause(new BooleanClause(q1, Occur.MUST));
q.addClause(new BooleanClause(subQuery,Occur.MUST));
Your confusion probably stems from the fact that the query API implements must and should as unary operators, while in the traditional programming languages AND and OR are binary operators

i solved a similar issue using query syntax:
+(col1:{query} OR col2:{query})
this will return the documents having the value {query} in at least one of the fields.
(note: i am using the classes Query and MultiFieldQueryParser)

Related

Reduce results to first match for each pattern with spark sql

I have a spark sql query, where I have to search for multiple identifiers:
SELECT * FROM my_table WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')
Now I get hundreds of results for each of these matches, where I am only interested in the first match for each identifier, i.e. one row with identifier == 'abc', one where identifier == 'cde' and so on.
What is the best way to reduce my result to only the first row for each match?

The best approach certainly depends a bit on your data and also on what you mean by first. Is that any random row that happens to be returned first? Or first by some particular sort order?
A general flexible approach is using window functions. row_number() allows you to easily filter for the first row by window.
SELECT * FROM (
SELECT *, row_number() OVER (PARTITION BY identifier ORDER BY ???) as row_num
FROM my_table
WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')) tmp
WHERE
row_num = 1
Though, aggregations like first or max_by are often more efficient. But these get quickly inconvenient when dealing with lots of columns.

You can use the first() aggregation function (after grouping by identifier) to only get the first row in each group.
But I don't think you'll be able to select * with this approach. Instead, you can list every individual column you want to get:
SELECT identifier, first(col1), first(col2), first(col3), ...
FROM my_table
WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')
GROUP BY identifier
Another approach would be to fire a query for each identifier value with a limit of 1 and then union all the results.
With the DataFrame API, you can use your original query and then use .dropDuplicates(["identifier"]) on the result to only keep a single row for each identifier value.

"Does Not Contain" text filter filters too much

By using Power Query, I have created an address list from address fields in approx. 15000 individual excel files.
I now have a list with 15143 rows but I have run into problems with the "Does Not Contain" Text filter.
I want to keep rows that do not contain the search term "foo" in a specific column.
When I first use the "Contains" "foo" Text Filter it returns a list of 150 rows
But when I use the "Does Not Contain" "foo" Text Filter instead the list is shortened to only 3218 rows.
A bit unexpected result...
If I recall my maths lessons correctly 15143-150=14993, not 3218.
This is driving me nuts!
Do I do something wrong or is it the Almighty Microsoft Bug that has hit me, once again?

This behavior is related to the expected Sql logic for null: if a row field is null, it doesn't contain "foo" but it also doesn't not contain "foo". Put differently, a WHERE filter skips rows that evaluate to null, and not null is also null.
You can see this in Power Query:
let
Source = Table.FromColumns({{null, "foo", "bar"}}),
FilteredRows = Table.SelectRows(Source, each
not Text.Contains([Column1], "foo") or Text.Contains([Column1], "foo"))
in
FilteredRows
... only returns the last two rows.
In Power Query if you want to avoid this bizarre kind of logic, you can replace null with empty string and then you get nicer behavior:
= Table.ReplaceValue(Source,null,"",Replacer.ReplaceValue,{"Column1"})

Filter based on existence in one table and non-existence in another

I have the following data model:
Record: Id, ..., CreateDate
FactA: RecordId, CreateDate
FactB: RecordId, CreateDate
Relationships exist from FactA to Record and FactB to Record.
I've written measures on Records such as this with no issues:
FactA's:=CALCULATE(DISTINCTCOUNT(Records[Id]), FactA)
FactB's:=CALCULATE(DISTINCTCOUNT(Records[Id]), FactB)
Now I'd like a count of Records with FactA but no FactB, in SQL I'd do a LEFT JOIN WHERE FactB.RecordId IS NULL but I can't figure out how to do similar in DAX. I've tried:
-- this returns blank, presumably because when there is a FactB then RecordId isn't blank, and when there is no Fact B then RecordId a NULL which isn't blank either
FactA_No_FactB:=CALCULATE(DISTINCTCOUNT(Records[Id]), FactA, FILTER(FactB, ISBLANK([RecordId])))
-- this returns the long "The value for columns "RecordId" in table "FactB" cannot be determined in the current context" error.
FactA_No_FactB:=CALCULATE(DISTINCTCOUNT(Records[Id]), FILTER(FactA, ISBLANK(FactB[RecordId])))
I've also tried various ways of using RELATED and RELATEDTABLE but I don't really understand enough about DAX and context to know what I'm doing.
Can someone explain how I can write the calculated measure to count Records with FactA but no FactB?
Thanks in advance.
Edit - Workaround
I've come up with this, it looks correct so far but I'm not sure if it is the generally correct way to do this:
-- Take the count with FactA and subtract the count of (FactA and FactB)
FactA_No_FactB:=CALCULATE(DISTINCTCOUNT(Records[Id]), FactA) - CALCULATE(DISTINCTCOUNT(Records[Id]), FactA, FactB)

Here's an alternative, that might still not be the best way of doing it:
FactA_No_FactB:=CALCULATE(DISTINCTCOUNT(Records[ID]), FILTER(Records,CONTAINS(FactA, FactA[RecordID],Records[ID]) && NOT(CONTAINS(FactB,FactB[RecordID],Records[ID]))))
The difference between my version and yours is that mine returns a value of 1 for those items in and A but not B and BLANK for everything else. Your version returns 1 for those items in A but not B, 0 for those in both A and B and BLANK for everything else. Depending on your use case, one outcome may be prefereable over the other.

Excel Power Query -- Select value in column specified in related table -- INDEX+MATCH alternative

Problem
I have two queries, one contains product data (data_query), the other (recode_query) contains product names from within the data_query and assigns them specific id_tags. id_tags are also column names within the data_query.
What I need to achieve and fail at
I need the data_query to look at the id_tag of the specific product name within the data_query, as parsed from the recode_query (this is already working and in place) and input the retrieved value within the specific custom column cell. In Excel, I would be using INDEX/MATCH combo:
{=INDEX(data_query[#Data];; MATCH(data_query[#id_tag]; data_query[#Headers]; 0))}
I have searched near and far, but I probably can't even spot the solution, even if I have come across it, as I am not that deep in the data manipulation and power query myself.

Is this what you're wanting?
let
DataQuery = Table.FromColumns({{1,2,3}, {"Boxed", "Bagged", "Rubberbanded"}}, {"ID","Pkg"}),
RecodeQuery = Table.FromColumns({{"Squirt Gun", "Coffee Maker", "Trenching Tool"}, {1,2,3}}, {"Prod Name", "ID2"}),
Rzlt = Table.Join(DataQuery, "ID", RecodeQuery, "ID2", JoinKind.Inner)
in
Rzlt

Complicated condition

I have predefined item combination (for example brand1|brand2|brand3 etc) in the table.
i like to collect brands and check against with predefined table data.
For example i collected brand1|brand2|brand3 then i can do get some value form that predefined table(it meets the condition).
How can i check?
brands would be unlimited. also brand1|brand2|brand3 of brand1|brand2| exist then returns true.

Okay, taking a wild guess at what you're asking, you have a delimited field with brands in them separated by a | character. You want to return any row that has the right combination of the brands in there, but don't want to return rows with, for example, brand "testify" in them when you search for "test".
You have four search conditions (looking for brand3):
the brand exists by itself: "brand3"
the brand starts the delimited field: "brand3|brand4|brand6"
the brand is in the middle of the field: "brand1|brand3|brand6"
the brand is at the end of the field: "brand1|brand2|brand3"
so, in SQL:
SELECT *
FROM MyTable
WHERE BrandField = 'brand3'
OR BrandField LIKE 'brand3|%'
OR BrandField LIKE '%|brand3|%'
OR BrandField LIKE '%|brand3'
Repeat as required for multiple brands.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Lucene BooleanQuery - Must be present in one of two columns - azure

i solved a similar issue using query syntax: +(col1:{query} OR col2:{query}) this will return the documents having the value {query} in at least one of the fields. (note: i am using the classes Query and MultiFieldQueryParser)

Related

Reduce results to first match for each pattern with spark sql

"Does Not Contain" text filter filters too much

Filter based on existence in one table and non-existence in another

Excel Power Query -- Select value in column specified in related table -- INDEX+MATCH alternative

Complicated condition

Categories

Resources