How do I filter an array against a list of strings in Slick - slick

I've got a column containing an array of varchar, and a list of search strings that I want to match against the column. If any of the search strings match any substring in the column strings, I want to return the row.
So for example if the column contains:
row 1: ['butter', 'water', 'eggs']
row 2: ['apples', 'oranges']
row 3: ['chubby', 'skinny']
And my search strings are:
Set("ter", "hub")
I want my filtered results to include row 1 and row 3, but not row 2.
If I were writing this in plain Scala I'd do something like:
val rows = [the rows containing my column]
val search = Set("ter", "hub")
rows.filter(r => search.exists(se => r.myColumn.exists(s => s.contains(se))))
Is there some way of doing this in Slick so the filtering gets done on the DB side before returning the results? Some combination of LIKE and ANY, maybe? I'm a little fuzzy on the mechanics of filtering an array against another array in SQL in the first place.

While I'm not convinced that this is the best way to do it, I've put together a solution that uses Regex. First, I concatenate the search terms into a simple regular expression:
val matcher = search.mkString(".*(","|",").*") // i.e. '.*(ter|hub).*'
Then I concatenate the array in the table column using an implicit SimpleExpression:
implicit class StringConcat(s: Rep[List[String]]){
def stringConcat: Rep[String] = {
val expr = SimpleExpression.unary[List[String], String] { (s, qb) =>
qb.sqlBuilder += "array_to_string("
qb.expr(s)
qb.sqlBuilder += ", ',')"
}
expr.apply(s)
}
}
Finally, I build a regex query using another implicit SimpleExpression:
implicit class RegexQuery(s: Rep[String]) {
def regexQ(p: Rep[String]): Rep[Boolean] = {
val expr = SimpleExpression.binary[String,String,Boolean] { (s, p, qb) =>
qb.expr(s)
qb.sqlBuilder += " ~* "
qb.expr(p)
}
expr.apply(s,p)
}
}
And I can then perform my match like:
myTable.filter(row => row.myColumn.stringConcat.regexQ(matcher))
Hope that helps someone out, and if you have a better way of doing it let me know.
Edit to add:
If you're looking for exact matches, and not partial matches, you can use the array overlap operator, like:
myColumn && '{"water","oranges"}'
In Slick this is the #& operator, like
.filter(table => table.myColumn #& myMatchList)

Related

Groovy: Find duplicates in Multiple arrays

I have three arrays as below and need to find the maching/duplicate in all of these
`def Ids_AS =[04-04350, 21-005676, REGU-132644681]
def Ids_AO= [ 04-04350, 04-04356, REGU-132644681]
def Ids_AV= [ 04-04350, AB-132644681, REGU-132644681]`
println(IdsResultMissingOnSolrOutPut_AS.intersect(IdsResultMissingOnSolrOutPut_AV))
I used intersect but it is getting applies on 2 arrays/list only
Another Case: Need to handle empty array like below and it should match the rest of remaining instead of returning null or error
`def Ids_AS =[04-04350, 21-005676, REGU-132644681]
def Ids_AO= [ 04-04350, 04-04356, REGU-132644681]
def Ids_AV= []`
is there way to find duplicates on multiple arrays? Please help
Just do the intersection for the third array
def duplicates = Ids_AS.intersect(Ids_AO).intersect(Ids_AV)
If you want to get clever, and you have many, you can make a list of your lists, and then use inject (fold) to intersect them all against each other
def all = [Ids_AS, Ids_AO, Ids_AV]
def duplicates = all.inject { a, b -> a.intersect(b) }
Both methods will result in
['04-04350', 'REGU-132644681']
For the second question, sort the list of lists so the longest one is first, and then ignore empty lists
def duplicates = all.sort { -it.size() }.inject { a, b -> b.empty ? a : a.intersect(b) }

Return multiple values from map in Groovy?

Let's say I have a map like this:
def map = [name: 'mrhaki', country: 'The Netherlands', blog: true, languages: ['Groovy', 'Java']]
Now I can return "submap" with only "name" and "blog" like this:
def keys = ['name', 'blog']
map.subMap(keys)
// Will return a map with entries name=mrhaki and blog=true
But is there a way to easily return multiple values instead of a list of entries?
Update:
I'd like to do something like this (which doesn't work):
def values = map.{'name','blog'}
which would yield for example values = ['mrhaki', true] (a list or tuple or some other datastructure).
map.subMap(keys)*.value
The Spread Operator (*.) is used to invoke an action on all items of
an aggregate object. It is equivalent to calling the action on each
item and collecting the result into a list
You can iterate over the submap and collect the values:
def values = map.subMap(keys).collect {it.value}
// Result: [mrhaki, true]
Or, iterate over the list of keys, returning the map value for that key:
def values = keys.collect {map[it]}
I would guess the latter is more efficient, not having to create the submap.
A more long-winded way to iterate over the map
def values = map.inject([]) {values, key, value ->
if (keys.contains(key)) {values << value}
values
}
For completeness I'll add another way of accomplishing this using Map.findResults:
map.findResults { k, v -> k in keys ? v : null }
flexible, but more long-winded than some of the previous answers.

How to figure out if DStream is empty

I have 2 inputs, where first input is stream (say input1) and the second one is batch (say input2).
I want to figure out if the keys in first input matches single row or more than one row in the second input.
The further transformations/logic depends on the number of rows matching, whether single row matches or multiple rows match (for atleast one key in the first input)
if(single row matches){
// do something
}else{
// do something
}
Code that i tried so far
val input1Pair = streamData.map(x => (x._1, x))
val input2Pair = input2.map(x => (x._1, x))
val joinData = input1Pair.transform{ x => input2Pair.leftOuterJoin(x)}
val result = joinData.mapValues{
case(v, Some(a)) => 1L
case(v, None) => 0
}.reduceByKey(_ + _).filter(_._2 > 1)
I have done the above coding.
When I do result.print, it prints nothing if all the keys matches only one row in the input2.
With the fact that the DStream may have multiple RDDs, not sure how to figure out if the DStream is empty or not. If this is possible then I can do a if check.
There's no function to determine if a DStream is empty, as a DStream represents a collection over time. From a conceptual perspective, an empty DStream would be a stream that never has data and that would not be very useful.
What can be done is to check whether a given microbatch has data or not:
dstream.foreachRDD{ rdd => if (rdd.isEmpty) {...} }
Please note that at any given point in time, there's only one RDD.
I think that the actual question is how to check the number of matches between the reference RDD and the data in the DStream. Probably the easiest way would be to intersect both collections and check the intersection size:
val intersectionDStream = streamData.transform{rdd => rdd.intersection(input2)}
intersectionDStream.foreachRDD{rdd =>
if (rdd.count > 1) {
..do stuff with the matches
} else {
..do otherwise
}
}
We could also place the RDD-centric transformations within the foreachRDD operation:
streamData.foreachRDD{rdd =>
val matches = rdd.intersection(input2)
if (matches.count > 1) {
..do stuff with the matches
} else {
..do otherwise
}
}

Scala String Similarity

I have a Scala code that computes similarity between a set of strings and give all the unique strings.
val filtered = z.reverse.foldLeft((List.empty[String],z.reverse)) {
case ((acc, zt), zz) =>
if (zt.tail.exists(tt => similarity(tt, zz) < threshold)) acc
else zz :: acc, zt.tail
}._1
I'll try to explain what is going on here :
This uses a fold over the reversed input data, starting from the empty String (to accumulate results) and the (reverse of the) remaining input data (to compare against - I labeled it zt for "z-tail").
The fold then cycles through the data, checking each entry against the tail of the remaining data (so it doesn't get compared to itself or any earlier entry)
If there is a match, just the existing accumulator (labelled acc) will be allowed through, otherwise, add the current entry (zz) to the accumulator. This updated accumulator is paired with the tail of the "remaining" Strings (zt.tail), to ensure a reducing set to compare against.
Finally, we end up with a pair of lists: the required remaining Strings, and an empty list (no Strings left to compare against), so we take the first of these as our result.
The problem is like in first iteration, if 1st, 4th and 8th strings are similar, I am getting only the 1st string. Instead of it, I should get a set of (1st,4th,8th), then if 2nd,5th,14th and 21st strings are similar, I should get a set of (2nd,5th,14th,21st).
If I understand you correctly - you want the result to be of type List[List[String]] and not the List[String] you are getting now - where each item is a list of similar Strings (right?).
If so - I can't see a trivial change to your implementation that would achieve this, as the similar values are lost (when you enter the if(true) branch and just return the acc - you skip an item and you'll never "see" it again).
Two possible solutions I can think of:
Based on your idea, but using a 3-Tuple of the form (acc, zt, scanned) as the foldLeft result type, where the added scanned is the list of already-scanned items. This way we can refer back to them when we find an element that doesn't have preceeding similar elements:
val filtered = z.reverse.foldLeft((List.empty[List[String]],z.reverse,List.empty[String])) {
case ((acc, zt, scanned), zz) =>
val hasSimilarPreceeding = zt.tail.exists { tt => similarity(tt, zz) < threshold }
val similarFollowing = scanned.collect { case tt if similarity(tt, zz) < threshold => tt }
(if (hasSimilarPreceeding) acc else (zz :: similarFollowing) :: acc, zt.tail, zz :: scanned)
}._1
A probably-slower but much simpler solution would be to just groupBy the group of similar strings:
val alternative = z.groupBy(s => z.collect {
case other if similarity(s, other) < threshold => other
}.toSet ).values.toList
All of this assumes that the function:
f(a: String, b: String): Boolean = similarity(a, b) < threshold
Is commutative and transitive, i.e.:
f(a, b) && f(a. c) means that f(b, c)
f(a, b) if and only if f(b, a)
To test both implementations I used:
// strings are similar if they start with the same character
def similarity(s1: String, s2: String) = if (s1.head == s2.head) 0 else 100
val threshold = 1
val z = List("aa", "ab", "c", "a", "e", "fa", "fb")
And both options produce the same results:
List(List(aa, ab, a), List(c), List(e), List(fa, fb))

Get any consecutive sequence containing the given number of elements that matches a fixed condition from a list C#

I have a list defined as below inside a method
List<string> testList = new List<string>() {"A","A","B","A","A","A","B"};
Here my condition will be fixed say where the element matches "A". Based on upon my input, for instance 2, logic should identify two consecutive A's in the list and return them. In the above list it should return first and second elements. If my input is 3 it should return fourth,fifth and sixth elements where three A's are consecutive. If it is 4 it should not return anything. Is there a simple way of implementing this in C#
This is one way to do what you want:
List<string> testList = new List<string>() {"A","A","B","A","A","A","B"};
string inputText = <your input text>;
int inputCount = <your input count>;
var zipped = testList.Zip(Enumerable.Range(0,testList.Count-1), (txt,idx) => new {txt,idx});
var result = zipped
.Where(combined => combined.idx <= testList.Count-inputCount)
.Where(combined => combined.txt == inputText
&& testList.GetRange(combined.idx, inputCount).All(z => z == inputText));
We zip with the ordered range [0..list count - 1] to provide indexes for your elements. Then we check if current element in the collection equals the input text, and if the first n elements starting from current one all equal input text, where n is the input count.
We can then use a foreach loop to iterate through result like so:
foreach(var r in result)
{
testList.GetRange(r.idx,inputCount).ForEach(x => Console.WriteLine(x));
}
For each pattern that is found, we use GetRange to get the corresponding values.
Note that we are creating another IEnumerable object when we call Zip, and also calling GetRange within the Where clause, so I don't think this method is very efficient. Nevertheless, it will solve the issue.

Resources