Comparing Lists of Strings in Scala - string

I know lists are immutable but I'm still confused on how I would go about this. I have two lists of strings - For example:
var list1: List[String] = List("M", "XW1", "HJ", "K")
var list2: List[String] = List("M", "XW4", "K", "YN")
I want to loop through these lists and see if the elements match. If it doesn't, the program would immediately return false. If it is a match, it will continue to iterate until it finds an element that begins with X. If it is indeed an X, I want to return true regardless of whether the number is the same or not.
Problem I'm having is that currently I have a conditional stating that if the two elements do not match, return false immediately. This is a problem because obviously XW1 and XW4 are not the same and it will return false. How can I bypass this and determine that it is a match to my eyes regardless of the number?
I also have a counter a two length variables to account for the fact the lists may be of differing length. My counter goes up to the shortest list: for (x <- 0 to (c-1)) (c being the counter).

You want to use zipAll & forall.
def compareLists(l1: List[String], l2: List[String]): Boolean =
l1.zipAll(l2, "", "").forall {
case (x, y) =>
(x == y) || (x.startsWith("X") && y.startsWith("X"))
}
Note that I am assuming an empty string will always be different than any other element.

If I understand your requirement correctly, to be considered a match, 1) each element in the same position of the two lists being simultaneously iterated must be the same except when both start with X (in which case it should return true without comparing any further), and 2) both lists must be of the same size.
If that's correct, I would recommend using a simple recursive function like below:
def compareLists(ls1: List[String], ls2: List[String]): Boolean = (ls1, ls2) match {
case (Nil, Nil) =>
true
case (h1 :: t1, h2 :: t2) =>
if (h1.startsWith("X") && h2.startsWith("X"))
true // short-circuiting
else
if (h1 != h2)
false
else
compareLists(t1, t2)
case _ =>
false
}

Based on your comment that, result should be true for lists given in question, you could do something like this:
val list1: List[String] = List("M", "XW1", "HJ", "K")
val list2: List[String] = List("M", "XW4", "K", "YN")
val (matched, unmatched) = list1.zipAll(list2, "", "").partition { case (x, y) => x == y }
val result = unmatched match {
case Nil => true
case (x, y) :: _ => (x.startsWith("X") && y.startsWith("X"))
}

You could also use cats foldM to iterate through the lists and terminate early if there is either (a) a mismatch, or (b) two elements that begin with 'X':
import cats.implicits._
val list1: List[String] = List("M", "XW1", "HJ", "K")
val list2: List[String] = List("M", "XW4", "K", "YN")
list1.zip(list2).foldM(()){
case (_, (s1, s2)) if s1 == s2 => ().asRight
case (_, (s1, s2)) if s1.startsWith("X") && s2.startsWith("X") => true.asLeft
case _ => false.asLeft
}.left.getOrElse(false)

Related

Scala String Similarity

I have a Scala code that computes similarity between a set of strings and give all the unique strings.
val filtered = z.reverse.foldLeft((List.empty[String],z.reverse)) {
case ((acc, zt), zz) =>
if (zt.tail.exists(tt => similarity(tt, zz) < threshold)) acc
else zz :: acc, zt.tail
}._1
I'll try to explain what is going on here :
This uses a fold over the reversed input data, starting from the empty String (to accumulate results) and the (reverse of the) remaining input data (to compare against - I labeled it zt for "z-tail").
The fold then cycles through the data, checking each entry against the tail of the remaining data (so it doesn't get compared to itself or any earlier entry)
If there is a match, just the existing accumulator (labelled acc) will be allowed through, otherwise, add the current entry (zz) to the accumulator. This updated accumulator is paired with the tail of the "remaining" Strings (zt.tail), to ensure a reducing set to compare against.
Finally, we end up with a pair of lists: the required remaining Strings, and an empty list (no Strings left to compare against), so we take the first of these as our result.
The problem is like in first iteration, if 1st, 4th and 8th strings are similar, I am getting only the 1st string. Instead of it, I should get a set of (1st,4th,8th), then if 2nd,5th,14th and 21st strings are similar, I should get a set of (2nd,5th,14th,21st).
If I understand you correctly - you want the result to be of type List[List[String]] and not the List[String] you are getting now - where each item is a list of similar Strings (right?).
If so - I can't see a trivial change to your implementation that would achieve this, as the similar values are lost (when you enter the if(true) branch and just return the acc - you skip an item and you'll never "see" it again).
Two possible solutions I can think of:
Based on your idea, but using a 3-Tuple of the form (acc, zt, scanned) as the foldLeft result type, where the added scanned is the list of already-scanned items. This way we can refer back to them when we find an element that doesn't have preceeding similar elements:
val filtered = z.reverse.foldLeft((List.empty[List[String]],z.reverse,List.empty[String])) {
case ((acc, zt, scanned), zz) =>
val hasSimilarPreceeding = zt.tail.exists { tt => similarity(tt, zz) < threshold }
val similarFollowing = scanned.collect { case tt if similarity(tt, zz) < threshold => tt }
(if (hasSimilarPreceeding) acc else (zz :: similarFollowing) :: acc, zt.tail, zz :: scanned)
}._1
A probably-slower but much simpler solution would be to just groupBy the group of similar strings:
val alternative = z.groupBy(s => z.collect {
case other if similarity(s, other) < threshold => other
}.toSet ).values.toList
All of this assumes that the function:
f(a: String, b: String): Boolean = similarity(a, b) < threshold
Is commutative and transitive, i.e.:
f(a, b) && f(a. c) means that f(b, c)
f(a, b) if and only if f(b, a)
To test both implementations I used:
// strings are similar if they start with the same character
def similarity(s1: String, s2: String) = if (s1.head == s2.head) 0 else 100
val threshold = 1
val z = List("aa", "ab", "c", "a", "e", "fa", "fb")
And both options produce the same results:
List(List(aa, ab, a), List(c), List(e), List(fa, fb))

Scala - modify strings in a list based on their number of occurences

Another Scala newbie question since I am not getting how to achieve this in a functional way (mostly coming from a scripting language background):
I have a list of strings:
val food-list = List("banana-name", "orange-name", "orange-num", "orange-name", "orange-num", "grape-name")
and where they are duplicated, I'd like to add an incrementing number into the string and get that in a list similar to the input list, like so:
List("banana-name", "orange1-name", "orange1-num", "orange2-name", "orange2-num", "grape-name")
I've grouped them up to get counts for them with:
val freqs = list.groupBy(identity).mapValues(v => List.range(1, v.length + 1))
Which gives me:
Map(orange-num -> List(1, 2), banana-name -> List(1), grape-name -> List(1), orange-name -> List(1, 2))
The order of the list is important (it should be in the original order of food-list) so I know it's problematic for me to use a Map at this point. The closest I feel I have gotten to a solution is:
food-list.map{l =>
if (freqs(l).length > 1){
freqs(l).map(n =>
l.split("-")(0) + n.toString + "-" + l.split("-")(1))
} else {
l
}
}
This of course gives me a wonky output since I am mapping the list of frequencies from the words value in freqs
List(banana-name, List(orange1-name, orange2-name), List(orange1-num, orange2-num), List(orange1-name, orange2-name), List(orange1-num, orange2-num), grape-name)
How is this done in a Scala fp way without resorting to clumsy for loops and counters?
If the indices are important, sometimes it's best to keep track of them explicitly using zipWithIndex (very similar to Python's enumerate):
food-list.zipWithIndex.groupBy(_._1).values.toList.flatMap{
//if only one entry in this group, don't change the values
//x is actually a tuple, could write case (str, idx) :: Nil => (str, idx) :: Nil
case x :: Nil => x :: Nil
//case where there are duplicate strings
case xs => xs.zipWithIndex.map {
//idx is index in the original list, n is index in the new list i.e. count
case ((str, idx), n) =>
//destructuring assignment, like python's (fruit, suffix) = ...
val Array(fruit, suffix) = str.split("-")
//string interpolation, returning a tuple
(s"$fruit${n+1}-$suffix", idx)
}
//We now have our list of (string, index) pairs;
//sort them and map to a list of just strings
}.sortBy(_._2).map(_._1)
Efficient and simple:
val food = List("banana-name", "orange-name", "orange-num",
"orange-name", "orange-num", "grape-name")
def replaceName(s: String, n: Int) = {
val tokens = s.split("-")
tokens(0) + n + "-" + tokens(1)
}
val indicesMap = scala.collection.mutable.HashMap.empty[String, Int]
val res = food.map { name =>
{
val n = indicesMap.getOrElse(name, 1)
indicesMap += (name -> (n + 1))
replaceName(name, n)
}
}
Here is an attempt to provide what you expected with foldLeft:
foodList.foldLeft((List[String](), Map[String, Int]()))//initial value
((a/*accumulator, list, map*/, v/*value from the list*/)=>
if (a._2.isDefinedAt(v))//already seen
(s"$v+${a._2(v)}" :: a._1, a._2.updated(v, a._2(v) + 1))
else
(v::a._1, a._2.updated(v, 1)))
._1/*select the list*/.reverse/*because we created in the opposite order*/

Detecting the index in a string that is not printable character with Scala

I have a method that detects the index in a string that is not printable as follows.
def isPrintable(v:Char) = v >= 0x20 && v <= 0x7E
val ba = List[Byte](33,33,0,0,0)
ba.zipWithIndex.filter { v => !isPrintable(v._1.toChar) } map {v => v._2}
> res115: List[Int] = List(2, 3, 4)
The first element of the result list is the index, but I wonder if there is a simpler way to do this.
If you want an Option[Int] of the first non-printable character (if one exists), you can do:
ba.zipWithIndex.collectFirst{
case (char, index) if (!isPrintable(char.toChar)) => index
}
> res4: Option[Int] = Some(2)
If you want all the indices like in your example, just use collect instead of collectFirst and you'll get back a List.
For getting only the first index that meets the given condition:
ba.indexWhere(v => !isPrintable(v.toChar))
(it returns -1 if nothing is found)
You can use directly regexp to found unprintable characters by unicode code points.
Resource: Regexp page
In such way you can directly filter your string with such pattern, for instance:
val text = "this is \n sparta\t\r\n!!!"
text.zipWithIndex.filter(_._1.matches("\\p{C}")).map(_._2)
> res3: Vector(8, 16, 17, 18)
As result you'll get Vector with indices of all unprintable characters in String. Check it out
If desired only the first occurrence of non printable char
Method span applied on a List delivers two sublists, the first where all the elements hold a condition, the second starts with an element that falsified the condition. In this case consider,
val (l,r) = ba.span(b => isPrintable(b.toChar))
l: List(33, 33)
r: List(0, 0, 0)
To get the index of the first non printable char,
l.size
res: Int = 2
If desired all the occurrences of non printable chars
Consider partition of a given List for a criteria. For instance, for
val ba2 = List[Byte](33,33,0,33,33)
val (l,r) = ba2.zipWithIndex.partition(b => isPrintable(b._1.toChar))
l: List((33,0), (33,1), (33,3), (33,4))
r: List((0,2))
where r includes tuples with non printable chars and their position in the original List.
I am not sure whether list of indexes or tuples is needed and I am not sure whether 'ba' needs to be an list of bytes or starts off as a string.
for { i <- 0 until ba.length if !isPrintable(ba(i).toChar) } yield i
here, because people need performance :)
def getNonPrintable(ba:List[Byte]):List[Int] = {
import scala.collection.mutable.ListBuffer
var buffer = ListBuffer[Int]()
#tailrec
def go(xs: List[Byte], cur: Int): ListBuffer[Int] = {
xs match {
case Nil => buffer
case y :: ys => {
if (!isPrintable(y.toChar)) buffer += cur
go(ys, cur + 1)
}
}
}
go(ba, 0)
buffer.toList
}

How to split a string given a list of positions in Scala

How would you write a funcitonal implementation for split(positions:List[Int], str:String):List[String], which is similar to splitAt but splits a given string into a list of strings by a given list of positions?
For example
split(List(1, 2), "abc") returns List("a", "b", "c")
split(List(1), "abc") returns List("a", "bc")
split(List(), "abc") returns List("abc")
def lsplit(pos: List[Int], str: String): List[String] = {
val (rest, result) = pos.foldRight((str, List[String]())) {
case (curr, (s, res)) =>
val (rest, split) = s.splitAt(curr)
(rest, split :: res)
}
rest :: result
}
Something like this:
def lsplit(pos: List[Int], s: String): List[String] = pos match {
case x :: rest => s.substring(0,x) :: lsplit(rest.map(_ - x), s.substring(x))
case Nil => List(s)
}
(Fair warning: not tail recursive so will blow the stack for large lists; not efficient due to repeated remapping of indices and chains of substrings. You can solve these things by adding additional arguments and/or an internal method that does the recursion.)
How about ....
def lSplit( indices : List[Int], s : String) = (indices zip (indices.tail)) map { case (a,b) => s.substring(a,b) }
scala> lSplit( List(0,4,6,8), "20131103")
List[String] = List(2013, 11, 03)

Merging through fuzzy matching of variables in R

I have two dataframes (x & y) where the IDs are student_name, father_name and mother_name. Because of typographical errors ("n" instead of "m", random white spaces, etc.), I have about 60% of values which are not aligning, though I can eyeball the data and see they should. Is there a way to reduce the level of non-match somehow so that manually editing because at least feasible? The dataframes are have about 700K observations.
R would be best. I know a little bit of python, and some basic unix tools. P.S. I read up on agrep(), but don't understand how that can work on actual datasets, especially when the match is over more than one variable.
update (data for posted bounty):
Here are two example data frames, sites_a and sites_b. They could be matched on the numeric columns lat and lon as well as on the sitename column. It would be useful to know how this could be done on a) just lat + lon, b) sitename or c) both.
you can source the file test_sites.R which is posted as a gist.
Ideally the answer would end with
merge(sites_a, sites_b, by = **magic**)
The agrep function (part of base R), which does approximate string matching using the Levenshtein edit distance is probably worth trying. Without knowing what your data looks like, I can't really suggest a working solution. But this is a suggestion... It records matches in a separate list (if there are multiple equally good matches, then these are recorded as well). Let's say that your data.frame is called df:
l <- vector('list',nrow(df))
matches <- list(mother = l,father = l)
for(i in 1:nrow(df)){
father_id <- with(df,which(student_name[i] == father_name))
if(length(father_id) == 1){
matches[['father']][[i]] <- father_id
} else {
old_father_id <- NULL
## try to find the total
for(m in 10:1){ ## m is the maximum distance
father_id <- with(df,agrep(student_name[i],father_name,max.dist = m))
if(length(father_id) == 1 || m == 1){
## if we find a unique match or if we are in our last round, then stop
matches[['father']][[i]] <- father_id
break
} else if(length(father_id) == 0 && length(old_father_id) > 0) {
## if we can't do better than multiple matches, then record them anyway
matches[['father']][[i]] <- old_father_id
break
} else if(length(father_id) == 0 && length(old_father_id) == 0) {
## if the nearest match is more than 10 different from the current pattern, then stop
break
}
}
}
}
The code for the mother_name would be basically the same. You could even put them together in a loop, but this example is just for the purpose of illustration.
This takes a list of common column names, matches based on agrep of all those columns combined, and then if all.x or all.y equals TRUE it appends non-matching records filling in missing columns with NA. Unlike merge, the column names to match on need to be the same in each data frame. The challenge would seem to be setting the agrep options correctly to avoid spurious matches.
agrepMerge <- function(df1, df2, by, all.x = FALSE, all.y = FALSE,
ignore.case = FALSE, value = FALSE, max.distance = 0.1, useBytes = FALSE) {
df1$index <- apply(df1[,by, drop = FALSE], 1, paste, sep = "", collapse = "")
df2$index <- apply(df2[,by, drop = FALSE], 1, paste, sep = "", collapse = "")
matches <- lapply(seq_along(df1$index), function(i, ...) {
agrep(df1$index[i], df2$index, ignore.case = ignore.case, value = value,
max.distance = max.distance, useBytes = useBytes)
})
df1_match <- rep(1:nrow(df1), sapply(matches, length))
df2_match <- unlist(matches)
df1_hits <- df1[df1_match,]
df2_hits <- df2[df2_match,]
df1_miss <- df1[setdiff(seq_along(df1$index), df1_match),]
df2_miss <- df2[setdiff(seq_along(df2$index), df2_match),]
remove_cols <- colnames(df2_hits) %in% colnames(df1_hits)
df_out <- cbind(df1_hits, df2_hits[,!remove_cols])
if(all.x) {
missing_cols <- setdiff(colnames(df_out), colnames(df1_miss))
df1_miss[missing_cols] <- NA
df_out <- rbind(df_out, df1_miss)
}
if(all.x) {
missing_cols <- setdiff(colnames(df_out), colnames(df2_miss))
df2_miss[missing_cols] <- NA
df_out <- rbind(df_out, df2_miss)
}
df_out[,setdiff(colnames(df_out), "index")]
}

Resources