How do I avoid an 'unbound variable' error when the query is unknown at compile but known when the function would be run? - ecto

I currently have a query going on the following journey:
# Context
def list_customers(params) do
items =
from(i in MyAppItems)
|> MyappItems.filter_by_params(params |> Enum.to_list())
MyAppCustomer
|> join(:left, [p], i in ^items, on: [customer_id: p.id])
|> join(:left, [_, i], pr in assoc(i, :provider))
|> join(:left, [_, i, _], t in assoc(i, :type))
|> join(:left, [_, i, _, _], s in assoc(i, :status))
|> join(:left, [_, i, _, _, _], a in assoc(i, :action))
|> join(:left, [_, i, _, _, _, _], n in assoc(i, :note))
|> preload([_, i, pr, t, s, a, n],
items: {i, provider: pr, type: t, status: s, action: a, note: n}
)
|> group_by([p, _, _, _, _, _, _], p.id)
|> Repo.all()
end
# MyAppItems
def filter_by_params(query, params) do
Enum.reduce(params, query, fn
{"list_date", list_date}, query ->
filter_by_list_date(query, list_date)
_, query ->
query
end)
end
defp filter_by_list_date(query, list_date) do
{:ok, date} = Date.from_iso8601(list_date)
query
|> where(fragment("date(inserted_at) = ?", ^date))
end
As is, when this runs I get an ambiguous column warning with regards to inserted_at.
I tried to fix this by changing the fragment as follows:
|> where(fragment("date(?) = ?", i.inserted_at, ^date))
However I can't shake unbound_variable errors surrounding the i.. I know that when the query runs i will be in the query that is being passed to the fragment but I can't get to that point because of the compile error.

You can alias your joins to reference in a later piped chain.
For example:
MyAppCustomer
|> join(:left, [p], i in ^items, on: [customer_id: p.id], as: :item)
|> where([item: item], fragment("date(?) =?", item.inserted_at, ^date))
Alternatively, if you know your hardcoded joins, you can do the same thing you were doing with the joins
MyAppCustomer
|> join(:left, [p], i in ^items, on: [customer_id: p.id], as: :item)
|> where([_, i], fragment("date(?) =?", i.inserted_at, ^date))

Related

Regular expression to stop at first match immediately

So this is a given string where you can see in some places there are missing values.
We have to fill those missing values in a certain specified way.
s = "_, _, 30, _, _, _, 50, _, _ "
My concern for the first bit of the problem is to extract the " _, _, 30 " part from the string ( so that i can take it apart, modify and replace the modifed bit in the original string ). I tried to do it using:
import re
res = re.findall("_.*[0-9]",s)
print(res)
The output I am getting is:
_, _, 30, _, _, _, 50
whereas the desired result is:
_, _, 30
How can i do it using re module?
Your problem is coming from the fact, that on default regex operators are greedy - which means they return the longest match, there are 2 ways to solve your problem:
(1) Just to move from greedy to non-greedy operator:
>>> re.findall("_.*?[0-9]+",s)
['_, _, 30', '_, _, _, 50']
(2) Replace "any" with non-numeric:
>>> re.findall(r"[^0-9]*[0-9]+", s)
['_, _, 30', ', _, _, _, 50']

Comparing Lists of Strings in Scala

I know lists are immutable but I'm still confused on how I would go about this. I have two lists of strings - For example:
var list1: List[String] = List("M", "XW1", "HJ", "K")
var list2: List[String] = List("M", "XW4", "K", "YN")
I want to loop through these lists and see if the elements match. If it doesn't, the program would immediately return false. If it is a match, it will continue to iterate until it finds an element that begins with X. If it is indeed an X, I want to return true regardless of whether the number is the same or not.
Problem I'm having is that currently I have a conditional stating that if the two elements do not match, return false immediately. This is a problem because obviously XW1 and XW4 are not the same and it will return false. How can I bypass this and determine that it is a match to my eyes regardless of the number?
I also have a counter a two length variables to account for the fact the lists may be of differing length. My counter goes up to the shortest list: for (x <- 0 to (c-1)) (c being the counter).
You want to use zipAll & forall.
def compareLists(l1: List[String], l2: List[String]): Boolean =
l1.zipAll(l2, "", "").forall {
case (x, y) =>
(x == y) || (x.startsWith("X") && y.startsWith("X"))
}
Note that I am assuming an empty string will always be different than any other element.
If I understand your requirement correctly, to be considered a match, 1) each element in the same position of the two lists being simultaneously iterated must be the same except when both start with X (in which case it should return true without comparing any further), and 2) both lists must be of the same size.
If that's correct, I would recommend using a simple recursive function like below:
def compareLists(ls1: List[String], ls2: List[String]): Boolean = (ls1, ls2) match {
case (Nil, Nil) =>
true
case (h1 :: t1, h2 :: t2) =>
if (h1.startsWith("X") && h2.startsWith("X"))
true // short-circuiting
else
if (h1 != h2)
false
else
compareLists(t1, t2)
case _ =>
false
}
Based on your comment that, result should be true for lists given in question, you could do something like this:
val list1: List[String] = List("M", "XW1", "HJ", "K")
val list2: List[String] = List("M", "XW4", "K", "YN")
val (matched, unmatched) = list1.zipAll(list2, "", "").partition { case (x, y) => x == y }
val result = unmatched match {
case Nil => true
case (x, y) :: _ => (x.startsWith("X") && y.startsWith("X"))
}
You could also use cats foldM to iterate through the lists and terminate early if there is either (a) a mismatch, or (b) two elements that begin with 'X':
import cats.implicits._
val list1: List[String] = List("M", "XW1", "HJ", "K")
val list2: List[String] = List("M", "XW4", "K", "YN")
list1.zip(list2).foldM(()){
case (_, (s1, s2)) if s1 == s2 => ().asRight
case (_, (s1, s2)) if s1.startsWith("X") && s2.startsWith("X") => true.asLeft
case _ => false.asLeft
}.left.getOrElse(false)

flatMap() function returns RDD[Char] instead RDD[String]

I am trying to understand how map and flatMap works but got stuck at below piece of code. flatMap() function returns an RDD[Char] but I was expecting the RDD[String] instead.
Can someone explain why it yields the RDD[Char] ?
scala> val inputRDD = sc.parallelize(Array(Array("This is Spark"), Array("It is a processing language"),Array("Very fast"),Array("Memory operations")))
scala> val mapRDD = inputRDD.map(x => x(0))
mapRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[28] at map at <console>:26
scala> mapRDD.collect
res27: Array[String] = Array(This is Spark, It is a processing language, Very fast, Memory operations)
scala> val mapRDD = inputRDD.flatMap(x => x(0))
mapRDD: org.apache.spark.rdd.RDD[Char] = MapPartitionsRDD[29] at flatMap at <console>:26
scala> mapRDD.collect
res28: Array[Char] = Array(T, h, i, s, , i, s, , S, p, a, r, k, I, t, , i, s, , a, , p, r, o, c, e, s, s, i, n, g, , l, a, n, g, u, a, g, e, V, e, r, y, , f, a, s, t, M, e, m, o, r, y, , o, p, e, r, a, t, i, o, n, s)
Take a look at this answer: https://stackoverflow.com/a/22510434/1547734
Basically flatmap transforms an RDD of N elements into (logically) an RDD of N collections and then flattens it into an RDD of all ELEMENTS of the internal collections.
So when you do inputRDD.flatMap(x => x(0)) then you convert each element into a string. A string is a collection of characters so the "flattening" portion would turn the entire RDD into an RDD of the resulting characters.
Since RDD are based on scala collections the following http://www.brunton-spall.co.uk/post/2011/12/02/map-map-and-flatmap-in-scala/ might help understanding it more.
The goal of flatMap is to convert a single item into multiple items (i.e. a one-to-many relationship). For example, for an RDD[Order], where each order is likely to have multiple items, I can use flatMap to get an RDD[Item] (rather than an RDD[Seq[Item]]).
In your case, a String is effectively a Seq[Char]. It therefore assumes that what you want to do is take that one string and break it up into its constituent characters.
Now, if what you want is to use flatMap to get all of the raw Strings in your RDD, your flatMap function should probably look like this: x => x.

Prevent more IO with multiple pipelines on the same RDD

E.g. if I run over the same RDD of numbers where one flow filters for the even numbers and averages them and the other filters for the odd and sums them. If I write this as two pipelines over the same RDD this will create two executions, that will scan the RDD twice, which can be expensive in terms of IO.
How can this IO be reduced to only read the data once without rewriting the logic to be in one pipeline? A framework that takes two pipelines and merges them to one is OK of course, just as long as developers continue to work on each pipeline independently (in the real case, these pipelines are loaded from separate modules)
The point is not to use cache() to achieve this
Since your question is rather vague let's think about general strategies that can be used to approach this problem.
A standard solution here would be caching, but since you explicitly want to avoid it, I assume there some additional limitations here. It suggests that some similar solutions, like
in memory data storage (like Ignite suggested by heenenee)
accelerated storage like Alluxio
are not acceptable either. It means you have to find some to manipulate pipeline itself.
Although multiple transformations can be squashed together every transformation creates a new RDD. This, combined with your statement about caching, sets relatively strong constraints on possible solutions.
Let's start with the simplest possible case where all pipelines can be expressed a single stage jobs. This restricts our choices to map only jobs and simple map-reduce jobs (like the one described in your question). Pipelines like this can be easily expressed as a sequence of operations on local iterators. So the following
import org.apache.spark.util.StatCounter
def isEven(x: Long) = x % 2 == 0
def isOdd(x: Long) = !isEven(x)
def p1(rdd: RDD[Long]) = {
rdd
.filter(isEven _)
.aggregate(StatCounter())(_ merge _, _ merge _)
.mean
}
def p2(rdd: RDD[Long]) = {
rdd
.filter(isOdd _)
.reduce(_ + _)
}
could be expressed as:
def p1(rdd: RDD[Long]) = {
rdd
.mapPartitions(iter =>
Iterator(iter.filter(isEven _).foldLeft(StatCounter())(_ merge _)))
.collect
.reduce(_ merge _)
.mean
}
def p2(rdd: RDD[Long]) = {
rdd
.mapPartitions(iter =>
Iterator(iter.filter(isOdd _).foldLeft(0L)(_ + _)))
.collect
.reduce(_ + _)
// identity _
}
At this point we can rewrite separate jobs as follows:
def mapPartitions2[T, U, V](rdd: RDD[T])(f: Iterator[T] => U, g: Iterator[T] => V) = {
rdd.mapPartitions(iter => {
val items = iter.toList
Iterator((f(items.iterator), g(items.iterator)))
})
}
def reduceLocally2[U, V](rdd: RDD[(U, V)])(f: (U, U) => U, g: (V, V) => V) = {
rdd.collect.reduce((x, y) => (f(x._1, y._1), g(x._2, y._2)))
}
def evaluate[U, V, X, Z](pair: (U, V))(f: U => X, g: V => Z) = (f(pair._1), g(pair._2))
val rdd = sc.range(0L, 100L)
def f(iter: Iterator[Long]) = iter.filter(isEven _).foldLeft(StatCounter())(_ merge _)
def g(iter: Iterator[Long]) = iter.filter(isOdd _).foldLeft(0L)(_ + _)
evaluate(reduceLocally2(mapPartitions2(rdd)(f, g))(_ merge _, _ + _))(_.mean, identity)
The biggest issue here is that we have to eagerly evaluate each partition to be able to apply individual pipelines. It means that overall memory requirements can be significantly higher compared to the same logic applied separately. Without caching* it is also useless in case of multistage jobs.
An alternative solution is to process data element-wise but treat each item as a tuple of seqs:
def map2[T, U, V, X](rdd: RDD[(Seq[T], Seq[U])])(f: T => V, g: U => X) = {
rdd.map{ case (ts, us) => (ts.map(f), us.map(g)) }
}
def filter2[T, U](rdd: RDD[(Seq[T], Seq[U])])(
f: T => Boolean, g: U => Boolean) = {
rdd.map{ case (ts, us) => (ts.filter(f), us.filter(g)) }
}
def aggregate2[T, U, V, X](rdd: RDD[(Seq[T], Seq[U])])(zt: V, zu: X)
(s1: (V, T) => V, s2: (X, U) => X, m1: (V, V) => V, m2: (X, X) => X) = {
rdd.mapPartitions(iter => {
var accT = zt
var accU = zu
iter.foreach { case (ts, us) => {
accT = ts.foldLeft(accT)(s1)
accU = us.foldLeft(accU)(s2)
}}
Iterator((accT, accU))
}).reduce { case ((v1, x1), (v2, x2)) => ((m1(v1, v2), m2(x1, x2))) }
}
With API like this we can express initial pipelines as:
val rddSeq = rdd.map(x => (Seq(x), Seq(x)))
aggregate2(filter2(rddSeq)(isEven, isOdd))(StatCounter(), 0L)(
_ merge _, _ + _, _ merge _, _ + _
)
This approach is slightly more powerful then the former one (you can easily implement some subset of byKey methods if needed) and memory requirements in typical pipelines should be comparable to the core API but it is also significantly more intrusive.
* You can check an answer provided by eje for multiplexing examples.

How to split a string given a list of positions in Scala

How would you write a funcitonal implementation for split(positions:List[Int], str:String):List[String], which is similar to splitAt but splits a given string into a list of strings by a given list of positions?
For example
split(List(1, 2), "abc") returns List("a", "b", "c")
split(List(1), "abc") returns List("a", "bc")
split(List(), "abc") returns List("abc")
def lsplit(pos: List[Int], str: String): List[String] = {
val (rest, result) = pos.foldRight((str, List[String]())) {
case (curr, (s, res)) =>
val (rest, split) = s.splitAt(curr)
(rest, split :: res)
}
rest :: result
}
Something like this:
def lsplit(pos: List[Int], s: String): List[String] = pos match {
case x :: rest => s.substring(0,x) :: lsplit(rest.map(_ - x), s.substring(x))
case Nil => List(s)
}
(Fair warning: not tail recursive so will blow the stack for large lists; not efficient due to repeated remapping of indices and chains of substrings. You can solve these things by adding additional arguments and/or an internal method that does the recursion.)
How about ....
def lSplit( indices : List[Int], s : String) = (indices zip (indices.tail)) map { case (a,b) => s.substring(a,b) }
scala> lSplit( List(0,4,6,8), "20131103")
List[String] = List(2013, 11, 03)

Resources