Spark HOF: reduce not identified - apache-spark

While experimenting with spark's HOFs, I see transform, filter, exists etc working fine. But reduce is throwing error.
Code is as:
spark.sql("SELECT" +
" celsius," +
" reduce(celsius,0,(t, acc) -> t + acc, acc -> (acc div size(celsius) * 9 div 5) + 32 ) as avgFahrenheit" +
" FROM celsiusView")
Error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Undefined function: 'reduce'. This function is neither a registered
temporary function nor a permanent function registered in the database
'default'.;

Related

How can I reproduce the indeterminacy exception in Spark?

Here's the indeterminacy I'm looking to reproduce: ERROR org.apache.spark.deploy.yarn.Client: Application diagnostics message: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ShuffleMapStage 401 to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again.
As I understand, this happens because a part of logic is not deterministic (e.g. using LocalTime.now() or using scala.util.Random). I wrote a piece of code that was retried and confirmed to be non-deterministic, but it still didn't fail with the indeterminacy exception. In fact, it succeeded.
import spark.implicits._
val data = spark
.range(0, 100, 1)
.map(identity)
.repartition(7)
.map { x =>
if ((x % 10) == 0) {
Thread.sleep(1000)
}
val result = Entry("key" + x, x + RichDate.now.timestamp)
println(x + " " + result)
result
}
.map { item =>
if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 0 && TaskContext.get
.partitionId() < 3) {
println(
"Throw " + TaskContext.get.attemptNumber() + " " + TaskContext.get
.partitionId() + " " + TaskContext.get.stageAttemptNumber()
)
throw new Exception("pkill -f -n java".!!)
}
Output(item.toString)
}
// write to S3
println(x + " " + result) confirms that the specific part of code is retried and produces different result at the second try.
What did I do wrong? And does anyone have a sample code that can reproduce the indeterminacy exception?

Problem rewriting scala method post upgrade to scala 2.12 (from 2.11)

I have the following small but complex method in my project:
private def indent ( s : String )
= s.lines.toStream match {
case h +: t =>
( ("- " + h) +: t.map{"| " + _} ) mkString "\n"
case _ => "- "
}
After upgrading my project from Scala 2.11 to 2.12, it would no longer compile. Error:
CaseClassString.scala:14: value toStream is not a member of java.util.stream.Stream[String]
I tried rewriting like this:
private def indent ( s : String )
= Stream(s.lines) match {
case h +: t =>
( ("- " + h) +: t.map{"| " + _} ) mkString "\n"
case _ => "- "
}
But it is not working.
This method was found in the following project:
https://github.com/nikita-volkov/sext
The function would transform a string like:
metricResult: column: value: city
function: density
value: metricValue: 0.1
to:
- metricResult: column: value: city
| - function: density
| - value: metricValue: 0.1
Anyone have other ideas about how to rewrite this method for Scala 2.12?
I suspect that you've also upgraded your JVM from Java 8 to Java 11 (or are using code written to run on Java 8 on Java 11). Java 11 added a lines method to String which results in a java.util.Stream[String]. On Java 8, there's no lines method for String which means that the Scala compiler can implicitly convert a String to scala.collection.immutable.StringOps which has a lines method with result type Iterator[String].
The trick here is to be explicit that you want to use StringOps.lines instead of String.lines, so something like
val lines = (s: scala.collection.immutable.StringOps).lines
lines.toStream match {
// etc.
}
It seems like that you not only upgraded Scala but also upgraded Java version. The error is simple to understand if you simply look at changes related to String for both Java and Scala.
import scala.collection.immutable.StringOps
def indent(s : String): String =
(s: StringOps).lines.toStream match {
case h +: t =>
( ("- " + h) +: t.map{"| " + _} ) mkString "\n"
case _ => "- "
}
Or, if you are working with Java 11 and don't really need Stream then,
def indent(s : String): String =
s.lines.toArray.toList match {
case h :: t =>
val indented = ("- " + h) :: t.map{"| " + _}
indented.mkString("\n")
case _ => "- "
}
Very similar to Andrey's solution, but working:
def indent(s: String) = s.linesIterator.mkString("- ", "\n| - ", "")
Code run at scatie.

How spark reduce works here

How does spark reduce work for this example?
val num = sc.parallelize(List(1,2,3))
val result = num.reduce((x, y) => x + y)
res: Int = 6
val result = num.reduce((x, y) => x + (y * 10))
res: Int = 321
I understand the 1st result (1 + 2 + 3 = 6). For the 2nd result, I thought the result would be 60 but it's not. Can someone explain?
Step1 : 0 + (1 * 10) = 10
Step2 : 10 + (2 * 10) = 30
Step3 : 30 + (3 * 10) = 60
Update:
As per Spark documentation:
The function should be commutative and associative so that it can be
computed correctly in parallel.
https://spark.apache.org/docs/latest/rdd-programming-guide.html
(2,3) -> 2 + 3 * 10 = 32
(1,(2,3)) -> (1,32) -> 1 + 32 * 10 = 321
A reducer (in general, not just Spark), takes a pair, applies the reduce function and takes the result and applies it again to another element. Until all elements have been applied. The order is implementation specific (or even random if in parallel), but as a rule, it should not affect the end result (commutative and associative).
Check also this https://stackoverflow.com/a/31660532/290036

Python Bit Summation Algorithm

I am trying to implement a function that will be used to judge whether a generator's output is continuous. The method I am gravitating towards is to iterate through the generator. For each value, I right justify the bits of the value (disregarding the 0b), count the number of ones, and shift the number of ones.
#!/usr/bin/python3
from typing import Tuple
def find_bit_sum(top: int, pad_length: int) -> int :
"""."""
return pad_length * (top + 1)
def find_pad_length(top: int) -> int :
"""."""
return len(bin(top)) - 2 # -"0b"
def guess_certain(top: int, pad_length: int) -> Tuple[int, int, int] :
"""."""
_both: int = find_bit_sum(top, pad_length)
_ones: int = sum(sum(int(_i_in) for _i_in in bin(_i_out)[2 :]) for _i_out in range(1, top + 1))
return _both - _ones, _ones, _both # zeros, ones, sum
def guess(top: int, pad_length: int) -> Tuple[int, int, int] : # zeros then ones then sum
"""."""
_bit_sum: int = find_bit_sum(top, pad_length) # number of bits in total
_zeros: int = _bit_sum # ones are deducted
_ones: int = 0 # _bit_sum - _zeros
# detect ones
for _indexed in range(pad_length) :
_ones_found: int = int(top // (2 ** (_indexed + 1))) # HELP!!!
_zeros -= _ones_found
_ones += _ones_found
#
return _zeros, _ones, _bit_sum
def test_the_guess(max_value: int) -> bool : # the range is int [0, max_value + 1)
pad: int = find_pad_length(max_value)
_zeros0, _ones0, _total0 = guess_certain(max_value, pad)
_zeros1, _ones1, _total1 = guess(max_value, pad)
return all((
_zeros0 == _zeros1,
_ones0 == _ones1,
_total0 == _total1
))
if __name__ == '__main__' : # should produce a lot of True
for x in range(3000) :
print(test_the_guess(x))
For the life of me, I cannot make guess() agree with guess_certain(). The time complexity of guess_certain() is my problem: it works for small ranges [0, top], but one can forget 256-bit numbers (tops). The find_bit_sum() function works perfectly. The find_pad_length() function also works.
top // (2 ** (_indexed + 1))
I've tried 40 or 50 variations of the guess() function. It has thoroughly frustrated me. The guess() function is probabilistic. In its finished state: if it returns False, then the Generator definitely isn't producing every value in range(top + 1); however, if it returns True, then the Generator could be. We already know that the generator range(top + 1) is continuous because it does produce each number between 0 and top inclusively; so, test_the_guess() should be returning True.
I sincerely do apologise for the chaotic explanation. If you have anny questions, please don't hesitate to ask.
I adjusted your ones_found assignment statement to account for the number of powers of two per int(top // (2 ** (_indexed + 1))), as well as a additional "rollover" ones that occur before the next power of two. Here is the resulting statement:
_ones_found: int = int(top // (2 ** (_indexed + 1))) * (2 ** (_indexed)) + max(0, (top % (2 ** (_indexed + 1))) - (2 ** _indexed) + 1)
I also took the liberty of converting the statement to bitwise operators for both clarity and speed, as shown below:
_ones_found: int = ((top >> _indexed + 1) << _indexed) + max(0, (top & (1 << _indexed + 1) - 1) - (1 << _indexed) + 1)

Overloaded method call has alternatives: String.format

I wrote the following Scala code below to handle a String that I pass in, format the String, append it to a StringBuilder and return the formatted String with escaped unicode back to my caller for other processing.
The Scala compiler complains of the following on the lines where there is a String.format call with the following error:
overloaded method value format with alternatives: (x$1; java.util.Locale; x$2: String, X$3: Object*) (x$1:String,x$2: Object*) String cannot be applied to (*String, Int)
class TestClass {
private def escapeUnicodeStuff(input: String): String = {
//type StringBuilder = scala.collection.mutable.StringBuilder
val sb = new StringBuilder()
val cPtArray = toCodePointArray(input) //this method call returns an Array[Int]
val len = cPtArray.length
for (i <- 0 until len) {
if (cPtArray(i) > 65535) {
val hi = (cPtArray(i) - 0x10000) / 0x400 + 0xD800
val lo = (cPtArray(i) - 0x10000) % 0x400 + 0xDC00
sb.append(String.format("\\u%04x\\u%04x", hi, lo)) //**complains here**
} else if (codePointArray(i) > 127) {
sb.append(String.format("\\u%04x", codePointArray(i))) //**complains here**
} else {
sb.append(String.format("%c", codePointArray(i))) //**complains here**
}
}
sb.toString
}
}
How do I address this problem? How can I clean up the code to accomplish my purpose of formatting a String? Thanks in advance to the Scala experts here
The String.format method in Java expects Objects as its arguments. The Object type in Java is equivalent to the AnyRef type in Scala. The primitive types in Scala extend AnyVal – not AnyRef. Read more about the differences between AnyVal, AnyRef, and Any in the docs or in this answer. The most obvious fix is to use the Integer wrapper class from Java to get an Object representation of your Ints:
String.format("\\u%04x\\u%04x", new Integer(hi), new Integer(lo))
Using those wrapper classes is almost emblematic of unidiomatic Scala code, and should only be used for interoperability with Java when there is no better option. The more natural way to do this in Scala would be to either use the StringOps equivalent method format:
"\\u%04x\\u%04x".format(hi, lo)
You can also use the f interpolator for a more concise syntax:
f"\\u$hi%04x\\u$lo%04x"
Also, using a for loop like you have here is unidiomatic in Scala. You're better off using one of the functional list methods like map, foldLeft, or even foreach together with a partial function using the match syntax. For example, you might try something like:
toCodePointArray(input).foreach {
case x if x > 65535 =>
val hi = (x - 0x10000) / 0x400 + 0xD800
val lo = (x - 0x10000) % 0x400 + 0xDC00
sb.append(f"\\u$hi%04x\\u$lo%04x")
case x if > 127 => sb.append(f"\\u$x%04x")
case x => sb.append(f"$x%c")
}
Or, if you don't have to use StringBuilder, which really only needs to be used in cases where you are appending many strings, you can replace your whole method body with foldLeft:
def escapeUnicodeStuff(input: String) = toCodePointArray(input).foldLeft("") {
case (acc, x) if x > 65535 =>
val hi = (x - 0x10000) / 0x400 + 0xD800
val lo = (x - 0x10000) % 0x400 + 0xDC00
acc + f"\\u$hi%04x\\u$lo%04x"
case (acc, x) if x > 127 => acc + f"\\u$x%04x"
case (acc, x) => acc + f"$x%c"
}
Or a even map followed by a mkString:
def escapeUnicodeStuff(input: String) = toCodePointArray(input).map {
case x if x > 65535 =>
val hi = (x - 0x10000) / 0x400 + 0xD800
val lo = (x - 0x10000) % 0x400 + 0xDC00
f"\\u$hi%04x\\u$lo%04x"
case x if x > 127 => f"\\u$x%04x"
case x => f"$x%c"
}.mkString
I couldn't figure out what exactly is causing that "overload clash" but notice the code below:
scala> "\\u%04x\\u%04x".format(10,20)
res12: String = \u000a\u0014
Using the one provided by StringOps works.

Resources