Supplying context for Spark UDF execution - apache-spark

I am working in Scala programming language. I want to hash the entire column of dataframe with sha2 and salt. I have implemented the following UDF which should take MessageDigest and input string which will be hashed.
val md = MessageDigest.getInstance("SHA-256")
val random = new SecureRandom();
val salt: Array[Byte] = new Array[Byte](16)
random.nextBytes(salt)
md.update(salt)
dataFrame.withColumn("ColumnName", Sqlfunc(md, col("ColumnName")))
....some other code....
val HashValue: ((MessageDigest, String) => String) = (md: MessageDigest, input: String) =>
{
val hashedPassword: Array[Byte] = md.digest(input.getBytes(StandardCharsets.UTF_8))
val sb: StringBuilder = new StringBuilder
for (b <- hashedPassword) {sb.append(String.format("%02x", Byte.box(b)))}
sb.toString();
}
val Sqlfunc = udf(HashValue)
However the above code does not compile, because I dont know how to pass messageDigest to this function so I am running into following error
<<< ERROR!
java.lang.ClassCastException: com...................$$anonfun$9 cannot be cast to scala.Function1
Can someone tell me what am I doing wrong?
Also I am novice on cryptography so feel free to suggest anything you can. We have to use Sha2 and salt.
What do you think about the performance here?
Thanks

The MessageDigest is not in your data. It's just context for the UDF evaluation. This type of context is provided via closures.
There are many ways to achieve the desired effect. The following is a useful pattern that uses function currying:
object X extends Serializable {
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def foo(context: String)(arg1: Int, arg2: Int): String =
context.slice(arg1, arg2)
def udfFoo(context: String): UserDefinedFunction =
udf(foo(context) _)
}
Trying it out:
spark.range(1).toDF
.select(X.udfFoo("Hello, there!")('id, 'id + 5))
.show(false)
generates
+-----------------+
|UDF(id, (id + 5))|
+-----------------+
|Hello |
+-----------------+

Related

How to format string to phone number kotlin alghorithm

I get a string containing a phone number as input. This string looks like this:
79998886666
But I need to convert it to this form:
+7 999 888-66-66
I tried several ways, but the most recent one I got looks like this:
private fun formatPhone(phone: String): String {
val prefix = "+"
val countryCode = phone.first()
val regionCode = phone.dropLast(7).drop(1)
val firstSub = phone.dropLast(4).drop(4)
val secondSub = phone.dropLast(2).drop(7)
val thirdSub = phone.drop(9)
return "$prefix$countryCode $regionCode $firstSub-$secondSub-$thirdSub"
}
But it seems to me that this method looks rather strange and does not work very efficiently.
How can this problem be solved differently?
You could use a regex replacement here:
val regex = """(\d)(\d{3})(\d{3})(\d{2})(\d{2})""".toRegex()
val number = "79998886666"
val output = regex.replace(number, "+$1 $2 $3-$4-$5")
println(output) // +7 999 888-66-66
You could create a helper function that returns chunks of the String at a time:
private fun formatPhone(phone: String): String {
fun CharIterator.next(count: Int) = buildString { repeat(count) { append(next()) } }
return with(phone.iterator()) {
"+${next(1)} ${next(3)} ${next(3)}-${next(2)}-${next(2)}"
}
}
Your original code could be simplified and made more performant by using substring instead of drop/dropLast.

SML how to change String after isSubstring comparison

I am currently learning SML/NJ due to a program that uses mostly a gui for basic input, but SML input for advanced options.
I want to compare whether one string is a substring of another.
If the condition is true then the full string should "just" be returned or assigned to a new variable.
For Testing purposes I used an online compiler, because I get almost zero feedback from the other program.
Relevant Code Snippet:
fun SString(sub:string, str:string):string =
if isSubstring(sub, str) = TRUE then str
(* str should be returned , no errors*)
else val p2:string="nope";
(* no return or adjustable(fixed)return /without data*)
val p1 = "sender,time,data"
val p2 = "sender"
print(SString(p2,p1))
So far I am stuck now.
My main questions are:
Can I actually create a new variable in a function?
What is the best practice in this case?
In some online docs I read that it isn't possible to assign a new value to a variable once assigned.
Should my function rather have following form with inner bindings and let decl in expr end *var?
fun newstr:string(sub:string,str:string) =
let val n = isSubstring(sub,str)
in
end *sub
Thanks in advance
Relevant results of the compilation:
Standard ML of New Jersey v110.78 [built: Thu Aug 31 03:45:42 2017]
- stdIn:4.33-4.44 Error: syntax error: deleting ELSE VAL ID
- stdIn:4.52-4.61 Error: syntax error: deleting EQUALOP STRING SEMICOLON
P.S. I added fitting tags, feel free to remove/adjust them
A gentle Introduction to ML gave me a great understanding of SML, except for some cryptic codes and type conversions.
val p1:string = "weather,maps,translate";
val p2:string = "maps";
fun SString(sub:string, str:string) = let
in case (String.isSubstring sub str) of
(true) => str ^ ",pie"
| (_) => "nope"
end;
val a = SString(p2,p1);
val b = String(p1,p2);
The executed result of above code:
Standard ML of New Jersey v110.78 [built: Thu Aug 31 03:45:42 2017]
- val p1 = "weather,maps,translate" : string
val p2 = "maps" : string
[autoloading]
[library $SMLNJ-BASIS/basis.cm is stable]
[autoloading done]
val SString = fn : string * string -> string
val a = "weather,maps,translate,pie" : string
val b = "nope" : string

Trying to create a Map[String, String] from lines in a text file, keep getting errors [duplicate]

This question already has an answer here:
Most efficient way to create a Scala Map from a file of strings?
(1 answer)
Closed 4 years ago.
Hi so I'm trying to create a Map[String, String] based on a text file, in the textfile there are arbritrary lines that begin with ";;;" that I ignore with the function and the lines that i dont ignore are the key-> values. they are separated by 2 spaces.
whenever i run my code i get an error saying the expected type Map[String,String] isn't the required type, even though my conversions seem correct.
def createMap(filename: String): Map[String,String] = {
for (line <- Source.fromFile(filename).getLines) {
if (line.nonEmpty && !line.startsWith(";;;")) {
val string: String = line.toString
val splits: Array[String] = string.split(" ")
splits.map(arr => arr(0) -> arr(1)).toMap
}
}
}
I expect it to return a (String -> String) map but instead i get a bunch of errors. how would i fix this?
Since your if statement is not an expression in the for-loop. You should use the if as a filter when yielding your results. To return a result, you must make it a for-comprehension. After the for-comprehension filters the results. You can map this structure to a Map.
import scala.io.Source
def createMap(filename: String): Map[String,String] = {
val keyValuePairs = for (line <- Source.fromFile(filename).getLines; if line.nonEmpty && !line.startsWith(";;;")) yield {
val string = line.toString
val splits: Array[String] = string.split(" ")
splits(0) -> splits(1)
}
keyValuePairs.toMap
}
Okay, so I took a second look. It looks like the file has some corrupt encodings. You can try this as a solution. It worked in my Scala REPL:
import java.nio.charset.CodingErrorAction
import scala.io.{Codec, Source}
def createMap(filename: String): Map[String,String] = {
val decoder = Codec.UTF8.decoder.onMalformedInput(CodingErrorAction.IGNORE)
Source.fromFile(filename)(decoder).getLines()
.filter(line => line.nonEmpty && !line.startsWith(";;;"))
.flatMap(line => {
val arr = line.split("\\s+")
arr match {
case Array(key, value) => Some(key -> value)
case Array(key, values#_*) => Some(key -> values.mkString(" "))
case _ => None
}
}).toMap
}

String replacement issue with `replaceAllIn` method in Scala

I have this template formatting code in Scala from Better String formatting in Scala
def getTemplateString(template:String, replacement:Map[String, String]) = {
replacement.foldLeft(template)((s:String, x:(String,String)) => ( "#\\{" + x._1 + "\\}" ).r.replaceAllIn( s, x._2 ))
}
The issue is that with a mapped value with '$' character, I get Illegal group reference java.lang.IllegalArgumentException: Illegal group reference error.
val template = "#{a}"
val map = Map[String, String]("a" -> "$bp")
val res = getTemplateString(template, map)
println(res)
How to fix this issue?
Try escaping the $ symbol:
val map = Map[String, String]("a" -> "\\$bp")

MLlib to Breeze vectors/matrices are private to org.apache.spark.mllib scope?

I have read somewhere that MLlib local vectors/matrices are currently wrapping Breeze implementation, but the methods converting MLlib to Breeze vectors/matrices are private to org.apache.spark.mllib scope. The suggestion to work around this is to write your code in org.apache.spark.mllib.something package.
Is there a better way to do this? Can you cite some relevant examples?
Thanks and regards,
I did the same solution as #dlwh suggested. Here is the code that does it:
package org.apache.spark.mllib.linalg
object VectorPub {
implicit class VectorPublications(val vector : Vector) extends AnyVal {
def toBreeze : breeze.linalg.Vector[scala.Double] = vector.toBreeze
}
implicit class BreezeVectorPublications(val breezeVector : breeze.linalg.Vector[Double]) extends AnyVal {
def fromBreeze : Vector = Vectors.fromBreeze(breezeVector)
}
}
notice that the implicit class extends AnyVal to prevent allocation of a new object when calling those methods
My solution is kind of a hybrid of those of #barclar and #lev, above. You don't need to put your code in the org.apache.spark.mllib.linalg if you don't make use of the spark-ml implicit conversions. You can define your own implicit conversions in your own package, like:
package your.package
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.ml.linalg.SparseVector
import org.apache.spark.ml.linalg.Vector
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}
object BreezeConverters
{
implicit def toBreeze( dv: DenseVector ): BDV[Double] =
new BDV[Double](dv.values)
implicit def toBreeze( sv: SparseVector ): BSV[Double] =
new BSV[Double](sv.indices, sv.values, sv.size)
implicit def toBreeze( v: Vector ): BV[Double] =
v match {
case dv: DenseVector => toBreeze(dv)
case sv: SparseVector => toBreeze(sv)
}
implicit def fromBreeze( dv: BDV[Double] ): DenseVector =
new DenseVector(dv.toArray)
implicit def fromBreeze( sv: BSV[Double] ): SparseVector =
new SparseVector(sv.length, sv.index, sv.data)
implicit def fromBreeze( bv: BV[Double] ): Vector =
bv match {
case dv: BDV[Double] => fromBreeze(dv)
case sv: BSV[Double] => fromBreeze(sv)
}
}
Then you can import these implicits into your code with:
import your.package.BreezeConverters._
As I understand it, the Spark people do not want to expose third party APIs (including Breeze) so that it's easier to change if they decide to move away from them.
You could always put just a simple implicit conversion class in that package and write the rest of your code in your own package. Not much better than just putting everything in there, but it makes it a little more obvious why you're doing it.
Here is the best I have so far. Note to #dlwh: please do provide any improvements you might have to this.
The solution I could come up with - that does not put code inside the mllib .linalg package - is to convert each Vector to a new Breeze DenseVector.
val v1 = Vectors.dense(1.0, 2.0, 3.0)
val v2 = Vectors.dense(4.0, 5.0, 6.0)
val bv1 = new DenseVector(v1.toArray)
val bv2 = new DenseVector(v2.toArray)
val vectout = Vectors.dense((bv1 + bv2).toArray)
vectout: org.apache.spark.mllib.linalg.Vector = [5.0,7.0,9.0]
This solution avoids putting code into Spark's packages and avoids converting sparse to dense vectors:
def toBreeze(vector: Vector) : breeze.linalg.Vector[scala.Double] = vector match {
case sv: SparseVector => new breeze.linalg.SparseVector[Double](sv.indices, sv.values, sv.size)
case dv: DenseVector => new breeze.linalg.DenseVector[Double](dv.values)
}
this is a method i wort to convert an Mlib DenceMatrix to a breeze matrix, maybe it help!!
import breeze.linalg._
import org.apache.spark.mllib.linalg.Matrix
def toBreez(X:org.apache.spark.mllib.linalg.Matrix):breeze.linalg.DenseMatrix[Double] = {
var i=0;
var j=0;
val m = breeze.linalg.DenseMatrix.zeros[Double](X.numRows,X.numCols)
for(i <- 0 to X.numRows-1){
for(j <- 0 to X.numCols-1){
m(i,j)=X.apply(i, j)
}
}
m
}

Resources