Impala vs SparkSQL: built-in function translation: fnv_hash - apache-spark

I am using the fnv_hash in Impala to translate some string value into numbers. Now I am migrating to Spark SQL, is there a similar function in Spark SQL that I can use? An almost 1-1 function mapping string value to number should work. Thanks!

Unfortunately Spark doesn't provide direct replacement. While built-in o.a.s.sql.functions.hash / pyspark.sql.functions.hash uses MurmurHash 3, which should have comparable properties with the same hash size, Spark uses 32 bit hashes (compared to 64 bit fnv_hash in Impala). If this is acceptable just import hash and you're good to go:
from pyspark.sql.functions import hash as hash_
df = sc.parallelize([("foo", ), ("bar", )]).toDF(["foo"])
df.select(hash_("foo"))
DataFrame[hash(foo): int]
If you need larger you can take a look at XXH64. It is not directly exposed using SQL functions, but the Catalyst expression is public so all you need is a simple wrapper. Roughly something like this:
package com.example.spark.sql
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.XxHash64
object functions {
def xxhash64(cols: Column*): Column = new Column(
new XxHash64(cols.map(_.expr))
)
}
from pyspark import SparkContext
from pyspark.sql.column import Column, _to_java_column, _to_seq
def xxhash64(*cols):
sc = SparkContext._active_spark_context
jc = sc._jvm.com.example.spark.sql.functions.xxhash64(
_to_seq(sc, cols, _to_java_column)
)
return Column(jc)
df.select(xxhash64("foo"))
DataFrame[xxHash(foo): bigint]

Related

How to convert GUID into integer in pyspark

Hi Stackoverflow fams:
I am new to pyspark and trying to learn as much as I can. But for now, I want to convert GUID's into integers in pysprak. I can currently run the following statement in SQL to convert GUID's into an int.
CHECKSUM(HASHBYTES('sha2_512',GUID)) AS int_value_wanted
I wanted to do the same thing in pyspark and tried to create a temporary table out of spark dataframe and add the above statement in the sql query. But the code keeps throwing "Undefined function: 'CHECKSUM'". Is there a way I can add the "CHECKSUM" function into pyspark or do the same thing using another pyspark way?
from awsglue.context import GlueContext
from pyspark.sql import SQLContext
glueContext = GlueContext(SparkContext.getOrCreate())
spark_session = glueContext.spark_session
sqlContext = SQLContext(spark_session.sparkContext, spark_session)
spark_df = spark.createDataFrame(
[("2540f487-7a29-400a-98a0-c03902e67f73", "1386172469"),
("0b32389a-ce01-4e6a-855c-15940cc91e9e", "-2013240275")],
("GUDI","int_value_wanted")
)
spark_df.show(truncate=False)
spark_df.registerTempTable('temp')
new_df = sqlContext.sql("SELECT .*, CHECKSUM(HASHBYTES('sha2_512', GUDI)) AS detail_id FROM temp")
new_df.show(truncate=False)
+------------------------------------+----------------+
|GUDI |int_value_wanted|
+------------------------------------+----------------+
|2540f487-7a29-400a-98a0-c03902e67f73|1386172469 |
|0b32389a-ce01-4e6a-855c-15940cc91e9e|-2013240275 |
+------------------------------------+----------------+
Thanks
There is a sha2 built-in function, which returns the checksum for the SHA-2 family as a hex string. SHA-512 is also supported.

Convert String expression to actual working instance expression

I am trying to convert an expression in Scala that is saved in database as String back to working code.
I have tried Reflect Toolbox, Groovy, etc. But I can't seem to achieve what I require.
Here's what I tried:
import scala.reflect.runtime.universe._
import scala.reflect.runtime.currentMirror
import scala.tools.reflect.ToolBox
val toolbox = currentMirror.mkToolBox()
val code1 = q"""StructType(StructField(id,IntegerType,true), StructField(name,StringType,true), StructField(tstamp,TimestampType,true), StructField(date,DateType,true))"""
val sType = toolbox.compile(code1)().asInstanceOf[StructType]
where I need to use the sType instance for passing customSchema to csv file for dataframe creation but it seems to fail.
Is there any way I can get the string expression of the StructType to convert into actual StructType instance? Any help would be appreciated.
If StructType is from Spark and you want to just convert String to StructType you don't need reflection. You can try this:
import org.apache.spark.sql.catalyst.parser.LegacyTypeStringParser
import org.apache.spark.sql.types.{DataType, StructType}
import scala.util.Try
def fromString(raw: String): StructType =
Try(DataType.fromJson(raw)).getOrElse(LegacyTypeStringParser.parse(raw)) match {
case t: StructType => t
case _ => throw new RuntimeException(s"Failed parsing: $raw")
}
val code1 =
"""StructType(Array(StructField(id,IntegerType,true), StructField(name,StringType,true), StructField(tstamp,TimestampType,true), StructField(date,DateType,true)))"""
fromString(code1) // res0: org.apache.spark.sql.types.StructType
The code is taken from the org.apache.spark.sql.types.StructType companion object from Spark. You cannot use it directly as it's in private package. Moreover, it uses LegacyTypeStringParser so I'm not sure if this is good enough for Production code.
Your code inside quasiquotes, needs to be valid Scala syntax, so you need to provide quotes for strings. You'd also need to provide all the necessary imports. This works:
val toolbox = currentMirror.mkToolBox()
val code1 =
q"""
//we need to import all sql types
import org.apache.spark.sql.types._
StructType(
//StructType needs list
List(
//name arguments need to be in proper quotes
StructField("id",IntegerType,true),
StructField("name",StringType,true),
StructField("tstamp",TimestampType,true),
StructField("date",DateType,true)
)
)
"""
val sType = toolbox.compile(code1)().asInstanceOf[StructType]
println(sType)
But maybe instead of trying to recompile the code, you should consider other alternatives as serializing struct type somehow (perhaps to JSON?).

Spark sampling options in JSON reader ignored?

In the following two examples, the number of tasks run and the corresponding run time imply that the sampling options have no effect, as they are similar to jobs run without any sampling options:
val df = spark.read.options("samplingRatio",0.001).json("s3a://test/*.json.bz2")
val df = spark.read.option("sampleSize",100).json("s3a://test/*.json.bz2")
I know that explicit schemas are best for performance, but in convenience cases sampling is useful.
New to Spark, am I using these options incorrectly? Attempted the same approach in PySpark, with same results:
df = spark.read.options(samplingRatio=0.1).json("s3a://test/*.json.bz2")
df = spark.read.options(samplingRatio=None).json("s3a://test/*.json.bz2")
TL;DR None of the you use options will have significant impact on the execution time:
sampleSize is not among valid JSONOptions or JSONOptionsInRead so it will be ignored.
samplingRatio is a valid option, but internally it uses PartitionwiseSampledRDD, so the process is linear in terms of the number of records. Therefore sampling can only reduce inference cost, not the IO, which is likely the bottleneck here.
Setting samplingRatio to None is equivalent to no sampling. PySpark OptionUtils simply discard None options and sampleRatio defaults to 1.0.
You can try to sample data explicitly. In Python
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField
def infer_json_schema(path: str, sample_size: int, **kwargs: str) -> StructType:
spark = SparkSession.builder.getOrCreate()
sample = spark.read.text(path).limit(sample_size).rdd.flatMap(lambda x: x)
return spark.read.options(**kwargs).json(sample).schema
In Scala:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
def inferJsonSchema(
path: String, sampleSize: Int, options: Map[String, String]): StructType = {
val spark = SparkSession.builder.getOrCreate()
val sample = spark.read.text(path).limit(sampleSize).as[String]
spark.read.options(options).json(sample).schema
}
Please keep in mind, that to work well, sample size should at most equal to the expected size of partition. Limits in Spark escalate quickly (see for example my answer to Spark count vs take and length) and you can easily end up scanning the whole input.

How should I convert an RDD of org.apache.spark.ml.linalg.Vector to Dataset?

I'm struggling to understand how the conversion among RDDs, DataSets and DataFrames works.
I'm pretty new to Spark, and I get stuck every time I need to pass from a data model to another (especially from RDDs to Datasets and Dataframes).
Could anyone explain me the right way to do it?
As an example, now I have a RDD[org.apache.spark.ml.linalg.Vector] and I need to pass it to my machine learning algorithm, for example a KMeans (Spark DataSet MLlib). So, I need to convert it to Dataset with a single column named "features" which should contain Vector typed rows. How should I do this?
All you need is an Encoder. Imports
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.ml.linalg
RDD:
val rdd = sc.parallelize(Seq(
linalg.Vectors.dense(1.0, 2.0), linalg.Vectors.sparse(2, Array(), Array())
))
Conversion:
val ds = spark.createDataset(rdd)(ExpressionEncoder(): Encoder[linalg.Vector])
.toDF("features")
ds.show
// +---------+
// | features|
// +---------+
// |[1.0,2.0]|
// |(2,[],[])|
// +---------+
ds.printSchema
// root
// |-- features: vector (nullable = true)
To convert a RDD to a dataframe, the easiest way is to use toDF() in Scala. To use this function, it is necessary to import implicits which is done using the SparkSession object. It can be done as follows:
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val df = rdd.toDF("features")
toDF() takes an RDD of tuples. When the RDD is built up of common Scala objects they will be implicitly converted, i.e. there is no need to do anything, and when the RDD has multiple columns there is no need to do anything either, the RDD already contains a tuple. However, in this special case you need to first convert RDD[org.apache.spark.ml.linalg.Vector] to RDD[(org.apache.spark.ml.linalg.Vector)]. Therefore, it is necessary to do a convertion to tuple as follows:
val df = rdd.map(Tuple1(_)).toDF("features")
The above will convert the RDD to a dataframe with a single column called features.
To convert to a dataset the easiest way is to use a case class. Make sure the case class is defined outside the Main object. First convert the RDD to a dataframe, then do the following:
case class A(features: org.apache.spark.ml.linalg.Vector)
val ds = df.as[A]
To show all possible convertions, to access the underlying RDD from a dataframe or dataset can be done using .rdd:
val rdd = df.rdd
Instead of converting back and forth between RDDs and dataframes/datasets it's usually easier to do all the computations using the dataframe API. If there is no suitable function to do what you want, usually it's possible to define an UDF, user defined function. See for example here: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs.html

Need some inputs in feature extraction in Apache Spark

I am new to Apache Spark and we are trying to use the MLIB utility to do some analysis. I collated some code to convert my data into features and then apply a linear regression algorithm to that. I am facing some issues . Please help and excuse if its a silly question
My person data looks like
1,1000.00,36
2,2000.00,35
3,2345.50,37
4,3323.00,45
Just a simple example to get the code working
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
case class Person(rating: String, income: Double, age: Int)
val persondata = sc.textFile("D:/spark/mydata/persondata.txt").map(_.split(",")).map(p => Person(p(0), p(1).toDouble, p(2).toInt))
def prepareFeatures(people: Seq[Person]): Seq[org.apache.spark.mllib.linalg.Vector] = {
val maxIncome = people.map(_ income) max
val maxAge = people.map(_ age) max
people.map (p =>
Vectors.dense(
if (p.rating == "A") 0.7 else if (p.rating == "B") 0.5 else 0.3,
p.income / maxIncome,
p.age.toDouble / maxAge))
}
def prepareFeaturesWithLabels(features: Seq[org.apache.spark.mllib.linalg.Vector]): Seq[LabeledPoint] =
(0d to 1 by (1d / features.length)) zip(features) map(l => LabeledPoint(l._1, l._2))
---Its working till here.
---It breaks in the below code
val data = sc.parallelize(prepareFeaturesWithLabels(prepareFeatures(people))
scala> val data = sc.parallelize(prepareFeaturesWithLabels(prepareFeatures(people)))
<console>:36: error: not found: value people
Error occurred in an application involving default arguments.
val data = sc.parallelize(prepareFeaturesWithLabels(prepareFeatures(people)))
^
Please advise
You seem to be going in roughly the right direction but there are a few minor problems. First off you are trying to reference a value (people) that you haven't defined. More generally you seem to be writing your code to work with sequences, and instead you should modify your code to work with RDDs (or DataFrames). Also you seem to be using parallelize to try and parallelize your operation, but parallelize is a helper method to take a local collection and make it available as a distributed RDD. I'd probably recommend looking at the programming guides or some additional documentation to get a better understanding of the Spark APIs. Best of luck with your adventures with Spark.

Resources