unfold vector column into normal columns in DataFrame [duplicate] - apache-spark

This question already has answers here:
How to split Vector into columns - using PySpark [duplicate]
(4 answers)
Closed 3 years ago.
I want to unfold a vector column into normal columns in a dataframe. .transform creates individual columns, but there is something wrong with datatypes or ‘nullable’ that gives an error when I try to .show – see an example code below. How to fix the problem?
from pyspark.sql.types import *
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import udf
spark = SparkSession\
.builder\
.config("spark.driver.maxResultSize", "40g") \
.config('spark.sql.shuffle.partitions', '2001') \
.getOrCreate()
data = [(0.2, 53.3, 0.2, 53.3),
(1.1, 43.3, 0.3, 51.3),
(2.6, 22.4, 0.4, 43.3),
(3.7, 25.6, 0.2, 23.4)]
df = spark.createDataFrame(data, ['A','B','C','D'])
df.show(3)
df.printSchema()
vecAssembler = VectorAssembler(inputCols=['C','D'], outputCol="features")
new_df = vecAssembler.transform(df)
new_df.printSchema()
new_df.show(3)
split1_udf = udf(lambda value: value[0], DoubleType())
split2_udf = udf(lambda value: value[1], DoubleType())
new_df = new_df.withColumn('c1', split1_udf('features')).withColumn('c2', split2_udf('features'))
new_df.printSchema()
new_df.show(3)

I don't know what is the problem with UDF's. But I found another solution - below.
data = [(0.2, 53.3, 0.2, 53.3),
(1.1, 43.3, 0.3, 51.3),
(2.6, 22.4, 0.4, 43.3),
(3.7, 25.6, 0.2, 23.4)]
df = spark.createDataFrame(data, ['A','B','C','D'])
vecAssembler = VectorAssembler(inputCols=['C','D'], outputCol="features")
new_df = vecAssembler.transform(df)
def extract(row):
return (row.A, row.B,row.C,row.D,) + tuple(row.features.toArray().tolist())
extracted_df = new_df.rdd.map(extract).toDF(['A','B','C','D', 'col1', 'col2'])
extracted_df.show()

feature column contain the type pyspark.ml.linalg.DenseVector, and the feature vector elements are of type numpy.float64.
To convert numpy dtypes to native python types value.item()
split1_udf = udf(lambda value: value[0].item(), DoubleType())
split2_udf = udf(lambda value: value[1].item(), DoubleType())
Using this fix results the following output
+---+----+---+----+----------+---+----+
| A| B| C| D| features| c1| c2|
+---+----+---+----+----------+---+----+
|0.2|53.3|0.2|53.3|[0.2,53.3]|0.2|53.3|
|1.1|43.3|0.3|51.3|[0.3,51.3]|0.3|51.3|
|2.6|22.4|0.4|43.3|[0.4,43.3]|0.4|43.3|
|3.7|25.6|0.2|23.4|[0.2,23.4]|0.2|23.4|
+---+----+---+----+----------+---+----+

Related

Create column of decimal type

I would like to provide numbers when creating a Spark dataframe. I have issues providing decimal type numbers.
This way the number gets truncated:
df = spark.createDataFrame([(10234567891023456789.5, )], ["numb"])
df = df.withColumn("numb_dec", F.col("numb").cast("decimal(30,1)"))
df.show(truncate=False)
#+---------------------+----------------------+
#|numb |numb_dec |
#+---------------------+----------------------+
#|1.0234567891023456E19|10234567891023456000.0|
#+---------------------+----------------------+
This fails:
df = spark.createDataFrame([(10234567891023456789.5, )], "numb decimal(30,1)")
df.show(truncate=False)
TypeError: field numb: DecimalType(30,1) can not accept object 1.0234567891023456e+19 in type <class 'float'>
How to correctly provide big decimal numbers so that they wouldn't get truncated?
Maybe this is related to some differences in floating points representation between Python and Spark. You can try passing string values when creating dataframe instead:
df = spark.createDataFrame([("10234567891023456789.5", )], ["numb"])
df = df.withColumn("numb_dec", F.col("numb").cast("decimal(30,1)"))
df.show(truncate=False)
#+----------------------+----------------------+
#|numb |numb_dec |
#+----------------------+----------------------+
#|10234567891023456789.5|10234567891023456789.5|
#+----------------------+----------------------+
Try something as below -
from pyspark.sql.types import *
from decimal import *
schema = StructType([StructField('numb', DecimalType(30,1))])
data = [( Context(prec=30, Emax=999, clamp=1).create_decimal('10234567891023456789.5'), )]
df = spark.createDataFrame(data=data, schema=schema)
df.show(truncate=False)
+----------------------+
|numb |
+----------------------+
|10234567891023456789.5|
+----------------------+

PySpark UDF for converting UTM error expected zero arguments for construction of ClassDict (for numpy.dtype)

I am trying to create a UDF in PySpark for converting UTM to longitude and latitude.
Error
Caused by: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
Have tried with different datatype, but without any luck.
PySpark code
import pyspark.sql.functions as F
from pyspark.sql.types import *
import utm
df2 = spark.createDataFrame([(531086, 6224626), (531086, 6224626)], ["C1", "C2"])
df2.printSchema()
utm_udf_x = F.udf(lambda x,y: utm.to_latlon(x,y, 32, 'U')[0], ArrayType(FloatType()))
utm_udf_y = F.udf(lambda x,y: utm.to_latlon(x,y, 32, 'U')[1], ArrayType(FloatType()))
df2 = df2.withColumn('lat',utm_udf_x(F.col('C1'), F.col('C2')))
df2 = df2.withColumn('lon',utm_udf_y(F.col('C1'), F.col('C2')))
display(df2)
Thanks
The main issue was to convert Numpy DType to Float form utm.to_latlon.
This is working
import pyspark.sql.functions as F
from pyspark.sql.types import *
import utm
df2 = spark.createDataFrame([(340000.0, 5710000.0), (573014.00000135, 6221529.99974406)], ["C1", "C2"])
df2.printSchema()
utm_udf_x = F.udf(lambda x,y: float(utm.to_latlon(x,y, 32, 'U')[0]), FloatType())
utm_udf_y = F.udf(lambda x,y: float(utm.to_latlon(x,y, 32, 'U')[1]), FloatType())
df2 = df2.withColumn('lat',utm_udf_x(F.col('C1'), F.col('C2')))
df2 = df2.withColumn('lon',utm_udf_y(F.col('C1'), F.col('C2')))
display(df2)

Join in Spark returns duplicates implicit data types don't match

I am getting duplicates when joining on two dataframes where one key is a decimal and the other is a string. It seems that Spark is converting the decimal to a string which results in a scientific notation expression, but then shows the original result in decimal form just fine. I found a workaround by converting to string directly, but this seems dangerous as duplicates are created without warning.
Is this a bug? How can I detect when this is happening?
Here's an demo in pyspark on Spark 2.4:
>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import *
>>> df1 = spark.createDataFrame([('a', 9223372034559809871), ('b', 9223372034559809771)], ['group', 'id_int'])
>>> df1=df1.withColumn('id',col('id_int').cast(DecimalType(38,0)))
>>>
>>> df1.show()
+-----+-------------------+-------------------+
|group| id_int| id|
+-----+-------------------+-------------------+
| a|9223372034559809871|9223372034559809871|
| b|9223372034559809771|9223372034559809771|
+-----+-------------------+-------------------+
>>>
>>> df2= spark.createDataFrame([(1, '9223372034559809871'), (2, '9223372034559809771')], ['value', 'id'])
>>> df2.show()
+-----+-------------------+
|value| id|
+-----+-------------------+
| 1|9223372034559809871|
| 2|9223372034559809771|
+-----+-------------------+
>>>
>>> df1.join(df2, ["id"]).show()
+-------------------+-----+-------------------+-----+
| id|group| id_int|value|
+-------------------+-----+-------------------+-----+
|9223372034559809871| a|9223372034559809871| 1|
|9223372034559809871| a|9223372034559809871| 2|
|9223372034559809771| b|9223372034559809771| 1|
|9223372034559809771| b|9223372034559809771| 2|
+-------------------+-----+-------------------+-----+
>>> df1.dtypes
[('group', 'string'), ('id_int', 'bigint'), ('id', 'decimal(38,0)')]
It's happenning because of the values (very very large) in the joining keys:
I tweaked the values in the joining condition and it's giving me the proper results :
from pyspark.sql.types import *
df1 = spark.createDataFrame([('a', 9223372034559809871), ('b', 9123372034559809771)],
['group', 'id_int'])
df1=df1.withColumn('id',col('id_int').cast(DecimalType(38,0)))
df2= spark.createDataFrame([(1, '9223372034559809871'), (2, '9123372034559809771')],
['value', 'id'])
df1.join(df2, df1["id"]==df2["id"],"inner").show()

Find median in spark SQL for multiple double datatype columns

I have a requirement to find median for multiple double datatype columns.Request suggestion to find the correct approach.
Below is my sample dataset with one column. I am expecting the median value to be returned as 1 for my sample.
scala> sqlContext.sql("select num from test").show();
+---+
|num|
+---+
|0.0|
|0.0|
|1.0|
|1.0|
|1.0|
|1.0|
+---+
I tried the following options
1) Hive UDAF percentile, it worked only for BigInt.
2) Hive UDAT percentile_approx, but it does not work as expected (returns 0.25 vs 1).
sqlContext.sql("select percentile_approx(num,0.5) from test").show();
+----+
| _c0|
+----+
|0.25|
+----+
3) Spark window function percent_rank- to find median the way i see is to look for all percent_rank above 0.5 and pick the max percent_rank's corresponding num value. But it does not work in all cases, especially when i have even record counts, in such case the median is the average of the middle value in the sorted distribution.
Also in the percent_rank, as i have to find the median for multiple columns, i have to calculate it in different dataframes, which to me is little complex method. Please correct me, if my understanding is not right.
+---+-------------+
|num|percent_rank |
+---+-------------+
|0.0|0.0|
|0.0|0.0|
|1.0|0.4|
|1.0|0.4|
|1.0|0.4|
|1.0|0.4|
+---+---+
Which version of Apache Spark are you using out of curiosity? There were some fixes within the Apache Spark 2.0+ which included changes to approxQuantile.
If I was to run the pySpark code snippet below:
rdd = sc.parallelize([[1, 0.0], [1, 0.0], [1, 1.0], [1, 1.0], [1, 1.0], [1, 1.0]])
df = rdd.toDF(['id', 'num'])
df.createOrReplaceTempView("df")
with the median calculation using approxQuantile as:
df.approxQuantile("num", [0.5], 0.25)
or
spark.sql("select percentile_approx(num, 0.5) from df").show()
the results are:
Spark 2.0.0: 0.25
Spark 2.0.1: 1.0
Spark 2.1.0: 1.0
Note, as these are the approximate numbers (via approxQuantile) though in general this should work well. If you need the exact median, one approach is to use numpy.median. The code snippet below is updated for this df example based on gench's SO response to How to find the median in Apache Spark with Python Dataframe API?:
from pyspark.sql.types import *
import pyspark.sql.functions as F
import numpy as np
def find_median(values):
try:
median = np.median(values) #get the median of values in a list in each row
return round(float(median),2)
except Exception:
return None #if there is anything wrong with the given values
median_finder = F.udf(find_median,FloatType())
df2 = df.groupBy("id").agg(F.collect_list("num").alias("nums"))
df2 = df2.withColumn("median", median_finder("nums"))
# print out
df2.show()
with the output of:
+---+--------------------+------+
| id| nums|median|
+---+--------------------+------+
| 1|[0.0, 0.0, 1.0, 1...| 1.0|
+---+--------------------+------+
Updated: Spark 1.6 Scala version using RDDs
If you are using Spark 1.6, you can calculate the median using Scala code via Eugene Zhulenev's response How can I calculate the exact median with Apache Spark. Below is the modified code that works with our example.
import org.apache.spark.SparkContext._
val rdd: RDD[Double] = sc.parallelize(Seq((0.0), (0.0), (1.0), (1.0), (1.0), (1.0)))
val sorted = rdd.sortBy(identity).zipWithIndex().map {
case (v, idx) => (idx, v)
}
val count = sorted.count()
val median: Double = if (count % 2 == 0) {
val l = count / 2 - 1
val r = l + 1
(sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2
} else sorted.lookup(count / 2).head.toDouble
with the output of:
// output
import org.apache.spark.SparkContext._
rdd: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[227] at parallelize at <console>:34
sorted: org.apache.spark.rdd.RDD[(Long, Double)] = MapPartitionsRDD[234] at map at <console>:36
count: Long = 6
median: Double = 1.0
Note, this is calculating the exact median using RDDs - i.e. you will need to convert the DataFrame column into an RDD to perform this calculation.

How to access element of a VectorUDT column in a Spark DataFrame?

I have a dataframe df with a VectorUDT column named features. How do I get an element of the column, say first element?
I've tried doing the following
from pyspark.sql.functions import udf
first_elem_udf = udf(lambda row: row.values[0])
df.select(first_elem_udf(df.features)).show()
but I get a net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict(for numpy.dtype) error. Same error if I do first_elem_udf = first_elem_udf(lambda row: row.toArray()[0]) instead.
I also tried explode() but I get an error because it requires an array or map type.
This should be a common operation, I think.
Convert output to float:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
Example usage:
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.select(ith("features", lit(1))).show()
## +-----------------+
## |ith_(features, 1)|
## +-----------------+
## | 2.0|
## | 9.0|
## +-----------------+
Explanation:
Output values have to be reserialized to equivalent Java objects. If you want to access values (beware of SparseVectors) you should use item method:
v.values.item(0)
which return standard Python scalars. Similarly if you want to access all values as a dense structure:
v.toArray().tolist()
If you prefer using spark.sql, you can use the follow custom function 'to_array' to convert the vector to array. Then you can manipulate it as an array.
from pyspark.sql.types import ArrayType, DoubleType
def to_array_(v):
return v.toArray().tolist()
from pyspark.sql import SQLContext
sqlContext=SQLContext(spark.sparkContext, sparkSession=spark, jsqlContext=None)
sqlContext.udf.register("to_array",to_array_, ArrayType(DoubleType()))
example
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.createOrReplaceTempView("tb")
spark.sql("""select * , to_array(features)[1] Second from tb """).toPandas()
output
id features Second
0 1 [1.0, 2.0, 3.0] 2.0
1 2 (0.0, 9.0, 0.0) 9.0
I ran into the same problem with not being able to use explode(). One thing you can do is use VectorSlice from the pyspark.ml.feature library. Like so:
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import Row
slicer = VectorSlicer(inputCol="features", outputCol="features_one", indices=[0])
output = slicer.transform(df)
output.select("features", "features_one").show()
For anyone trying to split the probability columns generated after training a PySpark ML model into usable columns. This does not use UDF or numpy. And this will only work for binary classification. Here lr_pred is the dataframe which has the predictions from the Logistic Regression Model.
prob_df1=lr_pred.withColumn("probability",lr_pred["probability"].cast("String"))
prob_df =prob_df1.withColumn('probabilityre',split(regexp_replace("probability", "^\[|\]", ""), ",")[1].cast(DoubleType()))
Since Spark 3.0.0 this can be done without using UDF.
from pyspark.ml.functions import vector_to_array
https://discuss.dizzycoding.com/how-to-split-vector-into-columns-using-pyspark/
Why is Vector[Double] is used in the results? That's not a very nice data type.

Resources