Create column of decimal type - apache-spark

I would like to provide numbers when creating a Spark dataframe. I have issues providing decimal type numbers.
This way the number gets truncated:
df = spark.createDataFrame([(10234567891023456789.5, )], ["numb"])
df = df.withColumn("numb_dec", F.col("numb").cast("decimal(30,1)"))
df.show(truncate=False)
#+---------------------+----------------------+
#|numb |numb_dec |
#+---------------------+----------------------+
#|1.0234567891023456E19|10234567891023456000.0|
#+---------------------+----------------------+
This fails:
df = spark.createDataFrame([(10234567891023456789.5, )], "numb decimal(30,1)")
df.show(truncate=False)
TypeError: field numb: DecimalType(30,1) can not accept object 1.0234567891023456e+19 in type <class 'float'>
How to correctly provide big decimal numbers so that they wouldn't get truncated?

Maybe this is related to some differences in floating points representation between Python and Spark. You can try passing string values when creating dataframe instead:
df = spark.createDataFrame([("10234567891023456789.5", )], ["numb"])
df = df.withColumn("numb_dec", F.col("numb").cast("decimal(30,1)"))
df.show(truncate=False)
#+----------------------+----------------------+
#|numb |numb_dec |
#+----------------------+----------------------+
#|10234567891023456789.5|10234567891023456789.5|
#+----------------------+----------------------+

Try something as below -
from pyspark.sql.types import *
from decimal import *
schema = StructType([StructField('numb', DecimalType(30,1))])
data = [( Context(prec=30, Emax=999, clamp=1).create_decimal('10234567891023456789.5'), )]
df = spark.createDataFrame(data=data, schema=schema)
df.show(truncate=False)
+----------------------+
|numb |
+----------------------+
|10234567891023456789.5|
+----------------------+

Related

How to convert string column to udt Vector with float values pyspark

I have a dataframe using pyspark with a column "features" that looks like:
[[-0.65467646, 0.578577578, 0.577757775], [-0.65467646, 0.578577578, 0.577757775, 0.65477645, 0.5887563773], [-0.65467646, 0.578577578, 0.577757775]]
I would like to apply k-means on it but it gives a type error saying that it's a string and should be converted to Vector.
I did the conversion using udf fonction. It makes something like:
Udf_vector=udf(lambda v: Vector(v), UDTVector())
but it says now that the values [-0.65467646, 0.578577578, 0.577757775 ... Are not float.
So I did again using udf function:
Udf1 =udf(lambda x:[float(y) for y in x])**
df = df.withColumns(col("features", udf(col("features"))
But this didn't work. Can someone help me with this I would be very thankful. It's my last step before applying k-means model
Your features column isn't an array type. You'll have first to convert it to an array.
You can remove the square brackets and split the string to get an array.
df = df.withColumn("features", split(expr("rtrim(']', ltrim('[', features))"), ","))
Now you have an array of strings where each element is an array-like, so you'll need to convert each one to an array too. For Spark 2.4+, you can do this with transform function :
df = df.withColumn("features", expr("transform(features, x -> split(rtrim(']', ltrim('[', x)), ','))"))
Finally, flatten the inner arrays cast the the strings elements to floats and convert to vector:
from pyspark.ml.linalg import VectorUDT
from pyspark.ml.linalg import Vectors
array_to_vector = udf(lambda a: Vectors.dense(a), VectorUDT())
df = df.withColumn("features", flatten(col("features")).cast("array<float>"))\
.withColumn("features", array_to_vector(col("features")))
df.printSchema()
#root
# |-- id: long (nullable = true)
# |-- features: vector (nullable = true)
Putting all together:
df = df.withColumn("features", split(expr("rtrim(']', ltrim('[', features))"), ",")) \
.withColumn("features", expr("""transform(features, x -> split(rtrim(']', ltrim('[', x)), ","))"""))\
.withColumn("features", flatten(col("features")).cast("array<float>"))\
.withColumn("features", array_to_vector(col("features")))

Spark LightGBM Predict dataframe datatype different from printSchema of output datatype

I want to transform one of the column data type in my dataframe to string using a UDF.
When I printSchema of my dataframe, that column indeed shows vector datatype, However when i use my UDF to transform the vector to string, I get error:
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(probability)'
due to data type mismatch: argument 1 requires vector type, however, '`probability`' is of
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> type.;;
imports
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import com.microsoft.ml.spark.{LightGBMClassifier,LightGBMClassificationModel}
import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage}
UDF
val vecToString = udf( (xs: Vector) => xs.toArray.mkString(";"))
DataFrame (printSchema)
val inputData = spark.read.parquet(inputDataPath)
val pipelineModel = PipelineModel.load(modelPath)
val predictions = pipelineModel.transform(inputData)
# Selecting only 2 columns from predictions DF:
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = false)
+-----------------------------------------+----------+
|probability |prediction|
+-----------------------------------------+----------+
|[0.2554504562575961,0.7445495437424039] |1.0 |
|[0.7763149003135102,0.22368509968648975] |0.0 |
Convert probability column to string using my UDF
val tmp = predictions
.withColumn("probabilityStr" , vecToString($"probability"))
And this is where the above error occurs.
Also tried:
val vecToString = udf( (xs: Array[Double]) => xs.mkString(";"))
AnalysisException: cannot resolve 'UDF(probability)' due to data type mismatch: argument 1 requires array<double> type, however, '`probability`' is of struct<type:tinyint,size:int,indices:array<int>,values:array<double>> type.;;
When I used a different model (not Light GBM, this works fine. Could it be due to type of model used?)

Join in Spark returns duplicates implicit data types don't match

I am getting duplicates when joining on two dataframes where one key is a decimal and the other is a string. It seems that Spark is converting the decimal to a string which results in a scientific notation expression, but then shows the original result in decimal form just fine. I found a workaround by converting to string directly, but this seems dangerous as duplicates are created without warning.
Is this a bug? How can I detect when this is happening?
Here's an demo in pyspark on Spark 2.4:
>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import *
>>> df1 = spark.createDataFrame([('a', 9223372034559809871), ('b', 9223372034559809771)], ['group', 'id_int'])
>>> df1=df1.withColumn('id',col('id_int').cast(DecimalType(38,0)))
>>>
>>> df1.show()
+-----+-------------------+-------------------+
|group| id_int| id|
+-----+-------------------+-------------------+
| a|9223372034559809871|9223372034559809871|
| b|9223372034559809771|9223372034559809771|
+-----+-------------------+-------------------+
>>>
>>> df2= spark.createDataFrame([(1, '9223372034559809871'), (2, '9223372034559809771')], ['value', 'id'])
>>> df2.show()
+-----+-------------------+
|value| id|
+-----+-------------------+
| 1|9223372034559809871|
| 2|9223372034559809771|
+-----+-------------------+
>>>
>>> df1.join(df2, ["id"]).show()
+-------------------+-----+-------------------+-----+
| id|group| id_int|value|
+-------------------+-----+-------------------+-----+
|9223372034559809871| a|9223372034559809871| 1|
|9223372034559809871| a|9223372034559809871| 2|
|9223372034559809771| b|9223372034559809771| 1|
|9223372034559809771| b|9223372034559809771| 2|
+-------------------+-----+-------------------+-----+
>>> df1.dtypes
[('group', 'string'), ('id_int', 'bigint'), ('id', 'decimal(38,0)')]
It's happenning because of the values (very very large) in the joining keys:
I tweaked the values in the joining condition and it's giving me the proper results :
from pyspark.sql.types import *
df1 = spark.createDataFrame([('a', 9223372034559809871), ('b', 9123372034559809771)],
['group', 'id_int'])
df1=df1.withColumn('id',col('id_int').cast(DecimalType(38,0)))
df2= spark.createDataFrame([(1, '9223372034559809871'), (2, '9123372034559809771')],
['value', 'id'])
df1.join(df2, df1["id"]==df2["id"],"inner").show()

unfold vector column into normal columns in DataFrame [duplicate]

This question already has answers here:
How to split Vector into columns - using PySpark [duplicate]
(4 answers)
Closed 3 years ago.
I want to unfold a vector column into normal columns in a dataframe. .transform creates individual columns, but there is something wrong with datatypes or ‘nullable’ that gives an error when I try to .show – see an example code below. How to fix the problem?
from pyspark.sql.types import *
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import udf
spark = SparkSession\
.builder\
.config("spark.driver.maxResultSize", "40g") \
.config('spark.sql.shuffle.partitions', '2001') \
.getOrCreate()
data = [(0.2, 53.3, 0.2, 53.3),
(1.1, 43.3, 0.3, 51.3),
(2.6, 22.4, 0.4, 43.3),
(3.7, 25.6, 0.2, 23.4)]
df = spark.createDataFrame(data, ['A','B','C','D'])
df.show(3)
df.printSchema()
vecAssembler = VectorAssembler(inputCols=['C','D'], outputCol="features")
new_df = vecAssembler.transform(df)
new_df.printSchema()
new_df.show(3)
split1_udf = udf(lambda value: value[0], DoubleType())
split2_udf = udf(lambda value: value[1], DoubleType())
new_df = new_df.withColumn('c1', split1_udf('features')).withColumn('c2', split2_udf('features'))
new_df.printSchema()
new_df.show(3)
I don't know what is the problem with UDF's. But I found another solution - below.
data = [(0.2, 53.3, 0.2, 53.3),
(1.1, 43.3, 0.3, 51.3),
(2.6, 22.4, 0.4, 43.3),
(3.7, 25.6, 0.2, 23.4)]
df = spark.createDataFrame(data, ['A','B','C','D'])
vecAssembler = VectorAssembler(inputCols=['C','D'], outputCol="features")
new_df = vecAssembler.transform(df)
def extract(row):
return (row.A, row.B,row.C,row.D,) + tuple(row.features.toArray().tolist())
extracted_df = new_df.rdd.map(extract).toDF(['A','B','C','D', 'col1', 'col2'])
extracted_df.show()
feature column contain the type pyspark.ml.linalg.DenseVector, and the feature vector elements are of type numpy.float64.
To convert numpy dtypes to native python types value.item()
split1_udf = udf(lambda value: value[0].item(), DoubleType())
split2_udf = udf(lambda value: value[1].item(), DoubleType())
Using this fix results the following output
+---+----+---+----+----------+---+----+
| A| B| C| D| features| c1| c2|
+---+----+---+----+----------+---+----+
|0.2|53.3|0.2|53.3|[0.2,53.3]|0.2|53.3|
|1.1|43.3|0.3|51.3|[0.3,51.3]|0.3|51.3|
|2.6|22.4|0.4|43.3|[0.4,43.3]|0.4|43.3|
|3.7|25.6|0.2|23.4|[0.2,23.4]|0.2|23.4|
+---+----+---+----+----------+---+----+

Get first non-null values in group by (Spark 1.6)

How can I get the first non-null values from a group by? I tried using first with coalesce F.first(F.coalesce("code")) but I don't get the desired behavior (I seem to get the first row).
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
I tried:
(df
.groupby("id")
.agg(F.first(F.coalesce("code")),
F.first(F.coalesce("name")))
.collect())
DESIRED OUTPUT
[Row(id='a', code='code1', name='name2')]
For Spark 1.3 - 1.5, this could do the trick:
from pyspark.sql import functions as F
df.groupBy(df['id']).agg(F.first(df['code']), F.first(df['name'])).show()
+---+-----------+-----------+
| id|FIRST(code)|FIRST(name)|
+---+-----------+-----------+
| a| code1| name2|
+---+-----------+-----------+
Edit
Apparently, in version 1.6 they have changed the way the first aggregate function is processed. Now, the underlying class First should be constructed with a second argument ignoreNullsExpr parameter, which is not yet used by the first aggregate function (as can bee seen here). However, in Spark 2.0 it will be able to call agg(F.first(col, True)) to ignore nulls (as can be checked here).
Therefore, for Spark 1.6 the approach must be different and a little more inefficient, unfornately. One idea is the following:
from pyspark.sql import functions as F
df1 = df.select('id', 'code').filter(df['code'].isNotNull()).groupBy(df['id']).agg(F.first(df['code']))
df2 = df.select('id', 'name').filter(df['name'].isNotNull()).groupBy(df['id']).agg(F.first(df['name']))
result = df1.join(df2, 'id')
result.show()
+---+-------------+-------------+
| id|first(code)()|first(name)()|
+---+-------------+-------------+
| a| code1| name2|
+---+-------------+-------------+
Maybe there is a better option. I'll edit the answer if I find one.
Because I only had one non-null value for every grouping, using min / max in 1.6 worked for my purposes:
(df
.groupby("id")
.agg(F.min("code"),
F.min("name"))
.show())
+---+---------+---------+
| id|min(code)|min(name)|
+---+---------+---------+
| a| code1| name2|
+---+---------+---------+
The first method accept an argument ignorenulls, that can be set to true,
Python:
df.groupby("id").agg(first(col("code"), ignorenulls=True).alias("code"))
Scala:
df.groupBy("id").agg(first(col("code"), ignoreNulls = true).alias("code"))

Resources