Converting a string to double in a dataframe - apache-spark

I have built a dataframe using concat which produces a string.
import sqlContext.implicits._
val df = sc.parallelize(Seq((1.0, 2.0), (3.0, 4.0))).toDF("k", "v")
df.registerTempTable("df")
val dfConcat = df.select(concat($"k", lit(","), $"v").as("test"))
dfConcat: org.apache.spark.sql.DataFrame = [test: string]
+-------------+
| test|
+-------------+
| 1.0,2.0|
| 3.0,4.0|
+-------------+
How can I convert it back to double?
I have tried casting to DoubleType but I get null
import org.apache.spark.sql.types._
intterim.features.cast(IntegerType))
val testDouble = dfConcat.select( dfConcat("test").cast(DoubleType).as("test"))
+----+
|test|
+----+
|null|
|null|
+----+
and udf return number format exception at run time
import org.apache.spark.sql.functions._
val toDbl = udf[Double, String]( _.toDouble)
val testDouble = dfConcat
.withColumn("test", toDbl(dfConcat("test")))
.select("test")

You cannot convert it to double because it is simply not a valid double representation. If you want an array just use array function:
import org.apache.spark.sql.functions.array
df.select(array($"k", $"v").as("test"))
You can also try to split and convert but it is far from optimal:
import org.apache.spark.sql.types.{ArrayType, DoubleType}
import org.apache.spark.sql.functions.split
dfConcat.select(split($"test", ",").cast(ArrayType(DoubleType)))

Related

Spark HiveContext get the same format as hive client select

When a Hive table has values like maps or arrays, if you select it in the Hive client they are shown as JSON, e.g.: {"a":1,"b":1} or [1,2,2].
When you select those in Spark, they are map/array objects in the DataFrame. If you stringify each row they are Map("a" -> 1, "b" -> 1) or WrappedArray(1, 2, 2).
I want to have the same format as the Hive client when using Spark's HiveContext.
How can I do this?
Spark has its own functions to convert complex objects into their JSON representation.
Here is the documentation for the org.apache.spark.sql.functions package, which also comes with the to_json function that does the following:
Converts a column containing a StructType, ArrayType of StructTypes, a MapType or ArrayType of MapTypes into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type.
Here is a short example as ran on the spark-shell:
scala> val df = spark.createDataFrame(
| Seq(("hello", Map("a" -> 1)), ("world", Map("b" -> 2)))
| ).toDF("name", "map")
df: org.apache.spark.sql.DataFrame = [name: string, map: map<string,int>]
scala> df.show
+-----+-----------+
| name| map|
+-----+-----------+
|hello|Map(a -> 1)|
|world|Map(b -> 2)|
+-----+-----------+
scala> df.select($"name", to_json(struct($"map")) as "json").show
+-----+---------------+
| name| json|
+-----+---------------+
|hello|{"map":{"a":1}}|
|world|{"map":{"b":2}}|
+-----+---------------+
Here is a similar example, with arrays instead of maps:
scala> val df = spark.createDataFrame(
| Seq(("hello", Seq("a", "b")), ("world", Seq("c", "d")))
| ).toDF("name", "array")
df: org.apache.spark.sql.DataFrame = [name: string, array: array<string>]
scala> df.select($"name", to_json(struct($"array")) as "json").show
+-----+-------------------+
| name| json|
+-----+-------------------+
|hello|{"array":["a","b"]}|
|world|{"array":["c","d"]}|
+-----+-------------------+

Apache spark - Window Function , FIRST_VALUE do not work

I have a problem with the WINDOW FUNCTION spark API :
my question is similar to this one : How to drop duplicates using conditions
I have a dataset :
+---+----------+---------+
| ID| VALUEE| OTHER|
+---+----------+---------+
| 1| null|something|
| 1|[1.0, 0.0]|something|
| 1|[1.0, 0.0]|something|
| 1|[0.0, 2.0]|something|
| 1|[3.0, 5.0]|something|
| 2|[3.0, 5.0]|something|
| 1|[3.0, 5.0]|something|
| 2| null|something|
| 3|[3.0, 5.0]|something|
| 4| null|something|
+---+----------+---------+
I want a keep only one ID of each ( no duplicate ) and I don't care of the VALUEE but I prefer a non NULL value
expected result
+---+----------+---------+
| ID| VALUEE| OTHER|
+---+----------+---------+
| 1|[0.0, 2.0]|something|
| 3|[3.0, 5.0]|something|
| 4| null|something|
| 2|[3.0, 5.0]|something|
+---+----------+---------+
windowsFunction with the Aggregate function first() do not work
whereas with row_number() it work
but i don't understand why first do not work
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.*;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.spark_project.guava.collect.ImmutableList;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import static org.apache.spark.sql.types.DataTypes.IntegerType;
import static org.apache.spark.sql.types.DataTypes.StringType;
import static org.apache.spark.sql.types.DataTypes.createStructField;
public class TestSOF {
public static void main(String[] args) {
StructType schema = new StructType(
new StructField[]{
createStructField("ID", IntegerType, false),
createStructField("VALUEE", DataTypes.createArrayType(DataTypes.DoubleType), true),
createStructField("OTHER", StringType, true),
});
double [] a =new double[]{1.0,0.0};
double [] b =new double[]{3.0,5.0};
double [] c =new double[]{0.0,2.0};
List<Row> listOfdata = new ArrayList();
listOfdata.add(RowFactory.create(1,null,"something"));
listOfdata.add(RowFactory.create(1,a,"something"));
listOfdata.add(RowFactory.create(1,a,"something"));
listOfdata.add(RowFactory.create(1,c,"something"));
listOfdata.add(RowFactory.create(1,b,"something"));
listOfdata.add(RowFactory.create(2,b,"something"));
listOfdata.add(RowFactory.create(1,b,"something"));
listOfdata.add(RowFactory.create(2,null,"something"));
listOfdata.add(RowFactory.create(3,b,"something"));
listOfdata.add(RowFactory.create(4,null,"something"));
List<Row> rowList = ImmutableList.copyOf(listOfdata);
SparkSession sparkSession = new SparkSession.Builder().config("spark.master", "local[*]").getOrCreate();
sparkSession.sparkContext().setLogLevel("ERROR");
Dataset<Row> dataset = sparkSession.createDataFrame(rowList,schema);
dataset.show();
WindowSpec windowSpec = Window.partitionBy(dataset.col("ID")).orderBy(dataset.col("VALUEE").asc_nulls_last());
// wind solution
// lost information
Dataset<Row> dataset0 = dataset.groupBy("ID").agg(functions.first(dataset.col("VALUEE"), true));
Dataset<Row> dataset1 = dataset.withColumn("new",functions.row_number().over(windowSpec)).where("new = 1").drop("new");
//do not work
Dataset<Row> dataset2 = dataset.withColumn("new",functions.first("VALUEE",true).over(windowSpec)).drop("new");
JavaRDD<Row> rdd =
dataset.toJavaRDD()
.groupBy(row -> row.getAs("ID"))
.map(g -> {
Iterator<Row> iter =g._2.iterator();
Row rst = null;
Row tmp;
while(iter.hasNext()){
tmp = iter.next();
if (tmp.getAs("VALUEE") != null) {
rst=tmp;
break;
}
if(rst==null){
rst=tmp;
}
}
return rst;
});
Dataset<Row> dataset3 = sparkSession.createDataFrame(rdd, schema);
dataset0.show();
dataset1.show();
dataset2.show();
dataset3.show();
}
}
First is not a Window function in SPARK 2.3 it's only an Aggregate function
firstValue is not present in the dataframe API
You can use an equivalent solution as the one you posted. In your case, the null values will appear in the first order. So :
val df: DataFrame = ???
import df.sparkSession.implicits._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, last}
val id_cols = "ID"
val windowSpec = Window.partitionBy(id_cols).orderBy($"VALUEE".asc)
val list_cols = Seq("VALUE", "OTHER")
val df_dd = df.select(col(id_cols) +: list_cols.map(x => last(col(x)).over(windowSpec).alias(x)):_*).distinct
For the example data you've provided, the short version of the solution dataset1, that you provided:
dataset.groupBy("ID").agg(functions.first(dataset.col("VALUEE"), true)).show();
For understanding of Window Functions and optimization of performance of WindowFunction vs groupBy in Spark i strongly recommend presentations by Jacek Laskowski:
https://databricks.com/session/from-basic-to-advanced-aggregate-operators-in-apache-spark-sql-2-2-by-examples-and-their-catalyst-optimizations
https://databricks.com/session/from-basic-to-advanced-aggregate-operators-in-apache-spark-sql-2-2-by-examples-and-their-catalyst-optimizations-continues

how to cast all columns of dataframe to string

I have a mixed type dataframe.
I am reading this dataframe from hive table using
spark.sql('select a,b,c from table') command.
Some columns are int , bigint , double and others are string. There are 32 columns in total.
Is there any way in pyspark to convert all columns in the data frame to string type ?
Just:
from pyspark.sql.functions import col
table = spark.sql("table")
table.select([col(c).cast("string") for c in table.columns])
Here's a one line solution in Scala :
df.select(df.columns.map(c => col(c).cast(StringType)) : _*)
Let's see an example here :
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val data = Seq(
Row(1, "a"),
Row(5, "z")
)
val schema = StructType(
List(
StructField("num", IntegerType, true),
StructField("letter", StringType, true)
)
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
schema
)
df.printSchema
//root
//|-- num: integer (nullable = true)
//|-- letter: string (nullable = true)
val newDf = df.select(df.columns.map(c => col(c).cast(StringType)) : _*)
newDf.printSchema
//root
//|-- num: string (nullable = true)
//|-- letter: string (nullable = true)
I hope it helps
for col in df_data.columns:
df_data = df_data.withColumn(col, df_data[col].cast(StringType()))
For Scala, spark version > 2.0
case class Row(id: Int, value: Double)
import spark.implicits._
import org.apache.spark.sql.functions._
val r1 = Seq(Row(1, 1.0), Row(2, 2.0), Row(3, 3.0)).toDF()
r1.show
+---+-----+
| id|value|
+---+-----+
| 1| 1.0|
| 2| 2.0|
| 3| 3.0|
+---+-----+
val castedDF = r1.columns.foldLeft(r1)((current, c) => current.withColumn(c, col(c).cast("String")))
castedDF.printSchema
root
|-- id: string (nullable = false)
|-- value: string (nullable = false)
you can cast single column as this
import pyspark.sql.functions as F
import pyspark.sql.types as T
df = df.withColumn("id", F.col("new_id").cast(T.StringType()))
and just for all column to cast

Add a new column to a Dataframe. New column i want it to be a UUID generator

I want to add a new column to a Dataframe, a UUID generator.
UUID value will look something like 21534cf7-cff9-482a-a3a8-9e7244240da7
My Research:
I've tried with withColumn method in spark.
val DF2 = DF1.withColumn("newcolname", DF1("existingcolname" + 1)
So DF2 will have additional column with newcolname with 1 added to it in all rows.
By my requirement is that I want to have a new column which can generate the UUID.
You can utilize built-in Spark SQL uuid function:
.withColumn("uuid", expr("uuid()"))
A full example in Scala:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
object CreateDf extends App {
val spark = SparkSession.builder
.master("local[*]")
.appName("spark_local")
.getOrCreate()
import spark.implicits._
Seq(1, 2, 3).toDF("col1")
.withColumn("uuid", expr("uuid()"))
.show(false)
}
Output:
+----+------------------------------------+
|col1|uuid |
+----+------------------------------------+
|1 |24181c68-51b7-42ea-a9fd-f88dcfa10062|
|2 |7cd21b25-017e-4567-bdd3-f33b001ee497|
|3 |1df7cfa8-af8a-4421-834f-5359dc3ae417|
+----+------------------------------------+
You should try something like this:
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val generateUUID = udf(() => UUID.randomUUID().toString)
val df1 = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
val df2 = df1.withColumn("UUID", generateUUID())
df1.show()
df2.show()
Output will be:
+---+-----+
| id|value|
+---+-----+
|id1| 1|
|id2| 4|
|id3| 5|
+---+-----+
+---+-----+--------------------+
| id|value| UUID|
+---+-----+--------------------+
|id1| 1|f0cfd0e2-fbbe-40f...|
|id2| 4|ec8db8b9-70db-46f...|
|id3| 5|e0e91292-1d90-45a...|
+---+-----+--------------------+
This is how we did in Java, we had a column date and wanted to add another column with month.
Dataset<Row> newData = data.withColumn("month", month((unix_timestamp(col("date"), "MM/dd/yyyy")).cast("timestamp")));
You can use similar technique to add any column.
Dataset<Row> newData1 = newData.withColumn("uuid", lit(UUID.randomUUID().toString()));
Cheers !

Convert null values to empty array in Spark DataFrame

I have a Spark data frame where one column is an array of integers. The column is nullable because it is coming from a left outer join. I want to convert all null values to an empty array so I don't have to deal with nulls later.
I thought I could do it like so:
val myCol = df("myCol")
df.withColumn( "myCol", when(myCol.isNull, Array[Int]()).otherwise(myCol) )
However, this results in the following exception:
java.lang.RuntimeException: Unsupported literal type class [I [I#5ed25612
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:49)
at org.apache.spark.sql.functions$.lit(functions.scala:89)
at org.apache.spark.sql.functions$.when(functions.scala:778)
Apparently array types are not supported by the when function. Is there some other easy way to convert the null values?
In case it is relevant, here is the schema for this column:
|-- myCol: array (nullable = true)
| |-- element: integer (containsNull = false)
You can use an UDF:
import org.apache.spark.sql.functions.udf
val array_ = udf(() => Array.empty[Int])
combined with WHEN or COALESCE:
df.withColumn("myCol", when(myCol.isNull, array_()).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array_())).show
In the recent versions you can use array function:
import org.apache.spark.sql.functions.{array, lit}
df.withColumn("myCol", when(myCol.isNull, array().cast("array<integer>")).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array().cast("array<integer>"))).show
Please note that it will work only if conversion from string to the desired type is allowed.
The same thing can be of course done in PySpark as well. For the legacy solutions you can define udf
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType
def empty_array(t):
return udf(lambda: [], ArrayType(t()))()
coalesce(myCol, empty_array(IntegerType()))
and in the recent versions just use array:
from pyspark.sql.functions import array
coalesce(myCol, array().cast("array<integer>"))
With a slight modification to zero323's approach, I was able to do this without using a udf in Spark 2.3.1.
val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
df.show
+---+---------+
| id| numbers|
+---+---------+
| a|[1, 2, 3]|
| b| null|
| c|[7, 8, 9]|
+---+---------+
val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
df2.show
+---+---------+
| id| numbers|
+---+---------+
| a|[1, 2, 3]|
| b| []|
| c|[7, 8, 9]|
+---+---------+
An UDF-free alternative to use when the data type you want your array elements in can not be cast from StringType is the following:
import pyspark.sql.types as T
import pyspark.sql.functions as F
df.withColumn(
"myCol",
F.coalesce(
F.col("myCol"),
F.from_json(F.lit("[]"), T.ArrayType(T.IntegerType()))
)
)
You can replace IntegerType() with whichever data type, also complex ones.

Resources