Apache spark - Window Function , FIRST_VALUE do not work - apache-spark

I have a problem with the WINDOW FUNCTION spark API :
my question is similar to this one : How to drop duplicates using conditions
I have a dataset :
+---+----------+---------+
| ID| VALUEE| OTHER|
+---+----------+---------+
| 1| null|something|
| 1|[1.0, 0.0]|something|
| 1|[1.0, 0.0]|something|
| 1|[0.0, 2.0]|something|
| 1|[3.0, 5.0]|something|
| 2|[3.0, 5.0]|something|
| 1|[3.0, 5.0]|something|
| 2| null|something|
| 3|[3.0, 5.0]|something|
| 4| null|something|
+---+----------+---------+
I want a keep only one ID of each ( no duplicate ) and I don't care of the VALUEE but I prefer a non NULL value
expected result
+---+----------+---------+
| ID| VALUEE| OTHER|
+---+----------+---------+
| 1|[0.0, 2.0]|something|
| 3|[3.0, 5.0]|something|
| 4| null|something|
| 2|[3.0, 5.0]|something|
+---+----------+---------+
windowsFunction with the Aggregate function first() do not work
whereas with row_number() it work
but i don't understand why first do not work
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.*;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.spark_project.guava.collect.ImmutableList;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import static org.apache.spark.sql.types.DataTypes.IntegerType;
import static org.apache.spark.sql.types.DataTypes.StringType;
import static org.apache.spark.sql.types.DataTypes.createStructField;
public class TestSOF {
public static void main(String[] args) {
StructType schema = new StructType(
new StructField[]{
createStructField("ID", IntegerType, false),
createStructField("VALUEE", DataTypes.createArrayType(DataTypes.DoubleType), true),
createStructField("OTHER", StringType, true),
});
double [] a =new double[]{1.0,0.0};
double [] b =new double[]{3.0,5.0};
double [] c =new double[]{0.0,2.0};
List<Row> listOfdata = new ArrayList();
listOfdata.add(RowFactory.create(1,null,"something"));
listOfdata.add(RowFactory.create(1,a,"something"));
listOfdata.add(RowFactory.create(1,a,"something"));
listOfdata.add(RowFactory.create(1,c,"something"));
listOfdata.add(RowFactory.create(1,b,"something"));
listOfdata.add(RowFactory.create(2,b,"something"));
listOfdata.add(RowFactory.create(1,b,"something"));
listOfdata.add(RowFactory.create(2,null,"something"));
listOfdata.add(RowFactory.create(3,b,"something"));
listOfdata.add(RowFactory.create(4,null,"something"));
List<Row> rowList = ImmutableList.copyOf(listOfdata);
SparkSession sparkSession = new SparkSession.Builder().config("spark.master", "local[*]").getOrCreate();
sparkSession.sparkContext().setLogLevel("ERROR");
Dataset<Row> dataset = sparkSession.createDataFrame(rowList,schema);
dataset.show();
WindowSpec windowSpec = Window.partitionBy(dataset.col("ID")).orderBy(dataset.col("VALUEE").asc_nulls_last());
// wind solution
// lost information
Dataset<Row> dataset0 = dataset.groupBy("ID").agg(functions.first(dataset.col("VALUEE"), true));
Dataset<Row> dataset1 = dataset.withColumn("new",functions.row_number().over(windowSpec)).where("new = 1").drop("new");
//do not work
Dataset<Row> dataset2 = dataset.withColumn("new",functions.first("VALUEE",true).over(windowSpec)).drop("new");
JavaRDD<Row> rdd =
dataset.toJavaRDD()
.groupBy(row -> row.getAs("ID"))
.map(g -> {
Iterator<Row> iter =g._2.iterator();
Row rst = null;
Row tmp;
while(iter.hasNext()){
tmp = iter.next();
if (tmp.getAs("VALUEE") != null) {
rst=tmp;
break;
}
if(rst==null){
rst=tmp;
}
}
return rst;
});
Dataset<Row> dataset3 = sparkSession.createDataFrame(rdd, schema);
dataset0.show();
dataset1.show();
dataset2.show();
dataset3.show();
}
}

First is not a Window function in SPARK 2.3 it's only an Aggregate function
firstValue is not present in the dataframe API

You can use an equivalent solution as the one you posted. In your case, the null values will appear in the first order. So :
val df: DataFrame = ???
import df.sparkSession.implicits._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, last}
val id_cols = "ID"
val windowSpec = Window.partitionBy(id_cols).orderBy($"VALUEE".asc)
val list_cols = Seq("VALUE", "OTHER")
val df_dd = df.select(col(id_cols) +: list_cols.map(x => last(col(x)).over(windowSpec).alias(x)):_*).distinct

For the example data you've provided, the short version of the solution dataset1, that you provided:
dataset.groupBy("ID").agg(functions.first(dataset.col("VALUEE"), true)).show();
For understanding of Window Functions and optimization of performance of WindowFunction vs groupBy in Spark i strongly recommend presentations by Jacek Laskowski:
https://databricks.com/session/from-basic-to-advanced-aggregate-operators-in-apache-spark-sql-2-2-by-examples-and-their-catalyst-optimizations
https://databricks.com/session/from-basic-to-advanced-aggregate-operators-in-apache-spark-sql-2-2-by-examples-and-their-catalyst-optimizations-continues

Related

How to group row values into a column based on an identifier?

Please see this example; I am trying to achieve this using spark sql/spark scala, but did not find any direct solution. Please let me know if it's not possible using Spark SQL / Spark Scala, in that case I can write a java/python program by writing a file out of As-Is.
github: https://github.com/mvasyliv/LearningSpark/blob/master/src/main/scala/spark/GroupListValueToColumn.scala
source code
{
package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object GroupListValueToColumn extends App {
val spark = SparkSession.builder()
.master("local")
.appName("Mapper")
.getOrCreate()
case class Customer(
cust_id: Int,
addresstype: String
)
import spark.implicits._
val source = Seq(
Customer(300312008, "credit_card"),
Customer(300312008, "to"),
Customer(300312008, "from"),
Customer(300312009, "to"),
Customer(300312009, "from"),
Customer(300312010, "to"),
Customer(300312010, "credit_card"),
Customer(300312010, "from")
).toDF()
val res = source.groupBy("cust_id").agg(collect_list("addresstype"))
res.show(false)
// +---------+-------------------------+
// |cust_id |collect_list(addresstype)|
// +---------+-------------------------+
// |300312010|[to, credit_card, from] |
// |300312008|[credit_card, to, from] |
// |300312009|[to, from] |
// +---------+-------------------------+
val res1 = source.groupBy("cust_id").agg(collect_set("addresstype"))
res1.show(false)
// +---------+------------------------+
// |cust_id |collect_set(addresstype)|
// +---------+------------------------+
// |300312010|[from, to, credit_card] |
// |300312008|[from, to, credit_card] |
// |300312009|[from, to] |
// +---------+------------------------+
}
}
Since answers are being given as opposed to good googling:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, "a"),
(1, "c"),
(2, "e")
).toDF("k", "v")
val df1 = df.groupBy("k").agg(collect_list("v"))
df1.show

Spark DataFrame: How to specify schema when writing as Avro

I want to write a DataFrame in Avro format using a provided Avro schema rather than Spark's auto-generated schema. How can I tell Spark to use my custom schema on write?
After applying the patch in https://github.com/databricks/spark-avro/pull/222/, I was able to specify a schema on write as follows:
df.write.option("forceSchema", myCustomSchemaString).avro("/path/to/outputDir")
Hope below method helps.
import org.apache.spark.sql.types._
val schema = StructType( StructField("title", StringType, true) ::StructField("averageRating", DoubleType, false) ::StructField("numVotes", IntegerType, false) :: Nil)
titleMappedDF.write.option("avroSchema", schema.toString).avro("/home/cloudera/workspace/movies/avrowithschema")
Example:
Download data from below site. https://datasets.imdbws.com/
Download the movies data title.ratings.tsv.gz
Copy to below location. /home/cloudera/workspace/movies/title.ratings.tsv.gz
Start Spark-shell and type below command.
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val title = sqlContext.read.text("file:///home/cloudera/Downloads/movies/title.ratings.tsv.gz")
scala> title.limit(5).show
+--------------------+
| value|
+--------------------+
|tconst averageRat...|
| tt0000001 5.8 1350|
| tt0000002 6.5 157|
| tt0000003 6.6 933|
| tt0000004 6.4 93|
+--------------------+
val titlerdd = title.rdd
case class Title(titleId:String, averageRating:Float, numVotes:Int)
val titlefirst = titlerdd.first
val titleMapped = titlerdd.filter(e=> e!=titlefirst).map(e=> {
val rowStr = e.getString(0)
val splitted = rowStr.split("\t")
val titleId = splitted(0).trim
val averageRating = scala.util.Try(splitted(1).trim.toFloat) getOrElse(0.0f)
val numVotes = scala.util.Try(splitted(2).trim.toInt) getOrElse(0)
Title(titleId, averageRating, numVotes)
})
val titleMappedDF = titleMapped.toDF
scala> titleMappedDF.limit(2).show
+---------+-------------+--------+
| titleId|averageRating|numVotes|
+---------+-------------+--------+
|tt0000001| 5.8| 1350|
|tt0000002| 6.5| 157|
+---------+-------------+--------+
import org.apache.spark.sql.types._
val schema = StructType( StructField("title", StringType, true) ::StructField("averageRating", DoubleType, false) ::StructField("numVotes", IntegerType, false) :: Nil)
titleMappedDF.write.option("avroSchema", schema.toString).avro("/home/cloudera/workspace/movies/avrowithschema")

Spark SQL: Get month from week number and year

I have a dataframe with "Week" & "Year" column and needs to calculate month for same as below:
Input:
+----+----+
|Week|Year|
+----+----+
| 50|2012|
| 50|2012|
| 50|2012|
Expected output:
+----+----+-----+
|Week|Year|Month|
+----+----+-----+
| 50|2012|12 |
| 50|2012|12 |
| 50|2012|12 |
Any help would be appreciated. Thanks
Thanks to #zero323, who pointed me out to the sqlContext.sql query, I converted the query in the following :
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import static org.apache.spark.sql.functions.*;
public class MonthFromWeekSparkSQL {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("MonthFromWeekSparkSQL").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
List myList = Arrays.asList(RowFactory.create(50, 2012), RowFactory.create(50, 2012), RowFactory.create(50, 2012));
JavaRDD myRDD = sc.parallelize(myList);
List<StructField> structFields = new ArrayList<StructField>();
// Create StructFields
StructField structField1 = DataTypes.createStructField("week", DataTypes.IntegerType, true);
StructField structField2 = DataTypes.createStructField("year", DataTypes.IntegerType, true);
// Add StructFields into list
structFields.add(structField1);
structFields.add(structField2);
// Create StructType from StructFields. This will be used to create DataFrame
StructType schema = DataTypes.createStructType(structFields);
DataFrame df = sqlContext.createDataFrame(myRDD, schema);
DataFrame df2 = df.withColumn("yearAndWeek", concat(col("year"), lit(" "), col("week")))
.withColumn("month", month(unix_timestamp(col("yearAndWeek"), "yyyy w").cast(("timestamp")))).drop("yearAndWeek");
df2.show();
}
}
You actually create a new column with year and week formatted as "yyyy w" then convert it using unix_timestamp from which you can pull the month as you see.
PS: It seems that cast behavior was incorrect in spark 1.5 - https://issues.apache.org/jira/browse/SPARK-11724
So in that case, it's more general to do .cast("double").cast("timestamp")

Add a new column to a Dataframe. New column i want it to be a UUID generator

I want to add a new column to a Dataframe, a UUID generator.
UUID value will look something like 21534cf7-cff9-482a-a3a8-9e7244240da7
My Research:
I've tried with withColumn method in spark.
val DF2 = DF1.withColumn("newcolname", DF1("existingcolname" + 1)
So DF2 will have additional column with newcolname with 1 added to it in all rows.
By my requirement is that I want to have a new column which can generate the UUID.
You can utilize built-in Spark SQL uuid function:
.withColumn("uuid", expr("uuid()"))
A full example in Scala:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
object CreateDf extends App {
val spark = SparkSession.builder
.master("local[*]")
.appName("spark_local")
.getOrCreate()
import spark.implicits._
Seq(1, 2, 3).toDF("col1")
.withColumn("uuid", expr("uuid()"))
.show(false)
}
Output:
+----+------------------------------------+
|col1|uuid |
+----+------------------------------------+
|1 |24181c68-51b7-42ea-a9fd-f88dcfa10062|
|2 |7cd21b25-017e-4567-bdd3-f33b001ee497|
|3 |1df7cfa8-af8a-4421-834f-5359dc3ae417|
+----+------------------------------------+
You should try something like this:
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val generateUUID = udf(() => UUID.randomUUID().toString)
val df1 = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
val df2 = df1.withColumn("UUID", generateUUID())
df1.show()
df2.show()
Output will be:
+---+-----+
| id|value|
+---+-----+
|id1| 1|
|id2| 4|
|id3| 5|
+---+-----+
+---+-----+--------------------+
| id|value| UUID|
+---+-----+--------------------+
|id1| 1|f0cfd0e2-fbbe-40f...|
|id2| 4|ec8db8b9-70db-46f...|
|id3| 5|e0e91292-1d90-45a...|
+---+-----+--------------------+
This is how we did in Java, we had a column date and wanted to add another column with month.
Dataset<Row> newData = data.withColumn("month", month((unix_timestamp(col("date"), "MM/dd/yyyy")).cast("timestamp")));
You can use similar technique to add any column.
Dataset<Row> newData1 = newData.withColumn("uuid", lit(UUID.randomUUID().toString()));
Cheers !

Converting a string to double in a dataframe

I have built a dataframe using concat which produces a string.
import sqlContext.implicits._
val df = sc.parallelize(Seq((1.0, 2.0), (3.0, 4.0))).toDF("k", "v")
df.registerTempTable("df")
val dfConcat = df.select(concat($"k", lit(","), $"v").as("test"))
dfConcat: org.apache.spark.sql.DataFrame = [test: string]
+-------------+
| test|
+-------------+
| 1.0,2.0|
| 3.0,4.0|
+-------------+
How can I convert it back to double?
I have tried casting to DoubleType but I get null
import org.apache.spark.sql.types._
intterim.features.cast(IntegerType))
val testDouble = dfConcat.select( dfConcat("test").cast(DoubleType).as("test"))
+----+
|test|
+----+
|null|
|null|
+----+
and udf return number format exception at run time
import org.apache.spark.sql.functions._
val toDbl = udf[Double, String]( _.toDouble)
val testDouble = dfConcat
.withColumn("test", toDbl(dfConcat("test")))
.select("test")
You cannot convert it to double because it is simply not a valid double representation. If you want an array just use array function:
import org.apache.spark.sql.functions.array
df.select(array($"k", $"v").as("test"))
You can also try to split and convert but it is far from optimal:
import org.apache.spark.sql.types.{ArrayType, DoubleType}
import org.apache.spark.sql.functions.split
dfConcat.select(split($"test", ",").cast(ArrayType(DoubleType)))

Resources