How to group row values into a column based on an identifier? - apache-spark

Please see this example; I am trying to achieve this using spark sql/spark scala, but did not find any direct solution. Please let me know if it's not possible using Spark SQL / Spark Scala, in that case I can write a java/python program by writing a file out of As-Is.

github: https://github.com/mvasyliv/LearningSpark/blob/master/src/main/scala/spark/GroupListValueToColumn.scala
source code
{
package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object GroupListValueToColumn extends App {
val spark = SparkSession.builder()
.master("local")
.appName("Mapper")
.getOrCreate()
case class Customer(
cust_id: Int,
addresstype: String
)
import spark.implicits._
val source = Seq(
Customer(300312008, "credit_card"),
Customer(300312008, "to"),
Customer(300312008, "from"),
Customer(300312009, "to"),
Customer(300312009, "from"),
Customer(300312010, "to"),
Customer(300312010, "credit_card"),
Customer(300312010, "from")
).toDF()
val res = source.groupBy("cust_id").agg(collect_list("addresstype"))
res.show(false)
// +---------+-------------------------+
// |cust_id |collect_list(addresstype)|
// +---------+-------------------------+
// |300312010|[to, credit_card, from] |
// |300312008|[credit_card, to, from] |
// |300312009|[to, from] |
// +---------+-------------------------+
val res1 = source.groupBy("cust_id").agg(collect_set("addresstype"))
res1.show(false)
// +---------+------------------------+
// |cust_id |collect_set(addresstype)|
// +---------+------------------------+
// |300312010|[from, to, credit_card] |
// |300312008|[from, to, credit_card] |
// |300312009|[from, to] |
// +---------+------------------------+
}
}

Since answers are being given as opposed to good googling:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, "a"),
(1, "c"),
(2, "e")
).toDF("k", "v")
val df1 = df.groupBy("k").agg(collect_list("v"))
df1.show

Related

How to extract urls from HYPERLINKS in excel file when reading into scala spark dataframe

I have an Excel file with Column A containing HYPERLINKS like this:
=HYPERLINK("https://google.com","View Link")
I can load the Excel file in scala spark dataframe using com.crealytics.spark.excel library but only with the 'View Link' text which DOES NOT contain the url
import org.apache.spark.sql._
import org.apache.spark.sql.types._
object Tut {
def main(args: Array[String]): Unit = {
println("started")
val spark = SparkSession
.builder()
.appName("MySpark")
.config("spark.master", "local")
.getOrCreate()
val customSchema = StructType(Array(
StructField("A", StringType, nullable = false),
StructField("B", IntegerType, nullable = false)))
val df = spark.read.format("com.crealytics.spark.excel")
.option("useHeader", "true").schema(customSchema)
.option("dataAddress", "A1")
.load("/MY_PATH/src/main/resources/SampFile.xlsx")
df.printSchema()
df.show()
}
}
My goal is to load the entire content of the HYPERLINK as a string:
=HYPERLINK("https://google.com","View Link")
and then extract the url
https://google.com.
Do you know if there is a way to do this using com.crealytics.spark.excel library or any other spark library? Thanks in advance!
About the other question link you provided in the comments, they're trying to read the column as BinaryType, and cast it out of the box into StringType, well, such thing is not possible (even in scala itself), since you need to know how to read the bytes and represent it as a human readable string, right? for instance the encoding, etc.
Now we know that we need to define a custom approach. I used a sample in-code dataframe, and this approach worked:
scala> import spark.implicits._
import spark.implicits._
scala> val df = Seq(
| ("ddd".getBytes, 1)
| ).toDF("A", "B")
df: org.apache.spark.sql.DataFrame = [A: binary, B: int]
scala> val btos: Array[Byte] => String = bytes => new String(bytes) // short fot bytes to string
btos: Array[Byte] => String = $Lambda$2322/665683021#738f6e44
scala> spark.udf.register("btos", btos)
res0: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$2322/665683021#738f6e44,StringType,List(Some(class[value[0]: binary])),Some(btos),true,true)
scala> df.withColumn("C", expr("btos(A)")).show
+----------+---+---+
| A| B| C|
+----------+---+---+
|[64 64 64]| 1|ddd|
+----------+---+---+
Hope this works for you.

Apache spark - Window Function , FIRST_VALUE do not work

I have a problem with the WINDOW FUNCTION spark API :
my question is similar to this one : How to drop duplicates using conditions
I have a dataset :
+---+----------+---------+
| ID| VALUEE| OTHER|
+---+----------+---------+
| 1| null|something|
| 1|[1.0, 0.0]|something|
| 1|[1.0, 0.0]|something|
| 1|[0.0, 2.0]|something|
| 1|[3.0, 5.0]|something|
| 2|[3.0, 5.0]|something|
| 1|[3.0, 5.0]|something|
| 2| null|something|
| 3|[3.0, 5.0]|something|
| 4| null|something|
+---+----------+---------+
I want a keep only one ID of each ( no duplicate ) and I don't care of the VALUEE but I prefer a non NULL value
expected result
+---+----------+---------+
| ID| VALUEE| OTHER|
+---+----------+---------+
| 1|[0.0, 2.0]|something|
| 3|[3.0, 5.0]|something|
| 4| null|something|
| 2|[3.0, 5.0]|something|
+---+----------+---------+
windowsFunction with the Aggregate function first() do not work
whereas with row_number() it work
but i don't understand why first do not work
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.*;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.spark_project.guava.collect.ImmutableList;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import static org.apache.spark.sql.types.DataTypes.IntegerType;
import static org.apache.spark.sql.types.DataTypes.StringType;
import static org.apache.spark.sql.types.DataTypes.createStructField;
public class TestSOF {
public static void main(String[] args) {
StructType schema = new StructType(
new StructField[]{
createStructField("ID", IntegerType, false),
createStructField("VALUEE", DataTypes.createArrayType(DataTypes.DoubleType), true),
createStructField("OTHER", StringType, true),
});
double [] a =new double[]{1.0,0.0};
double [] b =new double[]{3.0,5.0};
double [] c =new double[]{0.0,2.0};
List<Row> listOfdata = new ArrayList();
listOfdata.add(RowFactory.create(1,null,"something"));
listOfdata.add(RowFactory.create(1,a,"something"));
listOfdata.add(RowFactory.create(1,a,"something"));
listOfdata.add(RowFactory.create(1,c,"something"));
listOfdata.add(RowFactory.create(1,b,"something"));
listOfdata.add(RowFactory.create(2,b,"something"));
listOfdata.add(RowFactory.create(1,b,"something"));
listOfdata.add(RowFactory.create(2,null,"something"));
listOfdata.add(RowFactory.create(3,b,"something"));
listOfdata.add(RowFactory.create(4,null,"something"));
List<Row> rowList = ImmutableList.copyOf(listOfdata);
SparkSession sparkSession = new SparkSession.Builder().config("spark.master", "local[*]").getOrCreate();
sparkSession.sparkContext().setLogLevel("ERROR");
Dataset<Row> dataset = sparkSession.createDataFrame(rowList,schema);
dataset.show();
WindowSpec windowSpec = Window.partitionBy(dataset.col("ID")).orderBy(dataset.col("VALUEE").asc_nulls_last());
// wind solution
// lost information
Dataset<Row> dataset0 = dataset.groupBy("ID").agg(functions.first(dataset.col("VALUEE"), true));
Dataset<Row> dataset1 = dataset.withColumn("new",functions.row_number().over(windowSpec)).where("new = 1").drop("new");
//do not work
Dataset<Row> dataset2 = dataset.withColumn("new",functions.first("VALUEE",true).over(windowSpec)).drop("new");
JavaRDD<Row> rdd =
dataset.toJavaRDD()
.groupBy(row -> row.getAs("ID"))
.map(g -> {
Iterator<Row> iter =g._2.iterator();
Row rst = null;
Row tmp;
while(iter.hasNext()){
tmp = iter.next();
if (tmp.getAs("VALUEE") != null) {
rst=tmp;
break;
}
if(rst==null){
rst=tmp;
}
}
return rst;
});
Dataset<Row> dataset3 = sparkSession.createDataFrame(rdd, schema);
dataset0.show();
dataset1.show();
dataset2.show();
dataset3.show();
}
}
First is not a Window function in SPARK 2.3 it's only an Aggregate function
firstValue is not present in the dataframe API
You can use an equivalent solution as the one you posted. In your case, the null values will appear in the first order. So :
val df: DataFrame = ???
import df.sparkSession.implicits._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, last}
val id_cols = "ID"
val windowSpec = Window.partitionBy(id_cols).orderBy($"VALUEE".asc)
val list_cols = Seq("VALUE", "OTHER")
val df_dd = df.select(col(id_cols) +: list_cols.map(x => last(col(x)).over(windowSpec).alias(x)):_*).distinct
For the example data you've provided, the short version of the solution dataset1, that you provided:
dataset.groupBy("ID").agg(functions.first(dataset.col("VALUEE"), true)).show();
For understanding of Window Functions and optimization of performance of WindowFunction vs groupBy in Spark i strongly recommend presentations by Jacek Laskowski:
https://databricks.com/session/from-basic-to-advanced-aggregate-operators-in-apache-spark-sql-2-2-by-examples-and-their-catalyst-optimizations
https://databricks.com/session/from-basic-to-advanced-aggregate-operators-in-apache-spark-sql-2-2-by-examples-and-their-catalyst-optimizations-continues

Spark DataFrame: How to specify schema when writing as Avro

I want to write a DataFrame in Avro format using a provided Avro schema rather than Spark's auto-generated schema. How can I tell Spark to use my custom schema on write?
After applying the patch in https://github.com/databricks/spark-avro/pull/222/, I was able to specify a schema on write as follows:
df.write.option("forceSchema", myCustomSchemaString).avro("/path/to/outputDir")
Hope below method helps.
import org.apache.spark.sql.types._
val schema = StructType( StructField("title", StringType, true) ::StructField("averageRating", DoubleType, false) ::StructField("numVotes", IntegerType, false) :: Nil)
titleMappedDF.write.option("avroSchema", schema.toString).avro("/home/cloudera/workspace/movies/avrowithschema")
Example:
Download data from below site. https://datasets.imdbws.com/
Download the movies data title.ratings.tsv.gz
Copy to below location. /home/cloudera/workspace/movies/title.ratings.tsv.gz
Start Spark-shell and type below command.
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val title = sqlContext.read.text("file:///home/cloudera/Downloads/movies/title.ratings.tsv.gz")
scala> title.limit(5).show
+--------------------+
| value|
+--------------------+
|tconst averageRat...|
| tt0000001 5.8 1350|
| tt0000002 6.5 157|
| tt0000003 6.6 933|
| tt0000004 6.4 93|
+--------------------+
val titlerdd = title.rdd
case class Title(titleId:String, averageRating:Float, numVotes:Int)
val titlefirst = titlerdd.first
val titleMapped = titlerdd.filter(e=> e!=titlefirst).map(e=> {
val rowStr = e.getString(0)
val splitted = rowStr.split("\t")
val titleId = splitted(0).trim
val averageRating = scala.util.Try(splitted(1).trim.toFloat) getOrElse(0.0f)
val numVotes = scala.util.Try(splitted(2).trim.toInt) getOrElse(0)
Title(titleId, averageRating, numVotes)
})
val titleMappedDF = titleMapped.toDF
scala> titleMappedDF.limit(2).show
+---------+-------------+--------+
| titleId|averageRating|numVotes|
+---------+-------------+--------+
|tt0000001| 5.8| 1350|
|tt0000002| 6.5| 157|
+---------+-------------+--------+
import org.apache.spark.sql.types._
val schema = StructType( StructField("title", StringType, true) ::StructField("averageRating", DoubleType, false) ::StructField("numVotes", IntegerType, false) :: Nil)
titleMappedDF.write.option("avroSchema", schema.toString).avro("/home/cloudera/workspace/movies/avrowithschema")

an rdd char is to be converted into a dataframe

The RDD data is to be converted into a data frame. But I am unable to do so. ToDf is not working,also I tried with array RDD to dataframe . Kindly advise me.This program is for parsing a sample excel using scala and spark
import java.io.{File, FileInputStream}
import org.apache.poi.xssf.usermodel.XSSFCell
import org.apache.poi.xssf.usermodel.{XSSFSheet, XSSFWorkbook}
import org.apache.poi.ss.usermodel.Cell._
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.types.{ StructType, StructField, StringType, IntegerType };
object excel
{
def main(args: Array[String]) =
{
val sc = new SparkContext(new SparkConf().setAppName("Excel Parsing").setMaster("local[*]"))
val file = new FileInputStream(new File("test.xlsx"))
val wb = new XSSFWorkbook(file)
val sheet = wb.getSheetAt(0)
val rowIterator = sheet.iterator()
val builder = StringBuilder.newBuilder
var column = ""
while (rowIterator.hasNext())
{
val row = rowIterator.next();
val cellIterator = row.cellIterator();
while (cellIterator.hasNext())
{
val cell = cellIterator.next();
cell.getCellType match {
case CELL_TYPE_NUMERIC ⇒builder.append(cell.getNumericCellValue + ",")
case CELL_TYPE_BOOLEAN ⇒ builder.append(cell.getBooleanCellValue + ",")
case CELL_TYPE_STRING ⇒ builder.append(cell.getStringCellValue + ",")
case CELL_TYPE_BLANK ⇒ builder.append(",")
}
}
column = builder.toString()
println(column)
builder.setLength(0)
}
val data= sc.parallelize(column)
println(data)
}
}
For converting Spark RDD to DataFrame . You have to make a sqlContext or sparkSession according to the spark version and then use
val sqlContext=new SQLContext(sc)
import sqlContext.implicits._
Incase you are using Spark 2.0 or above use SparkSession instead as SqlContext is deprecated in the new release !
val spark=SparkSession.builder.config(conf).getOrCreate.
import spark.implicits._
This will allow you to use toDF on RDD.
This might solve your problem !
Note: For using the sqlContext you have to inculde the spark_sql as dependency !

Spark group by - Pig conversion

I am trying to achieve something like this in spark. The following code snippet is from Pig Latin. Is there anyway I can do the same thing with Spark?
A = load 'student' AS (name:chararray,age:int,gpa:float);
DESCRIBE A;
A: {name: chararray,age: int,gpa: float} DUMP A; (John,18,4.0F)
(Mary,19,3.8F) (Bill,20,3.9F) (Joe,18,3.8F)
B = GROUP A BY age;
Result: (18,{(John,18,4.0F),(Joe,18,3.8F)}) (19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
Thanks.
It's easy to do a list of names by age. I believe the Spark API doesn't allow you to collect complete rows and get a complete row list in the same way.
// Input data
val df = {
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
import java.time.LocalDate
val simpleSchema = StructType(
StructField("name", StringType) ::
StructField("age", IntegerType) ::
StructField("gpa", FloatType) :: Nil)
val data = List(
Row("John", 18, 4.0f),
Row("Mary", 19, 3.8f),
Row("Bill", 20, 3.9f),
Row("Joe", 18, 3.8f)
)
spark.createDataFrame(data.asJava, simpleSchema)
}
df.show()
val df2 = df.groupBy(col("age")).agg(collect_list(col("name")))
df2.show()

Resources