How to join on array field? - apache-spark

I have 2 dataset namely Distance and Customer, want to find out id in Customer dataset is present in id_5 of Distance dataset where the id_5 is Array of id's. Your help is greatly appreciated.
case class Distance(zip: String, id_5: Array[Int])
val dist = Seq(Distance("72712",Array(72713,72714,72715)))
val distDS=dist.toDS()
case class Customer (cust_id: Int, id: String)
val c = Seq(Customer(1,"72713"),Customer(2,"72714"),Customer(3,"72720"))
val custDS = c.toDS()
val res = distDS.joinWith(custDS,distDS.col("id_5"(??????)) === custDS.col("id"))`

Use array_contains:
import org.apache.spark.sql.functions.expr
distDS.joinWith(custDS, expr("array_contains(id_5, cust_id)"))

Related

Fetching columns dynamically from dataframe , column name would come in variable

I am not able to fetch values for given dynamic columns. Any help ?
var dynamicColumns = "col(\"one\"),col(\"two\"),col(\"three\")"
dataFrame.select(dynamicColumns)
Just use names alone:
val dynamicColumns = Seq("one", "two", "three")
dataFrame.select(dynamicColumns map col: _*)
and if you don't have control over the format, use regexp to extract names first
val dynamicColumns = "col(\"one\"),col(\"two\"),col(\"three\")"
val p = """(?<=col\(").+?(?="\))""".r
dataFrame.select(p.findAllIn(dynamicColumns) map col toSeq: _*)

How to Flatten spark dataframe Row to multiple Dataframe Rows

Hi I have a spark data frame which prints like this (single row)
[abc,WrappedArray(11918,1233),WrappedArray(46734,1234),1487530800317]
So inside a row i have wrapped array, I want to flatten it and create a dataframe which has single value for each array for example above row should transform something like this
[abc,11918,46734,1487530800317]
[abc,1233,1234,1487530800317]
So i got dataframe with 2 Rows instead of 1, So each corresponding element from wrapped array should go in new row.
Edit 1 after 1st answer:
What if i have 3 arrays in my input
WrappedArray(46734,1234,[abc,WrappedArray(11918,1233),WrappedArray(46734,1234),WrappedArray(1,2),1487530800317]
my output should be
[abc,11918,46734,1,1487530800317]
[abc,1233,1234,2,1487530800317]
Definitely not the best solution, but this would work:
case class TestFormat(a: String, b: Seq[String], c: Seq[String], d: String)
val data = Seq(TestFormat("abc", Seq("11918","1233"),
Seq("46734","1234"), "1487530800317")).toDS
val zipThem: (Seq[String], Seq[String]) => Seq[(String, String)] = _.zip(_)
val udfZip = udf(zipThem)
data.select($"a", explode(udfZip($"b", $"c")) as "tmp", $"d")
.select($"a", $"tmp._1" as "b", $"tmp._2" as "c", $"d")
.show
The problem is that by default you cannot be sure that both Sequences are of equal length.
The probably better solution would be to reformat the whole data frame into a structure that models the data, e.g.
root
-- a
-- d
-- records
---- b
---- c
Thanks for answering #swebbo, you answer helped me getting this done:
I did this:
import org.apache.spark.sql.functions.{explode, udf}
import sqlContext.implicits._
val zipColumns = udf((x: Seq[Long], y: Seq[Long], z: Seq[Long]) => (x.zip(y).zip(z)) map {
case ((a,b),c) => (a,b,c)
})
val flattened = subDf.withColumn("columns", explode(zipColumns($"col3", $"col4", $"col5"))).select(
$"col1", $"col2",
$"columns._1".alias("col3"), $"columns._2".alias("col4"), $"columns._3".alias("col5"))
flattened.show
Hope that is understandable :)

Spark Dataframe groupBy and sort results into a list

I have a Spark Dataframe and I would like to group the elements by a key and have the results as a sorted list
Currently I am using:
df.groupBy("columnA").agg(collect_list("columnB"))
How do I make the items in the list sorted ascending order?
You could try the function sort_array available in the functions package:
import org.apache.spark.sql.functions._
df.groupBy("columnA").agg(sort_array(collect_list("columnB")))
Just wanted to add another hint to the answer of Daniel de Paula regarding sort_array solution.
If you want to sort elements according to a different column, you can form a struct of two fields:
the sort by field
the result field
Since structs are sorted field by field, you'll get the order you want, all you need is to get rid of the sort by column in each element of the resulting list.
The same approach can be applied with several sort by columns when needed.
Here's an example that can be run in local spark-shell (use :paste mode):
import org.apache.spark.sql.Row
import spark.implicits._
case class Employee(name: String, department: String, salary: Double)
val employees = Seq(
Employee("JSMITH", "A", 20.0),
Employee("AJOHNSON", "A", 650.0),
Employee("CBAKER", "A", 650.2),
Employee("TGREEN", "A", 13.0),
Employee("CHORTON", "B", 111.0),
Employee("AIVANOV", "B", 233.0),
Employee("VSMIRNOV", "B", 11.0)
)
val employeesDF = spark.createDataFrame(employees)
val getNames = udf { salaryNames: Seq[Row] =>
salaryNames.map { case Row(_: Double, name: String) => name }
}
employeesDF
.groupBy($"department")
.agg(collect_list(struct($"salary", $"name")).as("salaryNames"))
.withColumn("namesSortedBySalary", getNames(sort_array($"salaryNames", asc = false)))
.show(truncate = false)
The result:
+----------+--------------------------------------------------------------------+----------------------------------+
|department|salaryNames |namesSortedBySalary |
+----------+--------------------------------------------------------------------+----------------------------------+
|B |[[111.0, CHORTON], [233.0, AIVANOV], [11.0, VSMIRNOV]] |[AIVANOV, CHORTON, VSMIRNOV] |
|A |[[20.0, JSMITH], [650.0, AJOHNSON], [650.2, CBAKER], [13.0, TGREEN]]|[CBAKER, AJOHNSON, JSMITH, TGREEN]|
+----------+--------------------------------------------------------------------+----------------------------------+

composing single insert statement in slick 3

This is the case class representing the entire row:
case class CustomerRow(id: Long, name: String, 20 other fields ...)
I have a shape case class that only 'exposes' a subset of columns and it is used when user creates/updates a customer:
case class CustomerForm(name: String, subset of all fields ...)
I can use CustomerForm for updates. However I can't use it for inserts. There are some columns not in CustomerForm that are required (not null) and can only be provided by the server. What I do now is that I create CustomerRow from CustomerForm:
def form2row(form: CustomerForm, id: Long, serverOnlyValue: Long, etc...) = CustomerRow(
id = id,
serverOnlyColumn = serverOnlyValue,
name = form.name.
// and so on for 20 more tedious lines of code
)
and use it for insert.
Is there a way to compose insert in slick so I can remove that tedious form2row function?
Something like:
(customers.map(formShape) += form) andAlsoOnTheSameRow .map(c => (c.id, c.serverOnlyColumn)) += (id, someValue)
?
Yes, You can do this like:
case class Person(name: String, email: String, address: String, id: Option[Int] = None)
case class NameAndAddress(name: String,address: String)
class PersonTable(tag: Tag) extends Table[Person](tag, "person") {
val id = column[Int]("id", O.PrimaryKey, O.AutoInc)
val name = column[String]("name")
val email = column[String]("email")
val address = column[String]("address")
//for partial insert
def nameWithAddress = (name, address)<>(NameAndAddress.tupled, NameAndAddress.unapply)
def * = (name, email, address, id.?) <> (Person.tupled, Person.unapply)
}
val personTableQuery = TableQuery[PersonTable]
// insert partial fields
personTableQuery.map(_.nameWithAddress) += NameAndAddress("abc", "xyz")
Make sure, You are aware of nullable fields they should be in form of Option[T] where T is filed type.In my example case, email should be Option[String] instead of String.

Accessing a global lookup Apache Spark

I have a list of csv files each with a bunch of category names as header columns. Each row is a list of users with a boolean value (0, 1) whether they are part of that category or not. Each of the csv files does not have the same set of header categories.
I want to create a composite csv across all the files which has the following output:
Header is a union of all the headers
Each row is a unique user with a boolean value corresponding to the category column
The way I wanted to tackle this is to create a tuple of a user_id and a unique category_id for each cell with a '1'. Then reduce all these columns for each user to get the final output.
How do I create the tuple to begin with? Can I have a global lookup for all the categories?
Example Data:
File 1
user_id,cat1,cat2,cat3
21321,,,1,
21322,1,1,1,
21323,1,,,
File 2
user_id,cat4,cat5
21321,1,,,
21323,,1,,
Output
user_id,cat1,cat2,cat3,cat4,cat5
21321,,1,1,,,
21322,1,1,1,,,
21323,1,1,,,,
Probably the title of the question is misleading in the sense that conveys a certain implementation choice as there's no need for a global lookup in order to solve the problem at hand.
In big data, there's a basic principle guiding most solutions: divide and conquer. In this case, the input CSV files could be divided in tuples of (user,category).
Any number of CSV files containing an arbitrary number of categories can be transformed to this simple format. The resulting CSV results of the union of the previous step, extraction of the total nr of categories present and some data transformation to get it in the desired format.
In code this algorithm would look like this:
import org.apache.spark.SparkContext._
val file1 = """user_id,cat1,cat2,cat3|21321,,,1|21322,1,1,1|21323,1,,""".split("\\|")
val file2 = """user_id,cat4,cat5|21321,1,|21323,,1""".split("\\|")
val csv1 = sparkContext.parallelize(file1)
val csv2 = sparkContext.parallelize(file2)
import org.apache.spark.rdd.RDD
def toTuples(csv:RDD[String]):RDD[(String, String)] = {
val headerLine = csv.first
val header = headerLine.split(",")
val data = csv.filter(_ != headerLine).map(line => line.split(","))
data.flatMap{elem =>
val merged = elem.zip(header)
val id = elem.head
merged.tail.collect{case (v,cat) if v == "1" => (id, cat)}
}
}
val data1 = toTuples(csv1)
val data2 = toTuples(csv2)
val union = data1.union(data2)
val categories = union.map{case (id, cat) => cat}.distinct.collect.sorted //sorted category names
val categoriesByUser = union.groupByKey.mapValues(v=>v.toSet)
val numericCategoriesByUser = categoriesByUser.mapValues{catSet => categories.map(cat=> if (catSet(cat)) "1" else "")}
val asCsv = numericCategoriesByUser.collect.map{case (id, cats)=> id + "," + cats.mkString(",")}
Results in:
21321,,,1,1,
21322,1,1,1,,
21323,1,,,,1
(Generating the header is simple and left as an exercise for the reader)
You dont need to do this as a 2 step process if all you need is the resulting values.
A possible design:
1/ Parse your csv. You dont mention whether your data is on a distributed FS, so i'll assume it is not.
2/ Enter your (K,V) pairs into a mutable parallelized (to take advantage of Spark) map.
pseudo-code:
val directory = ..
mutable.ParHashMap map = new mutable.ParHashMap()
while (files[i] != null)
{
val file = directory.spark.textFile("/myfile...")
val cols = file.map(_.split(","))
map.put(col[0], col[i++])
}
and then you can access your (K/V) tuples by way of an iterator on the map.

Resources