How to Flatten spark dataframe Row to multiple Dataframe Rows - apache-spark

Hi I have a spark data frame which prints like this (single row)
[abc,WrappedArray(11918,1233),WrappedArray(46734,1234),1487530800317]
So inside a row i have wrapped array, I want to flatten it and create a dataframe which has single value for each array for example above row should transform something like this
[abc,11918,46734,1487530800317]
[abc,1233,1234,1487530800317]
So i got dataframe with 2 Rows instead of 1, So each corresponding element from wrapped array should go in new row.
Edit 1 after 1st answer:
What if i have 3 arrays in my input
WrappedArray(46734,1234,[abc,WrappedArray(11918,1233),WrappedArray(46734,1234),WrappedArray(1,2),1487530800317]
my output should be
[abc,11918,46734,1,1487530800317]
[abc,1233,1234,2,1487530800317]

Definitely not the best solution, but this would work:
case class TestFormat(a: String, b: Seq[String], c: Seq[String], d: String)
val data = Seq(TestFormat("abc", Seq("11918","1233"),
Seq("46734","1234"), "1487530800317")).toDS
val zipThem: (Seq[String], Seq[String]) => Seq[(String, String)] = _.zip(_)
val udfZip = udf(zipThem)
data.select($"a", explode(udfZip($"b", $"c")) as "tmp", $"d")
.select($"a", $"tmp._1" as "b", $"tmp._2" as "c", $"d")
.show
The problem is that by default you cannot be sure that both Sequences are of equal length.
The probably better solution would be to reformat the whole data frame into a structure that models the data, e.g.
root
-- a
-- d
-- records
---- b
---- c

Thanks for answering #swebbo, you answer helped me getting this done:
I did this:
import org.apache.spark.sql.functions.{explode, udf}
import sqlContext.implicits._
val zipColumns = udf((x: Seq[Long], y: Seq[Long], z: Seq[Long]) => (x.zip(y).zip(z)) map {
case ((a,b),c) => (a,b,c)
})
val flattened = subDf.withColumn("columns", explode(zipColumns($"col3", $"col4", $"col5"))).select(
$"col1", $"col2",
$"columns._1".alias("col3"), $"columns._2".alias("col4"), $"columns._3".alias("col5"))
flattened.show
Hope that is understandable :)

Related

how to efficiently parse dataframe object into a map of key-value pairs

i'm working with a dataframe with the columns basketID and itemID. is there a way to efficiently parse through the dataset and generate a map where the keys are basketID and the value is a set of all the itemID contained within each basket?
my current implementation uses a for loop over the data frame which isn't very scalable. is it possible to do this more efficiently? any help would be appreciated thanks!
screen shot of sample data
the goal is to obtain basket = Map("b1" -> Set("i1", "i2", "i3"), "b2" -> Set("i2", "i4"), "b3" -> Set("i3", "i5"), "b4" -> Set("i6")). heres the implementation I have using a for loop
// create empty container
val basket = scala.collection.mutable.Map[String, Set[String]]()
// loop over all numerical indexes for baskets (b<i>)
for (i <- 1 to 4) {
basket("b" + i.toString) = Set();
}
// loop over every row in df and store the items to the set
df.collect().foreach(row =>
basket(row(0).toString) += row(1).toString
)
You can simply do aggregateByKey operation then collectItAsMap will directly give you the desired result. It is much more efficient than simple groupBy.
import scala.collection.mutable
case class Items(basketID: String,itemID: String)
import spark.implicits._
val result = output.as[Items].rdd.map(x => (x.basketID,x.itemID))
.aggregateByKey[mutable.Buffer[String]](new mutable.ArrayBuffer[String]())
((l: mutable.Buffer[String], p: String) => l += p ,
(l1: mutable.Buffer[String], l2: mutable.Buffer[String]) => (l1 ++ l2).distinct)
.collectAsMap();
you can check other aggregation api's like reduceBy and groupBy over here.
please also check aggregateByKey vs groupByKey vs ReduceByKey differences.
This is efficient assuming your dataset is small enough to fit into the driver's memory. .collect will give you an array of rows on which you are iterating which is fine. If you want scalability then instead of Map[String, Set[String]] (this will reside in driver memory) you can use PairRDD[String, Set[String]] (this will be distributed).
//NOT TESTED
//Assuming df is dataframe with 2 columns, first is your basketId and second is itemId
df.rdd.map(row => (row.getAs[String](0), row.getAs[String](1)).groupByKey().mapValues(x => x.toSet)

Scala How to Find All Unique Values from a Specific Column in a CSV?

I am using Scala to read from a csv file. The file is formatted to have 3 columns each separated by a \t character. The first 2 columns are unimportant and the third column contains a list of comma separated identifiers stored as as strings. Below is a sample of what the input csv would look like:
0002ba73 US 6o7,6on,6qc,6qj,6nw,6ov,6oj,6oi,15me,6pb,6p9
002f50e4 US 6om,6pb,6p8,15m9,6ok,6ov,6qc,6oo,15me
004b5edc US 6oj,6nz,6on,6om,6qc,6ql,6p6,15me
005cc990 US 6pb,6qf,15me,6og,6nx,6qc,6om,6ok
005fe1ea US 15me,6p0,6ql,6ok,6ox,6ol,6o5,6qj
00777555 US 6pb,15me,6nw,6rk,6qc,6ov,6qj,6o0,6oj,6ok,6on,6p6,6nx,15m9
00cbcc7d US 6oj,6qc,6qg,6pb,6ol,6p6,6ov,15me
010254a6 US 6qc,6pb,6nw,6nx,15me,6o0,6ok,6p8
011b905c US 6oj,6nw,6ov,15me,6qc,6ow,6ql,6on,6qi,6qe
011fffa6 US 15me,6ok,6oj,6p6,6pb,6on,6qc,6ov,6oo,6nw,6oc
I want to read in the csv, get rid of the first two columns, and create a List that contains one instance of each unique identifier code found in the third column, so running the code on the above data should return the result List(6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6p8, 15m9, 6ok, 6oo, 6nz, 6om, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc)
I have the following code which returns a List containing every distinct value found anywhere in the csv file:
val in_file = new File("input_file.csv")
val source = scala.io.Source.fromFile(in_file, "utf-8")
val labels = try source.getLines.mkString("\t") finally source.close()
val labelsList: List[String] = labels.split("[,\t]").map(_.trim).toList.distinct
Using the above input, my code returns labelsList with a value of List(0002ba73-e60c-4ffb-9131-c1612b904658, US, 6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 002f50e4-48cc-4b14-bb80-0502068b6161, 6om, 6p8, 15m9, 6ok, 6oo, 004b5edc-c0cc-4ffd-bef3-980bd92b92e6, 6nz, 6ql, 6p6, 005cc990-83dc-4e63-a4b6-58f38241e8fd, 6qf, 6og, 6nx, 005fe1ea-b918-48a3-a495-1f8ac12935ba, 6p0, 6ox, 6ol, 6o5, 00777555-83d4-401e-861b-5892f3aa3e1c, 6rk, 6o0, 00cbcc7d-1b48-4c5c-8141-8fc8f62b7b07, 6qg, 010254a6-2ef0-4a24-aa4d-3cc6656a55de, 011b905c-fbf3-441a-8912-a94cc0fe8a1d, 6ow, 6qi, 6qe, 011fffa6-0b9f-4d88-8ced-ce1cc864984f, 6oc)
How can I get my code to run properly and ignore anything contained within the first 2 columns of the csv?
You can ignore the first two columns and then split the third by the comma.
Finally a toSet will get rid of the duplicate identifiers.
val f = Source.fromFile("input_file.csv")
val lastColumns = f.getLines().map(_.split("\t")(2))
val uniques = lastColumns.flatMap(_.split(",")).toSet
uniques foreach println
Using Scala 2.13 resource management.
util.Using(io.Source.fromFile("input_file.csv")){
_.getLines()
.foldLeft(Array.empty[String]){
_ ++ _.split("\t")(2).split(",")
}.distinct.toList
}
//res0: scala.util.Try[List[String]] =
// Success(List(6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6om, 6p8, 15m9, 6ok, 6oo, 6nz, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc))
The .toList can be dropped if an Array result is acceptable.
This is what you can do , Am doing on a sample DF, you can replace with yours
val Df = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
val reqCols = Seq(2)
val finalDf = Df.select(reqCols map Df.columns map col: _*)
finalDf.show
Note : This is 0-based index, so pass 2 to get third column.
If you want distinct values from your desired column.you can use distinct along with mkstring
val Df = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
val reqCols = Seq(2)
val distinctValues = Df.select(reqCols map Df.columns map col: _*).distinct.collect.mkString(",").filterNot("[]".toSet)
println(distinctValues)
Dates are duplicate , above code is removing duplicates.
Another method using regex
val data = scala.io.Source.fromFile("source.txt").getLines()
data.toList.flatMap {
line => """\S+\s+\S+\s+(\S+)""".r.findAllMatchIn(line).map( x => x.group(1).split(",").toList)
}.flatten.distinct
// res0: List[String] = List(6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6om, 6p8, 15m9, 6ok, 6oo, 6nz, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc)

Replace a substring of a string in pyspark dataframe

How to replace substrings of a string. For example, I created a data frame based on the following json format.
line1:{"F":{"P3":"1:0.01","P8":"3:0.03,4:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"2:0.01,3:0.02","P10":"5:0.02", ...},"I":"blah"}
I need to replace the substrings "1:", "2:", "3:", with "a:", "b:", "c:", and etc. So the result will be:
line1:{"F":{"P3":"a:0.01","P8":"c:0.03,d:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"b:0.01,c:0.02","P10":"e:0.02", ...},"I":"blah"}
Please consider that this is just an example the real replacement is substring replacement not character replacement.
Any guidance either in Scala or Pyspark is helpful.
from pyspark.sql.functions import *
newDf = df.withColumn('col_name', regexp_replace('col_name', '1:', 'a:'))
Details here:
Pyspark replace strings in Spark dataframe column
Let's say you have a collection of strings for possible modification (simplified for this example).
val data = Seq("1:0.01"
,"3:0.03,4:0.04"
,"2:0.01,3:0.02"
,"5:0.02")
And you have a dictionary of required conversions.
val num2name = Map("1" -> "A"
,"2" -> "Bo"
,"3" -> "Cy"
,"4" -> "Dee")
From here you can use replaceSomeIn() to make the substitutions.
data.map("(\\d+):".r //note: Map key is only part of the match pattern
.replaceSomeIn(_, m => num2name.get(m group 1) //get replacement
.map(_ + ":"))) //restore ":"
//res0: Seq[String] = List(A:0.01
// ,Cy:0.03,Dee:0.04
// ,Bo:0.01,Cy:0.02
// ,5:0.02)
As you can see, "5:" is a match for the regex pattern but since the 5 part is not defined in num2name, the string is left unchanged.
This is the way I solved it in PySpark:
def _blahblah_name_replacement(val, ordered_mapping):
for key, value in ordered_mapping.items():
val = val.replace(key, value)
return val
mapping = {"1:":"aaa:", "2:":"bbb:", ..., "24:":"xxx:", "25:":"yyy:", ....}
ordered_mapping = OrderedDict(reversed(sorted(mapping.items(), key=lambda t: int(t[0][:-1]))))
replacing = udf(lambda x: _blahblah_name_replacement(x, ordered_mapping))
new_df = df.withColumn("F", replacing(col("F")))

Can a DataFrame be converted to Dataset of a case class if a column name contains a space?

I have a Spark DataFrame where a column name contains a space. Is it possible to convert these rows into case classes?
For example, if I do this:
val data = Seq(1, 2, 3).toDF("a number")
case class Record(`a number`: Int)
data.as[Record]
I get this exception:
org.apache.spark.sql.AnalysisException: cannot resolve '`a$u0020number`' given input columns: [a number];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
...
Is there any way to do this?
(Of course I can work around this by renaming the column before converting to a case class. I was hoping to have the case class match the input schema exactly.)
Can you try this solution , this worked for me without changing the column name.
import sqlContext.implicits._
case class Record(`a number`: Int)
val data = Seq(1, 2, 3)
val recDF = data.map(x => Record(x)).toDF()
recDF.collect().foreach(println)
[1]
[2]
[3]
I'm using Spark 1.6.0. The only part of your code that doesn't work for me is the part where you're setting up your test data. I have to use a sequence of tuples instead of a sequence of integers:
case class Record(`a number`:Int)
val data = Seq(Tuple1(1),Tuple1(2),Tuple1(3)).toDF("a number")
data.as[Record]
// returns org.apache.spark.sql.Data[Record] = [a$u0020number: int]
If you need a Dataframe instead of a Dataset you can always use another toDF:
data.as[Record].toDF

Accessing a global lookup Apache Spark

I have a list of csv files each with a bunch of category names as header columns. Each row is a list of users with a boolean value (0, 1) whether they are part of that category or not. Each of the csv files does not have the same set of header categories.
I want to create a composite csv across all the files which has the following output:
Header is a union of all the headers
Each row is a unique user with a boolean value corresponding to the category column
The way I wanted to tackle this is to create a tuple of a user_id and a unique category_id for each cell with a '1'. Then reduce all these columns for each user to get the final output.
How do I create the tuple to begin with? Can I have a global lookup for all the categories?
Example Data:
File 1
user_id,cat1,cat2,cat3
21321,,,1,
21322,1,1,1,
21323,1,,,
File 2
user_id,cat4,cat5
21321,1,,,
21323,,1,,
Output
user_id,cat1,cat2,cat3,cat4,cat5
21321,,1,1,,,
21322,1,1,1,,,
21323,1,1,,,,
Probably the title of the question is misleading in the sense that conveys a certain implementation choice as there's no need for a global lookup in order to solve the problem at hand.
In big data, there's a basic principle guiding most solutions: divide and conquer. In this case, the input CSV files could be divided in tuples of (user,category).
Any number of CSV files containing an arbitrary number of categories can be transformed to this simple format. The resulting CSV results of the union of the previous step, extraction of the total nr of categories present and some data transformation to get it in the desired format.
In code this algorithm would look like this:
import org.apache.spark.SparkContext._
val file1 = """user_id,cat1,cat2,cat3|21321,,,1|21322,1,1,1|21323,1,,""".split("\\|")
val file2 = """user_id,cat4,cat5|21321,1,|21323,,1""".split("\\|")
val csv1 = sparkContext.parallelize(file1)
val csv2 = sparkContext.parallelize(file2)
import org.apache.spark.rdd.RDD
def toTuples(csv:RDD[String]):RDD[(String, String)] = {
val headerLine = csv.first
val header = headerLine.split(",")
val data = csv.filter(_ != headerLine).map(line => line.split(","))
data.flatMap{elem =>
val merged = elem.zip(header)
val id = elem.head
merged.tail.collect{case (v,cat) if v == "1" => (id, cat)}
}
}
val data1 = toTuples(csv1)
val data2 = toTuples(csv2)
val union = data1.union(data2)
val categories = union.map{case (id, cat) => cat}.distinct.collect.sorted //sorted category names
val categoriesByUser = union.groupByKey.mapValues(v=>v.toSet)
val numericCategoriesByUser = categoriesByUser.mapValues{catSet => categories.map(cat=> if (catSet(cat)) "1" else "")}
val asCsv = numericCategoriesByUser.collect.map{case (id, cats)=> id + "," + cats.mkString(",")}
Results in:
21321,,,1,1,
21322,1,1,1,,
21323,1,,,,1
(Generating the header is simple and left as an exercise for the reader)
You dont need to do this as a 2 step process if all you need is the resulting values.
A possible design:
1/ Parse your csv. You dont mention whether your data is on a distributed FS, so i'll assume it is not.
2/ Enter your (K,V) pairs into a mutable parallelized (to take advantage of Spark) map.
pseudo-code:
val directory = ..
mutable.ParHashMap map = new mutable.ParHashMap()
while (files[i] != null)
{
val file = directory.spark.textFile("/myfile...")
val cols = file.map(_.split(","))
map.put(col[0], col[i++])
}
and then you can access your (K/V) tuples by way of an iterator on the map.

Resources