opposite of spark dataframe `withColumn` method? - apache-spark

I'd like to be able to chain a transformation on my DataFrame that drops a column, rather than assigning the DataFrame to a variable (i.e. df.drop()). If I wanted to add a column, I could simply call df.withColumn(). What is the way to drop a column in an in-line chain of transformations?

For the entire example use this as baseline:
val testVariable = 10
var finalDF = spark.sql("'test' as test_column")
val iDF = spark.sql("select 'John Smith' as Name, cast('10' as integer) as Age, 'Illinois' as State")
val iDF2 = spark.sql("select 'Jane Doe' as Name, cast('40' as integer) as Age, 'Iowa' as State")
val iDF3 = spark.sql("select 'Blobby' as Name, cast('150' as integer) as Age, 'Non-US' as State")
val nameDF = iDF.unionAll(iDF2).unionAll(iDF3)
1 Conditional Drop
If you want to only drop on certain outputs and these are known outputs, you can build out conditional loops to check if the iterator needs to be dropped or not. In this case if the test variable exceeds 4 it will drop the name column, else it adds a new column.
finalDF = if (testVariable>=5) {
nameDF.drop("Name")
} else {
nameDF.withColumn("Cooler_Name", lit("Cool_Name")
}
finalDF.printSchema
2 Programmatically build the select statement. Baseline the select expression statement takes in independent strings and build them into commands that can be read by Spark. In the case below we know we have a test for drop but we do know what columns might be dropped. In this case if a column gets a test values that does not equal 1 we do not include the value in out command array. When we run the command array against the select expression on the table, those columns are dropped.
val columnNames = nameDF.columns
val arrayTestOutput = Array(1,0,1)
var iteratorArray = 1
var commandArray = Array("")
while(iteratorArray <= columnNames.length) {
if (arrayTestOutput(iteratorArray-1) == 1) {
if (iteratorArray == 1) {
commandArray = columnNames(iteratorArray-1)
} else {
commandArray = commandArray ++ columnNames(iteratorArray-1)
}
}
iteratorArray = iteratorArray + 1
}
finalDF=nameDF.selectExpr(commandArray:_*)
finalDF.printSchema

Related

Need to add a new column to a Dataset/Row in Spark, based on all existing columns

I have this (simplified) Spark dataset with these columns:
"col1", "col2", "col3", "col4"
And I would like to add a new column: "result".
The value of "result" is the return value of a function that takes all the other columns ("col1", "col2", ...) values as parameters.
map/foreach can't change the iterated row, and UDF functions don't take a whole row as a parameter, so I will have to collect all the column names as input, and I will also have to specify each column type in the UDF registration part.
Notes:
The dataset doesn't have a lot of rows, so I don't mind having a low performant solution.
The dataset does have a lot of columns with different types, so specifying all the columns in the UDF registration part doesn't seem like the most elegant solution.
The project is written in Java, so I'm using the Java API to interact with Spark.
How can I achieve that behavior?
You actually could add a new column with a map.
df.map { row =>
val col1 = row.getAs[String]("col1")
val col2 = row.getAs[String]("col2")
// etc, extract all your columns
....
val newColumns = col1 + col2
// do what you need to do to obtain value for a new column
(col1, col2, ..., newColumn)
}.toDF("col1", "col2", ..., "new")
In term of Java API this will be just the same with some adjustments:
data.map((MapFunction<Row, Tuple3<String, String, String>>) row -> {
String col1 = row.getAs("col1");
String col2 = row.getAs("col2");
// whatever you need
String newColumns = col1 + col2;
return new Tuple3<>(col1, col2, newColumns);
}, Encoders.tuple(Encoders.STRING(), Encoders.STRING(), Encoders.STRING()))
.toDF("col1", "col2", ..., "new")
Alternatively, you could collect all your columns to the array and then process this array in your UDF.
val transformer = udf { arr: Seq[Any] =>
// do your stuff but bevare of types
}
data.withColumn("array", array($"col1", $"col2", ..., $"colN"))
.select($"col1", $"col2",..., transformer($"array") as "newCol")
I've found a solution for my question:
String[] allColumnsAsStrings = dataset.columns();
final Column[] allColumns = Arrays.stream(allColumnsAsStrings).toArray(Column[]::new);
UserDefinedFunction addColumnUdf = udf((Row row) -> {
double score;
// Calculate stuff based on the row values
// ...
return score;
}, DataTypes.DoubleType
);
dataset = dataset.withColumn("score", addColumnUdf.apply(functions.struct(allColumns)));

Compare Two dataframes add mis matched values as a new column in Spark

Difference between two records is:
df1.except(df2)
Its getting results like this
How to compare two dataframes and what changes, and where & which column have changes, add this value as a column. Expected output like this
Join the two dataframe on the primary key, later using a with column and UDF pass the both column values(old and new values), in UDF compare the data and return the value if not same.
val check = udf ( (old_val:String,new_val:String) => if (old_val == new_val) new_val else "")
df_check= df
.withColumn("Check_Name",check(df.col("name"),df.col("new_name")))
.withColumn("Check_Namelast",check(df.col("lastname"),df.col("new_lastname")))
Or Def function
def fn(old_df:Dataframe,new_df:Dataframe) : Dataframe =
{
val old_df_array = old_df.collect() //make df to array to loop thru
val new_df_array = new_df.collect() //make df to array to loop thru
var value_change : Array[String] = ""
val count = old_df.count
val row_count = old_df.coloumn
val row_c = row.length
val coloumn_name = old_df.coloumn
for (i to count ) //loop thru all rows
{
var old = old_df_array.Map(x => x.split(","))
var new = new_df_array.Map(x => x.split(","))
for (j to row_c ) //loop thru all coloumn
{
if( old(j) != new(j) )
{
value_change = value_change + coloumn_name(j) " has value changed" ///this will add all changes in one full row
}
//append to array
append j(0) //primary key
append value_change //Remarks coloumn
}
}
//convert array to df
}

spark spelling correction via udf

I need to correct some spellings using spark.
Unfortunately a naive approach like
val misspellings3 = misspellings1
.withColumn("A", when('A === "error1", "replacement1").otherwise('A))
.withColumn("A", when('A === "error1", "replacement1").otherwise('A))
.withColumn("B", when(('B === "conditionC") and ('D === condition3), "replacementC").otherwise('B))
does not work with spark How to add new columns based on conditions (without facing JaninoRuntimeException or OutOfMemoryError)?
The simple cases (the first 2 examples) can nicely be handled via
val spellingMistakes = Map(
"error1" -> "fix1"
)
val spellingNameCorrection: (String => String) = (t: String) => {
titles.get(t) match {
case Some(tt) => tt // correct spelling
case None => t // keep original
}
}
val spellingUDF = udf(spellingNameCorrection)
val misspellings1 = hiddenSeasonalities
.withColumn("A", spellingUDF('A))
But I am unsure how to handle the more complex / chained conditional replacements in an UDF in a nice & generalizeable manner.
If it is only a rather small list of spellings < 50 would you suggest to hard code them within a UDF?
You can make the UDF receive more than one column:
val spellingCorrection2= udf((x: String, y: String) => if (x=="conditionC" && y=="conditionD") "replacementC" else x)
val misspellings3 = misspellings1.withColumn("B", spellingCorrection2($"B", $"C")
To make this more generalized you can use a map from a tuple of the two conditions to a string same as you did for the first case.
If you want to generalize it even more then you can use dataset mapping. Basically create a case class with the relevant columns and then use as to convert the dataframe to a dataset of the case class. Then use the dataset map and in it use pattern matching on the input data to generate the relevant corrections and convert back to dataframe.
This should be easier to write but would have a performance cost.
For now I will go with the following which seems to work just fine and is more understandable: https://gist.github.com/rchukh/84ac39310b384abedb89c299b24b9306
If spellingMap is the map containing correct spellings, and df is the dataframe.
val df: DataFrame = _
val spellingMap = Map.empty[String, String] //fill it up yourself
val columnsWithSpellingMistakes = List("abc", "def")
Write a UDF like this
def spellingCorrectionUDF(spellingMap:Map[String, String]) =
udf[(String), Row]((value: Row) =>
{
val cellValue = value.getString(0)
if(spellingMap.contains(cellValue)) spellingMap(cellValue)
else cellValue
})
And finally, you can call them as
val newColumns = df.columns.map{
case columnName =>
if(columnsWithSpellingMistakes.contains(columnName)) spellingCorrectionUDF(spellingMap)(Column(columnName)).as(columnName)
else Column(columnName)
}
df.select(newColumns:_*)

How to dynamically create the list of the columns to include in select?

I tried to "generate" a spark query in this way
def stdizedOperationmode(sqLContext: SQLContext,withrul: DataFrame): DataFrame = {
// see http://spark.apache.org/docs/latest/sql-programming-guide.html
import sqLContext.implicits._
val AZ: Column = lit(0.00000001)
def opMode(id:Int): Column = {
(column("s"+id) - coalesce(column("a"+id) / column("sd"+id), column("a"+id) / lit(AZ))).as("std"+id)
}
// add the 21 std<i> columns based on s<i> - (a<id>/sd<id>)
val columns: IndexedSeq[Column] = 1 to 21 map(id => opMode(id))
val withStd = withrul.select(columns:_*)
withStd
}
Question how do I add "all other columns" (*) idea: something like withrul.select('* :+ columns:_*)
You can try the following :
// add the 21 std<i> columns based on s<i> - (a<id>/sd<id>)
val columns: IndexedSeq[Column] = 1 to 21 map(id => opMode(id))
val selectAll: Array[Column] = (for {
i <- withrul.columns
} yield withrul(i)) union columns.toSeq
val withStd = withrul.select(selectAll :_*)
The second line will yeild all the columns from withrul adding them with column as a Seq[Column]
You are not obliged to create a value to return it afterward, can replace the last 2 lines with :
withrul.select(selectAll : _*)

Accessing a global lookup Apache Spark

I have a list of csv files each with a bunch of category names as header columns. Each row is a list of users with a boolean value (0, 1) whether they are part of that category or not. Each of the csv files does not have the same set of header categories.
I want to create a composite csv across all the files which has the following output:
Header is a union of all the headers
Each row is a unique user with a boolean value corresponding to the category column
The way I wanted to tackle this is to create a tuple of a user_id and a unique category_id for each cell with a '1'. Then reduce all these columns for each user to get the final output.
How do I create the tuple to begin with? Can I have a global lookup for all the categories?
Example Data:
File 1
user_id,cat1,cat2,cat3
21321,,,1,
21322,1,1,1,
21323,1,,,
File 2
user_id,cat4,cat5
21321,1,,,
21323,,1,,
Output
user_id,cat1,cat2,cat3,cat4,cat5
21321,,1,1,,,
21322,1,1,1,,,
21323,1,1,,,,
Probably the title of the question is misleading in the sense that conveys a certain implementation choice as there's no need for a global lookup in order to solve the problem at hand.
In big data, there's a basic principle guiding most solutions: divide and conquer. In this case, the input CSV files could be divided in tuples of (user,category).
Any number of CSV files containing an arbitrary number of categories can be transformed to this simple format. The resulting CSV results of the union of the previous step, extraction of the total nr of categories present and some data transformation to get it in the desired format.
In code this algorithm would look like this:
import org.apache.spark.SparkContext._
val file1 = """user_id,cat1,cat2,cat3|21321,,,1|21322,1,1,1|21323,1,,""".split("\\|")
val file2 = """user_id,cat4,cat5|21321,1,|21323,,1""".split("\\|")
val csv1 = sparkContext.parallelize(file1)
val csv2 = sparkContext.parallelize(file2)
import org.apache.spark.rdd.RDD
def toTuples(csv:RDD[String]):RDD[(String, String)] = {
val headerLine = csv.first
val header = headerLine.split(",")
val data = csv.filter(_ != headerLine).map(line => line.split(","))
data.flatMap{elem =>
val merged = elem.zip(header)
val id = elem.head
merged.tail.collect{case (v,cat) if v == "1" => (id, cat)}
}
}
val data1 = toTuples(csv1)
val data2 = toTuples(csv2)
val union = data1.union(data2)
val categories = union.map{case (id, cat) => cat}.distinct.collect.sorted //sorted category names
val categoriesByUser = union.groupByKey.mapValues(v=>v.toSet)
val numericCategoriesByUser = categoriesByUser.mapValues{catSet => categories.map(cat=> if (catSet(cat)) "1" else "")}
val asCsv = numericCategoriesByUser.collect.map{case (id, cats)=> id + "," + cats.mkString(",")}
Results in:
21321,,,1,1,
21322,1,1,1,,
21323,1,,,,1
(Generating the header is simple and left as an exercise for the reader)
You dont need to do this as a 2 step process if all you need is the resulting values.
A possible design:
1/ Parse your csv. You dont mention whether your data is on a distributed FS, so i'll assume it is not.
2/ Enter your (K,V) pairs into a mutable parallelized (to take advantage of Spark) map.
pseudo-code:
val directory = ..
mutable.ParHashMap map = new mutable.ParHashMap()
while (files[i] != null)
{
val file = directory.spark.textFile("/myfile...")
val cols = file.map(_.split(","))
map.put(col[0], col[i++])
}
and then you can access your (K/V) tuples by way of an iterator on the map.

Resources