Custom output file format write with Spark - apache-spark

I have a requirement to write the following output format.
primary_key_value^attribute1:value1;attribute2:value2;attribute3:value3;attribute4:value4
The output will be written to a file. I can concat the values manually and make a string out of it. Are there any best practices that I can follow to get Spark to write this output

You could add the name of the column with concat or concat_ws and write semi colons as separators. In scala, it would look like this:
val df = Seq((0, "val1", "val2", "val3")).toDF("id", "col1", "col2", "col3")
val res = df
.select(df.columns.map(c => concat_ws(":", lit(c), col(c)).alias(c)) : _*)
res.show()
+----+---------+---------+---------+
| id| col1| col2| col3|
+----+---------+---------+---------+
|id:0|col1:val1|col2:val2|col3:val3|
+----+---------+---------+---------+
And then:
res.write.option("sep", ";").csv("...")

In Pyspark, for each column you can use the concat function, to concatenate the column name and its value, and apply all of this in the select operator.
After you write this with the csv function :
df.select(* [f.concat(col, f.lit(":"), f.lit(col)) for col in df.columns] ).write.option("header", "false").option("delimiter", ";").csv("../path")

Related

How can I use reduceByKey for RDD?

I have a RDD:
[{'date': '27/07/2022', 'user': 'User_83031', 'number_of_emails': 96},
{'date': '27/07/2022', 'user': 'User_45839', 'number_of_emails': 110},
{'date': '14/12/2022', 'user': 'User_15817', 'number_of_emails': 49}]
The code is:
from pyspark import SparkContext
sc = SparkContext(appName = "app-name")
raw_data=sc.textFile("emails.txt")
def formatEmail (row):
return {
"date": row.split(',')[0],
"user": row.split(',')[1],
"number_of_emails": int(row.split(',')[2])
}
emailsRDD=raw_data.map(lambda r: formatEmail(r))
emailsRDD.take(3)`
I run into problem when I try to use reduceByKey.
test=emailsRDD.map(lambda x: (x.get("date"),1)) \
.reduceByKey(lambda x,y: x+y)
test.first()`
The output is give me an error:
ValueError: RDD is empty
Does anybody knows why the error is occurred?
I am expecting to get paired RDD with date as a key and the value is the number of key occurences, like below:
('27/07/2022', 2)
It's very inefficient to use RDD with Python. Really, in 2023rd you should use DataFrame API that is more efficient. Plus you get things like, loading data as CSV file instead of manually parsing your lines.
With DataFrame API code will look as following:
import pyspark.sql.functions as F
df = spark.read.csv("emails.txt", schema="date string, user string, num int")
df2 = df.groupBy("date").agg(F.sum("num"))
df2.show()
will give you as expected:
+----------+--------+
| date|sum(num)|
+----------+--------+
|27/04/2021| 106|
|17/08/2022| 54|
|14/12/2022| 49|
|27/07/2022| 206|
+----------+--------+
In this case you work with high-level constructs, like:
loading data as CSV using the spark.read.csv
Summarizing your data for each date occurrence
Such code is much easier to read, plus it's more efficient as Spark won't need to serialize/deserialize data between JVM and Python.

Combine multiple columns into single column in SPARK

I have a flattened incoming data in the below format in my parquet file:
I want to convert it into the below format where I am non-flattening my structure:
I tried the following:
Dataset<Row> rows = df.select(col("id"), col("country_cd"),
explode(array("fullname_1", "fullname_2")).as("fullname"),
explode(array("firstname_1", "firstname_2")).as("firstname"));
But it gives the below error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Only one generator allowed per select clause but found 2: explode(array(fullname_1, fullname_2)), explode(array(firstname_1, firstname_2));
I understand it is because you cannot use more than 1 explode in a query.
I am looking for options to do the above in Spark Java.
This type of problem is most easily solved with a .flatMap(). A .flatMap() is like a .map() except that it allows you to output n records for each input record, as opposed to a 1:1 ratio.
val df = Seq(
(1, "USA", "Lee M", "Lee", "Dan A White", "Dan"),
(2, "CAN", "Pate Poland", "Pate", "Don Derheim", "Don")
).toDF("id", "country_code", "fullname_1", "firstname_1", "fullname_2", "firstname_2")
df.flatMap(row => {
val id = row.getAs[Int]("id")
val cc = row.getAs[String]("country_code")
Seq(
(id, cc, row.getAs[String]("fullname_1"), row.getAs[String]("firstname_1")),
(id, cc, row.getAs[String]("fullname_1"), row.getAs[String]("firstname_1"))
)
}).toDF("id", "country_code", "fullname", "firstname").show()
This results in the following:
+---+------------+-----------+---------+
| id|country_code| fullname|firstname|
+---+------------+-----------+---------+
| 1| USA| Lee M| Lee|
| 1| USA| Lee M| Lee|
| 2| CAN|Pate Poland| Pate|
| 2| CAN|Pate Poland| Pate|
+---+------------+-----------+---------+
You need to wrap first and last names into an array of structs, which you later then explode:
Dataset<Row> rows = df.select(col("id"), col("country_cd"),
explode(
array(
struct(
col("firstname_1").as("firstname"), col("fullname_1").as("fullname")),
struct(
col("firstname_2").as("firstname"), col("fullname_2").as("fullname"))
)
)
)
This way you'll get fast narrow transformation, have Scala/Python/R portability and it should run quicker than the df.flatMap solution, which will turn Dataframe to an RDD, which Query Optimizer cannot improve. There might be additional pressure from Java Garbage Collector because of copying from unsafe byte arrays to java objects.
As a database person, I like to use set-based operations for things like this, eg union
val df = Seq(
("1", "USA", "Lee M", "Lee", "Dan A White", "Dan"),
("2", "CAN", "Pate Poland", "Pate", "Don Derheim", "Don")
).toDF("id", "country_code", "fullname_1", "firstname_1", "fullname_2", "firstname_2")
val df_new = df
.select("id", "country_code", "fullname_1", "firstname_1").union(df.select("id", "country_code", "fullname_2", "firstname_2"))
.orderBy("id")
df_new.show
df.createOrReplaceTempView("tmp")
Or the equivalent SQL:
%sql
SELECT id, country_code, fullname_1 AS fullname, firstname_1 AS firstname
FROM tmp
UNION
SELECT id, country_code, fullname_2, firstname_2
FROM tmp
My results:
I suppose one advantage over the flatMap technique is you don't have to specify the datatypes and it appears simpler on the face of it. It's up to you of course.

How to query nested Array type of a json file using Spark?

How can I query a nested Array type using joins using Spark dataset?
Currently I'm exploding the Array type and doing join on the dataset where I need to remove the matched data. But is there a way wherein I can directly query it without exploding.
{
"id": 525,
"arrayRecords": [
{
"field1": 525,
"field2": 0
},
{
"field1": 537,
"field2": 1
}
]
}
The code
val df = sqlContext.read.json("jsonfile")
val someDF = Seq(("1"),("525"),("3")).toDF("FIELDIDS")
val withSRCRec =df.select($"*",explode($"arrayRecords")as("exploded_arrayRecords"))
val fieldIdMatchedDF= withSRCRec.as("table1").join(someDF.as("table2"),$"table1.exploded_arrayRecords.field1"===$"table2.FIELDIDS").select($"table1.exploded_arrayRecords.field1")
val finalDf = df.as("table1").join(fieldIdMatchedDF.as("table2"),$"table1.id"===$"table2.id","leftanti")
Id records having fieldIds need to be removed
You could use array_except instead:
array_except(col1: Column, col2: Column): Column Returns an array of the elements in the first array but not in the second array, without duplicates. The order of elements in the result is not determined
A solution could be as follows:
val input = spark.read.option("multiLine", true).json("input.json")
scala> input.show(false)
+--------------------+---+
|arrayRecords |id |
+--------------------+---+
|[[525, 0], [537, 1]]|525|
+--------------------+---+
// Since field1 is of type int, let's convert the ids to ints
// You could do this in Scala directly or in Spark SQL's select
val fieldIds = Seq("1", "525", "3").toDF("FIELDIDS").select($"FIELDIDS" cast "int")
// Collect the ids for array_except
val ids = fieldIds.select(collect_set("FIELDIDS") as "ids")
// The trick is to crossJoin (it is cheap given 1-row ids dataset)
val solution = input
.crossJoin(ids)
.select(array_except($"arrayRecords.field1", $"ids") as "unmatched")
scala> solution.show
+---------+
|unmatched|
+---------+
| [537]|
+---------+
You can register a temporary table based on your dataset and query it with SQL. It would be something like this:
someDs.registerTempTable("sometable");
sql("SELECT array['field'] FROM sometable");

Spark-Scala Try Select Statement

I'm trying to incorporate a Try().getOrElse() statement in my select statement for a Spark DataFrame. The project I'm working on is going to be applied to multiple environments. However, each environment is a little different in terms of the naming of the raw data for ONLY one field. I do not want to write several different functions to handle each different field. Is there a elegant way to handle exceptions, like this below, in a DataFrame select statement?
val dfFilter = dfRaw
.select(
Try($"some.field.nameOption1).getOrElse($"some.field.nameOption2"),
$"some.field.abc",
$"some.field.def"
)
dfFilter.show(33, false)
However, I keep getting the following error, which makes sense because it does not exist in this environments raw data, but I'd expect the getOrElse statement to catch that exception.
org.apache.spark.sql.AnalysisException: No such struct field nameOption1 in...
Is there a good way to handle exceptions in Scala Spark for select statements? Or will I need to code up different functions for each case?
val selectedColumns = if (dfRaw.columns.contains("some.field.nameOption1")) $"some.field.nameOption2" else $"some.field.nameOption2"
val dfFilter = dfRaw
.select(selectedColumns, ...)
So I'm revisiting this question after a year. I believe this solution to be much more elegant to implement. Please let me know anyone else's thoughts:
// Generate a fake DataFrame
val df = Seq(
("1234", "A", "AAA"),
("1134", "B", "BBB"),
("2353", "C", "CCC")
).toDF("id", "name", "nameAlt")
// Extract the column names
val columns = df.columns
// Add a "new" column name that is NOT present in the above DataFrame
val columnsAdd = columns ++ Array("someNewColumn")
// Let's then "try" to select all of the columns
df.select(columnsAdd.flatMap(c => Try(df(c)).toOption): _*).show(false)
// Let's reduce the DF again...should yield the same results
val dfNew = df.select("id", "name")
dfNew.select(columnsAdd.flatMap(c => Try(dfNew(c)).toOption): _*).show(false)
// Results
columns: Array[String] = Array(id, name, nameAlt)
columnsAdd: Array[String] = Array(id, name, nameAlt, someNewColumn)
+----+----+-------+
|id |name|nameAlt|
+----+----+-------+
|1234|A |AAA |
|1134|B |BBB |
|2353|C |CCC |
+----+----+-------+
dfNew: org.apache.spark.sql.DataFrame = [id: string, name: string]
+----+----+
|id |name|
+----+----+
|1234|A |
|1134|B |
|2353|C |
+----+----+

How to create a dataframe from a string key=value delimited by ";"

I have a Hive table with the structure:
I need to read the string field, breaking the keys and turn into a Hive table columns, the final table should look like this:
Very important, the number of keys in the string is dynamic and the name of the keys is also dynamic
An attempt would be to read the string with Spark SQL, create a dataframe with the schema based on all the strings and use saveAsTable () function to transform the dataframe the hive final table, but do not know how to do this
Any suggestion ?
A naive (assuming unique (code, date) combinations and no embedded = and ; in the string) can look like this:
import org.apache.spark.sql.functions.{explode, split}
val df = Seq(
(1, 1, "key1=value11;key2=value12;key3=value13;key4=value14"),
(1, 2, "key1=value21;key2=value22;key3=value23;key4=value24"),
(2, 4, "key3=value33;key4=value34;key5=value35")
).toDF("code", "date", "string")
val bits = split($"string", ";")
val kv = split($"pair", "=")
df
.withColumn("bits", bits) // Split column by `;`
.withColumn("pair", explode($"bits")) // Explode into multiple rows
.withColumn("key", kv(0)) // Extract key
.withColumn("val", kv(1)) // Extract value
// Pivot to wide format
.groupBy("code", "date")
.pivot("key")
.agg(first("val"))
// +----+----+-------+-------+-------+-------+-------+
// |code|date| key1| key2| key3| key4| key5|
// +----+----+-------+-------+-------+-------+-------+
// | 1| 2|value21|value22|value23|value24| null|
// | 1| 1|value11|value12|value13|value14| null|
// | 2| 4| null| null|value33|value34|value35|
// +----+----+-------+-------+-------+-------+-------+
This can be easily adjust to handle the case when (code, date) are not unique and you can process more complex string patterns using UDF.
Depending on a language you use and a number of columns you may be better with using RDD or Dataset. It is also worth to consider dropping full explode / pivot in favor of an UDF.
val parse = udf((text: String) => text.split(";").map(_.split("=")).collect {
case Array(k, v) => (k, v)
}.toMap)
val keys = udf((pairs: Map[String, String]) => pairs.keys.toList)
// Parse strings to Map[String, String]
val withKVs = df.withColumn("kvs", parse($"string"))
val keys = withKVs
.select(explode(keys($"kvs"))).distinct // Get unique keys
.as[String]
.collect.sorted.toList // Collect and sort
// Build a list of expressions for subsequent select
val exprs = keys.map(key => $"kvs".getItem(key).alias(key))
withKVs.select($"code" :: $"date" :: exprs: _*)
In Spark 1.5 you can try:
val keys = withKVs.select($"kvs").rdd
.flatMap(_.getAs[Map[String, String]]("kvs").keys)
.distinct
.collect.sorted.toList

Resources