update nested struct with null values - apache-spark

I have a dataframe with a column which is nested StructType. The StructType is deeply nested and may comprise other Structs. Now I want to update this column at the lowest level.
I tried withField but it doesn't work if any of the top level struct is null. I will appreciate any help with this.
The example schema is:
val schema = new StructType()
.add("key", StringType)
.add(
"cells",
ArrayType(
new StructType()
.add("family", StringType)
.add("qualifier", StringType)
.add("timestamp", LongType)
.add("nestStruct", new StructType()
.add("id1", LongType)
.add("id2", StringType)
. .add("id3", new StructType()
.add("id31", LongType)
.add("id32", StringType))
)
)
val data = Seq(
Row(
"1235321863",
Array(
Row("a", "b", 1L, null)
)
)
)
val df_test = spark
.createDataFrame(spark.sparkContext.parallelize(data), schema)
val result = df_test.withColumn(
"cell1",
transform($"cells", cell => {
cell.withField("nestStruct.id3.id31", lit(40)) /*This line doesn't do anything is nestStruct is null. */
}))
result.show(false)
result.printSchema
result.explain() /*The physical plan shows that if a field is null it will just return null*/

You can use the solution suggested for this question: How do I add a column to a nested struct in a PySpark dataframe?
Or you can try the following:
You can write your current dataframe to a json file, read the json file to a string, and try writing a regular expression to add the field you want to the json string, then write the json string to a new file, and read the new file to a dataframe.
For example, I'm using the example provided above:
import json, re
with open('./pyspark_sandbox_sample.json') as input_file:
string_data = str(json.load(input_file))
input_file.close()
string_data = re.sub(r"'id32': '(.*?)'", r"'id32': '\1', 'id33': 40", string_data)
with open('./pyspark_sandbox_sample.json', 'w') as output_file:
json.dump(eval(string_data), output_file)
output_file.close()

Related

Writing Spark Dataframe to ORC gives the wrong timezone

Whenever I write a Dataframe into ORC, the timezone of Timestamp fields is not correct.
here's my code -
// setting the timezone
val schema = List(
StructField("name", StringType),
StructField("date", TimestampType)
)
val data = Seq(
Row("test", java.sql.Timestamp.valueOf("2021-03-15 10:10:10.0"))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
// changing the timezone
spark.conf.set("spark.sql.session.timeZone", "MDT")
// value of the df has changed accordingly
df.show // prints 2021-03-15 08:10:10
// writing to orc
df.write.mode(SaveMode.Overwrite).format("orc").save("/tmp/dateTest.orc/")
value in ORC file will be 2021-03-15 10:10:10.0.
Is there any way to control the writer's timezone? am i missing something here?
Thanks in advance!
So after much investigation, this is something that's not supported (ATM) for ORC. it is supported for csv, though.

How to parse RDD to Dataframe

I'm trying to parse a RDD[Seq[String]] to Dataframe.
ALthough it's a Seq of Strings they could have a more specific type as Int, Boolean, Double, String an so on.
For example, a line could be:
"hello", "1", "bye", "1.1"
"hello1", "11", "bye1", "2.1"
...
Another execution could have a different number of columns.
First column is going to be always a String, second an int and so on and it's going to be always on this way. On the other hand, one execution could have seq of five elements and others execution could have 2000, so it depends of the execution. In each execution the name of type of columns is defined.
To do it, I could have something like this:
//I could have a parameter to generate the StructType dinamically.
def getSchema(): StructType = {
var schemaArray = scala.collection.mutable.ArrayBuffer[StructField]()
schemaArray += StructField("col1" , IntegerType, true)
schemaArray += StructField("col2" , StringType, true)
schemaArray += StructField("col3" , DoubleType, true)
StructType(schemaArray)
}
//Array of Any?? it doesn't seem the best option!!
val l1: Seq[Any] = Seq(1,"2", 1.1 )
val rdd1 = sc.parallelize(l1).map(Row.fromSeq(_))
val schema = getSchema()
val df = sqlContext.createDataFrame(rdd1, schema)
df.show()
df.schema
I don't like at all to have a Seq of Any, but it's really what I have. Another chance??
On the other hand I was thinking that I have something similar to a CSV, I could create one. With spark there is a library to read an CSV and return a dataframe where types are infered. Is it possible to call it if I have already an RDD[String]?
Since number of columns changes for each execution I would suggest to go with CSV option with delimiter set to space or something else. This way spark will figure out columns types for you.
Update:
Since you mentioned that you read data from HBase, one way to go is to convert HBase row to JSON or CSV and then to convert the RDD to dataframe:
val jsons = hbaseContext.hbaseRDD(tableName, scan).map{case (_, r) =>
val currentJson = new JSONObject
val cScanner = r.cellScanner
while (cScanner.advance) {
currentJson.put(Bytes.toString(cScanner.current.getQualifierArray, cScanner.current.getQualifierOffset, cScanner.current.getQualifierLength),
Bytes.toString(cScanner.current.getValueArray, cScanner.current.getValueOffset, cScanner.current.getValueLength))
}
currentJson.toString
}
val df = spark.read.json(spark.createDataset(jsons))
Similar thing can be done for CSV.

Spark load files collection in batch and find the line from each file with additional info from file level

I have the files collection specified with comma separator, like:
hdfs://user/cloudera/date=2018-01-15,hdfs://user/cloudera/date=2018-01-16,hdfs://user/cloudera/date=2018-01-17,hdfs://user/cloudera/date=2018-01-18,hdfs://user/cloudera/date=2018-01-19,hdfs://user/cloudera/date=2018-01-20,hdfs://user/cloudera/date=2018-01-21,hdfs://user/cloudera/date=2018-01-22
and I'm loading the files with Apache Spark, all in once with:
val input = sc.textFile(files)
Also, I have additional information associated with each file - the unique ID, for example:
File ID
--------------------------------------------------
hdfs://user/cloudera/date=2018-01-15 | 12345
hdfs://user/cloudera/date=2018-01-16 | 09245
hdfs://user/cloudera/date=2018-01-17 | 345hqw4
and so on
As the output, I need to receive the DataFrame with the rows, where each row will contain the same ID, as the ID of the file from which this line was read.
Is it possible to pass this information in some way to Spark in order to be able to associate with the lines?
Core sql approach with UDF (the same thing you can achieve with join if you represent File -> ID mapping as Dataframe):
import org.apache.spark.sql.functions
val inputDf = sparkSession.read.text(".../src/test/resources/test")
.withColumn("fileName", functions.input_file_name())
def withId(mapping: Map[String, String]) = functions.udf(
(file: String) => mapping.get(file)
)
val mapping = Map(
"file:///.../src/test/resources/test/test1.txt" -> "id1",
"file:///.../src/test/resources/test/test2.txt" -> "id2"
)
val resutlDf = inputDf.withColumn("id", withId(mapping)(inputDf("fileName")))
resutlDf.show(false)
Result:
+-----+---------------------------------------------+---+
|value|fileName |id |
+-----+---------------------------------------------+---+
|row1 |file:///.../src/test/resources/test/test1.txt|id1|
|row11|file:///.../src/test/resources/test/test1.txt|id1|
|row2 |file:///.../src/test/resources/test/test2.txt|id2|
|row22|file:///.../src/test/resources/test/test2.txt|id2|
+-----+---------------------------------------------+---+
text1.txt:
row1
row11
text2.txt:
row2
row22
This could help (not tested)
// read single text file into DataFrame and add 'id' column
def readOneFile(filePath: String, fileId: String)(implicit spark: SparkSession): DataFrame = {
val dfOriginal: DataFrame = spark.read.text(filePath)
val dfWithIdColumn: DataFrame = dfOriginal.withColumn("id", lit(fileId))
dfWithIdColumn
}
// read all text files into DataFrame
def readAllFiles(filePathIdsSeq: Seq[(String, String)])(implicit spark: SparkSession): DataFrame = {
// create empty DataFrame with expected schema
val emptyDfSchema: StructType = StructType(List(
StructField("value", StringType, false),
StructField("id", StringType, false)
))
val emptyDf: DataFrame = spark.createDataFrame(
rowRDD = spark.sparkContext.emptyRDD[Row],
schema = emptyDfSchema
)
val unionDf: DataFrame = filePathIdsSeq.foldLeft(emptyDf) { (intermediateDf: DataFrame, filePathIdTuple: (String, String)) =>
intermediateDf.union(readOneFile(filePathIdTuple._1, filePathIdTuple._2))
}
unionDf
}
References
spark.read.text(..) method
Create empty DataFrame

Taking value from one dataframe and passing that value into loop of SqlContext

Looking to try do something like this:
I have a dataframe that is one column of ID's called ID_LIST. With that column of id's I would like to pass it into a Spark SQL call looping through ID_LIST using foreach returning the result to another dataframe.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val id_list = sqlContext.sql("select distinct id from item_orc")
id_list.registerTempTable("ID_LIST")
id_list.foreach(i => println(i)
id_list println output:
[123]
[234]
[345]
[456]
Trying to now loop through ID_LIST and run a Spark SQL call for each:
id_list.foreach(i => {
val items = sqlContext.sql("select * from another_items_orc where id = " + i
items.foreach(println)
}
First.. not sure how to pull the individual value out, getting this error:
org.apache.spark.sql.AnalysisException: cannot recognize input near '[' '123' ']' in expression specification; line 1 pos 61
Second: how can I alter my code to output the result to a dataframe I can use later ?
Thanks, any help is appreciated!
Answer To First Question
When you perform the "foreach" Spark converts the dataframe into an RDD of type Row. Then when you println on the RDD it prints the Row, the first row being "[123]". It is boxing [] the elements in the row. The elements in the row are accessed by position. If you wanted to print just 123, 234, etc... try
id_list.foreach(i => println(i(0)))
Or you can use native primitive access
id_list.foreach(i => println(i.getString(0))) //For Strings
Seriously... Read the documentation I have linked about Row in Spark. This will transform your code to:
id_list.foreach(i => {
val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
items.foreach(i => println(i.getString(0)))
})
Answer to Second Question
I have a sneaking suspicion about what you actually are trying to do but I'll answer your question as I have interpreted it.
Let's create an empty dataframe which we will union everything to it in a loop of the distinct items from the first dataframe.
import org.apache.spark.sql.types.{StructType, StringType}
import org.apache.spark.sql.Row
// Create the empty dataframe. The schema should reflect the columns
// of the dataframe that you will be adding to it.
val schema = new StructType()
.add("col1", StringType, true)
var df = ss.createDataFrame(ss.sparkContext.emptyRDD[Row], schema)
// Loop over, select, and union to the empty df
id_list.foreach{ i =>
val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
df = df.union(items)
}
df.show()
You now have the dataframe df that you can use later.
NOTE: An easier thing to do would probably be to join the two dataframes on the matching columns.
import sqlContext.implicits.StringToColumn
val bar = id_list.join(another_items_orc, $"distinct_id" === $"id", "inner").select("id")
bar.show()

how can i add a timestamp as an extra column to my dataframe

*Hi all,
I have an easy question for you all.
I have an RDD, created from kafka streaming using createStream method.
Now i want to add a timestamp as a value to this rdd before converting in to dataframe.
I have tried doing to add a value to the dataframe using with withColumn() but returning this error*
val topicMaps = Map("topic" -> 1)
val now = java.util.Calendar.getInstance().getTime()
val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
messages.foreachRDD(rdd =>
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val dataframe = sqlContext.read.json(rdd.map(_._2))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
org.apache.spark.sql.AnalysisException: Cannot resolve column name "now" among (action, device_os_ver, device_type, event_name,
item_name, lat, lon, memberid, productUpccd, tenantid);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:15
As i came to know that DataFrames cannot be altered as they are immutable, but RDDs are immutable as well.
Then what is the best way to do it.
How to a value to the RDD(adding timestamp to an RDD dynamically).
Try current_timestamp function.
import org.apache.spark.sql.functions.current_timestamp
df.withColumn("time_stamp", current_timestamp())
For add a new column with a constant like timestamp, you can use litfunction:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("timeStamp_column", lit(System.currentTimeMillis))
This works for me. I usually perform a write after this.
val d = dataframe.withColumn("SparkLoadedAt", current_timestamp())
In Scala/Databricks:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("Timestamp",current_timestamp())
See my output
I see in comments that some folks are having trouble getting the timestamp to string. Here is a way to do that using spark 3 datetime format
import org.apache.spark.sql.functions._
val d =dataframe.
.withColumn("timeStamp_column", date_format(current_timestamp(), "y-M-d'T'H:m:sX"))

Resources