How to parse a JSON containing string property representing JSON - apache-spark

I have many JSONs with structure as followed.
{
"p1":"v1",
"p2":"v2",
"p3":"v3",
"modules": "{ \"nest11\":\"n1v1\", \"nest12\":\"n1v2\", \"nest13\": { \"nest21\": \"n2v1\" } }"
}
How to parse it to this?
v1, v2, v3, n1v1, n1v2, n2v1
It is not a problem to extract "v1, v2, v3", but how to access "n1v1, n1v2, n2v1" With Spark Data Frame API

One approach is to use the DataFrameFlattener implicit class found in the official databricks site.
First you will need to define the JSON schema for the modules column then you flatten the dataframe as shown below. Here I assume that the file test_json.txt
will have the next content:
{
"p1":"v1",
"p2":"v2",
"p3":"v3",
"modules": "{ \"nest11\":\"n1v1\", \"nest12\":\"n1v2\", \"nest13\": { \"nest21\": \"n2v1\" } }"
}
Here is the code:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.sql.types.{DataType, StructType, StringType}
implicit class DataFrameFlattener(df: DataFrame) {
def flattenSchema: DataFrame = {
df.select(flatten(Nil, df.schema): _*)
}
protected def flatten(path: Seq[String], schema: DataType): Seq[Column] = schema match {
case s: StructType => s.fields.flatMap(f => flatten(path :+ f.name, f.dataType))
case other => col(path.map(n => s"`$n`").mkString(".")).as(path.mkString(".")) :: Nil
}
}
val schema = (new StructType)
.add("nest11", StringType)
.add("nest12", StringType)
.add("nest13", (new StructType).add("nest21", StringType, false))
val df = spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("C:\\temp\\test_json.txt")
df.withColumn("modules", from_json($"modules", schema))
.select($"*")
.flattenSchema
And this should be the output:
+--------------+--------------+---------------------+---+---+---+
|modules.nest11|modules.nest12|modules.nest13.nest21|p1 |p2 |p3 |
+--------------+--------------+---------------------+---+---+---+
|n1v1 |n1v2 |n2v1 |v1 |v2 |v3 |
+--------------+--------------+---------------------+---+---+---+
Please let me know if you need further clarification.

All you need to do is parse the JSON string to actual javascript object
const originalJSON = {
"p1":"v1",
"p2":"v2",
"p3":"v3",
"modules": "{ \"nest11\":\"n1v1\", \"nest12\":\"n1v2\", \"nest13\": { \"nest21\": \"n2v1\" } }"
}
const { modules, ...rest } = originalJSON
const result = {
...rest,
modules: JSON.parse(modules)
}
console.log(result)
console.log(result.modules.nest11)

When you retrieve the "modules" element, you are actually retrieving a string. You have to instantiate this string as a new JSON object. I don't know what language you're using, but you generally do something like:
String modules_str = orginalJSON.get("modules");
JSON modulesJSON = new JSON(modules_str);
String nest11_str = modulesJSON.get("nest11");

Related

BSON to Play JSON support for Long values

I've started using the play-json/play-json-compat libraries with reactivemongo 0.20.11.
So I can use JSON Play reads/writes while importing the 'reactivemongo.play.json._' package and then easily fetch data from a JSONCollection instead of a BSONCollection.
For most cases, this works great but for Long fields, it doesn't :(
For example:
case class TestClass(name: String, age: Long)
object TestClass {
implicit val reads = Json.reads[TestClass]
}
If I try querying using the following func:
def getData: Map[String, TestClass] = {
val res = collection.find(emptyDoc)
.cursor[TestClass]()
.collect[List](-1, Cursor.ContOnError[List[TestClass]] { case (_, t) =>
failureLogger.error(s"Failed deserializing TestClass from Mongo", t)
})
.map { items =>
items map { item =>
item.name -> item.age
} toMap
}
Await.result(res, 10 seconds)
}
Then I get the following error:
play.api.libs.json.JsResultException: JsResultException(errors:List((/age,List(ValidationError(List(error.expected.jsnumber),WrappedArray())))))
I've debugged the reading of the document and noticed that when it first converts the BSON to a JsObject, then the long field is as following:
"age": {"$long": 1526389200000}
I found a way to make this work but I really don't like it:
case class MyBSONLong(`$long`: Long)
object MyBSONLong {
implicit val longReads = Json.reads[MyBSONLong]
}
case class TestClass(name: String, age: Long)
object TestClass {
implicit val reads = (
(__ \ "name").read[String] and
(__ \ "age").read[MyBSONLong].map(_.`$long`)
) (apply _)
}
So this works, but it's a very ugly solution.
Is there a better way to do this?
Thanks in advance :)

Transform object in another one, manually

I have the next method:
fun get(browsePlayerContext: BrowsePlayerContext): Single<List<Conference>>
Which returns a Single> with the next structure for the Conference object:
data class Conference(
val label: String,
val uid: UID?,
val action: BrowsePlayerAction?,
val image: String
)
But I need to transfor this response in a:
Single<List<EntityBrowse>>
The entity browse has the same structure I mean:
data class EntityBrowse(
val label: String,
val uid: UID?,
val action: BrowsePlayerAction?,
val image: String
)
I am doing the transformation manually, but I need a more sophisticated way, because I am gona get different kind of objects and I will have to do the same transformation to EntityBrowse.
Any ideas?
You could use the .map function to transform the Conference objects into EntityBrowse objects
val conferences: List<Conference> = getConferences()
val entities: List<Entities> = conferences.map {conference ->
EntityBrowse(conference.label, conference.uid, conference.action, conference.image)
}
You can use map function on Single object to transform Single<List<Conference>> to Single<List<EntityBrowse>>:
val result: Single<List<EntityBrowse>> = get(context).map { conferences: List<Conference> ->
// transform List<Conference> to List<EntityBrowse> using `conferences` variable
conferences.map { EntityBrowse(it.label, it.uid, it.action, it.image) }
}

Convert RDD[Array[Row]] to RDD[Row]

How to convert RDD[Array[Row]] to RDD[Row]?
Details:
I have some use case where my parsing function returns type Array[Row] for some data and Row for some data. How will I convert both of these to RDD[Row] for further use?
CODE SAMPLE
private def getRows(rdd: RDD[String], parser: Parser): RDD[Row] = {
var processedLines = rdd.map { line =>
map(p => parser.processBeacon(line) }
val rddOfRowsList = processedLines.map { x =>
x match {
case Right(obj) => obj.map { p =>
MyRow.getValue(p)
}//I can use flatmap here
case Left(obj) =>
MyRow.getValue(obj)
}//Cant use flatmap here
}
// Here I have to convert rddOfRowsList to RDD[Row]
//?????
val rowsRdd =?????
//
rowsRdd
}
def processLine(logMap: Map[String, String]):Either[Map[String, Object], Array[Map[String, Object]]] =
{
//process
}
Use flatMap;
rdd.flatMap(identity)
You ca use flatmap to get new rdd, and then use union to compose them.
use flatMap to flattern the contents of RDD

spark read wholeTextFiles with non UTF-8 encoding

I want to read whole text files in non UTF-8 encoding via
val df = spark.sparkContext.wholeTextFiles(path, 12).toDF
into spark. How can I change the encoding?
I would want to read ISO-8859 encoded text, but it is not CSV, it is something similar to xml:SGML.
edit
maybe a custom Hadoop file input format should be used?
https://dzone.com/articles/implementing-hadoops-input-format-and-output-forma
http://henning.kropponline.de/2016/10/23/custom-matlab-inputformat-for-apache-spark/
You can read the files using SparkContext.binaryFiles() instead and build the String for the contents specifying the charset you need. E.g:
val df = spark.sparkContext.binaryFiles(path, 12)
.mapValues(content => new String(content.toArray(), StandardCharsets.ISO_8859_1))
.toDF
It's Simple.
Here is the source code,
import java.nio.charset.Charset
import org.apache.hadoop.io.{Text, LongWritable}
import org.apache.hadoop.mapred.TextInputFormat
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
object TextFile {
val DEFAULT_CHARSET = Charset.forName("UTF-8")
def withCharset(context: SparkContext, location: String, charset: String): RDD[String] = {
if (Charset.forName(charset) == DEFAULT_CHARSET) {
context.textFile(location)
} else {
// can't pass a Charset object here cause its not serializable
// TODO: maybe use mapPartitions instead?
context.hadoopFile[LongWritable, Text, TextInputFormat](location).map(
pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)
)
}
}
}
From here it's copied.
https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/util/TextFile.scala
To Use it.
https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/util/TextFileSuite.scala
Edit:
If you need wholetext file,
Here is the actual source of the implementation.
def wholeTextFiles(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope {
assertNotStopped()
val job = NewHadoopJob.getInstance(hadoopConfiguration)
// Use setInputPaths so that wholeTextFiles aligns with hadoopFile/textFile in taking
// comma separated files as input. (see SPARK-7155)
NewFileInputFormat.setInputPaths(job, path)
val updateConf = job.getConfiguration
new WholeTextFileRDD(
this,
classOf[WholeTextFileInputFormat],
classOf[Text],
classOf[Text],
updateConf,
minPartitions).map(record => (record._1.toString, record._2.toString)).setName(path)
}
Try changing :
.map(record => (record._1.toString, record._2.toString))
to(probably):
.map(record => (record._1.toString, new String(record._2.getBytes, 0, record._2.getLength, "myCustomCharset")))

How can I retrieve the alias for a DataFrame in Spark

I'm using Spark 2.0.2. I have a DataFrame that has an alias on it, and I'd like to be able to retrieve that. A simplified example of why I'd want that is below.
def check(ds: DataFrame) = {
assert(ds.count > 0, s"${df.getAlias} has zero rows!")
}
The above code of course fails because DataFrame has no getAlias function. Is there a way to do this?
You can try something like this but I wouldn't go so far to claim it is supported:
Spark < 2.1:
import org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias
import org.apache.spark.sql.Dataset
def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
case SubqueryAlias(alias, _) => Some(alias)
case _ => None
}
Spark 2.1+:
def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
case SubqueryAlias(alias, _, _) => Some(alias)
case _ => None
}
Example usage:
val plain = Seq((1, "foo")).toDF
getAlias(plain)
Option[String] = None
val aliased = plain.alias("a dataset")
getAlias(aliased)
Option[String] = Some(a dataset)
Disclaimer: as stated above, this code relies on undocumented APIs subject to change. It works as of Spark 2.3.
After much digging into mostly undocumented Spark methods, here is the full code to pull the list of fields, along with the table alias for a dataframe in PySpark:
def schema_from_plan(df):
plan = df._jdf.queryExecution().analyzed()
all_fields = _schema_from_plan(plan)
iterator = plan.output().iterator()
output_fields = {}
while iterator.hasNext():
field = iterator.next()
queryfield = all_fields.get(field.exprId().id(),{})
if not queryfield=={}:
tablealias = queryfield["tablealias"]
else:
tablealias = ""
output_fields[field.exprId().id()] = {
"tablealias": tablealias,
"dataType": field.dataType().typeName(),
"name": field.name()
}
return list(output_fields.values())
def _schema_from_plan(root,tablealias=None,fields={}):
iterator = root.children().iterator()
while iterator.hasNext():
node = iterator.next()
nodeClass = node.getClass().getSimpleName()
if (nodeClass=="SubqueryAlias"):
# get the alias and process the subnodes with this alias
_schema_from_plan(node,node.alias(),fields)
else:
if tablealias:
# add all the fields, along with the unique IDs, and a new tablealias field
iterator = node.output().iterator()
while iterator.hasNext():
field = iterator.next()
fields[field.exprId().id()] = {
"tablealias": tablealias,
"dataType": field.dataType().typeName(),
"name": field.name()
}
_schema_from_plan(node,tablealias,fields)
return fields
# example: fields = schema_from_plan(df)
For Java:
As #veinhorn mentioned, it is also possible to get the alias in Java. Here is a utility method example:
public static <T> Optional<String> getAlias(Dataset<T> dataset){
final LogicalPlan analyzed = dataset.queryExecution().analyzed();
if(analyzed instanceof SubqueryAlias) {
SubqueryAlias subqueryAlias = (SubqueryAlias) analyzed;
return Optional.of(subqueryAlias.alias());
}
return Optional.empty();
}

Resources