How to convert RDD[Array[Row]] to RDD[Row]?
Details:
I have some use case where my parsing function returns type Array[Row] for some data and Row for some data. How will I convert both of these to RDD[Row] for further use?
CODE SAMPLE
private def getRows(rdd: RDD[String], parser: Parser): RDD[Row] = {
var processedLines = rdd.map { line =>
map(p => parser.processBeacon(line) }
val rddOfRowsList = processedLines.map { x =>
x match {
case Right(obj) => obj.map { p =>
MyRow.getValue(p)
}//I can use flatmap here
case Left(obj) =>
MyRow.getValue(obj)
}//Cant use flatmap here
}
// Here I have to convert rddOfRowsList to RDD[Row]
//?????
val rowsRdd =?????
//
rowsRdd
}
def processLine(logMap: Map[String, String]):Either[Map[String, Object], Array[Map[String, Object]]] =
{
//process
}
Use flatMap;
rdd.flatMap(identity)
You ca use flatmap to get new rdd, and then use union to compose them.
use flatMap to flattern the contents of RDD
Related
I've started using the play-json/play-json-compat libraries with reactivemongo 0.20.11.
So I can use JSON Play reads/writes while importing the 'reactivemongo.play.json._' package and then easily fetch data from a JSONCollection instead of a BSONCollection.
For most cases, this works great but for Long fields, it doesn't :(
For example:
case class TestClass(name: String, age: Long)
object TestClass {
implicit val reads = Json.reads[TestClass]
}
If I try querying using the following func:
def getData: Map[String, TestClass] = {
val res = collection.find(emptyDoc)
.cursor[TestClass]()
.collect[List](-1, Cursor.ContOnError[List[TestClass]] { case (_, t) =>
failureLogger.error(s"Failed deserializing TestClass from Mongo", t)
})
.map { items =>
items map { item =>
item.name -> item.age
} toMap
}
Await.result(res, 10 seconds)
}
Then I get the following error:
play.api.libs.json.JsResultException: JsResultException(errors:List((/age,List(ValidationError(List(error.expected.jsnumber),WrappedArray())))))
I've debugged the reading of the document and noticed that when it first converts the BSON to a JsObject, then the long field is as following:
"age": {"$long": 1526389200000}
I found a way to make this work but I really don't like it:
case class MyBSONLong(`$long`: Long)
object MyBSONLong {
implicit val longReads = Json.reads[MyBSONLong]
}
case class TestClass(name: String, age: Long)
object TestClass {
implicit val reads = (
(__ \ "name").read[String] and
(__ \ "age").read[MyBSONLong].map(_.`$long`)
) (apply _)
}
So this works, but it's a very ugly solution.
Is there a better way to do this?
Thanks in advance :)
I have many JSONs with structure as followed.
{
"p1":"v1",
"p2":"v2",
"p3":"v3",
"modules": "{ \"nest11\":\"n1v1\", \"nest12\":\"n1v2\", \"nest13\": { \"nest21\": \"n2v1\" } }"
}
How to parse it to this?
v1, v2, v3, n1v1, n1v2, n2v1
It is not a problem to extract "v1, v2, v3", but how to access "n1v1, n1v2, n2v1" With Spark Data Frame API
One approach is to use the DataFrameFlattener implicit class found in the official databricks site.
First you will need to define the JSON schema for the modules column then you flatten the dataframe as shown below. Here I assume that the file test_json.txt
will have the next content:
{
"p1":"v1",
"p2":"v2",
"p3":"v3",
"modules": "{ \"nest11\":\"n1v1\", \"nest12\":\"n1v2\", \"nest13\": { \"nest21\": \"n2v1\" } }"
}
Here is the code:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.sql.types.{DataType, StructType, StringType}
implicit class DataFrameFlattener(df: DataFrame) {
def flattenSchema: DataFrame = {
df.select(flatten(Nil, df.schema): _*)
}
protected def flatten(path: Seq[String], schema: DataType): Seq[Column] = schema match {
case s: StructType => s.fields.flatMap(f => flatten(path :+ f.name, f.dataType))
case other => col(path.map(n => s"`$n`").mkString(".")).as(path.mkString(".")) :: Nil
}
}
val schema = (new StructType)
.add("nest11", StringType)
.add("nest12", StringType)
.add("nest13", (new StructType).add("nest21", StringType, false))
val df = spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("C:\\temp\\test_json.txt")
df.withColumn("modules", from_json($"modules", schema))
.select($"*")
.flattenSchema
And this should be the output:
+--------------+--------------+---------------------+---+---+---+
|modules.nest11|modules.nest12|modules.nest13.nest21|p1 |p2 |p3 |
+--------------+--------------+---------------------+---+---+---+
|n1v1 |n1v2 |n2v1 |v1 |v2 |v3 |
+--------------+--------------+---------------------+---+---+---+
Please let me know if you need further clarification.
All you need to do is parse the JSON string to actual javascript object
const originalJSON = {
"p1":"v1",
"p2":"v2",
"p3":"v3",
"modules": "{ \"nest11\":\"n1v1\", \"nest12\":\"n1v2\", \"nest13\": { \"nest21\": \"n2v1\" } }"
}
const { modules, ...rest } = originalJSON
const result = {
...rest,
modules: JSON.parse(modules)
}
console.log(result)
console.log(result.modules.nest11)
When you retrieve the "modules" element, you are actually retrieving a string. You have to instantiate this string as a new JSON object. I don't know what language you're using, but you generally do something like:
String modules_str = orginalJSON.get("modules");
JSON modulesJSON = new JSON(modules_str);
String nest11_str = modulesJSON.get("nest11");
Working on an internal website where the URL contains the source reference from other systems. This is a business requirement and cannot be changed.
i.e. "http://localhost:9000/source.address.com/7808/project/repo"
"http://localhost:9000/build.address.com/17808/project/repo"
I need to remove these strings from the "project/repo" string/variables using a trait so this can be used natively from multiple services. I also want to be able to add more sources to this list (which already exists) and not modify the method.
"def normalizePath" is the method accessed by services, 2 non-ideal but reasonable attempts so far. Getting stuck on a on using foldLeft which I woudl like some help with or an simpler way of doing the described. Code Samples below.
1st attempt using an if-else (not ideal as need to add more if/else statements down the line and less readable than pattern match)
trait NormalizePath {
def normalizePath(path: String): String = {
if (path.startsWith("build.address.com/17808")) {
path.substring("build.address.com/17808".length, path.length)
} else {
path
}
}
}
and 2nd attempt (not ideal as likely more patterns will get added and it generates more bytecode than if/else)
trait NormalizePath {
val pattern = "build.address.com/17808/"
val pattern2 = "source.address.com/7808/"
def normalizePath(path: String) = path match {
case s if s.startsWith(pattern) => s.substring(pattern.length, s.length)
case s if s.startsWith(pattern2) => s.substring(pattern2.length, s.length)
case _ => path
}
}
Last attempt is to use an address list(already exists elsewhere but defined here as MWE) to remove occurrences from the path string and it doesn't work:
trait NormalizePath {
val replacements = (
"build.address.com/17808",
"source.address.com/7808/")
private def remove(path: String, string: String) = {
path-string
}
def normalizePath(path: String): String = {
replacements.foldLeft(path)(remove)
}
}
Appreciate any help on this!
If you are just stripping out those strings:
val replacements = Seq(
"build.address.com/17808",
"source.address.com/7808/")
replacements.foldLeft("http://localhost:9000/source.address.com/7808/project/repo"){
case(path, toReplace) => path.replaceAll(toReplace, "")
}
// http://localhost:9000/project/repo
If you are replacing those string by something else:
val replacementsMap = Seq(
"build.address.com/17808" -> "one",
"source.address.com/7808/" -> "two/")
replacementsMap.foldLeft("http://localhost:9000/source.address.com/7808/project/repo"){
case(path, (toReplace, replacement)) => path.replaceAll(toReplace, replacement)
}
// http://localhost:9000/two/project/repo
The replacements collection can come from elsewhere in the code and will not need to be redeployed.
// method replacing by empty string
def normalizePath(path: String) = {
replacements.foldLeft(path){
case(startingPoint, toReplace) => startingPoint.replaceAll(toReplace, "")
}
}
normalizePath("foobar/build.address.com/17808/project/repo")
// foobar/project/repo
normalizePath("whateverPath")
// whateverPath
normalizePath("build.address.com/17808build.address.com/17808/project/repo")
// /project/repo
A very simple replacement could be made as follows:
val replacements = Seq(
"build.address.com/17808",
"source.address.com/7808/")
def normalizePath(path: String): String = {
replacements.find(path.startsWith(_)) // find the first occurrence
.map(prefix => path.substring(prefix.length)) // remove the prefix
.getOrElse(path) // if not found, return the original string
}
Since the expected replacements are very similar, have you tried to generalize them and use regex matching?
There are a million and one ways to extract /project/repo from a String in Scala. Here are a few I came up with:
val list = List("build.address.com/17808", "source.address.com/7808") //etc
def normalizePath(path: String) = {
path.stripPrefix(list.find(x => path.contains(x)).getOrElse(""))
}
Output:
scala> normalizePath("build.address.com/17808/project/repo")
res0: String = /project/repo
val list = List("build.address.com/17808", "source.address.com/7808") //etc
def normalizePath(path: String) = {
list.map(x => if (path.contains(x)) {
path.takeRight(path.length - x.length)
}).filter(y => y != ()).head
}
Output:
scala> normalizePath("build.address.com/17808/project/repo")
res0: Any = /project/repo
val list = List("build.address.com/17808", "source.address.com/7808") //etc
def normalizePath(path: String) = {
list.foldLeft(path)((a, b) => a.replace(b, ""))
}
Output:
scala> normalizePath("build.address.com/17808/project/repo")
res0: String = /project/repo
Depends how complicated you want your code to look (or how silly you want to be), really. Note that the second example has return type Any, which might not be ideal for your scenario. Also, these examples aren't meant to be able to just take the String out of the middle of your path... they can be fairly easily modified if you want to do that though. Let me know if you want me to add some examples just stripping things like build.address.com/17808 out of a String - I'd be happy to do so.
I want to read whole text files in non UTF-8 encoding via
val df = spark.sparkContext.wholeTextFiles(path, 12).toDF
into spark. How can I change the encoding?
I would want to read ISO-8859 encoded text, but it is not CSV, it is something similar to xml:SGML.
edit
maybe a custom Hadoop file input format should be used?
https://dzone.com/articles/implementing-hadoops-input-format-and-output-forma
http://henning.kropponline.de/2016/10/23/custom-matlab-inputformat-for-apache-spark/
You can read the files using SparkContext.binaryFiles() instead and build the String for the contents specifying the charset you need. E.g:
val df = spark.sparkContext.binaryFiles(path, 12)
.mapValues(content => new String(content.toArray(), StandardCharsets.ISO_8859_1))
.toDF
It's Simple.
Here is the source code,
import java.nio.charset.Charset
import org.apache.hadoop.io.{Text, LongWritable}
import org.apache.hadoop.mapred.TextInputFormat
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
object TextFile {
val DEFAULT_CHARSET = Charset.forName("UTF-8")
def withCharset(context: SparkContext, location: String, charset: String): RDD[String] = {
if (Charset.forName(charset) == DEFAULT_CHARSET) {
context.textFile(location)
} else {
// can't pass a Charset object here cause its not serializable
// TODO: maybe use mapPartitions instead?
context.hadoopFile[LongWritable, Text, TextInputFormat](location).map(
pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)
)
}
}
}
From here it's copied.
https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/util/TextFile.scala
To Use it.
https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/util/TextFileSuite.scala
Edit:
If you need wholetext file,
Here is the actual source of the implementation.
def wholeTextFiles(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope {
assertNotStopped()
val job = NewHadoopJob.getInstance(hadoopConfiguration)
// Use setInputPaths so that wholeTextFiles aligns with hadoopFile/textFile in taking
// comma separated files as input. (see SPARK-7155)
NewFileInputFormat.setInputPaths(job, path)
val updateConf = job.getConfiguration
new WholeTextFileRDD(
this,
classOf[WholeTextFileInputFormat],
classOf[Text],
classOf[Text],
updateConf,
minPartitions).map(record => (record._1.toString, record._2.toString)).setName(path)
}
Try changing :
.map(record => (record._1.toString, record._2.toString))
to(probably):
.map(record => (record._1.toString, new String(record._2.getBytes, 0, record._2.getLength, "myCustomCharset")))
I'm using Spark 2.0.2. I have a DataFrame that has an alias on it, and I'd like to be able to retrieve that. A simplified example of why I'd want that is below.
def check(ds: DataFrame) = {
assert(ds.count > 0, s"${df.getAlias} has zero rows!")
}
The above code of course fails because DataFrame has no getAlias function. Is there a way to do this?
You can try something like this but I wouldn't go so far to claim it is supported:
Spark < 2.1:
import org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias
import org.apache.spark.sql.Dataset
def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
case SubqueryAlias(alias, _) => Some(alias)
case _ => None
}
Spark 2.1+:
def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
case SubqueryAlias(alias, _, _) => Some(alias)
case _ => None
}
Example usage:
val plain = Seq((1, "foo")).toDF
getAlias(plain)
Option[String] = None
val aliased = plain.alias("a dataset")
getAlias(aliased)
Option[String] = Some(a dataset)
Disclaimer: as stated above, this code relies on undocumented APIs subject to change. It works as of Spark 2.3.
After much digging into mostly undocumented Spark methods, here is the full code to pull the list of fields, along with the table alias for a dataframe in PySpark:
def schema_from_plan(df):
plan = df._jdf.queryExecution().analyzed()
all_fields = _schema_from_plan(plan)
iterator = plan.output().iterator()
output_fields = {}
while iterator.hasNext():
field = iterator.next()
queryfield = all_fields.get(field.exprId().id(),{})
if not queryfield=={}:
tablealias = queryfield["tablealias"]
else:
tablealias = ""
output_fields[field.exprId().id()] = {
"tablealias": tablealias,
"dataType": field.dataType().typeName(),
"name": field.name()
}
return list(output_fields.values())
def _schema_from_plan(root,tablealias=None,fields={}):
iterator = root.children().iterator()
while iterator.hasNext():
node = iterator.next()
nodeClass = node.getClass().getSimpleName()
if (nodeClass=="SubqueryAlias"):
# get the alias and process the subnodes with this alias
_schema_from_plan(node,node.alias(),fields)
else:
if tablealias:
# add all the fields, along with the unique IDs, and a new tablealias field
iterator = node.output().iterator()
while iterator.hasNext():
field = iterator.next()
fields[field.exprId().id()] = {
"tablealias": tablealias,
"dataType": field.dataType().typeName(),
"name": field.name()
}
_schema_from_plan(node,tablealias,fields)
return fields
# example: fields = schema_from_plan(df)
For Java:
As #veinhorn mentioned, it is also possible to get the alias in Java. Here is a utility method example:
public static <T> Optional<String> getAlias(Dataset<T> dataset){
final LogicalPlan analyzed = dataset.queryExecution().analyzed();
if(analyzed instanceof SubqueryAlias) {
SubqueryAlias subqueryAlias = (SubqueryAlias) analyzed;
return Optional.of(subqueryAlias.alias());
}
return Optional.empty();
}