SQL DDL to parse JSON schema file - python-3.x

Can a SQL DDL statement be parsed to a simple JSON schema file as shown below without using any tools, only Scala/Python/shell scripting?
CREATE TABLE TEMP (
ID INT,
NAME STRING)
[
{
"tableName": "temp",
"columns": [
{
"columnname": "id",
"datatype": "int"
},
{
"columnname": "name",
"datatype": "string"
}
]
}
]

You can create a string in the JSON form from your DDL using the below logic (Scala code). Once the string is made, it is converted into a Dataframe. This Dataframe is then saved into an HDFS/Amazon S3 as JSON file using Dataframe's built in API called write.json
import org.apache.spark.sql.types._
import spark.implicits._
val createSql = "CREATE TABLE TEMP (ID INT, NAME STRING)"
var jsonString = """[{"tableName":"""" + createSql.split(" ")(2).toLowerCase + "\"," + "\"columns\":["
createSql.split(s"\\(")(1).split(s"\\)")(0).split(",").map(r => {
jsonString += "{" + "\"columnname\": " + "\"" + r.trim.split(" ")(0).toLowerCase + "\"," + "\"datatype\": " + "\"" + r.trim.split(" ")(1).toLowerCase + "\"},"
})
jsonString = jsonString.patch(jsonString.lastIndexOf(','), "", 1) + "]}]"
val schema: StructType = null
val reader = spark.read
Option(schema).foreach(reader.schema)
val df = reader.json(sc.parallelize(Array(jsonString)))
df.coalesce(1).write.json("<targetlocation>")
Please let me know if you have any questions.

With some matching as e.g described here: How to pattern match using regular expression in Scala? the code for this could look like the below, assuming your initial expression is passed in as sequence of lines (note that JSONObject as used below is deprecated, so replace this with some alternative).
object Parser {
implicit class Regex(sc: StringContext) {
def r = new util.matching.Regex(sc.parts.mkString, sc.parts.tail.map(_ => "x"): _*)
}
def toJson(tablename: String, columns: Seq[(String,String)]): String = {
val columnList: List[JSONObject] = columns.toStream.map(x => JSONObject(Map("columnname" -> x._1, "datatype" -> x._2))).toList
JSONArray(List(JSONObject(Map("tableName" -> tablename, "columns" -> JSONArray(columnList))))).toString()
}
def parse(lines: Seq[String]): (String, Seq[(String,String)]) = {
lines.mkString("").toLowerCase match {
case r"create\s+table\s+(\S+)${tablename}\s+\((.+)${columns}\).*" =>
val columnWithType: immutable.Seq[(String, String)] = columns.split(",").toStream
.map(x => x.split("\\s+"))
.map(x => (x.head.toLowerCase, x(1).toLowerCase))
(tablename, columnWithType)
case _ => ("",Seq.empty)
}
}
}
To test that with your test string:
val data: (String, Seq[(String, String)]) = Parser.parse(Seq("CREATE TABLE TEMP (", "ID INT,", "NAME STRING)"))
println(Parser.toJson(data._1, data._2))

With scala.util.parsing.combinator package, u could define your Lexer and Grammer parser with DDL like this,
import scala.util.parsing.combinator._
class JSON extends JavaTokenParsers {
def value: Parser[Any] = obj | arr | stringLiteral | floatingPointNumber | "null" | "true" | "false"
def obj: Parser[Any] = "{"~repsep(member, ",")~"}"
def arr: Parser[Any] = "["~repsep(value, ",")~"]"
def member: Parser[Any] = stringLiteral~":"~value
}
Above code would be used to parse JSON string into a lexel stream for further procession. Read the doc u will be able to define your SQL DDL parser.

We just published this package in https://github.com/deepstartup/jsonutils. May be you will find it useful. If you need us to update something, open up a JIRA.
Try:
pip install DDLJ
from DDLj import genddl
genddl(*param1,param2,*param3,*param4)
Where
param1= JSON Schema File
param2=Database (Default Oracle)
Param3= Glossary file
Param4= DDL output script

Related

Groovy append to json file

I have a jenkins pipeline script, where I am collecting the data as json format. I want all the data to be updated in a string and finally write to a json file. But in my script, the value is over written.
for(String item: jsonObj.devices) {
os = item.os
if(!os.contains("\\.")) {
osV = os + ".0"
} else{
osV = os
}
def osValues = osV.split('\\.')
if(values[0].toInteger() <= osValues[0].toInteger()) {
name = item.name
def data = [
name: "$name",
os: "$os",
]
def json_str = JsonOutput.toJson(data)
def json_beauty = JsonOutput.prettyPrint(json_str)
writeJSON(file: 'message1.json', json: json_beauty)
}
}
I want all the data to be collected and finally write to sample.json. Please let me know where I am wrong.

convert string output of for loop into a single array in Groovy

I need to take my output from the for loop below and add it to a single array. The list object can either be ["envName-inactive-1", "active-1", "inactive-1", "envName-active-1"] or ["envName-inactive-2", "", "", "envName-active-2"]
My code:
if (appendVersion) {
for (elements in list) {
test = (elements + "-" + branch)
println(test)
}
} else {
println(list)
}
output:
envName-inactive-1-v2
active-1-v2
inactive-1-v2
envName-active-1-v2
and
envName-inactive-2-v2
-v2
-v2
envName-active-2-v2
desired output:
["envName-inactive-1-v2", "active-1-v2", "inactive-1-v2", "envName-active-1-v2"]
and
["envName-inactive-2-v2", "", "", "envName-active-2-v2"]
You desired format seems to be json. In Jenkins you have option to use writeJSON to convert list to json format.
def branch = "v2"
def list = ["envName-inactive-1", "active-1", "inactive-1", "envName-active-1"]
def versionedList = list.collect{it ? it+'-'+branch : ''}
def json = writeJSON returnText: true, json: versionedList
println json
the same in plain groovy:
def branch = "v2"
def list = ["envName-inactive-1", "active-1", "inactive-1", "envName-active-1"]
def versionedList = list.collect{it ? it+'-'+branch : ''}
def json = new groovy.json.JsonBuilder(versionedList).toString()
println json
result:
["envName-inactive-1-v2","active-1-v2","inactive-1-v2","envName-active-1-v2"]

python3 nested dictionary unpack for format string

I am trying to pass a dictionary (loaded from JSON file) to format string. While single key-value unpack works correctly, I am not sure how I can access the nested keys (children) with a format string.
Or is there any other better way to pass JSON to string format?
config = {
"TEST": "TEST",
"TEST1": "TEST1",
"TEST2": {
"TEST21": "TEST21"
}
}
query_1 = """
{TEST} {TEST1}
"""
query_2 = """
{TEST} {TEST1}
{TEST2.TEST21}
"""
print(query_1.format( **config )) # WORKING
print(query_2.format( **config )) # NOT WORKING
Using f-string
config = {
"TEST": "TEST",
"TEST1": "TEST1",
"TEST2": {
"TEST21": "TEST21"
}
}
query_2 = f"""
{config['TEST']} {config['TEST1']}
{config['TEST2']['TEST21']}
"""
print(query_2)
Note, if query is sql query, there is probably better way to do what you do, not using string formatting
In your query_2 change {TEST2.TEST21} to {TEST2[TEST21]} it will work.
Ex.
query_2 = """
{TEST} {TEST1}
{TEST2[TEST21]}
"""
print(query_2.format(**config))
Output
TEST TEST1
TEST21

Get file name using hadoopFile

I'm using Spark 2.2 along with Scala 2.11 to parse a directory and transform data inside.
To handle ISO charset, I'm using hadoopFile like this :
val inputDirPath = "myDirectory"
sc.hadoopFile[LongWritable, Text, TextInputFormat](inputDirPath).map(pair => new String(pair._2.getBytes, 0, pair._2.getLength, "iso-8859-1")).map(ProcessFunction(_)).toDF
How can I get the file name of each row into the ProcessFunction ?
ProcessFunction takes a String in param and return an object.
Thank you for your time
Answers includ your function ProcessFunction
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.{FileSplit, TextInputFormat}
import org.apache.spark.rdd.HadoopRDD
val inputDirPath = "dataset.txt"
val textRdd = sc.hadoopFile[LongWritable, Text, TextInputFormat](inputDirPath)
// cast TO HadoopRDD
val linesWithFileNames = rddHadoop.mapPartitionsWithInputSplit((inputSplit, iterator) => {
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map(tuple => (file.getPath, new String(tuple._2.getBytes, 0, tuple._2.getLength, "iso-8859-1")))
}).map{case (path, line) => (path, ProcessFunction(line)}
val textRdd = sc.hadoopFile[LongWritable, Text, TextInputFormat](inputDirPath)
// cast TO HadoopRDD
val linesWithFileNames = textRdd.asInstanceOf[HadoopRDD[LongWritable, Text]]
.mapPartitionsWithInputSplit((inputSplit, iterator) => {
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map(tuple => (file.getPath, tuple._2))
}
)
linesWithFileNames.foreach(println)

How to get multiple line json File Into Single record as a rdd

rdd=sc.textFile(json or xml)
rdd.collect()
[u'{', u' "glossary": {', u' "title": "example glossary",', u'\t\t"GlossDiv": {', u' "title": "S",', u'\t\t\t"GlossList": {', u' "GlossEntry": {', u' "ID": "SGML",', u'\t\t\t\t\t"SortAs": "SGML",', u'\t\t\t\t\t"GlossTerm": "Standard Generalized Markup Language",', u'\t\t\t\t\t"Acronym": "SGML",', u'\t\t\t\t\t"Abbrev": "ISO 8879:1986",', u'\t\t\t\t\t"GlossDef": {', u' "para": "A meta-markup language, used to create markup languages such as DocBook.",', u'\t\t\t\t\t\t"GlossSeeAlso": ["GML", "XML"]', u' },', u'\t\t\t\t\t"GlossSee": "markup"', u' }', u' }', u' }', u' }', u'}', u'']
But My output should be every think in one line
{"glossary": {"title": "example glossary","GlossDiv": {"title": "S","GlossList":.....}}
I'd recommend using Spark SQL JSON and then saving calling toJson (see https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets )
val input = sqlContext.jsonFile(path)
val output = input...
output.toJSON.saveAsTextFile(outputath)
However if your json records can't be parsed by Spark SQL because of the multi-line issue or some other issue, we can take one of the examples from the Learning Spark book (slightly biased as a co-author of course) and and modify it to use wholeTextFiles.
case class Person(name: String, lovesPandas: Boolean)
// Read the input and throw away the file names
val input = sc.wholeTextFiles(inputFile).map(_._2)
// Parse it into a specific case class. We use mapPartitions beacuse:
// (a) ObjectMapper is not serializable so we either create a singleton object encapsulating ObjectMapper
// on the driver and have to send data back to the driver to go through the singleton object.
// Alternatively we can let each node create its own ObjectMapper but that's expensive in a map
// (b) To solve for creating an ObjectMapper on each node without being too expensive we create one per
// partition with mapPartitions. Solves serialization and object creation performance hit.
val result = input.mapPartitions(records => {
// mapper object created on each executor node
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
// We use flatMap to handle errors
// by returning an empty list (None) if we encounter an issue and a
// list with one element if everything is ok (Some(_)).
records.flatMap(record => {
try {
Some(mapper.readValue(record, classOf[ioRecord]))
} catch {
case e: Exception => None
}
})
}, true)
result.filter(_.lovesPandas).map(mapper.writeValueAsString(_))
.saveAsTextFile(outputFile)
}
And in Python:
from pyspark import SparkContext
import json
import sys
if __name__ == "__main__":
if len(sys.argv) != 4:
print "Error usage: LoadJson [sparkmaster] [inputfile] [outputfile]"
sys.exit(-1)
master = sys.argv[1]
inputFile = sys.argv[2]
outputFile = sys.argv[3]
sc = SparkContext(master, "LoadJson")
input = sc.wholeTextFiles(inputFile).map(_._2)
data = input.flatMap(lambda x: json.loads(x))
data.filter(lambda x: 'lovesPandas' in x and x['lovesPandas']).map(
lambda x: json.dumps(x)).saveAsTextFile(outputFile)
sc.stop()
print "Done!"
Use sc.wholeTextFiles() instead.
Also have a look at sqlContext.jsonFile:
https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#json-datasets

Resources