Spark : Access Row inside an UDF

Spark : Access Row inside an UDF - apache-spark

I have following UDF used to convert time stored as a string into a timestamp.
val hmsToTimeStampUdf = udf((dt: String) => {
if (dt == null) null else {
val formatter = DateTimeFormat.forPattern("HH:mm:ss")
try {
new Timestamp(formatter.parseDateTime(dt).getMillis)
} catch {
case t: Throwable => throw new RuntimeException("hmsToTimeStampUdf,dt="+dt, t)
}
}
})
This UDF is used to convert String value into Timestamp:
outputDf.withColumn(schemaColumn.name, ymdToTimeStampUdf(col(schemaColumn.name))
But some CSV files have invalid value for this column causing RuntimeException. I want to find which rows have these broken records. Is it possible to access row information inside the UDF?

Instead of throwing a RuntimeException that kills your .csv parsing, maybe a better approach would be to have UDF returning a tuple (well-formed, corrupted) value. Then, you can easily segregate good/bad rows by selecting is null/is not null subsets.
def safeConvert(dt: String) : (Timestamp,String) = {
if (dt == null)
(null,null)
else {
val formatter = DateTimeFormat.forPattern("HH:mm:ss")
try {
(new Timestamp(formatter.parseDateTime(dt).getMillis),null)
} catch {
case e:Exception =>
(null,dt)
}
}
}
val safeConvertUDF = udf(safeConvert(_:String))
val df = Seq(("00:01:02"),("03:04:05"),("67:89:10")).toDF("dt")
df.withColumn("temp",safeConvertUDF($"dt"))
.withColumn("goodData",$"temp".getItem("_1"))
.withColumn("badData",$"temp".getItem("_2"))
.drop($"temp").show(false)
+--------+-------------------+--------+
|dt |goodData |badData |
+--------+-------------------+--------+
|00:01:02|1970-01-01 00:01:02|null |
|03:04:05|1970-01-01 03:04:05|null |
|67:89:10|null |67:89:10|
+--------+-------------------+--------+

You can add the row as second input parameter to the udf:
val hmsToTimeStampUdf = udf((dt: String, r: Row) => {
if (dt == null) null else {
val formatter = DateTimeFormat.forPattern("HH:mm:ss")
try {
new Timestamp(formatter.parseDateTime(dt).getMillis)
} catch {
case t: Throwable => {
println(r) //do some error handling
null
}
}
}
})
When calling the udf, use a struct with all columns of the dataframe as second parameter (thanks to this answer):
df.withColumn("dt", hmsToTimeStampUdf(col("dt"), struct(df.columns.map(df(_)) : _*)))

Related

Alternative way to check if a string contains multiple strings in Kotlin?

I currently have the following code
fun main() {
// Build.Fingerprint sample strings
val debug_fingerprint: String? = "Company/device/device:11/3526.4353/0064504902000:userdebug/com-d,dev-keys"
val dev_fingerprint: String? = "Company/device/device:11/526.4353/0064504902000:user/com-d,dev-keys"
val user_fingerprint: String? = "Company/device/device:11/526.4353/0064504902000:user/com-p,release-keys"
//Testing
val osVarient: String = user_fingerprint?.let {
when {
check(listOf("userdebug", "dev-keys"), it) -> "Userdebug"
check(listOf("user", "dev-keys"), it) -> "Userdevsigned"
check(listOf("user", "release-keys"), it) -> "User"
else -> "Unknown variant"
}
} ?: run {
"Unknown variant null"
}
print(osVarient)
}
fun check(args: List<String>,fingerprint: String): Boolean {
for(arg in args) {
if(!fingerprint.contains(arg)){
return false
}
}
return true
}
The above code works but I'm wondering if there is more elegant way of writing the code.
Are there any alternatives in Kotlin to compare multiple substrings to a string?

You may use a dictionary for OS variants.
Iterable<T>.all is a builtin kotlin stdlib function to check a predicate on all elements.
fun main() {
val userFingerprint: String? = "Company/device/device:11/526.4353/0064504902000:user/com-p,release-keys"
val variantMap = mapOf(
"Userdebug" to listOf("userdebug", "dev-keys"),
"Userdevsigned" to listOf("user", "dev-keys"),
"User" to listOf("user", "release-keys")
)
val osVariant = userFingerprint?.let { fingerprint ->
variantMap.entries
.firstOrNull { check(it.value, fingerprint) }
?.key
} ?: "Unknown variant null"
println(osVariant)
}
fun check(fingerprintTypes: List<String>, fingerprint: String): Boolean {
return fingerprintTypes.all { it in fingerprint }
// return fingerprintTypes.all { fingerprint.contains(it) }
}

Nested JSON with duplicate keys

I will have to process 10 billion Nested JSON records per day using NiFi (version 1.9). As part of the job, am trying to convert the nested JSON to csv using Groovy script. I referred the below Stack Overflow questions related to the same topic and came up with the below code.
Groovy collect from map and submap
how to convert json into key value pair completely using groovy
But am not sure how to retrieve the value of duplicate keys. Sample json is defined in the variable "json" in the below code. key "Flag1" will be coming in multiple sections (i.e., "OF" & "SF"). I want to get the output as csv.
Below is the output if I execute the below groovy code 2019-10-08 22:33:29.244000,v12,-,36178,0,0/0,10.65.5.56,sf,sf (flag1 key value is replaced by that key column's last occurrence value)
I am not an expert in Groovy. Also please suggest if there is any other better approach, so that I will give a try.
import groovy.json.*
def json = '{"transaction":{"TS":"2019-10-08 22:33:29.244000","CIPG":{"CIP":"10.65.5.56","CP":"0"},"OF":{"Flag1":"of","Flag2":"-"},"SF":{"Flag1":"sf","Flag2":"-"}}'
def jsonReplace = json.replace('{"transaction":{','{"transaction":[{').replace('}}}','}}]}')
def jsonRecord = new JsonSlurper().parseText(jsonReplace)
def columns = ["TS","V","PID","RS","SR","CnID","CIP","Flag1","Flag1"]
def flatten
flatten = { row ->
def flattened = [:]
row.each { k, v ->
if (v instanceof Map) {
flattened << flatten(v)
} else if (v instanceof Collection && v.every {it instanceof Map}) {
v.each { flattened << flatten(it) }
} else {
flattened[k] = v
}
}
flattened
}
print "output: " + jsonRecord.transaction.collect {row -> columns.collect {colName -> flatten(row)[colName]}.join(',')}.join('\n')
Edit: Based on the reply from #cfrick and #stck, I have tried the option and have follow up question below.
#cfrick and #stck- Thanks for your response.
Original source JSON record will have more than 100 columns and I am using "InvokeScriptedProcessor" in NiFi to trigger the Groovy script.
Below is the original Groovy script am using in "InvokeScriptedProcessor" in which I have used Streams(inputstream, outputstream). Is this what you are referring.
Am I doing anything wrong?
import groovy.json.JsonSlurper
class customJSONtoCSV implements Processor {
def REL_SUCCESS = new Relationship.Builder().name("success").description("FlowFiles that were successfully processed").build();
def log
static def flatten(row, prefix="") {
def flattened = new HashMap<String, String>()
row.each { String k, Object v ->
def key = prefix ? prefix + "_" + k : k;
if (v instanceof Map) {
flattened.putAll(flatten(v, k))
} else {
flattened.put(key, v.toString())
}
}
return flattened
}
static def toCSVRow(HashMap row) {
def columns = ["CIPG_CIP","CIPG_CP","CIPG_SLP","CIPG_SLEP","CIPG_CVID","SIPG_SIP","SIPG_SP","SIPG_InP","SIPG_SVID","TG_T","TG_R","TG_C","TG_SDL","DL","I_R","UAP","EDBL","Ca","A","RQM","RSM","FIT","CSR","OF_Flag1","OF_Flag2","OF_Flag3","OF_Flag4","OF_Flag5","OF_Flag6","OF_Flag7","OF_Flag8","OF_Flag9","OF_Flag10","OF_Flag11","OF_Flag12","OF_Flag13","OF_Flag14","OF_Flag15","OF_Flag16","OF_Flag17","OF_Flag18","OF_Flag19","OF_Flag20","OF_Flag21","OF_Flag22","OF_Flag23","SF_Flag1","SF_Flag2","SF_Flag3","SF_Flag4","SF_Flag5","SF_Flag6","SF_Flag7","SF_Flag8","SF_Flag9","SF_Flag10","SF_Flag11","SF_Flag12","SF_Flag13","SF_Flag14","SF_Flag15","SF_Flag16","SF_Flag17","SF_Flag18","SF_Flag19","SF_Flag20","SF_Flag21","SF_Flag22","SF_Flag23","SF_Flag24","GF_Flag1","GF_Flag2","GF_Flag3","GF_Flag4","GF_Flag5","GF_Flag6","GF_Flag7","GF_Flag8","GF_Flag9","GF_Flag10","GF_Flag11","GF_Flag12","GF_Flag13","GF_Flag14","GF_Flag15","GF_Flag16","GF_Flag17","GF_Flag18","GF_Flag19","GF_Flag20","GF_Flag21","GF_Flag22","GF_Flag23","GF_Flag24","GF_Flag25","GF_Flag26","GF_Flag27","GF_Flag28","GF_Flag29","GF_Flag30","GF_Flag31","GF_Flag32","GF_Flag33","GF_Flag34","GF_Flag35","VSL_VSID","VSL_TC","VSL_MTC","VSL_NRTC","VSL_ET","VSL_HRES","VSL_VRES","VSL_FS","VSL_FR","VSL_VSD","VSL_ACB","VSL_ASB","VSL_VPR","VSL_VSST","HRU_HM","HRU_HD","HRU_HP","HRU_HQ","URLF_CID","URLF_CGID","URLF_CR","URLF_RA","URLF_USM","URLF_USP","URLF_MUS","TCPSt_WS","TCPSt_SE","TCPSt_WSFNS","TCPSt_WSF","TCPSt_EM","TCPSt_RSTE","TCPSt_MSS","NS_OPID","NS_ODID","NS_EPID","NS_TrID","NS_VSN","NS_LSUT","NS_STTS","NS_TCPPR","CQA_NL","CQA_CL","CQA_CLC","CQA_SQ","CQA_SQC","TS","V","PID","RS","SR","CnID","A_S","OS","CPr","CVB","CS","HS","SUNR","SUNS","ML","MT","TCPSL","CT","MS","MSH","SID","SuID","UA","DID","UAG","CID","HR","CRG","CP1","CP2","AIDF","UCB","CLID","CLCL","OPTS","PUAG","SSLIL"]
return columns.collect { column ->
return row.containsKey(column) ? row.get(column) : ""
}.join(',')
}
#Override
void initialize(ProcessorInitializationContext context) {
log = context.getLogger()
}
#Override
Set<Relationship> getRelationships() {
return [REL_SUCCESS] as Set
}
#Override
void onTrigger(ProcessContext context, ProcessSessionFactory sessionFactory) throws ProcessException {
try {
def session = sessionFactory.createSession()
def flowFile = session.get()
if (!flowFile) return
flowFile = session.write(flowFile,
{ inputStream, outputStream ->
def bufferedReader = new BufferedReader(new InputStreamReader(inputStream, 'UTF-8'))
def jsonSlurper = new JsonSlurper()
def line
def header = "CIPG_CIP,CIPG_CP,CIPG_SLP,CIPG_SLEP,CIPG_CVID,SIPG_SIP,SIPG_SP,SIPG_InP,SIPG_SVID,TG_T,TG_R,TG_C,TG_SDL,DL,I_R,UAP,EDBL,Ca,A,RQM,RSM,FIT,CSR,OF_Flag1,OF_Flag2,OF_Flag3,OF_Flag4,OF_Flag5,OF_Flag6,OF_Flag7,OF_Flag8,OF_Flag9,OF_Flag10,OF_Flag11,OF_Flag12,OF_Flag13,OF_Flag14,OF_Flag15,OF_Flag16,OF_Flag17,OF_Flag18,OF_Flag19,OF_Flag20,OF_Flag21,OF_Flag22,OF_Flag23,SF_Flag1,SF_Flag2,SF_Flag3,SF_Flag4,SF_Flag5,SF_Flag6,SF_Flag7,SF_Flag8,SF_Flag9,SF_Flag10,SF_Flag11,SF_Flag12,SF_Flag13,SF_Flag14,SF_Flag15,SF_Flag16,SF_Flag17,SF_Flag18,SF_Flag19,SF_Flag20,SF_Flag21,SF_Flag22,SF_Flag23,SF_Flag24,GF_Flag1,GF_Flag2,GF_Flag3,GF_Flag4,GF_Flag5,GF_Flag6,GF_Flag7,GF_Flag8,GF_Flag9,GF_Flag10,GF_Flag11,GF_Flag12,GF_Flag13,GF_Flag14,GF_Flag15,GF_Flag16,GF_Flag17,GF_Flag18,GF_Flag19,GF_Flag20,GF_Flag21,GF_Flag22,GF_Flag23,GF_Flag24,GF_Flag25,GF_Flag26,GF_Flag27,GF_Flag28,GF_Flag29,GF_Flag30,GF_Flag31,GF_Flag32,GF_Flag33,GF_Flag34,GF_Flag35,VSL_VSID,VSL_TC,VSL_MTC,VSL_NRTC,VSL_ET,VSL_HRES,VSL_VRES,VSL_FS,VSL_FR,VSL_VSD,VSL_ACB,VSL_ASB,VSL_VPR,VSL_VSST,HRU_HM,HRU_HD,HRU_HP,HRU_HQ,URLF_CID,URLF_CGID,URLF_CR,URLF_RA,URLF_USM,URLF_USP,URLF_MUS,TCPSt_WS,TCPSt_SE,TCPSt_WSFNS,TCPSt_WSF,TCPSt_EM,TCPSt_RSTE,TCPSt_MSS,NS_OPID,NS_ODID,NS_EPID,NS_TrID,NS_VSN,NS_LSUT,NS_STTS,NS_TCPPR,CQA_NL,CQA_CL,CQA_CLC,CQA_SQ,CQA_SQC,TS,V,PID,RS,SR,CnID,A_S,OS,CPr,CVB,CS,HS,SUNR,SUNS,ML,MT,TCPSL,CT,MS,MSH,SID,SuID,UA,DID,UAG,CID,HR,CRG,CP1,CP2,AIDF,UCB,CLID,CLCL,OPTS,PUAG,SSLIL"
outputStream.write("${header}\n".getBytes('UTF-8'))
while (line = bufferedReader.readLine()) {
def jsonReplace = line.replace('{"transaction":{','{"transaction":[{').replace('}}}','}}]}')
def jsonRecord = new JsonSlurper().parseText(jsonReplace)
def a = jsonRecord.transaction.collect { row ->
return flatten(row)
}.collect { row ->
return toCSVRow(row)
}
outputStream.write("${a}\n".getBytes('UTF-8'))
}
} as StreamCallback)
session.transfer(flowFile, REL_SUCCESS)
session.commit()
}
catch (e) {
throw new ProcessException(e)
}
}
#Override
Collection<ValidationResult> validate(ValidationContext context) { return null }
#Override
PropertyDescriptor getPropertyDescriptor(String name) { return null }
#Override
void onPropertyModified(PropertyDescriptor descriptor, String oldValue, String newValue) { }
#Override
List<PropertyDescriptor> getPropertyDescriptors() {
return [] as List
}
#Override
String getIdentifier() { return null }
}
processor = new customJSONtoCSV()
If I should not use "collect" then what else I need to use to create the rows.
In the output flow file, the record output is coming inside []. I tried the below but it is not working. Not sure whether am doing the right thing. I want csv output without []
return toCSVRow(row).toString()

If you know what you want to extract exactly (and given you want to
generate a CSV from it) IMHO you are way better off to just shape the
data in the way you later want to consume it. E.g.
def data = new groovy.json.JsonSlurper().parseText('[{"TS":"2019-10-08 22:33:29.244000","CIPG":{"CIP":"10.65.5.56","CP":"0"},"OF":{"Flag1":"of","Flag2":"-"},"SF":{"Flag1":"sf","Flag2":"-"}}]')
extractors = [
{ it.TS },
{ it.V },
{ it.PID },
{ it.RS },
{ it.SR },
{ it.CIPG.CIP },
{ it.CIPG.CP },
{ it.OF.Flag1 },
{ it.SF.Flag1 },]
def extract(row) {
extractors.collect{ it(row) }
}
println(data.collect{extract it})
// ⇒ [[2019-10-08 22:33:29.244000, null, null, null, null, 10.65.5.56, 0, of, sf]]
As stated in the other answer, due to the sheer amount of data you are trying to
convert::
Make sure to use a library to generate the CSV file from that, or else
you will hit problems with the content, you try to write (e.g. line
breaks or the data containing the separator char).
Don't use collect (it is eager) to create the rows.

The idea is to modify "flatten" method - it should differentiate between same nested keys by providing parent key as a prefix.
I've simplified code a bit:
import groovy.json.*
def json = '{"transaction":{"TS":"2019-10-08 22:33:29.244000","CIPG":{"CIP":"10.65.5.56","CP":"0"},"OF":{"Flag1":"of","Flag2":"-"},"SF":{"Flag1":"sf","Flag2":"-"}}'
def jsonReplace = json.replace('{"transaction":{','{"transaction":[{').replace('}}','}}]')
def jsonRecord = new JsonSlurper().parseText(jsonReplace)
static def flatten(row, prefix="") {
def flattened = new HashMap<String, String>()
row.each { String k, Object v ->
def key = prefix ? prefix + "." + k : k;
if (v instanceof Map) {
flattened.putAll(flatten(v, k))
} else {
flattened.put(key, v.toString())
}
}
return flattened
}
static def toCSVRow(HashMap row) {
def columns = ["TS","V","PID","RS","SR","CnID","CIP","OF.Flag1","SF.Flag1"] // Last 2 keys have changed!
return columns.collect { column ->
return row.containsKey(column) ? row.get(column) : ""
}.join(', ')
}
def a = jsonRecord.transaction.collect { row ->
return flatten(row)
}.collect { row ->
return toCSVRow(row)
}.join('\n')
println a
Output would be:
2019-10-08 22:33:29.244000, , , , , , , of, sf

Call multiple classes without switch-case or if-else

I want to know weather is is possible or not, a way to call different classes on the basis of an integer value without conditional statements like switch case?
What I am having is:
int val = // getting some value here from a method
String data = // getting some value here from a method
switch(val)
{
case 1:
{
new TempClass1(data);
break;
}
case 2:
{
new TempClass2(data);
break;
}
}
What I want is like:
int val = // getting some value here from a method
String data = // getting some value here from a method
new TempClass(val, data);
This should call the object of TempClass1 or TempClass1 as per "val"
Any help will be appreciated.

Maybe use a Factory for the classes, assuming your two classes share a base class named TempBaseClass:
class TempClassFactory {
static public TempBaseClass getTempClass(int val, String data)
{
switch(val)
{
case 1:
{
return new TempClass1(data);
break;
}
case 2:
{
return new TempClass2(data);
break;
}
default:
throw new Exception("Bad value");
}
}
}
int val = // getting some value here from a method
String data = // getting some value here from a method
TempClassFactory::getTempClass(val, data);

Catching an exception in Slick 3.x

I'm trying to catch an SQL error in Slick 3.x. The code below doesn't print anything, but if traced under debug, it works fine (it prints the failure). What's wrong with this code?
object TestSlick extends App {
val db = Database.forConfig("dbconfig")
val sql = "update table_does_not_exist set zzz=1 where ccc=2"
val q = sqlu"#$sql"
db.run(q.asTry).map {result =>
result match {
case Success(r) => println(r)
case Failure(e) => {
println(s"SQL Error, ${e.getMessage}")
println("command:" + sql)
throw e
}
}
}
}

This works, a future is needed, thanks to lxx for the tip
val future = db.run(q.asTry).map {result =>
result match {
case Success(r) => println(r)
case Failure(e) => {
println(s"SQL Error, ${e.getMessage}")
println("command:" + sql)
throw e
}
}
}
Await.result(future, Duration.Inf)

how to run stored procedure from groovy that returns multiple resultsets

I couldnt find any good example of doing this online.
Can someone please show how to run a stored procedure (that returns multiple resultsets) from groovy?
Basically I am just trying to determine how many resultsets the stored procedure returns..

I have written a helper which allows me to work with stored procedures that return a single ResultSet in a way that is similar to working with queries with groovy.sql.Sql. This could easily be adapted to process multiple ResultSets (I assume each would need it's own closure).
Usage:
Sql sql = Sql.newInstance(dataSource)
SqlHelper helper = new SqlHelper(sql);
helper.eachSprocRow('EXEC sp_my_sproc ?, ?, ?', ['a', 'b', 'c']) { row ->
println "foo=${row.foo}, bar=${row.bar}, baz=${row.baz}"
}
Code:
class SqlHelper {
private Sql sql;
SqlHelper(Sql sql) {
this.sql = sql;
}
public void eachSprocRow(String query, List parameters, Closure closure) {
sql.cacheConnection { Connection con ->
CallableStatement proc = con.prepareCall(query)
try {
parameters.eachWithIndex { param, i ->
proc.setObject(i+1, param)
}
boolean result = proc.execute()
boolean found = false
while (!found) {
if (result) {
ResultSet rs = proc.getResultSet()
ResultSetMetaData md = rs.getMetaData()
int columnCount = md.getColumnCount()
while (rs.next()) {
// use case insensitive map
Map row = new TreeMap(String.CASE_INSENSITIVE_ORDER)
for (int i = 0; i < columnCount; ++ i) {
row[md.getColumnName(i+1)] = rs.getObject(i+1)
}
closure.call(row)
}
found = true;
} else if (proc.getUpdateCount() < 0) {
throw new RuntimeException("Sproc ${query} did not return a result set")
}
result = proc.getMoreResults()
}
} finally {
proc.close()
}
}
}
}

All Java classes are usable from Groovy. If Groovy does not give you a way to do it, then you can do it Java-way using JDBC callable statements.

I just stumbled across what could possibly be a solution to your problem, if an example was what you were after, have a look at the reply to this thread

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark : Access Row inside an UDF - apache-spark

Related

Alternative way to check if a string contains multiple strings in Kotlin?

Nested JSON with duplicate keys

Call multiple classes without switch-case or if-else

Catching an exception in Slick 3.x

how to run stored procedure from groovy that returns multiple resultsets

Categories

Resources