I want to show data grouped by column name "RefPI" in C#. I have a datatable and how can do that - c#-4.0

I have a datatable in C# and now I want to show data which will be grouped by columname "ReferenceOfThePI". How can I show that. I have written a code but that is not working. Please help me.
//Below is the Current data in datatable:
RefPI Date1 Date2 Status1 Status2 Status3 Status4
C2-EP1 28/04/2020 5/5/2022
C2-EP2 28/06/2022 1/5/2019
C2-EP3 15/6/2019 6/5/2019
C2-EP1 OK OK
C2-EP2 ok ok data1
C2-EP2 yes yes yes
//Expected result
RefPI Date1 Date2 Status1 Status2 Status3 Status4
C2-EP1 28/04/20 5/5/22 OK OK
C2-EP2 28/06/22 1/5/19 ok ok data1
yes yes yes
C2-EP3 15/6/2019 6/5/2019
//My code is as below but is not working.
dtHist.Merge(dtMainPIHistoryRepeat);
int iRow = 0;
DataTable dtGrpby = null;
dtGrpby = dtHist.AsEnumerable()
.GroupBy(r => new
{
Col1 = r["RefPI"],
Col2 = r["Date1"],
Col3 = r["Date2"],
Col4 = r["Status1"],
Col5 = r["Status2"],
Col6 = r["Status3"],
Col7 = r["Status4"],
Col8 = r["Status5"]
})
.Select(g =>
{
var row = dtHist.NewRow();
row["RefPI"] = g.Key.Col1;
row["Date1"] = g.Key.Col2;
row["Date2"] = g.Key.Col3;
row["Status1"] = g.Key.Col4; ;
row["Status2"] = g.Key.Col5; ;
row["Status3"] = g.Key.Col6; ;
row["Status4"] = g.Key.Col7; ;
row["Status5"] = g.Key.Col8;
return row;
}).CopyToDataTable();

Related

Get the columns and its values after comparing two Spark Dataframes where the values are different or rows/columns are new

I have two Spark Dataframe as below.
DF1 (previous day snapshot):
The primary key is id and is unique. The rest of the columns can have duplicate or unique values, does not matter.
id
name
pincode
landmark
city
1
Vijay1
411021
Zoo1
Pune1
2
Vijay2
411022
Zoo2
null
3
Vijay3
411023
Zoo3
Pune3
4
Vijay4
null
Zoo4
Pune4
5
Vijay5
411025
Zoo5
Pune5
DF2 (new delta):
The primary key is id and is unique. The rest of the columns can have duplicate or unique values, does not matter.
id
name
pincode
landmark
city
petname
1
Vijay1
411021
Zoo1
Pune1
null
2
Vijay2_New
null
Zoo2
Pune2_New
null
3
Vijay3
411023
Zoo3
Pune3
VJ3
4
Vijay4
411024_New
Zoo4
Pune4
null
5
Vijay5
411025_New
Zoo5
Pune5_New
null
6
Vijay6
411026
null
Pune6
VJ6
If you observe carefully:
a. New column(s) can be added in DF2, here it is petname.
b. New row(s) can be inserted in DF2, here row with id=6 is inserted.
c. Existing column values can be updated in DF2, here there are many columns whose values are updated with the same id. Some are changed from null to other values and vice versa as compared to DF1.
I need help with the Spark code snippet which will give me the differences in the columns and their values as below. With also need the current date_time_column combination.
OutputDF:
id
date_time_column
column
old_value
new_value
operation_type
2
20220423_205226516_name
name
Vijay2
Vijay2_New
update
2
20220423_205226516_pincode
pincode
411022
null
update
2
20220423_205226516_city
city
null
Pune2_New
update
3
20220423_205226516_petname
petname
null
VJ3
update
4
20220423_205226516_pincode
pincode
null
411024_New
update
5
20220423_205226516_pincode
pincode
411025
411025_New
update
5
20220423_205226516_city
city
Pune5
Pune5_New
update
6
20220423_205226516_name
name
null
Vijay6
insert
6
20220423_205226516_pincode
pincode
null
411026
insert
6
20220423_205226516_city
city
null
Pune6
insert
6
20220423_205226516_petname
petname
null
VJ6
insert
Basically, I am trying to fetch all the columns and their old and new values that are different in 2 Spark Dataframes. One Dataframe is the previous day's data snapshot and the other is the current day's delta. There can be new rows inserted too as well as new columns can also be added which are handled by schema evolution. In the case of new rows inserts, only columns with non-null values are added to the final output Dataframe.
Once found, I am going to write the final data frame in DynamoDB for change audit purposes. The id will be the partition key and date_time_column will be sort key in Dynamo DB.
Hope the question is clear. Let me know if any additional Info. is required. Thanks for the help in advance.
Update 1:
Below is the code that I have written to get the new rows/insert part.
//Read Snapshot
val df1 = spark.read.option("header","true").csv("file:///home/notroot/lab/data/snapshot.csv")
//Read Delta
val df2 = spark.read.option("header","true").csv("file:///home/notroot/lab/data/delta.csv")
//Find new columns in Delta
val newColumnsInDf2 = df2.schema.fieldNames.diff(df1.schema.fieldNames)
//Add new columns from Snapshot to Delta
val df1WithNewColumnsFromDf2 = newColumnsInDf2.foldLeft(df1)((df1,currCol) => df1.withColumn(currCol, lit(null)))
//Find new/inserted rows from Delta
val newInsertsDF = df2.as("df2Table").join(df1WithNewColumnsFromDf2.as("df1WithNewColumnsFromDf2Table"), $"df1WithNewColumnsFromDf2Table.id" === $"df2Table.id","LEFT_ANTI")
//Convert new/inserted row in desired format
val skipColumn = "id"
var columnCount = newInsertsDF.schema.size - 1
var columnsStr = ""
var counter = 0
for ( col <- newInsertsDF.columns ) {
counter = counter + 1
if(col != skipColumn) {
if(counter == newInsertsDF.schema.size) {
columnsStr = columnsStr + s"'$col', $col"
}
else {
columnsStr = columnsStr + s"'$col', $col,"
}
}
}
val newInsertsUnpivotedDF = newInsertsDF.select($"id", expr(s"stack($columnCount, $columnsStr) as (column, new_value)")).filter($"new_value".isNotNull).withColumn("operation_type", lit("insert")).withColumn("old_value", lit(null)).withColumn("date_time", date_format(current_timestamp(),"yyyyMMdd_HHmmssSSS")).withColumn("date_time_column", concat(col("date_time"),lit("_"),col("column"))).select("id","date_time_column","column","old_value","new_value","operation_type")
Update 2:
I was able to solve this problem using the code below. Posting this as an update and also as an answer. Let me know how we can further optimize this.
//Read Snapshot
val df1 = spark.read.option("header","true").csv("file:///home/notroot/lab/data/snapshot.csv")
//Read Delta
val df2 = spark.read.option("header","true").csv("file:///home/notroot/lab/data/delta.csv")
//Find new columns in Delta
val newColumnsInDf2 = df2.schema.fieldNames.diff(df1.schema.fieldNames)
//Add new columns from Snapshot to Delta
val df1WithNewColumnsFromDf2 = newColumnsInDf2.foldLeft(df1)((df1,currCol) => df1.withColumn(currCol, lit(null)))
//Find new/inserted rows from Delta
val newInsertsDF = df2.as("df2Table").join(df1WithNewColumnsFromDf2.as("df1WithNewColumnsFromDf2Table"), $"df1WithNewColumnsFromDf2Table.id" === $"df2Table.id","LEFT_ANTI")
//Convert new/inserted row in desired format
val skipColumn = "id"
var columnCount = newInsertsDF.schema.size - 1
var columnsStr = ""
var counter = 0
for ( col <- newInsertsDF.columns ) {
counter = counter + 1
if(col != skipColumn) {
if(counter == newInsertsDF.schema.size) {
columnsStr = columnsStr + s"'$col', $col"
}
else {
columnsStr = columnsStr + s"'$col', $col,"
}
}
}
val newInsertsUnpivotedDF = newInsertsDF.select($"id", expr(s"stack($columnCount, $columnsStr) as (column, new_value)")).filter($"new_value".isNotNull).withColumn("operation_type", lit("insert")).withColumn("old_value", lit(null)).withColumn("date_time", date_format(current_timestamp(),"yyyyMMdd_HHmmssSSS")).withColumn("date_time_column", concat(col("date_time"),lit("_"),col("column"))).select("id","date_time_column","column","old_value","new_value","operation_type")
//Find updated rows in Delta
val updatesInDf1Unpivoted = df1WithNewColumnsFromDf2.except(df2).select($"id", expr(s"stack($columnCount, $columnsStr) as (column_old, old_value)")).withColumnRenamed("id", "id_old")
val updatesInDf2Unpivoted = df2.except(df1WithNewColumnsFromDf2).except(newInsertsDF).select($"id", expr(s"stack($columnCount, $columnsStr) as (column, new_value)"))
val df1MinusDf2 = updatesInDf1Unpivoted.except(updatesInDf2Unpivoted)
val df2MinusDf1 = updatesInDf2Unpivoted.except(updatesInDf1Unpivoted)
val joinedUpdatesDF = df1MinusDf2.join(df2MinusDf1, df1MinusDf2("id_old") === df2MinusDf1("id") && df1MinusDf2("column_old") === df2MinusDf1("column")).withColumn("date_time", date_format(current_timestamp(),"yyyyMMdd_HHmmssSSS")).withColumn("date_time_column", concat(col("date_time"),lit("_"),col("column"))).withColumn("operation_type", lit("update")).select("id","date_time_column","column","old_value","new_value","operation_type")
//Final output DF after combining Inserts and Updates
val finalOutputDF = newInsertsUnpivotedDF.union(joinedUpdatesDF)
//To display the results
finalOutputDF.show()
I was able to solve the same using below code:
//Read Snapshot
val df1 = spark.read.option("header","true").csv("file:///home/notroot/lab/data/snapshot.csv")
//Read Delta
val df2 = spark.read.option("header","true").csv("file:///home/notroot/lab/data/delta.csv")
//Find new columns in Delta
val newColumnsInDf2 = df2.schema.fieldNames.diff(df1.schema.fieldNames)
//Add new columns from Snapshot to Delta
val df1WithNewColumnsFromDf2 = newColumnsInDf2.foldLeft(df1)((df1,currCol) => df1.withColumn(currCol, lit(null)))
//Find new/inserted rows from Delta
val newInsertsDF = df2.as("df2Table").join(df1WithNewColumnsFromDf2.as("df1WithNewColumnsFromDf2Table"), $"df1WithNewColumnsFromDf2Table.id" === $"df2Table.id","LEFT_ANTI")
//Convert new/inserted row in desired format
val skipColumn = "id"
var columnCount = newInsertsDF.schema.size - 1
var columnsStr = ""
var counter = 0
for ( col <- newInsertsDF.columns ) {
counter = counter + 1
if(col != skipColumn) {
if(counter == newInsertsDF.schema.size) {
columnsStr = columnsStr + s"'$col', $col"
}
else {
columnsStr = columnsStr + s"'$col', $col,"
}
}
}
val newInsertsUnpivotedDF = newInsertsDF.select($"id", expr(s"stack($columnCount, $columnsStr) as (column, new_value)")).filter($"new_value".isNotNull).withColumn("operation_type", lit("insert")).withColumn("old_value", lit(null)).withColumn("date_time", date_format(current_timestamp(),"yyyyMMdd_HHmmssSSS")).withColumn("date_time_column", concat(col("date_time"),lit("_"),col("column"))).select("id","date_time_column","column","old_value","new_value","operation_type")
//Find updated rows in Delta
val updatesInDf1Unpivoted = df1WithNewColumnsFromDf2.except(df2).select($"id", expr(s"stack($columnCount, $columnsStr) as (column_old, old_value)")).withColumnRenamed("id", "id_old")
val updatesInDf2Unpivoted = df2.except(df1WithNewColumnsFromDf2).except(newInsertsDF).select($"id", expr(s"stack($columnCount, $columnsStr) as (column, new_value)"))
val df1MinusDf2 = updatesInDf1Unpivoted.except(updatesInDf2Unpivoted)
val df2MinusDf1 = updatesInDf2Unpivoted.except(updatesInDf1Unpivoted)
val joinedUpdatesDF = df1MinusDf2.join(df2MinusDf1, df1MinusDf2("id_old") === df2MinusDf1("id") && df1MinusDf2("column_old") === df2MinusDf1("column")).withColumn("date_time", date_format(current_timestamp(),"yyyyMMdd_HHmmssSSS")).withColumn("date_time_column", concat(col("date_time"),lit("_"),col("column"))).withColumn("operation_type", lit("update")).select("id","date_time_column","column","old_value","new_value","operation_type")
//Final output DF after combining Inserts and Updates
val finalOutputDF = newInsertsUnpivotedDF.union(joinedUpdatesDF)
//To display the results
finalOutputDF.show(false)
snapshot.csv
id,name,pincode,landmark,city
1,Vijay1,411021,Zoo1,Pune1
2,Vijay2,411022,Zoo2,
3,Vijay3,411023,Zoo3,Pune3
4,Vijay4,,Zoo4,Pune4
5,Vijay5,411025,Zoo5,Pune5
delta.csv
id,name,pincode,landmark,city,petname
1,Vijay1,411021,Zoo1,Pune1,
2,Vijay2_New,,Zoo2,Pune2_New,
3,Vijay3,411023,Zoo3,Pune3,VJ3
4,Vijay4,411024_New,Zoo4,Pune4,
5,Vijay5,411025_New,Zoo5,Pune5_New,
6,Vijay6,411026,,Pune6,VJ6
Result as below:
+---+--------------------------+-------+---------+----------+--------------+
|id |date_time_column |column |old_value|new_value |operation_type|
+---+--------------------------+-------+---------+----------+--------------+
|6 |20220423_210923191_name |name |null |Vijay6 |insert |
|6 |20220423_210923191_pincode|pincode|null |411026 |insert |
|6 |20220423_210923191_city |city |null |Pune6 |insert |
|6 |20220423_210923191_petname|petname|null |VJ6 |insert |
|3 |20220423_210923191_petname|petname|null |VJ3 |update |
|4 |20220423_210923191_pincode|pincode|null |411024_New|update |
|2 |20220423_210923191_name |name |Vijay2 |Vijay2_New|update |
|2 |20220423_210923191_city |city |null |Pune2_New |update |
|5 |20220423_210923191_pincode|pincode|411025 |411025_New|update |
|5 |20220423_210923191_city |city |Pune5 |Pune5_New |update |
|2 |20220423_210923191_pincode|pincode|411022 |null |update |
+---+--------------------------+-------+---------+----------+--------------+

User defined Hash partioning in RDD with key

Hello i would like to make my own hash function with the key being a column "arrival delay"
the code that i had at the moment is
# this is for the flights
def import_parse_rdd(data):
# create rdd
rdd = sc.textFile(data)
# remove the header
header = rdd.first()
rdd = rdd.filter(lambda row: row != header) #filter out header
# split by comma
split_rdd = rdd.map(lambda line: line.split(','))
row_rdd = split_rdd.map(lambda line: Row(
YEAR = int(line[0]),MONTH = int(line[1]),DAY = int(line[2]),DAY_OF_WEEK = int(line[3])
,AIRLINE = line[4],FLIGHT_NUMBER = int(line[5]),
TAIL_NUMBER = line[6],ORIGIN_AIRPORT = line[7],DESTINATION_AIRPORT = line[8],
SCHEDULED_DEPARTURE = line[9],DEPARTURE_TIME = line[10],DEPARTURE_DELAY = 0 if "".__eq__(line[11]) else float(line[11]),TAXI_OUT = 0 if "".__eq__(line[12]) else float(line[12]),
WHEELS_OFF = line[13],SCHEDULED_TIME = line[14],ELAPSED_TIME = 0 if "".__eq__(line[15]) else float(line[15]),AIR_TIME = 0 if "".__eq__(line[16]) else float(line[16]),DISTANCE = 0 if "".__eq__(line[17]) else float(line[17]),WHEELS_ON = line[18],TAXI_IN = 0 if "".__eq__(line[19]) else float(line[19]),
SCHEDULED_ARRIVAL = line[20],ARRIVAL_TIME = line[21],ARRIVAL_DELAY = 0 if "".__eq__(line[22]) else float(line[22]),DIVERTED = line[23],CANCELLED = line[24],CANCELLATION_REASON = line[25],AIR_SYSTEM_DELAY = line[26],
SECURITY_DELAY = line[27],AIRLINE_DELAY = line[28],LATE_AIRCRAFT_DELAY = line[29],WEATHER_DELAY = line[30])
)
return row_rdd
if i take flight_rdd.take(1)
[Row(YEAR=2015, MONTH=6, DAY=26, DAY_OF_WEEK=5, AIRLINE='EV', FLIGHT_NUMBER=4951, TAIL_NUMBER='N707EV', ORIGIN_AIRPORT='BHM', DESTINATION_AIRPORT='LGA', SCHEDULED_DEPARTURE='630', DEPARTURE_TIME='629', DEPARTURE_DELAY=-1.0, TAXI_OUT=13.0, WHEELS_OFF='642', SCHEDULED_TIME='155', ELAPSED_TIME=141.0, AIR_TIME=113.0, DISTANCE=866.0, WHEELS_ON='935', TAXI_IN=15.0, SCHEDULED_ARRIVAL='1005', ARRIVAL_TIME='950', ARRIVAL_DELAY=-15.0, DIVERTED='0', CANCELLED='0', CANCELLATION_REASON='', AIR_SYSTEM_DELAY='', SECURITY_DELAY='', AIRLINE_DELAY='', LATE_AIRCRAFT_DELAY='', WEATHER_DELAY='')]
is the output
i would like to make a user defined hash partitioning function with the key being the ARRIVAL_DELAY column.
If i could i would also like Min and Max value in the ARRIVAL_DELAY column to be used as a key to determine the distribution of the partition.
the furthest i have gone is that
flight_rdd.partitionBy(number of parts, key)
is what i understand
YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
2015,2,6,5,OO,6271,N937SW,FAR,DEN,1712,1701,-11,15,1716,123,117,95,627,1751,7,1815,1758,-17,0,0,,,,,,
2015,1,19,1,AA,1605,N496AA,DFW,ONT,1740,1744,4,15,1759,193,198,175,1188,1854,8,1853,1902,9,0,0,,,,,,
2015,3,8,7,NK,1068,N519NK,LAS,CLE,2220,2210,-10,12,2222,238,229,208,1824,450,9,518,459,-19,0,0,,,,,,
2015,9,21,1,AA,1094,N3EDAA,DFW,BOS,1155,1155,0,12,1207,223,206,190,1562,1617,4,1638,1621,-17,0,0,,,,,,
this is the unprocessed version of the data set

How to create update query with QSqlQuery

I'm trying to create an update query in Python3/PyQt5.10/Sqlite . A select/insert query made the same way runs fine. Fields & corresponding record exist.
def updateRecords():
theDict = {
"Loc": "PyQt121",
"BoekNr" : "dfdf",
"BoekTitel" : "eeee",
"BoekBedrag" : 999
}
theFilter = " WHERE Loc = 'PyQt'"
query = QSqlQuery()
columns = ', '.join(pDict.keys())
placeholders = ':'+', :'.join(pDict.keys())
sql = 'UPDATE %s SET (%s) VALUES (%s) %s' % (pTable, columns, placeholders, pFilter)
query.prepare(sql)
for key, value in pDict.items():
query.bindValue(":"+key, value)
print (sql)
query.exec_()
print(query.lastError().databaseText())
return query.numRowsAffected()
The sql generated is UPDATE tempbooks SET (Loc, BoekNr, BoekTitel, BoekBedrag) VALUES (:Loc, :BoekNr, :BoekTitel, :BoekBedrag) WHERE Loc = 'PyQt'.
query.lastError().databaseText()) give me "No Query" and updated rows is -1.
The correct syntax for an update query:
UPDATE tablename
set col1 = val1,
col2 = val2,
col3 = val3
WHERE condition
Probably query.prepare(sql) is returning False because of invalid syntax.

How can I prevent sql injection with groovy?

I have a sql like:
String sql = """SELECT id, name, sex, age, bron_year, address, phone, state, comment, is_hbp, is_dm, is_cva, is_copd, is_chd, is_cancer, is_floating, is_poor, is_disability, is_mental
FROM statistics_stin WHERE 1=1
${p.team_num == 0 ? "" : "AND team_num = ${p.team_num}"}
${p.zone == 0 ? "" : "AND team_id = ${p.zone}"}
${p.is_hbp == 2 ? "" : "AND is_hbp = ${p.is_hbp}"}
${p.is_dm == 2 ? "" : "AND is_dm = ${p.is_dm}"}
${p.is_chd == 2 ? "" : "AND is_chd = ${p.is_chd}"}
${p.is_cva == 2 ? "" : "AND is_cva = ${p.is_cva}"}
${p.is_copd == 2 ? "" : "AND is_copd = ${p.is_copd}"}
${p.is_cancer == 2 ? "" : "AND is_cancer = ${p.is_cancer}"}
${p.is_floating == 2 ? "" : "AND is_floating = ${p.is_floating}"}
${p.is_poor == 2 ? "" : "AND is_poor = ${p.is_poor}"}
${p.is_disability == 2 ? "" : "AND is_disability = ${p.is_disability}"}
${p.is_mental == 2 ? "" : "AND is_mental = ${p.is_mental}"}
${p.is_aged == 2 ? "" : (p.is_aged == 1 ? " AND age >= 65" : " AND age < 65")}
${p.is_prep_aged == 2 ? "" : (p.is_prep_aged == 1 ? "AND (age BETWEEN 60 AND 64)" : "AND (age < 60 OR age > 64)")}
${p.is_young == 2 ? "" : (p.is_young == 1 ? " AND age < 60" : " AND age >= 60")}
ORDER BY team_id ASC, id ASC
LIMIT ${start}, ${page_size}
""";
Then use :
def datasource = ctx.lookup("jdbc/mysql");
def executer = Sql.newInstance(datasource);
def rows = executer.rows(sql);
Here the p is a json object like:
p = {is_aged=2, is_cancer=2, is_chd=1, is_copd=2, is_cva=2, is_disability=2, is_dm=2, is_floating=2, is_hbp=1, is_mental=2, is_poor=2, pn=1, team_num=0, zone=0}
This way has sql injection. I know I can use params type like:
executer.rows('SELECT * from statistics_stin WHERE is_chd=:is_chd', [is_chd: 1]);
But this case has to many AND conditions, use or not will be decided by the json p.
How to do this please?
You have a problem of dynamic SQL binding, i.e. the number of bind parameters is not constant, but depend on the input.
There is an elegant solution from Tom Kyte
which has even more elegant implementation in Groovy
The basic idea is simple bind all variables, the variables that have an input value and should be used are processed normally, e.g
col1 = :col1
the variables that have no input (and shall be ignored) are binded with a dummy construct:
(1=1 or :col2 is NULL)
i.e. they are effectively ignored by shortcut evaluation.
Here two examples for three columns
def p = ["col1" : 1, "col2" : 2, "col3" : 3]
This input leads to a full query
SELECT col1, col2, col3
FROM tab WHERE
col1 = :col1 AND
col2 = :col2 AND
col3 = :col3
ORDER by col1,col2,col3
For limited input
p = [ "col3" : 3]
you get this query
SELECT col1, col2, col3
FROM tab WHERE
(1=1 or :col1 is NULL) AND
(1=1 or :col2 is NULL) AND
col3 = :col3
ORDER by col1,col2,col3
Here the Groovy creation of the SQL Statement
String sql = """SELECT col1, col2, col3
FROM tab WHERE
${(!p.col1) ? "(1=1 or :col1 is NULL)" : "col1 = :col1"} AND
${(!p.col2) ? "(1=1 or :col2 is NULL)" : "col2 = :col2"} AND
${(!p.col3) ? "(1=1 or :col3 is NULL)" : "col3 = :col3"}
ORDER by col1,col2,col3
"""
You can even get rid of the ugly 1=1 predicate;)
Another option is to build your bindings as you build the query then execute the appropriate implementation of rows
def query = new StringBuilder( "SELECT id, name, sex, age, bron_year, address, phone, state, comment, is_hbp, is_dm, is_cva, is_copd, is_chd, is_cancer, is_floating, is_poor, is_disability, is_mental FROM statistics_stin WHERE 1=1" )
def binds = []
if ( p.team_num == 0 ) {
query.append( ' AND team_num = ? ' )
binds << p.team_num
}
if ( p.zone == 0 ) {
query.append( ' AND team_id = ? ' )
binds << p.zone == 0
}
...
executer.rows(query.toString(), binds);

Spark: Process multiline input blob

I'm new to Hadoop/Spark and trying to process a multiple line input blob into a csv or tab delimited format for further processing.
Example Input
------------------------------------------------------------------------
AAA=someValueAAA1
BBB=someValueBBB1
CCC=someValueCCC1
DDD=someValueDDD1
EEE=someValueEEE1
FFF=someValueFFF1
ENDOFRECORD
------------------------------------------------------------------------
AAA=someValueAAA2
BBB=someValueBBB2
CCC=someValueCCC2
DDD=someValueDDD2
EEE=someValueEEE2
FFF=someValueFFF2
ENDOFRECORD
------------------------------------------------------------------------
AAA=someValueAAA3
BBB=someValueBBB3
CCC=someValueCCC3
DDD=someValueDDD3
EEE=someValueEEE3
FFF=someValueFFF3
GGG=someValueGGG3
HHH=someValueHHH3
ENDOFRECORD
------------------------------------------------------------------------
Needed output
someValueAAA1, someValueBBB1, someValueCCC1, someValueDDD1, someValueEEE1, someValueFFF1
someValueAAA2, someValueBBB2, someValueCCC2, someValueDDD2, someValueEEE2, someValueFFF2
someValueAAA3, someValueBBB3, someValueCCC3, someValueDDD3, someValueEEE3, someValueFFF3
Code ive tried so far -
#inputRDD
val inputRDD = sc.textFile("/somePath/someFile.gz")
#transform
val singleRDD = inputRDD.map(x=>x.split("ENDOFRECORD")).filter(x=>x.trim.startsWith("AAA"))
val logData = singleRDD.map(x=>{
val rowData = x.split("\n")
var AAA = ""
var BBB = ""
var CCC = ""
var DDD = ""
var EEE = ""
var FFF = ""
for (data <- rowData){
if(data.trim().startsWith("AAA")){
AAA = data.split("AAA=")(1)
}else if(data.trim().startsWith("BBB")){
BBB = data.split("BBB=")(1)
}else if(data.trim().startsWith("CCC=")){
CCC = data.split("CCC=")(1)
}else if(data.trim().startsWith("DDD=")){
DDD = data.split("DDD=")(1)
}else if(data.trim().startsWith("EEE=")){
EEE = data.split("EEE=")(1)
}else if(data.trim().startsWith("FFF=")){
FFF = data.split("FFF=")(1)
}
}
(AAA,BBB,CCC,DDD,EEE,FFF)
})
logData.take(10).foreach(println)
This does not seem to work and i get o/p such as
AAA,,,,,,
,BBB,,,,,
,,CCC,,,,
,,,DDD,,,
Cant seem to figure out whats wrong here. Do i have to write a custom input format to solve this?
To process the data as per your requirement:
Load the dataset as wholeTextFiles, this makes your dataset as key, value pairs
Convert the key, value pair into FlatMap to obtain individual collections of text. For Example:
AAA=someValueAAA1
BBB=someValueBBB1
CCC=someValueCCC1
DDD=someValueDDD1
EEE=someValueEEE1
FFF=someValueFFF1
ENDOFRECORD
Convert the collection to individual element by splitting using \n
Try the below code:
// load your data set
val data = sc.wholeTextFiles("file:///path/to/file")
val data1 = data.flatMap(x => x._2.split("ENDOFRECORD"))
val logData = data1.map(x=>{
val rowData = x.split("\n")
var AAA = ""
var BBB = ""
var CCC = ""
var DDD = ""
var EEE = ""
var FFF = ""
for (data <- rowData){
if(data.trim().contains("AAA")){
AAA = data.split("AAA=")(1)
}else if(data.trim().contains("BBB")){
BBB = data.split("BBB=")(1)
}else if(data.trim().contains("CCC=")){
CCC = data.split("CCC=")(1)
}else if(data.trim().contains("DDD=")){
DDD = data.split("DDD=")(1)
}else if(data.trim().contains("EEE=")){
EEE = data.split("EEE=")(1)
}else if(data.trim().contains("FFF=")){
FFF = data.split("FFF=")(1)
}
}
(AAA,BBB,CCC,DDD,EEE,FFF)
})
logData.foreach(println)
OUTPUT:
(someValueAAA1,someValueBBB1,someValueCCC1,someValueDDD1,someValueEEE1,someValueFFF1)
(someValueAAA2,someValueBBB2,someValueCCC2,someValueDDD2,someValueEEE2,someValueFFF2)
(someValueAAA3,someValueBBB3,someValueCCC3,someValueDDD3,someValueEEE3,someValueFFF3)

Resources