Spark load to Hive - apache-spark

I am thinking of this logic:
val cnt = sc.textFile("/home/user/cust_acc.txt").map(line => line.split("|")).count.toInt///find number of rows
for(i<- 1 until cnt )
val v_arr1 = sc.textFile("/home/user/cust_acc.txt").map(line => line.split("|")).take(i).count //count array elements
case when v_arr1 =7 then execute
case class Person(name: String, age: String,age1: String,age2: String,age3: String,age4: String,age5: String)
val splitrdd = sc.textFile("/home/user/cust_acc.txt").map(line=>line.split("|")).map(p => Person(p(0), p(1),p(2),p(3),p(4),p(5),p(6))).toDF()
registerTempTable("df")
process thru sqlContext.sql("processing")
then write append to hdfs
case when v_arr1 =6 then execute
case class Person(name: String, age: String,age1: String,age2: String,age3: String,age4: String)
val splitrdd = sc.textFile("/home/user/cust_acc.txt").map(line=>line.split("|")).map(p => Person(p(0), p(1),p(2),p(3),p(4),p(5))).toDF()
registerTempTable("df")
process thru sqlContext.sql("processing")
then write append to hdfs
......
.....
Somehow the code I am working is not fine. Can anyone guide me here?

Related

Apache Spark - Performance with and without using Case Classes

I have 2 datasets, customers and orders
I want to join both on customer key.
I tried two approaches, one using case classes and one without.
Using Case classes: -> Just takes forever to complete - almost 11 minutes
case class Customer(custKey: Int, name: String, address: String, phone: String, acctBal: String, mktSegment: String, comment: String) extends Serializable
case class Order(orderKey: Int, custKey: Int, orderStatus: String, totalPrice: Double, orderDate: String, orderQty: String, clerk: String, shipPriority: String, comment: String) extends Serializable
val customers = sc.textFile("customersFile").map(row => row.split('|')).map(cust => (cust(0).toInt, Customer(cust(0).toInt, cust(1), cust(2), cust(3), cust(4), cust(5), cust(6))))
val orders = sc.textFile("ordersFile").map(row => row.split('|')).map(order => (order(1).toInt, Order(order(0).toInt, order(1).toInt, order(2), order(3).toDouble, order(4), order(5), order(6), order(7), order(8))))
orders.join(customers).take(1)
Without Case classes -- completes in few seconds
val customers = sc.textFile("customersFile").map(row => row.split('|'))
val orders = sc.textFile("ordersFile").map(row => row.split('|'))
val customersByCustKey = customers.map(row => (row(0), row)) // customer key is the first column in customers rdd, hence row(0)
val ordersByCustKey = orders.map(row => (row(1), row)) // customer key is the second column in orders rdd, hence row(1)
ordersByCustKey.join(customersByCustKey).take(1)
Want to know if this due to the time taken for serialization/deserialization while using case classes?
if yes, in which cases is it recommended to use case classes?
Job details using case classes:
Job details without case classes:

How to improve DataFrame UDF Which is Connecting to Hbase For every Row

I have a DataFrame where I need to create a column based on Values from Each row.
I iterate using UDF which process for each row and connects to HBase to get Data.
The UDF creates a connection, Returns Data, Closes a connection.
The process is slow as Zookeeper Hangs after few reads. I want to Pull data with only 1 open connection.
I tried mapwithpartition, But the connection is not passed as it's not serialized.
UDF:-
val lookUpUDF = udf((partyID: Int, brand: String, algorithm: String, bigPartyProductMappingTableName: String, env: String) => lookUpLogic.lkpBigPartyAccount(partyID, brand, algorithm, bigPartyProductMappingTableName, env))
How DataFrame Iterates:-
ocisPreferencesDF
.withColumn("deleteStatus", lookUpUDF(col(StagingBatchConstants.OcisPreferencesPartyId),
col(StagingBatchConstants.OcisPreferencesBrand), lit(EnvironmentConstants.digest_algorithm), lit
(bigPartyProductMappingTableName), lit(env)))
Main Login:-
def lkpBigPartyAccount(partyID: Int,
brand: String,
algorithm: String,
bigPartyProductMappingTableName: String,
envVar: String,
hbaseInteraction: HbaseInteraction = new HbaseInteraction,
digestGenerator: DigestGenerator = new DigestGenerator): Array[(String, String)] = {
AppInit.setEnvVar(envVar)
val message = partyID.toString + "-" + brand
val rowKey = Base64.getEncoder.encodeToString(message.getBytes())
val hbaseAccountInfo = hbaseInteraction.hbaseReader(bigPartyProductMappingTableName, rowKey, "cf").asScala
val convertMap: mutable.HashMap[String, String] = new mutable.HashMap[String, String]
for ((key, value) <- hbaseAccountInfo) {
convertMap.put(key.toString, value.toString)
}
convertMap.toArray
}
I expect to improve the code performance. What I'm hoping is to create a connection only once.

spark spelling correction via udf

I need to correct some spellings using spark.
Unfortunately a naive approach like
val misspellings3 = misspellings1
.withColumn("A", when('A === "error1", "replacement1").otherwise('A))
.withColumn("A", when('A === "error1", "replacement1").otherwise('A))
.withColumn("B", when(('B === "conditionC") and ('D === condition3), "replacementC").otherwise('B))
does not work with spark How to add new columns based on conditions (without facing JaninoRuntimeException or OutOfMemoryError)?
The simple cases (the first 2 examples) can nicely be handled via
val spellingMistakes = Map(
"error1" -> "fix1"
)
val spellingNameCorrection: (String => String) = (t: String) => {
titles.get(t) match {
case Some(tt) => tt // correct spelling
case None => t // keep original
}
}
val spellingUDF = udf(spellingNameCorrection)
val misspellings1 = hiddenSeasonalities
.withColumn("A", spellingUDF('A))
But I am unsure how to handle the more complex / chained conditional replacements in an UDF in a nice & generalizeable manner.
If it is only a rather small list of spellings < 50 would you suggest to hard code them within a UDF?
You can make the UDF receive more than one column:
val spellingCorrection2= udf((x: String, y: String) => if (x=="conditionC" && y=="conditionD") "replacementC" else x)
val misspellings3 = misspellings1.withColumn("B", spellingCorrection2($"B", $"C")
To make this more generalized you can use a map from a tuple of the two conditions to a string same as you did for the first case.
If you want to generalize it even more then you can use dataset mapping. Basically create a case class with the relevant columns and then use as to convert the dataframe to a dataset of the case class. Then use the dataset map and in it use pattern matching on the input data to generate the relevant corrections and convert back to dataframe.
This should be easier to write but would have a performance cost.
For now I will go with the following which seems to work just fine and is more understandable: https://gist.github.com/rchukh/84ac39310b384abedb89c299b24b9306
If spellingMap is the map containing correct spellings, and df is the dataframe.
val df: DataFrame = _
val spellingMap = Map.empty[String, String] //fill it up yourself
val columnsWithSpellingMistakes = List("abc", "def")
Write a UDF like this
def spellingCorrectionUDF(spellingMap:Map[String, String]) =
udf[(String), Row]((value: Row) =>
{
val cellValue = value.getString(0)
if(spellingMap.contains(cellValue)) spellingMap(cellValue)
else cellValue
})
And finally, you can call them as
val newColumns = df.columns.map{
case columnName =>
if(columnsWithSpellingMistakes.contains(columnName)) spellingCorrectionUDF(spellingMap)(Column(columnName)).as(columnName)
else Column(columnName)
}
df.select(newColumns:_*)

composing single insert statement in slick 3

This is the case class representing the entire row:
case class CustomerRow(id: Long, name: String, 20 other fields ...)
I have a shape case class that only 'exposes' a subset of columns and it is used when user creates/updates a customer:
case class CustomerForm(name: String, subset of all fields ...)
I can use CustomerForm for updates. However I can't use it for inserts. There are some columns not in CustomerForm that are required (not null) and can only be provided by the server. What I do now is that I create CustomerRow from CustomerForm:
def form2row(form: CustomerForm, id: Long, serverOnlyValue: Long, etc...) = CustomerRow(
id = id,
serverOnlyColumn = serverOnlyValue,
name = form.name.
// and so on for 20 more tedious lines of code
)
and use it for insert.
Is there a way to compose insert in slick so I can remove that tedious form2row function?
Something like:
(customers.map(formShape) += form) andAlsoOnTheSameRow .map(c => (c.id, c.serverOnlyColumn)) += (id, someValue)
?
Yes, You can do this like:
case class Person(name: String, email: String, address: String, id: Option[Int] = None)
case class NameAndAddress(name: String,address: String)
class PersonTable(tag: Tag) extends Table[Person](tag, "person") {
val id = column[Int]("id", O.PrimaryKey, O.AutoInc)
val name = column[String]("name")
val email = column[String]("email")
val address = column[String]("address")
//for partial insert
def nameWithAddress = (name, address)<>(NameAndAddress.tupled, NameAndAddress.unapply)
def * = (name, email, address, id.?) <> (Person.tupled, Person.unapply)
}
val personTableQuery = TableQuery[PersonTable]
// insert partial fields
personTableQuery.map(_.nameWithAddress) += NameAndAddress("abc", "xyz")
Make sure, You are aware of nullable fields they should be in form of Option[T] where T is filed type.In my example case, email should be Option[String] instead of String.

Spark - How to handle error case in RDD.map() method correctly?

I am trying to do some text processing using Spark RDD.
The format of the input file is:
2015-05-20T18:30 <some_url>/?<key1>=<value1>&<key2>=<value2>&...&<keyn>=<valuen>
I want to extract some fields from the text and convert them into CSV format like:
<value1>,<value5>,<valuek>,<valuen>
The following code is how I do this:
val lines = sc.textFile(s"s3n://${MY_BUCKET}/${MY_FOLDER}/test/*.gz")
val records = lines.map { line =>
val mp = line.split("&")
.map(_.split("="))
.filter(_.length >= 2)
.map(t => (t(0), t(1))).toMap
(mp.get("key1"), mp.get("key5"), mp.get("keyk"), mp.get("keyn"))
}
I would like to know that, if some line of the input text is of wrong format or invalid, then the map() function cannot return a valid value. This should very common in text processing, what is the best practice to deal with this problem?
in order to manage this errors you can use the scala's class Try within a flatMap operation, in code:
val lines = sc.textFile(s"s3n://${MY_BUCKET}/${MY_FOLDER}/test/*.gz")
val records = lines.flatMap (line =>
Try{
val mp = line.split("&")
.map(_.split("="))
.filter(_.length >= 2)
.map(t => (t(0), t(1))).toMap
(mp.get("key1"), mp.get("key5"), mp.get("keyk"), mp.get("keyn"))
} match {
case Success(map) => Seq(map)
case _ => Seq()
})
With this you have only the "good ones" but if you want both (the errors and the good ones) i would recommend to use a map function that returns a Scala Either and then use a Spark filter, in code:
val lines = sc.textFile(s"s3n://${MY_BUCKET}/${MY_FOLDER}/test/*.gz")
val goodBadRecords = lines.map (line =>
Try{
val mp = line.split("&")
.map(_.split("="))
.filter(_.length >= 2)
.map(t => (t(0), t(1))).toMap
(mp.get("key1"), mp.get("key5"), mp.get("keyk"), mp.get("keyn"))
} match {
case Success(map) => Right(map)
case Failure(e) => Left(e)
})
val records = goodBadRecords.filter(_.isRight)
val errors = goodBadRecords.filter(_.isLeft)
I hope this will be useful

Resources