Apache Spark - Performance with and without using Case Classes - apache-spark

I have 2 datasets, customers and orders
I want to join both on customer key.
I tried two approaches, one using case classes and one without.
Using Case classes: -> Just takes forever to complete - almost 11 minutes
case class Customer(custKey: Int, name: String, address: String, phone: String, acctBal: String, mktSegment: String, comment: String) extends Serializable
case class Order(orderKey: Int, custKey: Int, orderStatus: String, totalPrice: Double, orderDate: String, orderQty: String, clerk: String, shipPriority: String, comment: String) extends Serializable
val customers = sc.textFile("customersFile").map(row => row.split('|')).map(cust => (cust(0).toInt, Customer(cust(0).toInt, cust(1), cust(2), cust(3), cust(4), cust(5), cust(6))))
val orders = sc.textFile("ordersFile").map(row => row.split('|')).map(order => (order(1).toInt, Order(order(0).toInt, order(1).toInt, order(2), order(3).toDouble, order(4), order(5), order(6), order(7), order(8))))
orders.join(customers).take(1)
Without Case classes -- completes in few seconds
val customers = sc.textFile("customersFile").map(row => row.split('|'))
val orders = sc.textFile("ordersFile").map(row => row.split('|'))
val customersByCustKey = customers.map(row => (row(0), row)) // customer key is the first column in customers rdd, hence row(0)
val ordersByCustKey = orders.map(row => (row(1), row)) // customer key is the second column in orders rdd, hence row(1)
ordersByCustKey.join(customersByCustKey).take(1)
Want to know if this due to the time taken for serialization/deserialization while using case classes?
if yes, in which cases is it recommended to use case classes?
Job details using case classes:
Job details without case classes:

Related

How to improve DataFrame UDF Which is Connecting to Hbase For every Row

I have a DataFrame where I need to create a column based on Values from Each row.
I iterate using UDF which process for each row and connects to HBase to get Data.
The UDF creates a connection, Returns Data, Closes a connection.
The process is slow as Zookeeper Hangs after few reads. I want to Pull data with only 1 open connection.
I tried mapwithpartition, But the connection is not passed as it's not serialized.
UDF:-
val lookUpUDF = udf((partyID: Int, brand: String, algorithm: String, bigPartyProductMappingTableName: String, env: String) => lookUpLogic.lkpBigPartyAccount(partyID, brand, algorithm, bigPartyProductMappingTableName, env))
How DataFrame Iterates:-
ocisPreferencesDF
.withColumn("deleteStatus", lookUpUDF(col(StagingBatchConstants.OcisPreferencesPartyId),
col(StagingBatchConstants.OcisPreferencesBrand), lit(EnvironmentConstants.digest_algorithm), lit
(bigPartyProductMappingTableName), lit(env)))
Main Login:-
def lkpBigPartyAccount(partyID: Int,
brand: String,
algorithm: String,
bigPartyProductMappingTableName: String,
envVar: String,
hbaseInteraction: HbaseInteraction = new HbaseInteraction,
digestGenerator: DigestGenerator = new DigestGenerator): Array[(String, String)] = {
AppInit.setEnvVar(envVar)
val message = partyID.toString + "-" + brand
val rowKey = Base64.getEncoder.encodeToString(message.getBytes())
val hbaseAccountInfo = hbaseInteraction.hbaseReader(bigPartyProductMappingTableName, rowKey, "cf").asScala
val convertMap: mutable.HashMap[String, String] = new mutable.HashMap[String, String]
for ((key, value) <- hbaseAccountInfo) {
convertMap.put(key.toString, value.toString)
}
convertMap.toArray
}
I expect to improve the code performance. What I'm hoping is to create a connection only once.

Spark load to Hive

I am thinking of this logic:
val cnt = sc.textFile("/home/user/cust_acc.txt").map(line => line.split("|")).count.toInt///find number of rows
for(i<- 1 until cnt )
val v_arr1 = sc.textFile("/home/user/cust_acc.txt").map(line => line.split("|")).take(i).count //count array elements
case when v_arr1 =7 then execute
case class Person(name: String, age: String,age1: String,age2: String,age3: String,age4: String,age5: String)
val splitrdd = sc.textFile("/home/user/cust_acc.txt").map(line=>line.split("|")).map(p => Person(p(0), p(1),p(2),p(3),p(4),p(5),p(6))).toDF()
registerTempTable("df")
process thru sqlContext.sql("processing")
then write append to hdfs
case when v_arr1 =6 then execute
case class Person(name: String, age: String,age1: String,age2: String,age3: String,age4: String)
val splitrdd = sc.textFile("/home/user/cust_acc.txt").map(line=>line.split("|")).map(p => Person(p(0), p(1),p(2),p(3),p(4),p(5))).toDF()
registerTempTable("df")
process thru sqlContext.sql("processing")
then write append to hdfs
......
.....
Somehow the code I am working is not fine. Can anyone guide me here?

Filter a list of case class objects based on a list of strings

I have a case class as this User(id:String, name: String, address: String, password: String) and another case class as Account(userId: String, accountId: String, roles: Set[String]). I need to filter a list of Account objects ( List[Account]) based on a list of userIds which I have as a List[String] in Scala. I have been struggling with this and tried doing this but couldn't. Any pointers on how should I do this would be really helpful.
Thanks !
I'm not sure if I understand your question correctly, but if you're only trying to keep only Accounts for which the userId is part of separate collection that you have, you can do it like this:
val accounts: List[Account] = ???
val idsToKeep: Set[String] = ???
accounts.filter(a => idsToKeep.contains(a.userId))
For the record, if you use the contains method a lot, you are better off using a Set[String] than a List[String] to store the ids to keep.

Convert a DataSet with single column to multiple column dataset in scala

I have a dataset which is a DataSet of String and it has the data
12348,5,233,234559,4
12348,5,233,234559,4
12349,6,233,234560,5
12350,7,233,234561,6
I want to split this single row and convert this to multiple columns which says RegionId, PerilId, Date, EventId, ModelId. How do i achieve this ?
you mean:
case class NewSet(RegionId: String, PerilId: String, Date: String, EventId: String, ModelId: String)
val newDataset = oldDataset.map(s:String => {
val strings = s.split(",")
NewSet(strings(0), strings(1), strings(2), string(3), strings(4)) })
Of course you should probably make the lambda function a little more robust...
If you have the data you specified in an RDD, then converting that to dataframe is pretty easy.
case class MyClass(RegionId: String, PerilId: String, Date: String,
EventId: String, ModelId: String)
val dataframe = sqlContext.createDataFrame(rdd,classOf[MyClass])
this dataframe will have all the columns with the column name corresponds to the variables of clas MyClass.

composing single insert statement in slick 3

This is the case class representing the entire row:
case class CustomerRow(id: Long, name: String, 20 other fields ...)
I have a shape case class that only 'exposes' a subset of columns and it is used when user creates/updates a customer:
case class CustomerForm(name: String, subset of all fields ...)
I can use CustomerForm for updates. However I can't use it for inserts. There are some columns not in CustomerForm that are required (not null) and can only be provided by the server. What I do now is that I create CustomerRow from CustomerForm:
def form2row(form: CustomerForm, id: Long, serverOnlyValue: Long, etc...) = CustomerRow(
id = id,
serverOnlyColumn = serverOnlyValue,
name = form.name.
// and so on for 20 more tedious lines of code
)
and use it for insert.
Is there a way to compose insert in slick so I can remove that tedious form2row function?
Something like:
(customers.map(formShape) += form) andAlsoOnTheSameRow .map(c => (c.id, c.serverOnlyColumn)) += (id, someValue)
?
Yes, You can do this like:
case class Person(name: String, email: String, address: String, id: Option[Int] = None)
case class NameAndAddress(name: String,address: String)
class PersonTable(tag: Tag) extends Table[Person](tag, "person") {
val id = column[Int]("id", O.PrimaryKey, O.AutoInc)
val name = column[String]("name")
val email = column[String]("email")
val address = column[String]("address")
//for partial insert
def nameWithAddress = (name, address)<>(NameAndAddress.tupled, NameAndAddress.unapply)
def * = (name, email, address, id.?) <> (Person.tupled, Person.unapply)
}
val personTableQuery = TableQuery[PersonTable]
// insert partial fields
personTableQuery.map(_.nameWithAddress) += NameAndAddress("abc", "xyz")
Make sure, You are aware of nullable fields they should be in form of Option[T] where T is filed type.In my example case, email should be Option[String] instead of String.

Resources