I have a DataFrame where I need to create a column based on Values from Each row.
I iterate using UDF which process for each row and connects to HBase to get Data.
The UDF creates a connection, Returns Data, Closes a connection.
The process is slow as Zookeeper Hangs after few reads. I want to Pull data with only 1 open connection.
I tried mapwithpartition, But the connection is not passed as it's not serialized.
UDF:-
val lookUpUDF = udf((partyID: Int, brand: String, algorithm: String, bigPartyProductMappingTableName: String, env: String) => lookUpLogic.lkpBigPartyAccount(partyID, brand, algorithm, bigPartyProductMappingTableName, env))
How DataFrame Iterates:-
ocisPreferencesDF
.withColumn("deleteStatus", lookUpUDF(col(StagingBatchConstants.OcisPreferencesPartyId),
col(StagingBatchConstants.OcisPreferencesBrand), lit(EnvironmentConstants.digest_algorithm), lit
(bigPartyProductMappingTableName), lit(env)))
Main Login:-
def lkpBigPartyAccount(partyID: Int,
brand: String,
algorithm: String,
bigPartyProductMappingTableName: String,
envVar: String,
hbaseInteraction: HbaseInteraction = new HbaseInteraction,
digestGenerator: DigestGenerator = new DigestGenerator): Array[(String, String)] = {
AppInit.setEnvVar(envVar)
val message = partyID.toString + "-" + brand
val rowKey = Base64.getEncoder.encodeToString(message.getBytes())
val hbaseAccountInfo = hbaseInteraction.hbaseReader(bigPartyProductMappingTableName, rowKey, "cf").asScala
val convertMap: mutable.HashMap[String, String] = new mutable.HashMap[String, String]
for ((key, value) <- hbaseAccountInfo) {
convertMap.put(key.toString, value.toString)
}
convertMap.toArray
}
I expect to improve the code performance. What I'm hoping is to create a connection only once.
I'm working on Spark 2.3.0 using Cost Based Optimizer(CBO) for computing statistics for queries on done on external tables.
I have a created a external table in spark :
CREATE EXTERNAL TABLE IF NOT EXISTS test (
eventID string,type string,exchange string,eventTimestamp bigint,sequenceNumber bigint
,optionID string,orderID string,side string,routingFirm string,routedOrderID string
,session string,price decimal(18,8),quantity bigint,timeInForce string,handlingInstructions string
,orderAttributes string,isGloballyUnique boolean,originalOrderID string,initiator string,leavesQty bigint
,symbol string,routedOriginalOrderID string,displayQty bigint,orderType string,coverage string
,result string,resultTimestamp bigint,nbbPrice decimal(18,8),nbbQty bigint,nboPrice decimal(18,8)
,nboQty bigint,reporter string,quoteID string,noteType string,definedNoteData string,undefinedNoteData string
,note string,desiredLeavesQty bigint,displayPrice decimal(18,8),workingPrice decimal(18,8),complexOrderID string
,complexOptionID string,cancelQty bigint,cancelReason string,openCloseIndicator string,exchOriginCode string
,executingFirm string,executingBroker string,cmtaFirm string,mktMkrSubAccount string,originalOrderDate string
,tradeID string,saleCondition string,executionCodes string,buyDetails_side string,buyDetails_leavesQty bigint
,buyDetails_openCloseIndicator string,buyDetails_quoteID string,buyDetails_orderID string,buyDetails_executingFirm string,buyDetails_executingBroker string,buyDetails_cmtaFirm string,buyDetails_mktMkrSubAccount string,buyDetails_exchOriginCode string,buyDetails_liquidityCode string,buyDetails_executionCodes string,sellDetails_side string,sellDetails_leavesQty bigint,sellDetails_openCloseIndicator string,sellDetails_quoteID string,sellDetails_orderID string,sellDetails_executingFirm string,sellDetails_executingBroker string,sellDetails_cmtaFirm string,sellDetails_mktMkrSubAccount string,sellDetails_exchOriginCode string,sellDetails_liquidityCode string,sellDetails_executionCodes string,tradeDate int,reason string,executionTimestamp bigint,capacity string,fillID string,clearingNumber string
,contraClearingNumber string,buyDetails_capacity string,buyDetails_clearingNumber string,sellDetails_capacity string
,sellDetails_clearingNumber string,receivingFirm string,marketMaker string,sentTimestamp bigint,onlyOneQuote boolean
,originalQuoteID string,bidPrice decimal(18,8),bidQty bigint,askPrice decimal(18,8),askQty bigint,declaredTimestamp bigint,revokedTimestamp bigint,awayExchange string,comments string,clearingFirm string )
PARTITIONED BY (date integer ,reporteIDs string ,version integer )
STORED AS PARQUET LOCATION '/home/test/'
I have computed statistics on the columns using the following command:
val df = spark.read.parquet("/home/test/")
val cols = df.columns.mkString(",")
val analyzeDDL = s"Analyze table events compute statistics for columns $cols"
spark.sql(analyzeDDL)
Now when I'm trying to get the statistics for the query :
val query = "Select * from test where date > 20180222"
Its giving me only size and not the rowCount :
scala> val exec = spark.sql(query).queryExecution
exec: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('date > 20180222)
+- 'UnresolvedRelation `test`
== Analyzed Logical Plan ==
eventID: string, type: string, exchange: string, eventTimestamp: bigint, sequenceNumber: bigint, optionID: string, orderID: string, side: string, routingFirm: string, routedOrderID: string, session: string, price: decimal(18,8), quantity: bigint, timeInForce: string, handlingInstructions: string, orderAttributes: string, isGloballyUnique: boolean, originalOrderID: string, initiator: string, leavesQty: bigint, symbol: string, routedOriginalOrderID: string, displayQty: bigint, orderType: string, ... 82 more fields
Project [eventID#797974, type#797975, exchange#797976, eventTimestamp#797977L, sequenceNumber#...
scala>
scala> val stats = exec.optimizedPlan.stats
stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=1.0 B, hints=none)
Am I missing any steps here? How can I get the rowcount for the query.
Spark-version : 2.3.0
Files in the table are in parquet format.
Update
I'm able to get the statistics for a csv file. Not able to get the same for a parquet file.
The difference between the execution plan for parquet and csv is format is that in csv we are getting a HiveTableRelation while for parquet its Relation.
Any idea why it so?
I am using cloudera vm. I have imported the products table from retail_db as a textfile with '|' as a fields separator (using sqoop).
Following is the table schema:
mysql> describe products;
product_id: int(11)
product_category_id: int(11)
product_name: varchar(45)
product_description: varchar(255)
product_price: float
product_image: varchar(255)
I want to create a Dataframe from this data.
I got no issue while using following code:
var products = sc.textFile("/user/cloudera/ex/products").map(r => {var p = r.split('|'); (p(0).toInt, p(1).toInt, p(2), p(3), p(4).toFloat, p(5))})
case class Products(productID: Int, productCategory: Int, productName: String, productDescription: String, productPrice: Float, productImage: String)
var productsDF = products.map(r => Products(r._1, r._2, r._3, r._4, r._5, r._6)).toDF()
productsDF.show()
But I got NumberFormatException exception for following code:
case class Products (product_id: Int, product_category_id: Int, product_name: String, product_description: String, product_price: Float, product_image: String)
val productsDF = sc.textFile("/user/cloudera/ex/products").map(_.split("|")).map(p => Products(p(0).trim.toInt, p(1).trim.toInt, p(2), p(3), p(4).trim.toFloat, p(5))).toDF()
productsDF.show()
java.lang.NumberFormatException: For input string: ""
Why is that I am getting exception in 2nd code even though it is same as that of 1st one?
The error is due to _.split("|") in second part of your code
You need to use _.split('|') or _.split("\\|") or _.split("""\|""") or Pattern.quote("|")
If you use "|" it tries to split with regular expression and | is or in the regular expression, so it does not matches anything and returns empty string ""
Hope this helps!
This is the case class representing the entire row:
case class CustomerRow(id: Long, name: String, 20 other fields ...)
I have a shape case class that only 'exposes' a subset of columns and it is used when user creates/updates a customer:
case class CustomerForm(name: String, subset of all fields ...)
I can use CustomerForm for updates. However I can't use it for inserts. There are some columns not in CustomerForm that are required (not null) and can only be provided by the server. What I do now is that I create CustomerRow from CustomerForm:
def form2row(form: CustomerForm, id: Long, serverOnlyValue: Long, etc...) = CustomerRow(
id = id,
serverOnlyColumn = serverOnlyValue,
name = form.name.
// and so on for 20 more tedious lines of code
)
and use it for insert.
Is there a way to compose insert in slick so I can remove that tedious form2row function?
Something like:
(customers.map(formShape) += form) andAlsoOnTheSameRow .map(c => (c.id, c.serverOnlyColumn)) += (id, someValue)
?
Yes, You can do this like:
case class Person(name: String, email: String, address: String, id: Option[Int] = None)
case class NameAndAddress(name: String,address: String)
class PersonTable(tag: Tag) extends Table[Person](tag, "person") {
val id = column[Int]("id", O.PrimaryKey, O.AutoInc)
val name = column[String]("name")
val email = column[String]("email")
val address = column[String]("address")
//for partial insert
def nameWithAddress = (name, address)<>(NameAndAddress.tupled, NameAndAddress.unapply)
def * = (name, email, address, id.?) <> (Person.tupled, Person.unapply)
}
val personTableQuery = TableQuery[PersonTable]
// insert partial fields
personTableQuery.map(_.nameWithAddress) += NameAndAddress("abc", "xyz")
Make sure, You are aware of nullable fields they should be in form of Option[T] where T is filed type.In my example case, email should be Option[String] instead of String.
I am trying to use scala case class to map Cassandra table.
Some of my column names happen to be reserved key words in scala. Is there an easy way to map them ?
eg:
Cassandra Table
Create Table cars (
id_uuid uuid,
new boolean,
type text,
PRIMARY KEY ((id_uuid))
)
// This declaration will fail as "new" and "type" are reserved keywords
scala> case class Cars (idUuid : String, new : Boolean, type: String)
Try this:
case class Cars (idUuid:String, `new`:Boolean, `type`:String)