Spark CBO not showing rowcount for queries having partition column in query - apache-spark

I'm working on Spark 2.3.0 using Cost Based Optimizer(CBO) for computing statistics for queries on done on external tables.
I have a created a external table in spark :
CREATE EXTERNAL TABLE IF NOT EXISTS test (
eventID string,type string,exchange string,eventTimestamp bigint,sequenceNumber bigint
,optionID string,orderID string,side string,routingFirm string,routedOrderID string
,session string,price decimal(18,8),quantity bigint,timeInForce string,handlingInstructions string
,orderAttributes string,isGloballyUnique boolean,originalOrderID string,initiator string,leavesQty bigint
,symbol string,routedOriginalOrderID string,displayQty bigint,orderType string,coverage string
,result string,resultTimestamp bigint,nbbPrice decimal(18,8),nbbQty bigint,nboPrice decimal(18,8)
,nboQty bigint,reporter string,quoteID string,noteType string,definedNoteData string,undefinedNoteData string
,note string,desiredLeavesQty bigint,displayPrice decimal(18,8),workingPrice decimal(18,8),complexOrderID string
,complexOptionID string,cancelQty bigint,cancelReason string,openCloseIndicator string,exchOriginCode string
,executingFirm string,executingBroker string,cmtaFirm string,mktMkrSubAccount string,originalOrderDate string
,tradeID string,saleCondition string,executionCodes string,buyDetails_side string,buyDetails_leavesQty bigint
,buyDetails_openCloseIndicator string,buyDetails_quoteID string,buyDetails_orderID string,buyDetails_executingFirm string,buyDetails_executingBroker string,buyDetails_cmtaFirm string,buyDetails_mktMkrSubAccount string,buyDetails_exchOriginCode string,buyDetails_liquidityCode string,buyDetails_executionCodes string,sellDetails_side string,sellDetails_leavesQty bigint,sellDetails_openCloseIndicator string,sellDetails_quoteID string,sellDetails_orderID string,sellDetails_executingFirm string,sellDetails_executingBroker string,sellDetails_cmtaFirm string,sellDetails_mktMkrSubAccount string,sellDetails_exchOriginCode string,sellDetails_liquidityCode string,sellDetails_executionCodes string,tradeDate int,reason string,executionTimestamp bigint,capacity string,fillID string,clearingNumber string
,contraClearingNumber string,buyDetails_capacity string,buyDetails_clearingNumber string,sellDetails_capacity string
,sellDetails_clearingNumber string,receivingFirm string,marketMaker string,sentTimestamp bigint,onlyOneQuote boolean
,originalQuoteID string,bidPrice decimal(18,8),bidQty bigint,askPrice decimal(18,8),askQty bigint,declaredTimestamp bigint,revokedTimestamp bigint,awayExchange string,comments string,clearingFirm string )
PARTITIONED BY (date integer ,reporteIDs string ,version integer )
STORED AS PARQUET LOCATION '/home/test/'
I have computed statistics on the columns using the following command:
val df = spark.read.parquet("/home/test/")
val cols = df.columns.mkString(",")
val analyzeDDL = s"Analyze table events compute statistics for columns $cols"
spark.sql(analyzeDDL)
Now when I'm trying to get the statistics for the query :
val query = "Select * from test where date > 20180222"
Its giving me only size and not the rowCount :
scala> val exec = spark.sql(query).queryExecution
exec: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('date > 20180222)
+- 'UnresolvedRelation `test`
== Analyzed Logical Plan ==
eventID: string, type: string, exchange: string, eventTimestamp: bigint, sequenceNumber: bigint, optionID: string, orderID: string, side: string, routingFirm: string, routedOrderID: string, session: string, price: decimal(18,8), quantity: bigint, timeInForce: string, handlingInstructions: string, orderAttributes: string, isGloballyUnique: boolean, originalOrderID: string, initiator: string, leavesQty: bigint, symbol: string, routedOriginalOrderID: string, displayQty: bigint, orderType: string, ... 82 more fields
Project [eventID#797974, type#797975, exchange#797976, eventTimestamp#797977L, sequenceNumber#...
scala>
scala> val stats = exec.optimizedPlan.stats
stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=1.0 B, hints=none)
Am I missing any steps here? How can I get the rowcount for the query.
Spark-version : 2.3.0
Files in the table are in parquet format.
Update
I'm able to get the statistics for a csv file. Not able to get the same for a parquet file.
The difference between the execution plan for parquet and csv is format is that in csv we are getting a HiveTableRelation while for parquet its Relation.
Any idea why it so?

Related

Hive, how to partition by a colum with null values, putting all nulls in one partition

I am using Hive, and the IDE is Hue. I am trying different key combinations to choose for my partition key(s).
The definition of my original table is as follows:
CREATE External Table `my_hive_db`.`my_table`(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_day` string
)
PARTITIONED BY (
`date_year` string,
`date_month` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://some/where/in/amazon/s3';
The above code is working properly. But when I create a new table with date_day as partition key, the table is empty and I need to run MSCK Repair Table. However, I am getting the following error:
Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask
When the partion keys were date_year, date_month, MSCK worked properly.
Table definition of the table I am getting the error for is as follows:
CREATE External Table `my_hive_db`.`my_table`(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_year` string,
`date_month` string
)
PARTITIONED BY (
`date_day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://some/where/in/amazon/s3';
After this the following query is empty:
Select * From `my_hive_db`.`my_table` Limit 10;
I therefore ran the following command:
MSCK REPAIR TABLE `my_hive_db`.`my_table`;
And I get the error: Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask
I checked this link as it is exactly the error I am getting, but by using the solution provided:
set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;
I get a different error:
Error while processing statement: Cannot modify hive.msck.path.validation at runtime. It is not in list of params that are allowed to be modified at runtime.
I think the reason I am getting these errors is that there are more than 200 million records with date_day having null value.
There are 31 distinct date-day not null values. I would like to partition my table with 32 partitions, each for a distinct value of date_day field, and all the null values get into a different partition. Is there a way to do so (partitioning by a column with null values)?
If this can be achieved by spark, I am also open to use it.
This is part of a bigger problem of changing partition keys by recreating a table as mentioned in this link in answer to my other question.
Thank you for your help.
You seem to not understand how Hive's partitioning work.
Hive stores data into files on HDFS (or S3, or some other distributed folders).
If you create a non-partitioned parquet table called my_schema.my_table, you will see in your distributed storage files stored in a folder
hive/warehouse/my_schema.db/my_table/part_00001.parquet
hive/warehouse/my_schema.db/my_table/part_00002.parquet
...
If you create a table partitioned by a column p_col, the files will look like
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00002.parquet
...
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00002.parquet
...
The command MSCK repair table allows you to automatically reload the partitions, when you create an external table.
Let's say you have folders on s3 that look like this:
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value3/part_00001.parquet
You create an external table with
CREATE External Table my_schema.my_table(
... some columns ...
)
PARTITIONED BY (p_col STRING)
the table will be created but empty, because Hive hasn't detected the partitions yet. You run MSCK REPAIR TABLE my_schema.my_table, and Hive will recognize that your partition p_col matches the partitioning scheme on s3 (/p_col=value1/).
From what I could understand from your other question, you are trying to change the partitioning scheme of the table by doing
CREATE External Table my_schema.my_table(
... some columns ...
)
PARTITIONED BY (p_another_col STRING)
and you are getting an error message because p_another_col doesn't match with the column used in s3, which was p_col.
And this error is perfectly normal, since what you are doing doesn't make sense.
As stated in the other question's answer, you need to create a copy of the first table, with a different partitioning scheme.
You should instead try something like this:
CREATE External Table my_hive_db.my_table_2(
`col_id` bigint,
`result_section__col2` string,
`result_section_col3` string ,
`result_section_col4` string,
`result_section_col5` string,
`result_section_col6__label` string,
`result_section_col7__label_id` bigint ,
`result_section_text` string ,
`result_section_unit` string,
`result_section_col` string ,
`result_section_title` string,
`result_section_title_id` bigint,
`col13` string,
`timestamp` bigint,
`date_year` string,
`date_month` string
)
PARTITIONED BY (`date_day` string)
and then populate your new table with dynamic partitioning
INSERT OVERWRITE TABLE my_hive_db.my_table_2 PARTITION(date_day)
SELECT
col_id,
result_section__col2,
result_section_col3,
result_section_col4,
result_section_col5,
result_section_col6__label,
result_section_col7__label_id,
result_section_text,
result_section_unit,
result_section_col,
result_section_title,
result_section_title_id,
col13,
timestamp,
date_year,
date_month,
date_day
FROM my_hive_db.my_table_1

How to improve DataFrame UDF Which is Connecting to Hbase For every Row

I have a DataFrame where I need to create a column based on Values from Each row.
I iterate using UDF which process for each row and connects to HBase to get Data.
The UDF creates a connection, Returns Data, Closes a connection.
The process is slow as Zookeeper Hangs after few reads. I want to Pull data with only 1 open connection.
I tried mapwithpartition, But the connection is not passed as it's not serialized.
UDF:-
val lookUpUDF = udf((partyID: Int, brand: String, algorithm: String, bigPartyProductMappingTableName: String, env: String) => lookUpLogic.lkpBigPartyAccount(partyID, brand, algorithm, bigPartyProductMappingTableName, env))
How DataFrame Iterates:-
ocisPreferencesDF
.withColumn("deleteStatus", lookUpUDF(col(StagingBatchConstants.OcisPreferencesPartyId),
col(StagingBatchConstants.OcisPreferencesBrand), lit(EnvironmentConstants.digest_algorithm), lit
(bigPartyProductMappingTableName), lit(env)))
Main Login:-
def lkpBigPartyAccount(partyID: Int,
brand: String,
algorithm: String,
bigPartyProductMappingTableName: String,
envVar: String,
hbaseInteraction: HbaseInteraction = new HbaseInteraction,
digestGenerator: DigestGenerator = new DigestGenerator): Array[(String, String)] = {
AppInit.setEnvVar(envVar)
val message = partyID.toString + "-" + brand
val rowKey = Base64.getEncoder.encodeToString(message.getBytes())
val hbaseAccountInfo = hbaseInteraction.hbaseReader(bigPartyProductMappingTableName, rowKey, "cf").asScala
val convertMap: mutable.HashMap[String, String] = new mutable.HashMap[String, String]
for ((key, value) <- hbaseAccountInfo) {
convertMap.put(key.toString, value.toString)
}
convertMap.toArray
}
I expect to improve the code performance. What I'm hoping is to create a connection only once.

java.lang.NumberFormatException: For input string: "" while creating Dataframe

I am using cloudera vm. I have imported the products table from retail_db as a textfile with '|' as a fields separator (using sqoop).
Following is the table schema:
mysql> describe products;
product_id: int(11)
product_category_id: int(11)
product_name: varchar(45)
product_description: varchar(255)
product_price: float
product_image: varchar(255)
I want to create a Dataframe from this data.
I got no issue while using following code:
var products = sc.textFile("/user/cloudera/ex/products").map(r => {var p = r.split('|'); (p(0).toInt, p(1).toInt, p(2), p(3), p(4).toFloat, p(5))})
case class Products(productID: Int, productCategory: Int, productName: String, productDescription: String, productPrice: Float, productImage: String)
var productsDF = products.map(r => Products(r._1, r._2, r._3, r._4, r._5, r._6)).toDF()
productsDF.show()
But I got NumberFormatException exception for following code:
case class Products (product_id: Int, product_category_id: Int, product_name: String, product_description: String, product_price: Float, product_image: String)
val productsDF = sc.textFile("/user/cloudera/ex/products").map(_.split("|")).map(p => Products(p(0).trim.toInt, p(1).trim.toInt, p(2), p(3), p(4).trim.toFloat, p(5))).toDF()
productsDF.show()
java.lang.NumberFormatException: For input string: ""
Why is that I am getting exception in 2nd code even though it is same as that of 1st one?
The error is due to _.split("|") in second part of your code
You need to use _.split('|') or _.split("\\|") or _.split("""\|""") or Pattern.quote("|")
If you use "|" it tries to split with regular expression and | is or in the regular expression, so it does not matches anything and returns empty string ""
Hope this helps!

Convert a DataSet with single column to multiple column dataset in scala

I have a dataset which is a DataSet of String and it has the data
12348,5,233,234559,4
12348,5,233,234559,4
12349,6,233,234560,5
12350,7,233,234561,6
I want to split this single row and convert this to multiple columns which says RegionId, PerilId, Date, EventId, ModelId. How do i achieve this ?
you mean:
case class NewSet(RegionId: String, PerilId: String, Date: String, EventId: String, ModelId: String)
val newDataset = oldDataset.map(s:String => {
val strings = s.split(",")
NewSet(strings(0), strings(1), strings(2), string(3), strings(4)) })
Of course you should probably make the lambda function a little more robust...
If you have the data you specified in an RDD, then converting that to dataframe is pretty easy.
case class MyClass(RegionId: String, PerilId: String, Date: String,
EventId: String, ModelId: String)
val dataframe = sqlContext.createDataFrame(rdd,classOf[MyClass])
this dataframe will have all the columns with the column name corresponds to the variables of clas MyClass.

cassandra system.schema_columns insert in other table with modifications

i have the following select:
select * from system.schema_columns where keyspace_name = 'automotive' and columnfamily_name = 'cars';
and i want to insert data returned by this into another table with some modifications:
-i want to insert the data type of column
-remove audit columns like created_at, created_by etc..
in my sql we can do this by:
insert into formUtil(table_name, column_name, ordinal_position, is_nullable, data_type)
SELECT
col.table_name,
col.column_name,
col.ordinal_position,
case when col.is_nullable = 'YES' then 1 else 0 end,
col.data_type
from
information_schema.COLUMNS col
where
col.table_schema = 'i2cwac' and
col.column_name not in ('id','modifiedAt','modifiedBy','createdAt','createdBy') and
col.table_name = 'users';
how we can do this in cassandra ?
You can achieve this by using Spark
import java.nio.ByteBuffer;
import com.datastax.spark.connector._;
case class SchemaColumns(keyspaceName: String, tableName: String, clusteringOrder: String, columnNameBytes: ByteBuffer, kind: String, position: Int, type: String)
case class AnotherTable(keyspaceName: String, tableName: String, type: String)
sc.cassandraTable[SchemaColumns]("system", "schema_columns")
.map(schemaColumns -> AnotherTable(schemaColumns.keyspaceName,schemaColumns.tableName, schemaColumns.type)
.saveToCassandra("my_keyspace","another_table")

Resources