JDBC to Spark Dataframe - How to ensure even partitioning? - apache-spark

I am new to Spark, and am working on creating a DataFrame from a Postgres database table via JDBC, using spark.read.jdbc.
I am a bit confused about the partitioning options, in particular partitionColumn, lowerBound, upperBound, and numPartitions.
The documentation seems to indicate that these fields are optional.
What happens if I don't provide them?
How does Spark know how to partition the queries? How efficient will that be?
If I DO specify these options, how do I ensure that the partition sizes are roughly even if the partitionColumn is not evenly distributed?
Let's say I'm going to have 20 executors, so I set my numPartitions to 20.
My partitionColumn is an auto-incremented ID field, and let's say the values range from 1 to 2,000,000
However, because the user selects to process some really old data, along with some really new data, with nothing in the middle, most of the data has ID values either under 100,000 or over 1,900,000.
Will my 1st and 20th executors get most of the work, while the other 18 executors sit there mostly idle?
If so, is there a way to prevent this?

I found a way to manually specify the partition boundaries, by using the jdbc constructor with the predicates parameter.
It allows you to explicitly specify individual conditions to be inserted in the "where" clause for each partition, which allows you to specify exactly which range of rows each partition will receive. So, if you don't have a uniformly distributed column to auto-partition on, you can customize your own partition strategy.
An example of how to use it can be found in the accepted answer to this question.

What are all these options : spark.read.jdbc refers to reading a table from RDBMS.
parallelism is power of spark, in order to achieve this you have to mention all these options.
Question[s] :-)
1) The documentation seems to indicate that these fields are optional. What happens if I don't provide them ?
Answer : default Parallelism or poor parallelism
Based on scenario developer has to take care about the performance tuning strategy. and to ensure data splits across the boundaries (aka partitions) which in turn will be tasks in parallel. By seeing this way.
2) How does Spark know how to partition the queries? How efficient will that be?
jdbc-reads -referring to databricks docs
You can provide split boundaries based on the dataset’s column values.
These options specify the parallelism on read.
These options must all be specified if any of them is specified.
Note
These options specify the parallelism of the table read. lowerBound and upperBound decide the partition stride, but do not
filter the rows in table. Therefore, Spark partitions and returns all
rows in the table.
Example 1:
You can split the table read across executors on the emp_no column using the partitionColumn, lowerBound, upperBound, and numPartitions parameters.
val df = spark.read.jdbc(url=jdbcUrl,
table="employees",
columnName="emp_no",
lowerBound=1L,
upperBound=100000L,
numPartitions=100,
connectionProperties=connectionProperties)
also numPartitions means number of parllel connections you are asking RDBMS to read the data. if you are providing numPartitions then you are limiting number of connections... with out exhausing the connections at RDBMS side..
Example 2 source : datastax presentation to load oracle data in cassandra :
val basePartitionedOracleData = sqlContext
.read
.format("jdbc")
.options(
Map[String, String](
"url" -> "jdbc:oracle:thin:username/password#//hostname:port/oracle_svc",
"dbtable" -> "ExampleTable",
"lowerBound" -> "1",
"upperBound" -> "10000",
"numPartitions" -> "10",
"partitionColumn" -> “KeyColumn"
)
)
.load()
The last four arguments in that map are there for the purpose of getting a partitioned dataset. If you pass any of them,
you have to pass all of them.
When you pass these additional arguments in, here’s what it does:
It builds a SQL statement template in the format
SELECT * FROM {tableName} WHERE {partitionColumn} >= ? AND
{partitionColumn} < ?
It sends {numPartitions} statements to the DB engine. If you suppled these values: {dbTable=ExampleTable,
lowerBound=1, upperBound=10,000, numPartitions=10, partitionColumn=KeyColumn}, it would create these ten
statements:
SELECT * FROM ExampleTable WHERE KeyColumn >= 1 AND KeyColumn < 1001
SELECT * FROM ExampleTable WHERE KeyColumn >= 1001 AND KeyColumn < 2000
SELECT * FROM ExampleTable WHERE KeyColumn >= 2001 AND KeyColumn < 3000
SELECT * FROM ExampleTable WHERE KeyColumn >= 3001 AND KeyColumn < 4000
SELECT * FROM ExampleTable WHERE KeyColumn >= 4001 AND KeyColumn < 5000
SELECT * FROM ExampleTable WHERE KeyColumn >= 5001 AND KeyColumn < 6000
SELECT * FROM ExampleTable WHERE KeyColumn >= 6001 AND KeyColumn < 7000
SELECT * FROM ExampleTable WHERE KeyColumn >= 7001 AND KeyColumn < 8000
SELECT * FROM ExampleTable WHERE KeyColumn >= 8001 AND KeyColumn < 9000
SELECT * FROM ExampleTable WHERE KeyColumn >= 9001 AND KeyColumn < 10000
And then it would put the results of each of those queries in its own partition in Spark.
Question[s] :-)
If I DO specify these options, how do I ensure that the partition
sizes are roughly even if the partitionColumn is not evenly
distributed?
Will my 1st and 20th executors get most of the work, while the other
18 executors sit there mostly idle?
If so, is there a way to prevent this?
All questions has one answer
Below is the way...
1) You need to understand how many number of records/rows per partition.... based on this you can repartition or coalesce
Snippet 1: Spark 1.6 >
spark 2.x provides facility to know how many records are there in the partition.
spark_partition_id() exists in org.apache.spark.sql.functions
import org.apache.spark.sql.functions._
val df = "<your dataframe read through rdbms.... using spark.read.jdbc>"
df.withColumn("partitionId", spark_partition_id()).groupBy("partitionId").count.show
Snippet 2 : for all version of spark
df
.rdd
.mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))}
.toDF("partition_number","NumberOfRecordsPerPartition")
.show
and then you need to incorporate your strategy again query tuning between ranges or repartitioning etc.... , you can use mappartitions or foreachpartitions
Conclusion : I prefer using given options which works on number columns since I have seen it was dividing data in to uniform across
bounderies/partitions.
Some time it may not be possible to use these option then manually
tuning the partitions/parllelism is required...
Update :
With the below we can achive uniform distribution...
Fetch the primary key of the table.
Find the key minimum and maximum values.
Execute Spark with those values.
def main(args: Array[String]){
// parsing input parameters ...
val primaryKey = executeQuery(url, user, password, s"SHOW KEYS FROM ${config("schema")}.${config("table")} WHERE Key_name = 'PRIMARY'").getString(5)
val result = executeQuery(url, user, password, s"select min(${primaryKey}), max(${primaryKey}) from ${config("schema")}.${config("table")}")
val min = result.getString(1).toInt
val max = result.getString(2).toInt
val numPartitions = (max - min) / 5000 + 1
val spark = SparkSession.builder().appName("Spark reading jdbc").getOrCreate()
var df = spark.read.format("jdbc").
option("url", s"${url}${config("schema")}").
option("driver", "com.mysql.jdbc.Driver").
option("lowerBound", min).
option("upperBound", max).
option("numPartitions", numPartitions).
option("partitionColumn", primaryKey).
option("dbtable", config("table")).
option("user", user).
option("password", password).load()
// some data manipulations here ...
df.repartition(10).write.mode(SaveMode.Overwrite).parquet(outputPath)
}

Related

Number of Task in Apache spark while writing into HDFS

I am trying to read csv file and then adding some columns . After that trying to save in orc format.
I could not understand how spark decided number of tasks for different stages.
Why number of task for CSV stage is 1 and for ORC stage it is 39?
val c1c8 = spark.read.option("header",true).csv("/user/DEEPAK_TEST/C1C6_NEW/")
val c1c8new = { c1c8.withColumnRenamed("c1c6_F","c1c8").withColumnRenamed("Network_Out","c1c8_network").withColumnRenamed("Access NE Out","c1c8_access_ne")
.withColumn("c1c8_signalling",when (col("signalling_Out") === "SIP Cl4" , "SIP CL4").when (col("signalling_Out") === "SIP cl4" , "SIP CL4").when (col("signalling_Out") === "Other" , "other").otherwise(col("signalling_Out")))
.withColumnRenamed("access type Out","c1c8_access_type").withColumnRenamed("Type_of_traffic_C","c1c8_typeoftraffic")
.withColumnRenamed("BOS traffic type Out","c1c8_bos_trafc_typ").withColumnRenamed("Scope_Out","c1c8_scope")
.withColumnRenamed("Join with UP-DWN SIP cl5 T1T7 Out","c1c8_join_indicator")
.select("c1c8","c1c8_network", "c1c8_access_ne", "c1c8_signalling", "c1c8_access_type", "c1c8_typeoftraffic",
"c1c8_bos_trafc_typ", "c1c8_scope","c1c8_join_indicator")
}
c1c8new.write.orc("/user/DEEPAK_TEST/C1C8_MAPPING_NEWT/")
Below is my understanding from looking at Spark 2.x source code.
Stage 0 is a file scan that creates FileScanRDD which is an RDD that scans a list of file partitions. This stage can have more than one task when you are reading from multiple partitioned directories, such as a partitioned Hive table.
The number of tasks in Stage 1 will be equals to the number of RDD partitions. In your case c1c8new.rdd.getNumPartitions will be 39. This number is calculated using:
config value spark.files.maxPartitionBytes (128MB by default)
sparkContext.defaultParallelism returned by task scheduler (equal to number of cores when running in local mode)
totalBytes
DataSourceScanExec.scala#L423
val defaultMaxSplitBytes =
fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
val bytesPerCore = totalBytes / defaultParallelism
val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, " +
s"open cost is considered as scanning $openCostInBytes bytes.")
You can see actual calculated values in the above log message if you set the log level to INFO - spark.sparkContext.setLogLevel("INFO")
In your case, I think the split size is 128 and so, number of tasks/partitions is roughly 4.6G/128MB
As a side note, you can change the number of partitions (and hence the number of tasks in the subsequent stage) by using repartition() or coalesce() on the dataframe. More importantly, the number of partitions after a shuffle is determined by spark.sql.shuffle.partitions (200 by default). If you have a shuffle, it is better to use this configuration to control the number of tasks because inserting repartition() or coalesce() between stages adds extra overhead.
For large spark SQL workloads, setting optimum values for spark.sql.shuffle.partitions in each stage was always a pain point. Spark 3.x has better support for this if Adaptive Query Execution is enabled, but I haven't tried it for any production workloads.

What is the best strategy to load huge datasets/data into Hive tables using Spark? [duplicate]

I am trying to move data from a table in PostgreSQL table to a Hive table on HDFS. To do that, I came up with the following code:
val conf = new SparkConf().setAppName("Spark-JDBC").set("spark.executor.heartbeatInterval","120s").set("spark.network.timeout","12000s").set("spark.sql.inMemoryColumnarStorage.compressed", "true").set("spark.sql.orc.filterPushdown","true").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").set("spark.kryoserializer.buffer.max","512m").set("spark.serializer", classOf[org.apache.spark.serializer.KryoSerializer].getName).set("spark.streaming.stopGracefullyOnShutdown","true").set("spark.yarn.driver.memoryOverhead","7168").set("spark.yarn.executor.memoryOverhead","7168").set("spark.sql.shuffle.partitions", "61").set("spark.default.parallelism", "60").set("spark.memory.storageFraction","0.5").set("spark.memory.fraction","0.6").set("spark.memory.offHeap.enabled","true").set("spark.memory.offHeap.size","16g").set("spark.dynamicAllocation.enabled", "false").set("spark.dynamicAllocation.enabled","true").set("spark.shuffle.service.enabled","true")
val spark = SparkSession.builder().config(conf).master("yarn").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
def prepareFinalDF(splitColumns:List[String], textList: ListBuffer[String], allColumns:String, dataMapper:Map[String, String], partition_columns:Array[String], spark:SparkSession): DataFrame = {
val colList = allColumns.split(",").toList
val (partCols, npartCols) = colList.partition(p => partition_columns.contains(p.takeWhile(x => x != ' ')))
val queryCols = npartCols.mkString(",") + ", 0 as " + flagCol + "," + partCols.reverse.mkString(",")
val execQuery = s"select ${allColumns}, 0 as ${flagCol} from schema.tablename where period_year='2017' and period_num='12'"
val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName).option("password", devPassword)
.option("partitionColumn","cast_id")
.option("lowerBound", 1).option("upperBound", 100000)
.option("numPartitions",70).load()
val totalCols:List[String] = splitColumns ++ textList
val cdt = new ChangeDataTypes(totalCols, dataMapper)
hiveDataTypes = cdt.gpDetails()
val fc = prepareHiveTableSchema(hiveDataTypes, partition_columns)
val allColsOrdered = yearDF.columns.diff(partition_columns) ++ partition_columns
val allCols = allColsOrdered.map(colname => org.apache.spark.sql.functions.col(colname))
val resultDF = yearDF.select(allCols:_*)
val stringColumns = resultDF.schema.fields.filter(x => x.dataType == StringType).map(s => s.name)
val finalDF = stringColumns.foldLeft(resultDF) {
(tempDF, colName) => tempDF.withColumn(colName, regexp_replace(regexp_replace(col(colName), "[\r\n]+", " "), "[\t]+"," "))
}
finalDF
}
val dataDF = prepareFinalDF(splitColumns, textList, allColumns, dataMapper, partition_columns, spark)
val dataDFPart = dataDF.repartition(30)
dataDFPart.createOrReplaceTempView("preparedDF")
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("set hive.exec.dynamic.partition=true")
spark.sql(s"INSERT OVERWRITE TABLE schema.hivetable PARTITION(${prtn_String_columns}) select * from preparedDF")
The data is inserted into the hive table dynamically partitioned based on prtn_String_columns: source_system_name, period_year, period_num
Spark-submit used:
SPARK_MAJOR_VERSION=2 spark-submit --conf spark.ui.port=4090 --driver-class-path /home/fdlhdpetl/jars/postgresql-42.1.4.jar --jars /home/fdlhdpetl/jars/postgresql-42.1.4.jar --num-executors 80 --executor-cores 5 --executor-memory 50G --driver-memory 20G --driver-cores 3 --class com.partition.source.YearPartition splinter_2.11-0.1.jar --master=yarn --deploy-mode=cluster --keytab /home/fdlhdpetl/fdlhdpetl.keytab --principal fdlhdpetl#FDLDEV.COM --files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties --name Splinter --conf spark.executor.extraClassPath=/home/fdlhdpetl/jars/postgresql-42.1.4.jar
The following error messages are generated in the executor logs:
Container exited with a non-zero exit code 143.
Killed by external signal
18/10/03 15:37:24 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[SIGTERM handler,9,system]
java.lang.OutOfMemoryError: Java heap space
at java.util.zip.InflaterInputStream.<init>(InflaterInputStream.java:88)
at java.util.zip.ZipFile$ZipFileInflaterInputStream.<init>(ZipFile.java:393)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:374)
at java.util.jar.JarFile.getManifestFromReference(JarFile.java:199)
at java.util.jar.JarFile.getManifest(JarFile.java:180)
at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:944)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:450)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:99)
at sun.misc.Signal$1.run(Signal.java:212)
at java.lang.Thread.run(Thread.java:745)
I see in the logs that the read is being executed properly with the given number of partitions as below:
Scan JDBCRelation((select column_names from schema.tablename where period_year='2017' and period_num='12') as year2017) [numPartitions=50]
Below is the state of executors in stages:
The data is not being partitioned properly. One partition is smaller while the other one becomes huge. There is a skew problem here.
While inserting the data into Hive table the job fails at the line:spark.sql(s"INSERT OVERWRITE TABLE schema.hivetable PARTITION(${prtn_String_columns}) select * from preparedDF") but I understand this is happening because of the data skew problem.
I tried to increase number of executors, increasing the executor memory, driver memory, tried to just save as csv file instead of saving the dataframe into a Hive table but nothing affects the execution from giving the exception:
java.lang.OutOfMemoryError: GC overhead limit exceeded
Is there anything in the code that I need to correct ? Could anyone let me know how can I fix this problem ?
Determine how many partitions you need given the amount of input data and your cluster resources. As a rule of thumb it is better to keep partition input under 1GB unless strictly necessary. and strictly smaller than the block size limit.
You've previously stated that you migrate 1TB of data values you use in different posts (5 - 70) are likely way to low to ensure smooth process.
Try to use value which won't require further repartitioning.
Know your data.
Analyze the columns available in the the dataset to determine if there any columns with high cardinality and uniform distribution to be distributed among desired number of partitions. These are good candidates for an import process. Additionally you should determine an exact range of values.
Aggregations with different centrality and skewness measure as well as histograms and basic counts-by-key are good exploration tools. For this part it is better to analyze data directly in the database, instead of fetching it to Spark.
Depending on the RDBMS you might be able to use width_bucket (PostgreSQL, Oracle) or equivalent function to get a decent idea how data will be distributed in Spark after loading with partitionColumn, lowerBound, upperBound, numPartitons.
s"""(SELECT width_bucket($partitionColum, $lowerBound, $upperBound, $numPartitons) AS bucket, COUNT(*)
FROM t
GROUP BY bucket) as tmp)"""
If there are no columns which satisfy above criteria consider:
Creating a custom one and exposing it via. a view. Hashes over multiple independent columns are usually good candidates. Please consult your database manual to determine functions that can be used here (DBMS_CRYPTO in Oracle, pgcrypto in PostgreSQL)*.
Using a set of independent columns which taken together provide high enough cardinality.
Optionally, if you're going to write to a partitioned Hive table, you should consider including Hive partitioning columns. It might limit the number of files generated later.
Prepare partitioning arguments
If column selected or created in the previous steps is numeric (or date / timestamp in Spark >= 2.4) provide it directly as the partitionColumn and use range values determined before to fill lowerBound and upperBound.
If bound values don't reflect the properties of data (min(col) for lowerBound, max(col) for upperBound) it can result in a significant data skew so thread carefully. In the worst case scenario, when bounds don't cover the range of data, all records will be fetched by a single machine, making it no better than no partitioning at all.
If column selected in the previous steps is categorical or is a set of columns generate a list of mutually exclusive predicates that fully cover the data, in a form that can be used in a SQL where clause.
For example if you have a column A with values {a1, a2, a3} and column B with values {b1, b2, b3}:
val predicates = for {
a <- Seq("a1", "a2", "a3")
b <- Seq("b1", "b2", "b3")
} yield s"A = $a AND B = $b"
Double check that conditions don't overlap and all combinations are covered. If these conditions are not satisfied you end up with duplicates or missing records respectively.
Pass data as predicates argument to jdbc call. Note that the number of partitions will be equal exactly to the number of predicates.
Put database in a read-only mode (any ongoing writes can cause data inconsistency. If possible you should lock database before you start the whole process, but if might be not possible, in your organization).
If the number of partitions matches the desired output load data without repartition and dump directly to the sink, if not you can try to repartition following the same rules as in the step 1.
If you still experience any problems make sure that you've properly configured Spark memory and GC options.
If none of the above works:
Consider dumping your data to a network / distributes storage using tools like COPY TO and read it directly from there.
Note that or standard database utilities you will typically need a POSIX compliant file system, so HDFS usually won't do.
The advantage of this approach is that you don't need to worry about the column properties, and there is no need for putting data in a read-only mode, to ensure consistency.
Using dedicated bulk transfer tools, like Apache Sqoop, and reshaping data afterwards.
* Don't use pseudocolumns - Pseudocolumn in Spark JDBC.
In my experience there are 4 kinds of memory settings which make a difference:
A) [1] Memory for storing data for processing reasons VS [2] Heap Space for holding the program stack
B) [1] Driver VS [2] executor memory
Up to now, I was always able to get my Spark jobs running successfully by increasing the appropriate kind of memory:
A2-B1 would therefor be the memory available on the driver to hold the program stack. Etc.
The property names are as follows:
A1-B1) executor-memory
A1-B2) driver-memory
A2-B1) spark.yarn.executor.memoryOverhead
A2-B2) spark.yarn.driver.memoryOverhead
Keep in mind that the sum of all *-B1 must be less than the available memory on your workers and the sum of all *-B2 must be less than the memory on your driver node.
My bet would be, that the culprit is one of the boldly marked heap settings.
There was an another question of yours routed here as duplicate
'How to avoid data skewing while reading huge datasets or tables into spark?
The data is not being partitioned properly. One partition is smaller while the
other one becomes huge on read.
I observed that one of the partition has nearly 2million rows and
while inserting there is a skew in partition. '
if the problem is to deal with data that is partitioned in a dataframe after read, Have you played around increasing the "numPartitions" value ?
.option("numPartitions",50)
lowerBound, upperBound form partition strides for generated WHERE clause expressions and numpartitions determines the number of split.
say for example, sometable has column - ID (we choose that as partitionColumn) ; value range we see in table for column-ID is from 1 to 1000 and we want to get all the records by running select * from sometable,
so we going with lowerbound = 1 & upperbound = 1000 and numpartition = 4
this will produce a dataframe of 4 partition with result of each Query by building sql based on our feed (lowerbound = 1 & upperbound = 1000 and numpartition = 4)
select * from sometable where ID < 250
select * from sometable where ID >= 250 and ID < 500
select * from sometable where ID >= 500 and ID < 750
select * from sometable where ID >= 750
what if most of the records in our table fall within the range of ID(500,750). that's the situation you are in to.
when we increase numpartition , the split happens even further and that reduce the volume of records in the same partition but this
is not a fine shot.
Instead of spark splitting the partitioncolumn based on boundaries we provide, if you think of feeding the split by yourself so, data can be evenly
splitted. you need to switch over to another JDBC method where instead of (lowerbound,upperbound & numpartition) we can provide
predicates directly.
def jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties): DataFrame
Link

Spark bucketing read performance

Spark version - 2.2.1.
I've created a bucketed table with 64 buckets, I'm executing an aggregation function select t1.ifa,count(*) from $tblName t1 where t1.date_ = '2018-01-01' group by ifa . I can see that 64 tasks in Spark UI, which utilize just 4 executors (each executor has 16 cores) out of 20. Is there a way I can scale out the number of tasks or that's how bucketed queries should run (number of running cores as the number of buckets)?
Here's the create table:
sql("""CREATE TABLE level_1 (
bundle string,
date_ date,
hour SMALLINT)
USING ORC
PARTITIONED BY (date_ , hour )
CLUSTERED BY (ifa)
SORTED BY (ifa)
INTO 64 BUCKETS
LOCATION 'XXX'""")
Here's the query:
sql(s"select t1.ifa,count(*) from $tblName t1 where t1.date_ = '2018-01-01' group by ifa").show
With bucketing, the number of tasks == number of buckets, so you should be aware of the number of cores/tasks that you need/want to use and then set it as the buckets number.
num of task = num of buckets is probably the most important and under-discussed aspect of bucketing in Spark. Buckets (by default) are historically solely useful for creating "pre-shuffled" dataframes which can optimize large joins. When you read a bucketed table all of the file or files for each bucket are read by a single spark executor (30 buckets = 30 spark tasks when reading the data) which would allow the table to be joined to another table bucketed on the same # of columns. I find this behavior annoying and like the user above mentioned problematic for tables that may grow.
You might be asking yourself now, why and when in the would I ever want to bucket and when will my real-world data grow exactly in the same way over time? (you probably partitioned your big data by date, be honest) In my experience you probably don't have a great use case to bucket tables in the default spark way. BUT ALL IS NOT LOST FOR BUCKETING!
Enter "bucket-pruning". Bucket pruning only works when you bucket ONE column but is potentially your greatest friend in Spark since the advent of SparkSQL and Dataframes. It allows Spark to determine which files in your table contain specific values based on some filter in your query, which can MASSIVELY reduce the number of files spark physically reads, resulting in hugely efficient and fast queries. (I've taken 2+hr queries down to 2 minutes and 1/100th of the Spark workers). But you probably don't care because of the # of buckets to tasks issue means your table will never "scale-up" if you have too many files per bucket, per partition.
Enter Spark 3.2.0. There is a new feature coming that will allow bucket pruning to stay active when you disable bucket-based reading, allowing you to distribute the spark reads with bucket-pruning/scan. I also have a trick for doing this with spark < 3.2 as follows.
(note the leaf-scan for files with vanilla spark.read on s3 is added overhead but if your table is big it doesn't matter, bc your bucket optimized table will be a distributed read across all your available spark workers and will now be scalable)
val table = "ex_db.ex_tbl"
val target_partition = "2021-01-01"
val bucket_target = "valuex"
val bucket_col = "bucket_col"
val partition_col = "date"
import org.apache.spark.sql.functions.{col, lit}
import org.apache.spark.sql.execution.FileSourceScanExec
import org.apache.spark.sql.execution.datasources.{FileScanRDD,FilePartition}
val df = spark.table(tablename).where((col(partition_col)===lit(target_partition)) && (col(bucket_col)===lit(bucket_target)))
val sparkplan = df.queryExecution.executedPlan
val scan = sparkplan.collectFirst { case exec: FileSourceScanExec => exec }.get
val rdd = scan.inputRDDs.head.asInstanceOf[FileScanRDD]
val bucket_files = for
{ FilePartition(bucketId, files) <- rdd.filePartitions f <- files }
yield s"$f".replaceAll("path: ", "").split(",")(0)
val format = bucket_files(0).split("
.").last
val result_df = spark.read.option("mergeSchema", "False").format(format).load(bucket_files:_*).where(col(bucket_col) === lit(bucket_target))

Data frame re partition is not happening as expected

I am running below code in spark to create table temp1 with number of parttion 200 . But while i am checking the actual number of partition by creating an rdd out of temp1 table its coming to be more than 200.
How is this possible , am i missing any thing .It would be really helpful if any one can tell me ,if i am missing any thing !! Thanks
val TransDataFrame = hiveContext.sql(
s""" SELECT *
FROM uacc.TRANS
WHERE PROD_SURRO_ID != 0
AND MONTH_ID >= 201401
AND MONTH_ID <= 201403
AND CRE_DT <= '2016-11-13'
""").repartition(200,$"NDC").registerTempTable("temp")
hiveContext.sql(
s"""
CREATE TABLE uacc.temp1
AS SELECT * FROM temp
""")
val df = hiveContext.sql("SELECT * FROM uacc.temp1")
df.rdd.getNumPartitions
1224
As you create the table uacc.temp1 you actually write your dataframe to hdfs, now as you load that table again, the number of partitions is controlled by the number of hdfs files (more specific: file splits), see How does partitioning work for data from files on HDFS?

How to stop load the whole table in spark?

The thing is, I have read right to one table,which is partition by year month and day.But I don't have right read the data from 2016/04/24.
when I execute in Hive command:
hive>select * from table where year="2016" and month="06" and day="01";
I CAN READ OTHER DAYS' DATA EXCEPT 2016/04/24
But,when I read in spark
sqlContext.sql.sql(select * from table where year="2016" and month="06" and day="01")
exceptition is throwable That I dont have the right to hdfs/.../2016/04/24
THIS SHOW SPARK SQL LOAD THE WHOLE TABLE ONCE AND THEN FILTER?
HOW CAN I AVOID LOAD THE WHOLE TABLE?
You can use JdbcRDDs directly. With it you can bypass spark sql engine therefore your queries will be directly sent to hive.
To use JdbcRDD you need to create hive driver and register it first (of course it is not registered already).
val driver = "org.apache.hive.jdbc.HiveDriver"
Class.forName(driver)
Then you can create a JdbcRDD;
val connUrl = "jdbc:hive2://..."
val query = """select * from table where year="2016" and month="06" and day="01" and ? = ?"""
val lowerBound = 0
val upperBound = 0
val numOfPartitions = 1
new JdbcRDD(
sc,
() => DriverManager.getConnection(connUrl),
query,
lowerBound,
upperBound,
numOfPartitions,
(r: ResultSet) => (r.getString(1) /** get data here or with a function**/)
)
JdbcRDD query must have two ? in order to create partition your data. So you should write a better query than me. This just creates one partition to demonstrate how it works.
However, before doing this I recommend you to check HiveContext. This supports HiveQL as well. Check this.

Resources