I am trying to read csv file and then adding some columns . After that trying to save in orc format.
I could not understand how spark decided number of tasks for different stages.
Why number of task for CSV stage is 1 and for ORC stage it is 39?
val c1c8 = spark.read.option("header",true).csv("/user/DEEPAK_TEST/C1C6_NEW/")
val c1c8new = { c1c8.withColumnRenamed("c1c6_F","c1c8").withColumnRenamed("Network_Out","c1c8_network").withColumnRenamed("Access NE Out","c1c8_access_ne")
.withColumn("c1c8_signalling",when (col("signalling_Out") === "SIP Cl4" , "SIP CL4").when (col("signalling_Out") === "SIP cl4" , "SIP CL4").when (col("signalling_Out") === "Other" , "other").otherwise(col("signalling_Out")))
.withColumnRenamed("access type Out","c1c8_access_type").withColumnRenamed("Type_of_traffic_C","c1c8_typeoftraffic")
.withColumnRenamed("BOS traffic type Out","c1c8_bos_trafc_typ").withColumnRenamed("Scope_Out","c1c8_scope")
.withColumnRenamed("Join with UP-DWN SIP cl5 T1T7 Out","c1c8_join_indicator")
.select("c1c8","c1c8_network", "c1c8_access_ne", "c1c8_signalling", "c1c8_access_type", "c1c8_typeoftraffic",
"c1c8_bos_trafc_typ", "c1c8_scope","c1c8_join_indicator")
}
c1c8new.write.orc("/user/DEEPAK_TEST/C1C8_MAPPING_NEWT/")
Below is my understanding from looking at Spark 2.x source code.
Stage 0 is a file scan that creates FileScanRDD which is an RDD that scans a list of file partitions. This stage can have more than one task when you are reading from multiple partitioned directories, such as a partitioned Hive table.
The number of tasks in Stage 1 will be equals to the number of RDD partitions. In your case c1c8new.rdd.getNumPartitions will be 39. This number is calculated using:
config value spark.files.maxPartitionBytes (128MB by default)
sparkContext.defaultParallelism returned by task scheduler (equal to number of cores when running in local mode)
totalBytes
DataSourceScanExec.scala#L423
val defaultMaxSplitBytes =
fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
val bytesPerCore = totalBytes / defaultParallelism
val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, " +
s"open cost is considered as scanning $openCostInBytes bytes.")
You can see actual calculated values in the above log message if you set the log level to INFO - spark.sparkContext.setLogLevel("INFO")
In your case, I think the split size is 128 and so, number of tasks/partitions is roughly 4.6G/128MB
As a side note, you can change the number of partitions (and hence the number of tasks in the subsequent stage) by using repartition() or coalesce() on the dataframe. More importantly, the number of partitions after a shuffle is determined by spark.sql.shuffle.partitions (200 by default). If you have a shuffle, it is better to use this configuration to control the number of tasks because inserting repartition() or coalesce() between stages adds extra overhead.
For large spark SQL workloads, setting optimum values for spark.sql.shuffle.partitions in each stage was always a pain point. Spark 3.x has better support for this if Adaptive Query Execution is enabled, but I haven't tried it for any production workloads.
I have a large dataset in parquet format (~1TB in size) that is partitioned into 2 hierarchies: CLASS and DATE
There are only 7 classes. But the Date is ever increasing from 2020-01-01 onwards.
My data is partitioned by CLASS first and then DATE
So something like:
CLASS1---DATE 1
---DATE 2
--- .
--- .
--- .
---DATE N
CLASS2---DATE 1
---DATE 2
--- .
--- .
--- .
---DATE N
I load my data by CLASS in a for-loop. If I load the entire parquet file, YARN kills the job since it overloads the memory instances. But I load all the days since I am doing a percentile calculation in my modeling. This method takes about 23hrs to complete.
However, if I repartition such that I only have the CLASS partition, the job takes about 10hrs.
Does having too many sub-partitions slow down the spark executor jobs?
I keep the partition hierarchy as CLASS -> DATE only because I need to append new data by DATE every day.
If having only 1 partition is more efficient, then I would have to repartition to just the CLASS partition every day after loading new data.
Could someone explain why having a single partition works faster? And if so, what would be the best method to partition the data on a daily basis by appending and without repartitioning the entire dataset?
Thank You
EDIT:
I use the for loop on the file structure to loop by CLASS partition like so:
fs = s3fs.S3FileSystem(anon=False)
inpath="s3://bucket/file.parquet/"
Dirs= fs.ls(inpath)
for paths in Dirs:
customPath='s3://' + uvapath + '/'
class=uvapath.split('=')[1]
df=spark.read.parquet(customPath)
outpath="s3://bucket/Output_" + class + ".parquet"
#Perform calculations
df.write.mode('overwrite').parquet(outpath)
The loaded df will have all the dates for CLASS=1. I then output the file as separate parquet files for each CLASS such that I have 7 parquet files:
Output_1.parquet
Output_2.parquet
Output_3.parquet
Output_4.parquet
Output_5.parquet
Output_6.parquet
Output_7.parquet
I then merge the 7 parquets into a single parquet is not a problem as the resulting parquet files are much smaller.
I have the partitioned data with three columns, year, month, and id. The folder path hierarchy is
year=2020/month=08/id=1/*.parquet
year=2020/month=08/id=2/*.parquet
year=2020/month=08/id=3/*.parquet
...
year=2020/month=09/id=1/*.parquet
year=2020/month=09/id=2/*.parquet
year=2020/month=09/id=3/*.parquet
and I can read the DataFrame by loading the root path.
val df = spark.read.parquet("s3://mybucket/")
Then, the partitioned column is automatically added to the DataFrame. Now, then you can filter your data for the partitioned column in a way that
val df_filtered = df.filter("year = '2020' and month = '09'")
and do something with df_filtered then the spark will use only the partitioned data!
For your repeated processing, you can use the fair scheduler of the spark. Add the fair.xml file into src/main/resources of your project with the below code,
<?xml version="1.0"?>
<allocations>
<pool name="fair">
<schedulingMode>FAIR</schedulingMode>
<weight>10</weight>
<minShare>0</minShare>
</pool>
</allocations>
and set the spark configuration after creating the spark session.
spark.sparkContext.setLocalProperty("spark.scheduler.mode", "FAIR")
spark.sparkContext.setLocalProperty("spark.scheduler.allocation.file", getClass.getResource("/fair.xml").getPath)
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "fair")
Then you can do your job in parallel. You may want to parallelize the job depends on the CLASS, so
val classes = (1 to 7).par
val date = '2020-09-25'
classes foreach { case i =>
val df_filtered = df.filter(s"CLASS == '$i' and DATE = '$date'")
// Do your job
}
the code will work at the same time with different CLASS values.
I am new to Spark distributed development. I'm attempting to optimize my existing Spark job which takes up to 1 hour to complete.
Infrastructure:
EMR [10 instances of r4.8xlarge (32 cores, 244GB)]
Source Data: 1000 .gz files in S3 (~30MB each)
Spark Execution Parameters [Executors: 300, Executor Memory: 6gb, Cores: 1]
In general, the Spark job performs the following:
private def processLines(lines: RDD[String]): DataFrame = {
val updatedLines = lines.mapPartitions(row => ...)
spark.createDataFrame(updatedLines, schema)
}
// Read S3 files and repartition() and cache()
val lines: RDD[String] = spark.sparkContext
.textFile(pathToFiles, numFiles)
.repartition(2 * numFiles) // double the parallelism
.cache()
val numRawLines = lines.count()
// Custom process each line and cache table
val convertedLines: DataFrame = processLines(lines)
convertedRows.createOrReplaceTempView("temp_tbl")
spark.sqlContext.cacheTable("temp_tbl")
val numRows = spark.sql("select count(*) from temp_tbl").collect().head().getLong(0)
// Select a subset of the data
val myDataFrame = spark.sql("select a, b, c from temp_tbl where field = 'xxx' ")
// Define # of parquet files to write using coalesce
val numParquetFiles = numRows / 1000000
var lessParts = myDataFrame.rdd.coalesce(numParquetFiles)
var lessPartsDataFrame = spark.sqlContext.createDataFrame(lessParts, myDataFrame.schema)
lessPartsDataFrame.createOrReplaceTempView('my_view')
// Insert data from view into Hive parquet table
spark.sql("insert overwrite destination_tbl
select * from my_view")
lines.unpersist()
The app reads all S3 files => repartitions to twice the amount of files => caches the RDD => custom processes each line => creates a temp view/cache table => counts the num rows => selects a subset of the data => decrease the amount of partitions => creates a view of the subset of data => inserts to hive destination table using the view => unpersist the RDD.
I am not sure why it takes a long time to execute. Are the spark execution parameters incorrectly set or is there something being incorrectly invoked here?
Before looking at the metrics, I would try the following change to your code.
private def processLines(lines: DataFrame): DataFrame = {
lines.mapPartitions(row => ...)
}
val convertedLinesDf = spark.read.text(pathToFiles)
.filter("field = 'xxx'")
.cache()
val numLines = convertedLinesDf.count() //dataset get in memory here, it takes time
// Select a subset of the data, but it will be fast if you have enough memory
// Just use Dataframe API
val myDataFrame = convertedLinesDf.transform(processLines).select("a","b","c")
//coalesce here without converting to RDD, experiment what best
myDataFrame.coalesce(<desired_output_files_number>)
.write.option(SaveMode.Overwrite)
.saveAsTable("destination_tbl")
Caching is useless if you don't count the number of rows. And it will take some memory and add some GC pressure
Caching table may consume more memory and add more GC pressure
Converting Dataframe to RDD is costly as it implies ser/deser operations
Not sure what you trying to do with : val numParquetFiles = numRows / 1000000 and repartition(2 * numFiles). With your setup, 1000 files of 30MB each will give you 1000 partitions. It could be fine like this. Calling repartition and coalesce may trigger a shuffling operation which is costly. (Coalesce may not trigger a shuffle)
Tell me if you get any improvements !
I am trying to move data from a table in PostgreSQL table to a Hive table on HDFS. To do that, I came up with the following code:
val conf = new SparkConf().setAppName("Spark-JDBC").set("spark.executor.heartbeatInterval","120s").set("spark.network.timeout","12000s").set("spark.sql.inMemoryColumnarStorage.compressed", "true").set("spark.sql.orc.filterPushdown","true").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").set("spark.kryoserializer.buffer.max","512m").set("spark.serializer", classOf[org.apache.spark.serializer.KryoSerializer].getName).set("spark.streaming.stopGracefullyOnShutdown","true").set("spark.yarn.driver.memoryOverhead","7168").set("spark.yarn.executor.memoryOverhead","7168").set("spark.sql.shuffle.partitions", "61").set("spark.default.parallelism", "60").set("spark.memory.storageFraction","0.5").set("spark.memory.fraction","0.6").set("spark.memory.offHeap.enabled","true").set("spark.memory.offHeap.size","16g").set("spark.dynamicAllocation.enabled", "false").set("spark.dynamicAllocation.enabled","true").set("spark.shuffle.service.enabled","true")
val spark = SparkSession.builder().config(conf).master("yarn").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
def prepareFinalDF(splitColumns:List[String], textList: ListBuffer[String], allColumns:String, dataMapper:Map[String, String], partition_columns:Array[String], spark:SparkSession): DataFrame = {
val colList = allColumns.split(",").toList
val (partCols, npartCols) = colList.partition(p => partition_columns.contains(p.takeWhile(x => x != ' ')))
val queryCols = npartCols.mkString(",") + ", 0 as " + flagCol + "," + partCols.reverse.mkString(",")
val execQuery = s"select ${allColumns}, 0 as ${flagCol} from schema.tablename where period_year='2017' and period_num='12'"
val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName).option("password", devPassword)
.option("partitionColumn","cast_id")
.option("lowerBound", 1).option("upperBound", 100000)
.option("numPartitions",70).load()
val totalCols:List[String] = splitColumns ++ textList
val cdt = new ChangeDataTypes(totalCols, dataMapper)
hiveDataTypes = cdt.gpDetails()
val fc = prepareHiveTableSchema(hiveDataTypes, partition_columns)
val allColsOrdered = yearDF.columns.diff(partition_columns) ++ partition_columns
val allCols = allColsOrdered.map(colname => org.apache.spark.sql.functions.col(colname))
val resultDF = yearDF.select(allCols:_*)
val stringColumns = resultDF.schema.fields.filter(x => x.dataType == StringType).map(s => s.name)
val finalDF = stringColumns.foldLeft(resultDF) {
(tempDF, colName) => tempDF.withColumn(colName, regexp_replace(regexp_replace(col(colName), "[\r\n]+", " "), "[\t]+"," "))
}
finalDF
}
val dataDF = prepareFinalDF(splitColumns, textList, allColumns, dataMapper, partition_columns, spark)
val dataDFPart = dataDF.repartition(30)
dataDFPart.createOrReplaceTempView("preparedDF")
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("set hive.exec.dynamic.partition=true")
spark.sql(s"INSERT OVERWRITE TABLE schema.hivetable PARTITION(${prtn_String_columns}) select * from preparedDF")
The data is inserted into the hive table dynamically partitioned based on prtn_String_columns: source_system_name, period_year, period_num
Spark-submit used:
SPARK_MAJOR_VERSION=2 spark-submit --conf spark.ui.port=4090 --driver-class-path /home/fdlhdpetl/jars/postgresql-42.1.4.jar --jars /home/fdlhdpetl/jars/postgresql-42.1.4.jar --num-executors 80 --executor-cores 5 --executor-memory 50G --driver-memory 20G --driver-cores 3 --class com.partition.source.YearPartition splinter_2.11-0.1.jar --master=yarn --deploy-mode=cluster --keytab /home/fdlhdpetl/fdlhdpetl.keytab --principal fdlhdpetl#FDLDEV.COM --files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties --name Splinter --conf spark.executor.extraClassPath=/home/fdlhdpetl/jars/postgresql-42.1.4.jar
The following error messages are generated in the executor logs:
Container exited with a non-zero exit code 143.
Killed by external signal
18/10/03 15:37:24 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[SIGTERM handler,9,system]
java.lang.OutOfMemoryError: Java heap space
at java.util.zip.InflaterInputStream.<init>(InflaterInputStream.java:88)
at java.util.zip.ZipFile$ZipFileInflaterInputStream.<init>(ZipFile.java:393)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:374)
at java.util.jar.JarFile.getManifestFromReference(JarFile.java:199)
at java.util.jar.JarFile.getManifest(JarFile.java:180)
at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:944)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:450)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.util.SignalUtils$ActionHandler.handle(SignalUtils.scala:99)
at sun.misc.Signal$1.run(Signal.java:212)
at java.lang.Thread.run(Thread.java:745)
I see in the logs that the read is being executed properly with the given number of partitions as below:
Scan JDBCRelation((select column_names from schema.tablename where period_year='2017' and period_num='12') as year2017) [numPartitions=50]
Below is the state of executors in stages:
The data is not being partitioned properly. One partition is smaller while the other one becomes huge. There is a skew problem here.
While inserting the data into Hive table the job fails at the line:spark.sql(s"INSERT OVERWRITE TABLE schema.hivetable PARTITION(${prtn_String_columns}) select * from preparedDF") but I understand this is happening because of the data skew problem.
I tried to increase number of executors, increasing the executor memory, driver memory, tried to just save as csv file instead of saving the dataframe into a Hive table but nothing affects the execution from giving the exception:
java.lang.OutOfMemoryError: GC overhead limit exceeded
Is there anything in the code that I need to correct ? Could anyone let me know how can I fix this problem ?
Determine how many partitions you need given the amount of input data and your cluster resources. As a rule of thumb it is better to keep partition input under 1GB unless strictly necessary. and strictly smaller than the block size limit.
You've previously stated that you migrate 1TB of data values you use in different posts (5 - 70) are likely way to low to ensure smooth process.
Try to use value which won't require further repartitioning.
Know your data.
Analyze the columns available in the the dataset to determine if there any columns with high cardinality and uniform distribution to be distributed among desired number of partitions. These are good candidates for an import process. Additionally you should determine an exact range of values.
Aggregations with different centrality and skewness measure as well as histograms and basic counts-by-key are good exploration tools. For this part it is better to analyze data directly in the database, instead of fetching it to Spark.
Depending on the RDBMS you might be able to use width_bucket (PostgreSQL, Oracle) or equivalent function to get a decent idea how data will be distributed in Spark after loading with partitionColumn, lowerBound, upperBound, numPartitons.
s"""(SELECT width_bucket($partitionColum, $lowerBound, $upperBound, $numPartitons) AS bucket, COUNT(*)
FROM t
GROUP BY bucket) as tmp)"""
If there are no columns which satisfy above criteria consider:
Creating a custom one and exposing it via. a view. Hashes over multiple independent columns are usually good candidates. Please consult your database manual to determine functions that can be used here (DBMS_CRYPTO in Oracle, pgcrypto in PostgreSQL)*.
Using a set of independent columns which taken together provide high enough cardinality.
Optionally, if you're going to write to a partitioned Hive table, you should consider including Hive partitioning columns. It might limit the number of files generated later.
Prepare partitioning arguments
If column selected or created in the previous steps is numeric (or date / timestamp in Spark >= 2.4) provide it directly as the partitionColumn and use range values determined before to fill lowerBound and upperBound.
If bound values don't reflect the properties of data (min(col) for lowerBound, max(col) for upperBound) it can result in a significant data skew so thread carefully. In the worst case scenario, when bounds don't cover the range of data, all records will be fetched by a single machine, making it no better than no partitioning at all.
If column selected in the previous steps is categorical or is a set of columns generate a list of mutually exclusive predicates that fully cover the data, in a form that can be used in a SQL where clause.
For example if you have a column A with values {a1, a2, a3} and column B with values {b1, b2, b3}:
val predicates = for {
a <- Seq("a1", "a2", "a3")
b <- Seq("b1", "b2", "b3")
} yield s"A = $a AND B = $b"
Double check that conditions don't overlap and all combinations are covered. If these conditions are not satisfied you end up with duplicates or missing records respectively.
Pass data as predicates argument to jdbc call. Note that the number of partitions will be equal exactly to the number of predicates.
Put database in a read-only mode (any ongoing writes can cause data inconsistency. If possible you should lock database before you start the whole process, but if might be not possible, in your organization).
If the number of partitions matches the desired output load data without repartition and dump directly to the sink, if not you can try to repartition following the same rules as in the step 1.
If you still experience any problems make sure that you've properly configured Spark memory and GC options.
If none of the above works:
Consider dumping your data to a network / distributes storage using tools like COPY TO and read it directly from there.
Note that or standard database utilities you will typically need a POSIX compliant file system, so HDFS usually won't do.
The advantage of this approach is that you don't need to worry about the column properties, and there is no need for putting data in a read-only mode, to ensure consistency.
Using dedicated bulk transfer tools, like Apache Sqoop, and reshaping data afterwards.
* Don't use pseudocolumns - Pseudocolumn in Spark JDBC.
In my experience there are 4 kinds of memory settings which make a difference:
A) [1] Memory for storing data for processing reasons VS [2] Heap Space for holding the program stack
B) [1] Driver VS [2] executor memory
Up to now, I was always able to get my Spark jobs running successfully by increasing the appropriate kind of memory:
A2-B1 would therefor be the memory available on the driver to hold the program stack. Etc.
The property names are as follows:
A1-B1) executor-memory
A1-B2) driver-memory
A2-B1) spark.yarn.executor.memoryOverhead
A2-B2) spark.yarn.driver.memoryOverhead
Keep in mind that the sum of all *-B1 must be less than the available memory on your workers and the sum of all *-B2 must be less than the memory on your driver node.
My bet would be, that the culprit is one of the boldly marked heap settings.
There was an another question of yours routed here as duplicate
'How to avoid data skewing while reading huge datasets or tables into spark?
The data is not being partitioned properly. One partition is smaller while the
other one becomes huge on read.
I observed that one of the partition has nearly 2million rows and
while inserting there is a skew in partition. '
if the problem is to deal with data that is partitioned in a dataframe after read, Have you played around increasing the "numPartitions" value ?
.option("numPartitions",50)
lowerBound, upperBound form partition strides for generated WHERE clause expressions and numpartitions determines the number of split.
say for example, sometable has column - ID (we choose that as partitionColumn) ; value range we see in table for column-ID is from 1 to 1000 and we want to get all the records by running select * from sometable,
so we going with lowerbound = 1 & upperbound = 1000 and numpartition = 4
this will produce a dataframe of 4 partition with result of each Query by building sql based on our feed (lowerbound = 1 & upperbound = 1000 and numpartition = 4)
select * from sometable where ID < 250
select * from sometable where ID >= 250 and ID < 500
select * from sometable where ID >= 500 and ID < 750
select * from sometable where ID >= 750
what if most of the records in our table fall within the range of ID(500,750). that's the situation you are in to.
when we increase numpartition , the split happens even further and that reduce the volume of records in the same partition but this
is not a fine shot.
Instead of spark splitting the partitioncolumn based on boundaries we provide, if you think of feeding the split by yourself so, data can be evenly
splitted. you need to switch over to another JDBC method where instead of (lowerbound,upperbound & numpartition) we can provide
predicates directly.
def jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties): DataFrame
Link
I have a use case where in i create rdd from a hive table. I wrote a business logic that operates on every row in the hive table. My assumption was that when i create rdd and span a map process on it, it then utilises all my spark executors. But, what i see in my log is only one node process the rdd while rest of my 5 nodes sitting idle. Here is my code
val flow = hiveContext.sql("select * from humsdb.t_flow")
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Any clue where i go wrong?
As specify here by #jaceklaskowski
By default, a partition is created for each HDFS partition, which by
default is 64MB (from Spark’s Programming Guide).
If your input data is less than 64MB (and you are using HDFS) then by default only one partition will be created.
Spark will use all nodes when using big data
Could there be a possibility that your data is skewed?
To rule out this possibility, do the following and rerun the code.
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(200)
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Further if in your map logic you are dependent on a particular column you can do below
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(col("yourColumnName"))
var x = flow.rdd.map { row =>
< do some computation on each row>
}
A good partition column could be date column