I am trying to build for each of my users a vector containing the average number of records per hour of day. Hence the vector has to have 24 dimensions.
My original DataFrame has userID and hour columns, andI am starting by doing a groupBy and counting the number of record per user per hour as follow:
val hourFreqDF = df.groupBy("userID", "hour").agg(count("*") as "hfreq")
Now, in order to generate a vector per user I am doing the follow, based on the first suggestion in this answer.
val hours = (0 to 23 map { n => s"$n" } toArray)
val assembler = new VectorAssembler()
.setInputCols(hours)
.setOutputCol("hourlyConnections")
val exprs = hours.map(c => avg(when($"hour" === c, $"hfreq").otherwise(lit(0))).alias(c))
val transformed = assembler.transform(hourFreqDF.groupBy($"userID")
.agg(exprs.head, exprs.tail: _*))
When I run this example, I get the following warning:
Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
I presume this is because the expression is too long?
My question is: can I safely ignore this warning?
You can safely ignore it, if you are not interested in seeing the sql schema logs. Otherwise, you might want to set the property to a higher value, but it might affect the performance of your job:
spark.debug.maxToStringFields=100
Default value is: DEFAULT_MAX_TO_STRING_FIELDS = 25
The performance overhead of creating and logging strings
for wide schemas can be large. To limit the impact, we bound the
number of fields to include by default. This can be overridden by
setting the 'spark.debug.maxToStringFields' conf in SparkEnv.
Taken from: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L90
This config, along many others, has been moved to: SQLConf - sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
This can be set either in the config file or via command line in spark, using:
spark.conf.set("spark.sql.debug.maxToStringFields", 1000)
Related
First of all, I have to say that I've already tried everything I know or found on google (Including this Spark: How to use crossJoin which is exactly my problem).
I have to calculate the Cartesian product between two DataFrame - countries and units such that -
A.cache().count()
val units = A.groupBy("country")
.agg(sum("grade").as("grade"),
sum("point").as("point"))
.withColumn("AVR", $"grade" / $"point" * 1000)
.drop("point", "grade")
val countries = D.select("country").distinct()
val C = countries.crossJoin(units)
countries contains a countries name and its size bounded by 150. units is DataFrame with 3 rows - an aggregated result of other DataFrame. I checked 100 times the result and those are the sizes indeed - and it takes 5 hours to complete.
I know I missed something. I've tried caching, repartitioning, etc.
I would love to get some other ideas.
I have two suggestions for you:
Look at the explain plan and the spark properties, for the amount of data you have mentioned 5 hours is a really long time. My expectation is you have way too many shuffles, you can look at different properties like : spark.sql.shuffle.partitions
Instead of doing a cross join, you can maybe do a collect and explore broadcasts
https://sparkbyexamples.com/spark/spark-broadcast-variables/ but do this only on small amounts of data as this data is brought back to the driver.
What is the action you are doing afterwards with C?
Also, if these datasets are so small, consider collecting them to the driver, and doing these manupulation there, you can always spark.createDataFrame later again.
Update #1:
final case class Unit(country: String, AVR: Double)
val collectedUnits: Seq[Unit] = units.as[Unit].collect
val collectedCountries: Seq[String] = countries.collect
val pairs: Seq[(String, Unit)] = for {
unit <- units
country <- countries
} yield (country, unit)
I've finally understood the problem - Spark used too many excessive numbers of partitions, and thus the shuffle takes a lot of time.
The way to solve it is to change the default number -
sparkSession.conf.set("spark.sql.shuffle.partitions", 10)
And it works like magic.
When using Dataset.groupByKey(_.key).mapGroups or Dataset.groupByKey(_.key).cogroup in Spark, I've run into a problem when one of the groupings results in more than 2GB of data.
I need to normalize the data by group before I can start to reduce it, and I would like to split up the groups into smaller subgroups so they distribute better. For example, here's one way I've attempted to split the groups:
val groupedInputs = inputData.groupByKey(_.key).mapGroups {
case(key, inputSeries) => inputSeries.grouped(maxGroupSize).map(group => (key, group))
}
But unfortunately however I try to work around it, my jobs always die with an error like this: java.lang.UnsupportedOperationException: Cannot grow BufferHolder by size 23816 because the size after growing exceeds size limitation 2147483632. When using Kryo serialization I get a different Kryo serialization failed: Buffer overflow error recommending I increase spark.kryoserializer.buffer.max, but I've already increased it to the 2GB limit.
One solution that occurs to me is to add a random value to the keys before grouping them. This isn't ideal since it'll split up every group (not just the large ones), but I'm willing to sacrifice "ideal" for the sake of "working". That code would look something like this:
val splitInputs = inputData.map( record => (record, ThreadLocalRandom.current.nextInt(splitFactor)))
val groupedInputs = splitInputs.groupByKey{ case(record, split) => (record.key, split)).mapGroups {
case((key, _), inputSeries) => inputSeries.grouped(maxGroupSize).map(group => (key, group.map(_._1)))
}
Add a salt key and do groupBy on your key and the salt key and later
import scala.util.Random
val start = 1
val end = 5
val randUdf = udf({() => start + Random.nextInt((end - start) + 1)})
val saltGroupBy=skewDF.withColumn("salt_key", randUdf())
.groupBy(col("name"), col("salt_key"))
So your all the skew data doesn't go into one executor and cause the 2GB Limit.
But you have to develop a logic to aggregate the above result and finally remove the salt key at the end.
When you use groupBy all the records with the same key will reach one executor and bottle neck occur.
The above is one of the method to mitigate it.
For this case, where the dataset had a lot of skew and it was important to group the records into regularly-sized groups, I decided to process the dataset in two passes. First I used a window function to number the rows by key, and converted that to a "group index," based on a configurable "maxGroupSize":
// The "orderBy" doesn't seem necessary here,
// but the row_number function requires it.
val partitionByKey = Window.partitionBy(key).orderBy(key)
val indexedData = inputData.withColumn("groupIndex",
(row_number.over(partitionByKey) / maxGroupSize).cast(IntegerType))
.as[(Record, Int)]
Then I can group by key and index, and produce groups that are consistently sized--the keys with a lot of records get split up more, and the keys with few records may not be split up at all.
indexedData.groupByKey{ case (record, groupIndex) => (record.key, groupIndex) }
.mapGroups{ case((key, _), recordGroup) =>
// Remove the index values before returning the groups
(key, recordGroup.map(_._1))
}
I've 2 dataframes and I want to find the records with all columns equal except 2 (surrogate_key,current)
And then I want to save those records with new surrogate_key value.
Following is my code :
val seq = csvDataFrame.columns.toSeq
var exceptDF = csvDataFrame.except(csvDataFrame.as('a).join(table.as('b),seq).drop("surrogate_key","current"))
exceptDF.show()
exceptDF = exceptDF.withColumn("surrogate_key", makeSurrogate(csvDataFrame("name"), lit("ecc")))
exceptDF = exceptDF.withColumn("current", lit("Y"))
exceptDF.show()
exceptDF.write.option("driver","org.postgresql.Driver").mode(SaveMode.Append).jdbc(postgreSQLProp.getProperty("url"), tableName, postgreSQLProp)
This code gives correct results, but get stuck while writing those results to postgre.
Not sure what's the issue. Also is there any better approach for this??
Regards,
Sorabh
By Default spark-sql creates 200 partitions, which means when you are trying to save the datafrmae it will be saved in 200 parquet files. you can reduce the number of partitions for Dataframe using below techniques.
At application level. Set the parameter "spark.sql.shuffle.partitions" as follows :
sqlContext.setConf("spark.sql.shuffle.partitions", "10")
Reduce the number of partition for a particular DataFrame as follows :
df.coalesce(10).write.save(...)
Using the var for dataframe are not suggested, You should always use val and create a new Dataframe after performing some transformation in dataframe.
Please remove all the var and replace with val.
Hope this helps!
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.spark.rdd.PairRDDFunctions
def bulkWriteToHBase(sparkSession: SparkSession, sparkContext: SparkContext, jobContext: Map[String, String], sinkTableName: String, outRDD: RDD[(ImmutableBytesWritable, Put)]): Unit = {
val hConf = HBaseConfiguration.create()
hConf.set("hbase.zookeeper.quorum", jobContext("hbase.zookeeper.quorum"))
hConf.set("zookeeper.znode.parent", jobContext("zookeeper.znode.parent"))
hConf.set(TableInputFormat.INPUT_TABLE, sinkTableName)
val hJob = Job.getInstance(hConf)
hJob.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, sinkTableName)
hJob.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
outRDD.saveAsNewAPIHadoopDataset(hJob.getConfiguration())
}
what I have found by using this hbase bulk insertion is that, every time spark will only write into one single region server from hbase, which becomes the bottleneck.
however when I use almost the same approach but reading from hbase, it is using multiple executors to do parallel reading .
def bulkReadFromHBase(sparkSession: SparkSession, sparkContext: SparkContext, jobContext: Map[String, String], sourceTableName: String) = {
val hConf = HBaseConfiguration.create()
hConf.set("hbase.zookeeper.quorum", jobContext("hbase.zookeeper.quorum"))
hConf.set("zookeeper.znode.parent", jobContext("zookeeper.znode.parent"))
hConf.set(TableInputFormat.INPUT_TABLE, sourceTableName)
val inputRDD = sparkContext.newAPIHadoopRDD(hConf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
inputRDD
}
can anyone please explain why this could happen? or maybe I have
used the wrong way for spark-hbase bulk I/O ?
Question : I have used the wrong way for spark-hbase bulk I/O ?
No your way is right, although, you need to pre-split regions before hand & create table with presplit regions.
for example create 'test_table', 'f1', SPLITS=> ['1', '2', '3', '4', '5', '6', '7', '8', '9']
Above table occupies 9 regions..
design good rowkey with will starts with 1-9
you can use guava murmur hash like below.
import com.google.common.hash.HashCode;
import com.google.common.hash.HashFunction;
import com.google.common.hash.Hashing;
/**
* getMurmurHash.
*
* #param content
* #return HashCode
*/
public static HashCode getMurmurHash(String content) {
final HashFunction hf = Hashing.murmur3_128();
final HashCode hc = hf.newHasher().putString(content, Charsets.UTF_8).hash();
return hc;
}
final long hash = getMurmur128Hash(Bytes.toString(yourrowkey as string)).asLong();
final int prefix = Math.abs((int) hash % 9);
now append this prefix to your rowkey
For example
1rowkey1 // will go in to first region
2rowkey2 // will go in to
second region
3rowkey3 // will go in to third region
...
9rowkey9 //
will go in to ninth region
If you are doing pre-splitting, and want to manually manage region splits, you can also disable region splits, by setting hbase.hregion.max.filesize to a high number and setting the split policy to ConstantSizeRegionSplitPolicy. However, you should use a safeguard value of like 100GB, so that regions does not grow beyond a region server’s capabilities. You can consider disabling automated splitting and rely on the initial set of regions from pre-splitting for example, if you are using uniform hashes for your key prefixes, and you can ensure that the read/write load to each region as well as its size is uniform across the regions in the table
1) please ensure that you can presplit the table before loading data in to hbase table 2) Design good rowkey as Explained below using murmurhash or some other hashing technique. to ensure uniform distribution across the regions.
Also look at http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
Question : can anyone please explain why this could happen?
reason is quite obvious and simple HOT SPOTTING of data in to one specific reason becuase of poor rowkey for that table...
Consider a hashmap in java which has elements with hashcode 1234. then it will fill all the elements in one bucket isntit ? If hashmap elements are distributed across different good hashcode then it will put elements in different buckets. same is the case with hbase. here your hashcode is just like your rowkey...
Further more,
What happens if I already have a table and I want to split the regions
across...
The RegionSplitter class provides several utilities to help in the administration lifecycle for developers who choose to manually split regions instead of having HBase handle that automatically.
The most useful utilities are:
Create a table with a specified number of pre-split regions
Execute a rolling split of all regions on an existing table
Example :
$ hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit -c 10 -f f1
where -c 10, specifies the requested number of regions as 10, and -f specifies the column families you want in the table, separated by “:”. The tool will create a table named “test_table” with 10 regions:
13/01/18 18:49:32 DEBUG hbase.HRegionInfo: Current INFO from scan results = {NAME => 'test_table,,1358563771069.acc1ad1b7962564fc3a43e5907e8db33.', STARTKEY => '', ENDKEY => '19999999', ENCODED => acc1ad1b7962564fc3a43e5907e8db33,}
13/01/18 18:49:32 DEBUG hbase.HRegionInfo: Current INFO from scan results = {NAME => 'test_table,19999999,1358563771096.37ec12df6bd0078f5573565af415c91b.', STARTKEY => '19999999', ENDKEY => '33333332', ENCODED => 37ec12df6bd0078f5573565af415c91b,}
...
as discussed in comment, you found that my final RDD right before writing into hbase only has 1 partition! which indicates that there
was only one executor holding the entire data... I am still trying to
find out why.
Also, Check
spark.default.parallelism defaults to the number of all cores on all
machines. The parallelize api has no parent RDD to determine the
number of partitions, so it uses the spark.default.parallelism.
So You can increase partitions by repartitioning.
NOTE : I observed that, In Mapreduce The number of partitions of the regions/input split = number of mappers launched.. Similarly in your case it may be the same situation where data loaded in to one particular region thats why one executor lauched. please verify that as well
Though you have not provided example data or enough explanation,this is mostly not due to your code or configuration.
It is happening so,due to non-optimal rowkey design.
The data you are writing is having keys(hbase rowkey) improperly structured(maybe monotonically increasing or something else).So, write to one of the regions is happening.You can prevent that thro' various ways(various recommended practices for rowkey design like salting,inverting,and other techniques).
For reference you can see http://hbase.apache.org/book.html#rowkey.design
In case,if you are wondering whether the write is done in parallel for all regions or one by one(not clear from question) look at this :
http://hbase.apache.org/book.html#_bulk_load.
I do some experimentation on a MacBook (i5, 2.6GHz, 8GB ram) with Zeppelin NB and Spark in standalone mode. spark.executor/driver.memory both get 2g. I have also set spark.serializer org.apache.spark.serializer.KryoSerializer in spark-defaults.conf, but that seems to be ignored by zeppelin
ALS model
I have trained a ALS model with ~400k (implicit) ratings and want to get recommendations with val allRecommendations = model.recommendProductsForUsers(1)
Sample set
Next I take a sample to play around with
val sampledRecommendations = allRecommendations.sample(false, 0.05, 1234567).cache
This contains 3600 recommendations.
Remove product recommendations that users own
Next I want to remove all ratings for products that a given user already owns, the list I hold in a RDD of the form (user_id, Set[product_ids]): RDD[(Long, scala.collection.mutable.HashSet[Int])]
val productRecommendations = (sampledRecommendations
// add user portfolio to the list, but convert the key from Long to Int first
.join(usersProductsFlat.map( up => (up._1.toInt, up._2) ))
.mapValues(
// (user, (ratings: Array[Rating], usersOwnedProducts: HashSet[Long]))
r => (r._1
.filter( rating => !r._2.contains(rating.product))
.filter( rating => rating.rating > 0.5)
.toList
)
)
// In case there is no recommendation (left), remove the entry
.filter(rating => !rating._2.isEmpty)
).cache
Question 1
Calling this (productRecommendations.count) on the cached sample set generates a stage that includes flatMap at MatrixFactorizationModel.scala:278 with 10,000 tasks, 263.6 MB of input data and 196.0 MB shuffle write. Shouldn't the tiny and cached RDD be used instead and what is going (wr)on(g) here? The execution of the count takes almost 5 minutes!
Question 2
Calling usersProductsFlat.count which is fully cached according to the "Storage" view in the application UI takes ~60 seconds each time. It's 23Mb in size – shouldn't that be a lot faster?
Map to readable form
Next I bring this in some readable form replacing IDs with names from a broadcasted lookup Map to put into a DF/table:
val readableRatings = (productRecommendations
.flatMapValues(x=>x)
.map( r => (r._1, userIdToMailBC.value(r._1), r._2.product.toInt, productIdToNameBC.value(r._2.product), r._2.rating))
).cache
val readableRatingsDF = readableRatings.toDF("user","email", "product_id", "product", "rating").cache
readableRatingsDF.registerTempTable("recommendations")
Select … with patience
The insane part starts here. Doing a SELECT takes several hours (I could never wait for one to finish):
%sql
SELECT COUNT(user) AS usr_cnt, product, AVG(rating) AS avg_rating
FROM recommendations
GROUP BY product
I don't know where to look to find the bottlenecks here, there is obviously some huge kerfuffle going on here! Where can I start looking?
Your number of partitions may be too large. I think you should use about 200 when running in local mode rather than 10000. You can set the number of partitions in different ways. I suggest you edit the spark.default.parallelism flag in the Spark configuration file.