run reduceByKey on huge data in spark - apache-spark

I'm running reduceByKey in spark. My program is the simplest example of spark:
val counts = textFile.flatMap(line => line.split(" ")).repartition(20000).
.map(word => (word, 1))
.reduceByKey(_ + _, 10000)
counts.saveAsTextFile("hdfs://...")
but it always run out of memory...
I 'm using 50 servers , 35 executors per server, 140GB memory per server.
the documents volume is :
8TB documents, 20 billion documents, 1000 billion words in total.
and the words after reduce will be about 100 million.
I wonder how to set the configuration of spark?
I wonder what value should these parameters be?
1. the number of the maps ? 20000 for example?
2. the number of the reduces ? 10000 for example?
3. others parameters?

It would be helpful if you posted the logs, but one option would be to specify a larger number of partitions when reading in the initial text file (e.g. sc.textFile(path, 200000)) rather than re-partitioning after reading . Another important thing is to make sure that your input file is splittable (some compression options make it not splittable, and in that case Spark may have to read it on a single machine causing OOMs).
Some other options are, since you aren't caching any of the data, would be reducing the amount of memory Spark is setting aside for caching (controlled with with spark.storage.memoryFraction), also since you are only working with tuples of strings I'd recommend using the org.apache.spark.serializer.
KryoSerializer serializer.

Did you try to use a partionner, it can help to reduce the number of key per node, if we suppose that keys word weights in average 1ko, it implies 100Go of memory exclusively for keys per node. With partitioning you can approximatively reduce number of key per node by the number of node, reducing accordingly the necessary amount of memery per node.
The spark.storage.memoryFraction option mentioned by #Holden is also an key factor.

Related

What exactly is a partition size in cassandra?

I am new to Cassandra, I have a cassandra cluster with 6 nodes. I am trying to find the partition size,
Tried to fetch it with this basic command
nodetool tablehistograms keyspace.tablename
Now, I am wondering how is it calculated and why the result has only 5 records other than min, max, while the number of nodes are 6. Does node size and number of partitions for a table has any relation?
Fundamentally, what I know is partition key is used to hash and distribute data to be persisted across various nodes
When exactly should we go for bucketing? I am assuming that Cassandra has got a partitioner that take care of distributed persistence across nodes.
The number of entries in this column is not related to the number of nodes. It shows the distribution of the values - you have min, max, and percentiles (50/75/95/98/99).
Most of the nodetool commands doesn't show anything about other nodes - they are tools for providing information about current node only.
P.S. This document would be useful in explaining how to interpret this information.
As the name of the command suggests, tablehistograms reports the distribution of metadata for the partitions held by a node.
To add to what Alex Ott has already stated, the percentiles (not percentages) provide an insight on the range of metadata values. For example:
50% of the partitions for the given table have a size of 74KB or less
95% are 263KB or less
98% are 455KB or less
These metadata don't have any correlation with the number of partitions or the number of nodes in your cluster.
You are correct in that the partition key gets hashed and the resulting value determines where the partition (and its associated rows) get stored (distributed among nodes in the cluster). If you're interested, I've explained in a bit more detail with some examples in this post -- https://community.datastax.com/questions/5944/.
As far as bucketing is concerned, you would typically do that to reduce the number of rows in a partition and therefore reducing its size. The general recommendation is to keep your partition sizes less than 100MB for optimal performance but it's not a hard rule -- you can have larger partitions as long as you are aware of the tradeoffs.
In your case, the larges partition is only 455KB so size is not a concern. Cheers!

Performance of hazelcast using executors on imap with million entries

We are applying few predicates on imap containing just 100,000 objects to filter data. These predicates will change per user. While doing POC on my local machine (16 GB) with two nodes(each node shows 50000) and 100,000 records, I am getting output in 30 sec which is way more than querying the database directly.
Will increasing number of nodes reduce the time, I even tried with PagingPredicate but it takes around 20 sec for each page
IMap objectMap = hazelcastInstance.getMap("myMap");
MultiMap resultMap = hazelcastInstance.getMap("myResultMap");
/*Option 1 : passing hazelcast predicate for imap.values*/
objectMap.values(predicate).parallelStream().forEach(entry -> resultMap(userId, entry));
/*Option 2: applying java predicate to entrySet OR localkeyset*/
objectMap.entrySet.parallelstream().filter(predicate).forEach(entry -> resultMap(userId, entry));
More nodes will help, but the improvement is difficult to quantify. It could be large, it could be small.
Part of the work in the code sample involves applying a predicate across 100,000 entries. If there is no index, the scan stage checks 50,000 entries per node if there are 2 nodes. Double up to 4 nodes, each has 25,000 entries to scan and so the scan time will half.
The scan time is part of the query time, the overall result set also has to be formed from the partial results from each node. So doubling the number of nodes might nearly half the run time as a best case, or it might not be a significant improvement.
Perhaps the bigger question here is what are you trying to achieve ?
objectMap.values(predicate) in the code sample retrieves the result set to a central point, which then has parallelStream() applied to try to merge the results in parallel into a MultiMap. So this looks like more of an ETL than a query.
Use of executors as per the title, and something like objectMap.localKeySet(predicate) might allow this to be parallelised out better, as there would be no central point holding intermediate results.

Write spark dataframe to single parquet file

I am trying to do something very simple and I'm having some very stupid struggles. I think it must have to do with a fundamental misunderstanding of what spark is doing. I would greatly appreciate any help or explanation.
I have a very large (~3 TB, ~300MM rows, 25k partitions) table, saved as parquet in s3, and I would like to give someone a tiny sample of it as a single parquet file. Unfortunately, this is taking forever to finish and I don't understand why. I have tried the following:
tiny = spark.sql("SELECT * FROM db.big_table LIMIT 500")
tiny.coalesce(1).write.saveAsTable("db.tiny_table")
and then when that didn't work I tried this, which I thought should be the same, but I wasn't sure. (I added the print's in an effort to debug.)
tiny = spark.table("db.big_table").limit(500).coalesce(1)
print(tiny.count())
print(tiny.show(10))
tiny.write.saveAsTable("db.tiny_table")
When I watch the Yarn UI, both print statements and the write are using 25k mappers. The count took 3 mins, the show took 25 mins, and the write took ~40 mins, although it finally did write the single file table I was looking for.
It seems to me like the first line should take the top 500 rows and coalesce them to a single partition, and then the other lines should happen extremely fast (on a single mapper/reducer). Can anyone see what I'm doing wrong here? I've been told maybe I should use sample instead of limit but as I understand it limit should be much faster. Is that right?
Thanks in advance for any thoughts!
I’ll approach the print functions issue first, as it’s something fundamental to understanding spark. Then limit vs sample. Then repartition vs coalesce.
The reasons the print functions take so long in this manner is because coalesce is a lazy transformation. Most transformations in spark are lazy and do not get evaluated until an action gets called.
Actions are things that do stuff and (mostly) dont return a new dataframe as a result. Like count, show. They return a number, and some data, whereas coalesce returns a dataframe with 1 partition (sort of, see below).
What is happening is that you are rerunning the sql query and the coalesce call each time you call an action on the tiny dataframe. That’s why they are using the 25k mappers for each call.
To save time, add the .cache() method to the first line (for your print code anyway).
Then the data frame transformations are actually executed on your first line and the result persisted in memory on your spark nodes.
This won’t have any impact on the initial query time for the first line, but at least you’re not running that query 2 more times because the result has been cached, and the actions can then use that cached result.
To remove it from memory, use the .unpersist() method.
Now for the actual query youre trying to do...
It really depends on how your data is partitioned. As in, is it partitioned on specific fields etc...
You mentioned it in your question, but sample might the right way to go.
Why is this?
limit has to search for 500 of the first rows. Unless your data is partitioned by row number (or some sort of incrementing id) then the first 500 rows could be stored in any of the the 25k partitions.
So spark has to go search through all of them until it finds all the correct values. Not only that, it has to perform an additional step of sorting the data to have the correct order.
sample just grabs 500 random values. Much easier to do as there’s no order/sorting of the data involved and it doesn’t have to search through specific partitions for specific rows.
While limit can be faster, it also has its, erm, limits. I usually only use it for very small subsets like 10/20 rows.
Now for partitioning....
The problem I think with coalesce is it virtually changes the partitioning. Now I’m not sure about this, so pinch of salt.
According to the pyspark docs:
this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
So your 500 rows will actually still sit across your 25k physical partitions that are considered by spark to be 1 virtual partition.
Causing a shuffle (usually bad) and persisting in spark memory with .repartition(1).cache() is possibly a good idea here. Because instead of having the 25k mappers looking at the physical partitions when you write, it should only result in 1 mapper looking at what is in spark memory. Then write becomes easy. You’re also dealing with a small subset, so any shuffling should (hopefully) be manageable.
Obviously this is usually bad practice, and doesn’t change the fact spark will probably want to run 25k mappers when it performs the original sql query. Hopefully sample takes care of that.
edit to clarify shuffling, repartition and coalesce
You have 2 datasets in 16 partitions on a 4 node cluster. You want to join them and write as a new dataset in 16 partitions.
Row 1 for data 1 might be on node 1, and row 1 for data 2 on node 4.
In order to join these rows together, spark has to physically move one, or both of them, then write to a new partition.
That’s a shuffle, physically moving data around a cluster.
It doesn’t matter that everything is partitioned by 16, what matters is where the data is sitting on he cluster.
data.repartition(4) will physically move data from each 4 sets of partitions per node into 1 partition per node.
Spark might move all 4 partitions from node 1 over to the 3 other nodes, in a new single partition on those nodes, and vice versa.
I wouldn’t think it’d do this, but it’s an extreme case that demonstrates the point.
A coalesce(4) call though, doesn’t move the data, it’s much more clever. Instead, it recognises “I already have 4 partitions per node & 4 nodes in total... I’m just going to call all 4 of those partitions per node a single partition and then I’ll have 4 total partitions!”
So it doesn’t need to move any data because it just combines existing partitions into a joined partition.
Try this, in my empirical experience repartition works better for this kind of problems:
tiny = spark.sql("SELECT * FROM db.big_table LIMIT 500")
tiny.repartition(1).write.saveAsTable("db.tiny_table")
Even better if you are interested in the parquet you don't need to save it as a table:
tiny = spark.sql("SELECT * FROM db.big_table LIMIT 500")
tiny.repartition(1).write.parquet(your_hdfs_path+"db.tiny_table")

How to deal with strongly varying data sizes in spark

I'm wondering about the best practice in designing spark-jobs where the volume of data is not known in advance (or is strongly varying). In my case, the application should both handle initial loads and later on incremental data.
I wonder how I should set the number of partitions in my data (e.g. using repartition or setting parameters like spark.sql.shuffle.partitions in order to avoid OOM excpetion in the executors (giving fixed amount of allocated memory per executor). I could
define a very high number of partition to make sure that even on very high workloads, the job does not fail
Set number of partitions at runtime depending on the size of source-data
Introduce an iteration over independent chunks of data (i.e. looping)
In all option, I see issues:
1: I imagine this to be inefficient for small data sizes as taks get very small
2: Needs additional querys (e.g. count) and e.g. for setting spark.sql.shuffle.partitions, SparkContext needs to be restartet which I would like to avoid
3: Seems to contradict the spirit of Spark
So I wonder what the most efficient strategy is for strongly varying data volumes.
EDIT:
I was wrong about setting spark.sql.shuffle.partitions, this can be set at runtime woutout restarting spark context
Do not set a high number of partitions without knowing this is needed. You will absolutely kill the performance of your job.
Yes
As you said, don't loop!
As you mention, you introduce an extra step which is to count your data, which at first glance seems wrong. However, you shouldn't think of this as mis-spent computation. Usually, the time it takes to count your data is significantly less than the time it would take to do further processing if you partition the data badly. Think of the count operation as an investment, it's certainly worth it.
You do not need to set partitions through the config and restart Spark. Instead, do the following:
Note current number of partitions for RDD / Dataframe / Dataset
Count number of entries / rows in your data
Based on an estimate of average row size, compute the target number of partitions
If #targetPartitions << #actualPartitions Then coalesce
Else If #targetPartitions >> #actualPartitions Then repartition
Else #targetPartitions ~= #actualPartitions Then do nothing
The coalesce operation will re-partition your data without shuffling, and so is much more efficient when it is available.
Ideally you can estimate the number of rows your will generate, rather than count them. Also, you will need to think carefully about when it is appropriate to perform this operation. With a long RDD lineage you can kill performance, because you may inadvertently reduce the number of cores which can execute complex code, due to scala lazy execution. Look into checkpointing to mitigate this problem.

Cassandra batch size

I have 6 Cassandra nodes mainly used for writing (95%).
What's the best approach to inserting data - individual inserts or batches? reason says batches are to be used, while keeping the "batch size" under 5kb to avoid node instability:
https://issues.apache.org/jira/browse/CASSANDRA-6487
Do these 5kb concern the size of the queries, as in number of chars * bytes_per_char? are there any performance drawbacks to fully running individual inserts?
Batch will increase performance if used for a single partition. You are able to get more through put of data inserted.

Resources