There is a configuration item(max-split-size) to set one split's max size.In other word,i can change the value of the item to change the number of splits.
I know,more splits will use more cpu at the same time,and the search will become faster.
If so,why presto set the default value of the item is 32M instead of such 1M?
There's overhead to each split that is created, so you don't want them to be too small. Also, some file formats like ORC can't be split smaller than the size of an ORC stripe which tends to be tens to hundreds of megabytes
Related
I've been reading few questions regarding this topic and also several forums, and in all of them they seem to be mentioning that each of resulting .parquet files coming out from Spark should be either 64MB or 1GB size, but still can't make my mind around which case scenarios belong to each of those file sizes and the reasons behind apart from HDFS splitting them in 64MB blocks.
My current testing scenario is the following.
dataset
.coalesce(n) # being 'n' 4 or 48 - reasons explained below.
.write
.mode(SaveMode.Append)
.partitionBy(CONSTANTS)
.option("basepath", outputPath)
.parquet(outputPath)
I'm currently handling a total of 2.5GB to 3GB of daily data, that will be split and saved into daily buckets per year. The reasons behind 'n' being 4 or 48 is just for testing purposes, as I know the size of my testing set in advance, I try to get a number as close to 64MB or 1GB as I can. I haven't implemented code to buffer the needed data until I get the exact size I need prior saving.
So my question here is...
Should I take the size that much into account if I'm not planning to use HDFS and merely store and retrieve data from S3?
And also, which should be the optimal size for daily datasets of around 10GB maximum if I'm planning to use HDFS to store my resulting .parquet files?
Any other optimization tip would be really appreciated!
You can control the split size of parquet files, provided you save them with a splittable compression like snappy. For the s3a connector, just set fs.s3a.block.size to a different number of bytes.
Smaller split size
More workers can work on a file simultaneously. Speedup if you have idle workers.
More startup overhead scheduling work, starting processing, committing tasks
Creates more files from the output, unless you repartition.
Small files vs large files
Small files:
you get that small split whether or not you want it.
even if you use unsplittable compression.
takes longer to list files. Listing directory trees on s3 is very slow
impossible to ask for larger block sizes than the file length
easier to save if your s3 client doesn't do incremental writes in blocks. (Hadoop 2.8+ does if you set spark.hadoop.fs.s3a.fast.upload true.
Personally, and this is opinion, and some benchmark driven -but not with your queries
Writing
save to larger files.
with snappy.
shallower+wider directory trees over deep and narrow
Reading
play with different block sizes; treat 32-64 MB as a minimum
Hadoop 3.1, use the zero-rename committers. Otherwise, switch to v2
if your FS connector supports this make sure random IO is turned on (hadoop-2.8 + spark.hadoop.fs.s3a.experimental.fadvise random
save to larger files via .repartion().
Keep an eye on how much data you are collecting, as it is very easy to run up large bills from storing lots of old data.
see also Improving Spark Performance with S3/ADLS/WASB
As my spark program runs on more data, I think I am crashing because I'm picking up the default number of output partitions for aggregation - namely the 200. I've learned how to control this, but it seems ideally, I would set the number of output partitions based on the amount of data I'm writing. Here in lies the conundrum - I need to first call count() on the dataframe, and then write it. That means I may re-ready it from S3 twice. I could cache and then count, but I've seen spark crash when I cache this data, caching seems to use the most resources, whereas if I just write it - it can do something more optimal.
So my questions are, if you think this is a decent approach - doing a count first (the count is a proxy to the size on disk) or should you just hard code some numbers, change them when you need? And if I am going to count first, is their some clever way to optimize things so that the count and write share work? Other than caching the whole dataframe?
Yes the count approach is actually correct way to go. Ideally you want your rdd partitions to be some considerable size like 50MB before writing. Otherwise you will end up with "small file problem".
Now if you have large data caching in memory could be hard. You could try MEMORY_AND_DISK but then the data will spill to disk and cause slowdown.
I have faced this predicament multiple times and every time I have chosen a "magic number" for the number of partitions. The number is parameterized so when I need to change I don't need to change the code, rather pass the different parameter.
If you know your datasize is generally in a particular range you could set the partition number hard coded. It is not ideal but gets the job done.
Also you could pump the metrics like size of the data in s3 and if that breaches some threshold raise an alarm then someone could change the partition number manually.
In generally if you keep the partition number moderately high like 5000 for approximately 500GB data that works for a large range i.e from 300GB to 1.2TB amount of data. This means probably you don't need to change the partition number too often if you have moderate inflow of data.
I'm running reduceByKey in spark. My program is the simplest example of spark:
val counts = textFile.flatMap(line => line.split(" ")).repartition(20000).
.map(word => (word, 1))
.reduceByKey(_ + _, 10000)
counts.saveAsTextFile("hdfs://...")
but it always run out of memory...
I 'm using 50 servers , 35 executors per server, 140GB memory per server.
the documents volume is :
8TB documents, 20 billion documents, 1000 billion words in total.
and the words after reduce will be about 100 million.
I wonder how to set the configuration of spark?
I wonder what value should these parameters be?
1. the number of the maps ? 20000 for example?
2. the number of the reduces ? 10000 for example?
3. others parameters?
It would be helpful if you posted the logs, but one option would be to specify a larger number of partitions when reading in the initial text file (e.g. sc.textFile(path, 200000)) rather than re-partitioning after reading . Another important thing is to make sure that your input file is splittable (some compression options make it not splittable, and in that case Spark may have to read it on a single machine causing OOMs).
Some other options are, since you aren't caching any of the data, would be reducing the amount of memory Spark is setting aside for caching (controlled with with spark.storage.memoryFraction), also since you are only working with tuples of strings I'd recommend using the org.apache.spark.serializer.
KryoSerializer serializer.
Did you try to use a partionner, it can help to reduce the number of key per node, if we suppose that keys word weights in average 1ko, it implies 100Go of memory exclusively for keys per node. With partitioning you can approximatively reduce number of key per node by the number of node, reducing accordingly the necessary amount of memery per node.
The spark.storage.memoryFraction option mentioned by #Holden is also an key factor.
I have 6 Cassandra nodes mainly used for writing (95%).
What's the best approach to inserting data - individual inserts or batches? reason says batches are to be used, while keeping the "batch size" under 5kb to avoid node instability:
https://issues.apache.org/jira/browse/CASSANDRA-6487
Do these 5kb concern the size of the queries, as in number of chars * bytes_per_char? are there any performance drawbacks to fully running individual inserts?
Batch will increase performance if used for a single partition. You are able to get more through put of data inserted.
I am using Cassandra to store my parsed site logs. I have two column families with multiple secondary indices. The log data by itself is around 30 gb in size. However, the size of the cassandra data dir is ~91g. Is there any way I can reduce the size of this store? Also, will having multiple secondary indices have a big impact on the datastore size?
Potentially, the secondary indices could have a big impact, but obviously it depends what you put in them! If most of your data entries appear in one or more indexes, then the indexes could form a significant proportion of your storage.
You can see how much space each column family is using JConsole and/or 'nodetool cfstats'.
You can also look at the sizes of the disk data files to get some idea of usage.
It's also possible that data isn't being flushed to disk often enough - this can result in lots of commitlog files being left on disk for a long time, occupying extra space. This can happen if some of your column families are only lightly loaded. See http://wiki.apache.org/cassandra/MemtableThresholds for parameters to tune this.
If you have very large numbers of small columns, then the column names may use a significant proportion of the storage, so it may be worth shortening them where this makes sense (not if they are timestamps or other meaningful data!).