I am using Flink batch API with Hadoop FileInputFormat to process a large number of input files(approx. 100k). I found it is extremely slow for job to be prepared. I found that in FileInputFormat.getSplits() method, it iterate all input paths and get block locations for every paths. I think it will send 100k requests to HDFS which leads to the problem. Is there any approaches to speed up the split generation procedure? I think spark and mapreduce may have a similarly problem as well. Thank you very much!
Try increasing this parameter: mapreduce.input.fileinputformat.list-status.num-threads
Also, compacting those 100k files would definitely help.
Related
Before I write dataframe into hdfs, I coalesce(1) to make it write only one file, so it is easily to handle thing manually when copying thing around, get from hdfs, ...
I would code like this to write output.
outputData.coalesce(1).write.parquet(outputPath)
(outputData is org.apache.spark.sql.DataFrame)
I would like to ask if their are any impact on performance vs not coalesce
outputData.write.parquet(outputPath)
Yes, it will write with 1 worker.
So, even through you give 10 CPU core, it will write with 1 worker (single partition).
Problem if your file very big (10 gb or more). But recommend if you have small file (100 mb)
I would not recommend doing that. The whole purpose of distributed computing is to have data and processing sitting on multiple machine and capitalize the benefits of CPU/Memory of many machines (worker nodes).
In your case, you are trying to put everything in one place. Why do you need a distributed file system if you want to write into a single file with just one partition? Performance can be an issue but it can be only assessed after you check before/after using Coalesce function on huge amount of data that is spread across multiple nodes on cluster.
Though really not suggested when dealing with huge data, Using coalesce(1) can be handy when there are too many small partition files in the _temporary and the file movement is taking quite a bit of time to move them into the proper directories.
I have about 8m rows of data with about 500 columns.
When I try to write it with spark as a single file coalesce(1) it fails with an OutOfMemoryException.
I know this is a lot of data on one executor, but as far as I understand the write process of parquet, it only holds the data for one row group in memory, before flushing it to disk and then continues with the next one.
My executor has 16gb of memory and it cannot be increased any further. The data contains a lot of strings.
So what I am interested in is, some settings where I can tweak the process of writing big parquet files for wide tables.
I know i can enable/disable dictionary, increase/decrease block- and pagesize.
But what would be a good configuration for my needs?
I don't think that Parquet is really contributes to failure here and tweaking its configuration probably won't help.
coalesce(1) is a drastic operation that affect all upstream code. As a result, all processing is done on a single node, and according to your own words, your resources are already very limited.
You didn't provide any information about the rest of the pipeline, but if you want to stay with Spark, your best hope is replacing coalesce with repartition. If OOM occurs in one of the preceding operations it might help.
I have a Spark Streaming application that writes its output to HDFS.
What precautions and strategies can I take to ensure that not too many small files are generated by this process and create a memory pressure in the HDFS Namenode.
Does Apache Spark provides any pre-built solutions to avoid small files in HDFS.
No. Spark do not provide any such solution.
What you can do:
Increase batch interval - this will not guarantee anything - but still there is higher chance. Though the tradeoff here is that streaming will have bigger latency.
Manually manage it. For example - on each batch you could calculate size of the RDD and accumulate RDDs unless they satisfy your size requirement. Then you just union RDDs and write to disk. This will unpredictably increase latency, but will guarantee efficient space usage.
Another solution is also to get another Spark application that reaggregates the small files every hour/day/week,etc.
I know this question is old, but may be useful for someone in the future.
Another option is to use coalesce with a smaller number of partitions. coalesce merges partitions together and creates larger partitions. This can increase the processing time of the streaming batch because of the reduction in number of partitions during the write, but it will help in reducing the number of files.
This will reduce the parallelism, hence having too few partitions can cause issues to the Streaming job. You will have to test with different values of partitions for coalesce to find which value works best in your case.
You can reduce the number of part files .
By default spark generates output in 200 part files . You can decrease the number of part files .
I am developing an integration channel with kafka and spark, which will process batchs and streaming.
for batch processing, I entered huge CSV files (4 GB).
I'm considering two solutions:
Send the whole file to the file system and send a message to kafka
with the file address, and the spark job will read the file from the
FS and turn on it.
cut the file before kafka in unit message (with apache nifi) and
send to treat the batch as streaming in the spark job.
What do you think is the best solution ?
Thanks
If you're writing code to place the file on the file system, you can use that same code to submit the Spark job to the job tracker. The job tracker becomes the task queue and processes your submitted files as Spark jobs.
This would be a more simplistic way of implementing #1 but it has drawbacks. The main drawback being that you have to tune resource allocation to make sure you don't under allocate for cases if your data set is extremely large. If you over allocate resources for the job, then your task queue potentially grows while tasks are waiting for resources. The advantage is that there aren't very many moving parts to maintain and troubleshoot.
Using nifi to cut a large file down and having spark handle the pieces as a stream would probably make it easier to utilize the cluster resources more effectively. If your cluster is servicing random jobs on top of this data ingestion, this might be the better way to go. The drawbacks here might be that you need to do extra work to process all parts of a single file in one transactional context, you may have to do a few extra things to make sure you aren't going to lose the data delivered by Kafka, etc.
If this is for a batch operation, maybe method 2 would be considered overkill. The setup seems pretty complex for reading a CSV file even if it is a potentially really large file. If you had a problem with the velocity of the CSV file, a number of ever-changing sources for the CSV, or a high error rate then NiFi would make a lot of sense.
It's hard to suggest the best solution. If it were me, I'd start with the variation of #1 to make it work first. Then you make it work better by introducing more system parts depending on how your approach performs with an acceptable level of accuracy in handling anomalies in the input file. You may find that your biggest problem is trying to identify errors in input files during a large scale ingestion.
I am trying to "train" a DecisionTreeClassifier using Apache Spark running in a cluster in Amazon EMR. Even though I can see that there are around 50 Executors added and that the features are created by querying a Postgres database using SparkSQL and stored in a DataFrame.
The DesisionTree fit method takes for many hours even though the Dataset is not that big (10.000 db entries with a couple of hundreds of bytes each row).I can see that there is only one task for this so I assume this is the reason that it's been so slow.
Where should I look for the reason that this is running in one task?
Is it the way that I retrieve the data?
I am sorry if this is a bit vague but I don't know if the code that retrieves the data is relevant, or is it a parameter in the algorithm (although I didn't find anything online), or is it just Spark tuning?
I would appreciate any direction!
Thanks in advance.
Spark relies on data locality. It seems that all the data is located in a single place. Hence spark uses a single partition to process it. You could apply a repartition or state the number of partitions you would like to use at load time. I would also look into the decision tree Api and see if you can set the number of partitions for it specifically.
Basically, partitions are your level of parallelism.