I have a huge file to process, loaded into an RDD, and performing some validations on its lines using map function.
I have a set of errors that are fatal for the whole file if encountered even on one line of the file. Thus, I would like to abort any other processing (all launched mappers across the whole cluster) as soon as that validation fails on a line (to save some time).
Is there a way to archive this?
Thank you.
PS: Using Spark 1.6, Java API
Well, after further search and understanding of Spark transformations laziness , I just have to do something like:
rdd.filter(checkFatal).take(1)
Then due to the laziness, the processing will just stop itself once one record matching that rule is found :)
Related
I have a databricks job that run many commands and at the end it tries to save the results to a folder. However, it is failed because it tried to write a file to folder but folder was not exists.
I simply created the folder.
However, how can I make it continue where it left without executing all the previous commands.
I assume that by Databricks job you refer to the way to run non-interactive code in a Databricks cluster.
I do not think that what you ask is possible, namely getting the output of a certain Spark task from a previous job run on Databricks. As pointed out in the other answer, "if job is finished, then all processed data is gone". This has to do with the way Spark works under the hood. If you are curious about this topic, I suggest you start reading this post about Transformations and Actions in Spark.
Although you can think of a few workarounds, for instance if you are curious about certain intermediate outputs of your job, you could decide to temporary write your DataFrame/Dataset to some external location. In this way you can easily resume a the job from your preferred point by reading one of your checkpoints as input. This approach is a bit clanky and I do not recommend it, but it's a quick and dirty solution you might want to choose if you are in the testing/designing phase.
A more robust solution would involve splitting your job in multiple sub-jobs and setting upstream & downstream dependencies among them. You can do that using Databricks natively (Task dependencies section) or an external scheduler that integrates with Databricks, like Airflow.
In this way, you can split your tasks and you will be able to have an higher control granularity on your Application. So, in case of again failures on the writing step, you will be able to re run only the writing easily.
If job is finished, then all processed data is gone, until you write some intermediate states (additional tables, etc.) from which you can continue processing. In most cases, Spark actually execute the code only when it's writing results of execution of provided transformations.
So right now you just need to rerun the job.
I'm having some issue with the spark I'm using
I'm trying to read some parquet file apply a filter and then run some python udf.
spark.read.parquet(...).where(
col('col_name_a').isNotNull()
).withColumn(
costly_python_udf("col_name_b").alias("col_b_transform")
)
With just those two steps, the code runs fast (in the sense that I'm action on a partial subset of data given the col_a!=null condition) as I expected. However if I take the dataframe and add more operations to it, processing slows down. It almost feels like the python udf operation and the null check operations got swapped.
I looked at the SQL section in the ApplicationMaster and I see in 'Optimized Logical Plan' that Filter step and BatchEvalPython got swapped.
In the same project, this swap impacted me where I was checking for null string and the loading json and the json load was failing due to the null load error, because guess again, the sequence of filter application and python udf was swapped. Few questions.
I'm currently managing this issue, by inserting cache() step after filter and before the python udf, which seems to be forcing Filter and BatchEvalPython run in the right sequence.
Why does the Filter gets pushed after the BatchEvalPython. I can't think of any scenario where that would be an optimization. If anything, it should be the other way around.
Is this is known issue? Is this fixed in future versions?
Is there a better command than cache to force pyspark to not optimize across before and after the cache call?
spark.version = u'2.1.0'
My problem is the following, I want to aggregate some data that is stored on S3. As initial input to my pipeline I use a text file that contains the path of all the S3 files that should be aggregated.
PCollection<String> readInputPipeline = p.apply("ReadLines", TextIO.read().from(options.getInputFile()));
readInputPipeline = readInputPipeline.apply(ParDo.of(new ReadFromS3Mapper()));
The input file has 346k lines. When I deploy this code to a Spark cluster reading from S3 looks like it happens only in 2 Spark Tasks even though many cores are available. Is there any way for me to increase the parallelism of this operation?
I am running this on EMR on Amazon with a master (m3.xlarge) and a core machine (R3.4xlarge) with the following options:
"spark-submit"
"--driver-java-options='-Dspark.yarn.app.container.log.dir=/mnt/var/log/hadoop'",
"--master", "yarn",
"--executor-cores","16",
"--executor-memory","6g"
PS: maybe the solution could be that I shouldn't do this kind of expensive IO operations in this context?
Spark decides how to split up an input, here it's decided to go through the entire file in one go, because it so small.
I've done something similar in a distcp application; this uses Spark's ParallelCollectionRDD class to explicitly tell spark to split the listing up one-by-one.
That class should be enough for you to do something similar -you may have to read the initial text file in locally to a list, then pass the list to the ParallelCollectionRDD constructor
A bit late reply, but I looked into what Beam does in the 2.16.0 release.
You're getting 2 tasks after the first TextIO.read() -- I suspect that your initial list of files of 346k lines is being split into two partitions. This behaviour is controlled by the desiredBundleSize inside TextIO, which is hard-coded to 64MB.
In Spark, your action ReadFromS3Mapper will be "fused" to the arriving records and you'll always stay at two partitions.
If you want to keep the same code, you can force a repartition between the two transformations:
PCollection<String> allContents = p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
.apply("Repartition", Reshuffle.viaRandomKey())
.apply(ParDo.of(new ReadFromS3Mapper()));
As an alternative, there's quite a few interesting patterns available in the TextIO and FileIO utilities. There's an example that matches yours almost exactly (implicitly including the reshuffle).
If nondeterministic code runs on Spark, this can cause a problem when recovery from failure of a node is necessary, because the new output may not be exactly the same as the old output. My interpretation is that the entire job might need to be rerun in this case, because otherwise the output data could be inconsistent with itself (as different data was produced at different times). At the very least any nodes that are downstream from the recovered node would probably need to be restarted from scratch, because they have processed data that may now change. That's my understanding of the situation anyway, please correct me if I am wrong.
My question is whether Spark can somehow automatically detect if code is nondeterministic (for example by comparing the old output to the new output) and adjust the failure recovery accordingly. If this were possible it would relieve application developers of the requirement to write nondeterministic code, which might sometimes be challenging and in any case this requirement can easily be forgotten.
No. Spark will not be able to handle non deterministic code in case of failures. The fundamental data structure of Spark, RDD is not only immutable but it
should also be determinstic function of it's input. This is necessary otherwise Spark framework will not be able to recompute the partial RDD (partition) in case of
failure. If the recomputed partition is not deterministic then it had to re-run the transformation again on full RDDs in lineage. I don't think that Spark is a right
framework for non-deterministic code.
If Spark has to be used for such use case, application developer has to take care of keeping the output consistent by writing code carefully. It can be done by using RDD only (no datframe or dataset) and persisting output after every transformation executing non-determinstic code. If performance is the concern, then the intermediate RDDs can be persisted on Alluxio.
A long term approach would be to open a feature request in apache spark jira. But I am not too positive about the acceptance of feature. A little hint in syntax to know wether code is deterministic or not and framework can switch to recover RDD partially or fully.
Non-deterministic results are not detected and accounted for in failure recovery (at least in spark 2.4.1, which I'm using).
I have encountered issues with this a few times on spark. For example, let's say I use a window function:
first_value(field_1) over (partition by field_2 order by field_3)
If field_3 is not unique, the result is non-deterministic and can differ each time that function is run. If a spark executor dies and restarts while calculating this window function, you can actually end up with two different first_value results output for the same field_2 partition.
I am working on parsing different types of files (text,xml,csv etc.) into a specific text file format using spark java API. This output file maintains the order of file header, start tag, data header, data and end tag. All of these element are extracted from input file at some point.
I tried to achieve this in below 2 ways:
Read file to RDD using sparks textFile and perform parsing by using map or mapPartions which returns new RDD.
Read file using sparks textFile , reduce to 1 partition using coalesce and perform parsing by using mapPartions which returns new RDD.
While I am not concerned about sequencing of actual data, with first approach I am not able to keep the required order of File Header, Start Tag, Data Header and End Tag.
The latter works for me, but I know it is not efficient way and may cause problem in case of BIG files.
Is there any efficient way to achieve this?
You are correct in you assumptions. The second choice simply cancels the distributional aspect of your application, so it's not scalable. For the order issue, as the concept is asynchronous, we cannot keep track of order when the data reside in different nodes. What you could do is some preprocessing that would cancel the need for order. Meaning, merge lines up to the point where the line order does not matter and only then distribute your file. Unless you can make assumptions about the file structure, such as number of lines that belong together, I would go with the above.