I’m learning how to use kubeflow pipeline for Apache Spark jobs and have a question. I’d appreciate if you could share your thoughts!
It is my understanding that data cannot be shared between SparkSessions, and that in each pipeline step/component you need to instantiate a new SparkSession (please correct me if I’m wrong). Does that mean in order to use output from spark jobs from previous pipeline steps, we need to save it somewhere? I suspect this will cause disk read/write burden and slow down the whole process. Can you please share with me how helpful it will be then to use pipeline for spark work?
I’m imaging a potential use case where one would like to ingest data in pyspark, preprocess it, select features for a ML job, then try different ML models and select the best one. In a non-spark situation, I probably would set separate components for each step of “loading data”, “preprocessing data”, and “feature engineering”. Due to the aforementioned issue, however, would it be better to complete all these with in one step in the pipeline, save the output somewhere, and then dedicate separate pipeline components for each model and train them in parallel?
Can you share any other potential use case? Thanks a lot in advance!
Spark in general is a in-memory processing framework, you'd avoid un-necessary writing/reading files. I believe it's better to have one spark job done in one task so you don't need to share spark session and the "middle" result between tasks. Data from loading data/pre-processing/feature engineering better to be serialised/stored with/without kubeflow anyway (think silver/bronze/golden).
Related
Let`s say there is a collection "goods" in mongodb like this:
{name:"A",attr:["location":"us"],"eventTime":"2018-01-01"}
{name:"B",attr:["brand":"nike"],"eventTime":"2018-01-01"}
In the past,I use spark to flatten it and save to hive:
goodsDF.select($"name",explode($"attribute"))
But,now we need to handle incremental data,
for example,there are a new good in the third line in the next day
{name:"A",attr:["location":"us"],"eventTime":"2018-01-01"}
{name:"B",attr:["brand":"nike"],"eventTime":"2018-01-01"}
{name:"C",attr:["location":"uk"],"eventTime":"2018-02-01"}
some of our team think flink is better on streaming,because flink has event driver application,streaming pipeline and batch,but spark is just micro batch.
so we change to use flink,but there are a lot of code has been written by spark,for example,the "explode" above,so my question is:
Is it possible to use flink to fetch source and save to the sink,but in the middle,use spark to transform the dataset?
If it is not possible,how about save it to a temporary sink,let`s say,some json files,and then spark read the files and transform and save to hive.But I am afraid this makes no sense,because for spark,It is also incremental data.Use flink then use spark is the same as use spark Structured Streaming directly.
No. Apache Spark code can not be used in Flink without making changes in code. As these two are different processing frameworks and APIs provided by two and it's syntax are different from each other. Choice of framework should really be driven by the use case and not by generic statements like Flink is better than Spark. A framework may work great for your use case and it may perform poorly in other use case. By the way, Spark is not just micro batch. It has batch, streaming, graph, ML and other things. Since the complete use case is not mentioned in question, it would be hard to suggest which one is better for this scenario. But if your use case can afford sub-second latency then I would not waste my time in moving to another framework.
Also, if the things are dynamic and it is anticipated that processing framework may change in future it would be better to use something like apache beam which provides abstraction over most of the processing engines. Using apache beam processing APIs will give you flexibility to change underlying processing engine any time. Here is the link to read more about beam - https://beam.apache.org/.
I have a processing data pipeline including 3 methods ( let's say A(), B(), C() sequentially) for an input text file. But I have to repeat this pipeline for 10000 different files. I have used adhoc multithreading: create 10000 threads, and add them to threadPool...Now I switch to Spark to achieve this parallel. My question are:
If Spark can do better job, guide me basic steps please cause I'm new to Spark.
If I use adhoc multithreading, deploy it on cluster. How can i manage resource to allocate threads running equally among nodes.I'm new to HPC system too.
I hope I ask the right questions, thanks !
I am trying to "train" a DecisionTreeClassifier using Apache Spark running in a cluster in Amazon EMR. Even though I can see that there are around 50 Executors added and that the features are created by querying a Postgres database using SparkSQL and stored in a DataFrame.
The DesisionTree fit method takes for many hours even though the Dataset is not that big (10.000 db entries with a couple of hundreds of bytes each row).I can see that there is only one task for this so I assume this is the reason that it's been so slow.
Where should I look for the reason that this is running in one task?
Is it the way that I retrieve the data?
I am sorry if this is a bit vague but I don't know if the code that retrieves the data is relevant, or is it a parameter in the algorithm (although I didn't find anything online), or is it just Spark tuning?
I would appreciate any direction!
Thanks in advance.
Spark relies on data locality. It seems that all the data is located in a single place. Hence spark uses a single partition to process it. You could apply a repartition or state the number of partitions you would like to use at load time. I would also look into the decision tree Api and see if you can set the number of partitions for it specifically.
Basically, partitions are your level of parallelism.
On a Spark cluster, if the jobs are very small, I assume that the clustering will be inefficient since most of the time will be spent on communication between nodes, rather than utilizing the processors on the nodes.
Is there a way to monitor how much time out of a job submitted with spark-submit is wasted on communication, and how much on actual computation?
I could then monitor this ratio to check how efficient my file aggregation scheme or processing algorithm is in terms of distribution efficiency.
I looked through the Spark docs, and couldn't find anything relevant, though I'm sure I'm missing something. Ideas anyone?
You can see this information in the Spark UI, asuming you are running Spark 1.4.1 or higher (sorry but I don't know how to do this for earlier versions of Spark).
Here is a sample image:
Here is the page that the image came from.
A brief summary: You can view a timeline of all the events happening in your Spark job within the Spark UI. From there, you can zoom in on each individual job and each individual task. Each task is divided into shceduler delay, serialization / deserialization, computation, shuffle, etc.
Now, this is obviously a very pretty UI but you might want something more robust so that you can check this info programmatically. It appears here that you can use the REST API to export the logging info in JSON.
I want to check and see if it is a good idea to invoke Spark code from a storm bolt. We have a stream based system in Storm. So per message we would like to do so ML and we are thinking of using Spark for that. So wanted to check if it is a good idea to do so. Any run time issues we might encounter ?
Thanks
ap
if you already have a system in place with Storm, then why do you want to use Spark?
IMHO both Spark and storm are different beast, you may want to run them in parallel for same or different use cases but do not tightly integrate each other.
What do you mean ML per message? ML on a single message doesn't make much sense. Do you mean a ML on a stream? Sure you can do it with Spark, but then you need to either use Spark Streaming (and you have two streaming architectures...) or save the data somewhere and do batch ML with Spark.
Why not use trident-ml instead?