I have a problem where I want to implement a recursive algorithm in Spark, and looking to see if there are any recommendations for building this in Spark, or exploring other data analytics frameworks that might be better suited.
eg. The job needs to list a directory structure/tree recursively and process nodes, combined with map/reduce patterns to map paths or groups of files into derived data, group/merge such derived data recursively.
I'm trying to do this in a way that can leverage parallelizing the overall algorithm. It would be straightforward to build a solution that runs on a single node (Eg. the spark master), but assume the directory structure is very large with O(Billion) leaf nodes.
Any suggestions for building recursive/iterative kinds of data pipelines in Spark or other frameworks/data processing technologies?
With Flink I would look at using the Stateful Functions API for this sort of use case.
Related
I am working on something where I have a SQL code in place already. Now we are migrating to Azure. So I created an Azure databricks for the piece of transformation and used the same SQL code with some minor changes.
I want to know - Is there any recommended way or best practice to work with Azure databricks ?
Should we re-write the code in PySpark for the better performance?
Note : End results from the previous SQL code has no bugs. Its just that we are migrating to Azure. Instead of spending time over re-writing the code, I made use of same SQL code. Now I am looking for suggestions to understand the best practices and how it will make a difference.
Looking for your help.
Thanks !
Expecting -
Along with the migration from on prem to Azure. I am looking for some best practices for better performance.
Under the hood, all of the code (SQL/Python/Scala, if written correctly) is executed by the same execution engine. You can always compare execution plans of SQL & Python (EXPLAIN <query for SQL, and dataframe.explain() for Python) and see that they are the same for same operations.
So if your SQL code is working already you may continue to use it:
You can trigger SQL queries/dashboards/alerts from Databricks Workflows
You can use SQL operations in Delta Live Tables (DLT)
You can use DBT together with Dataricks Workflows
But often you can get more flexibility or functionality when using Python. For example (this is not a full list):
You can programmatically generate DLT tables that are performing the same transformations but on different tables
You can use streaming sources (SQL support for streaming isn't very broad yet)
You need to integrate your code with some 3rd party libraries
But really, on Databricks you can usually mix & match SQL & Python code together, for example, you can expose Python code as user-defined function and call it from SQL (small example of DLT pipeline that is doing that), etc.
You asked a lot of questions there but I'll address the one you asked in the title:
Any benefits of using Pyspark code over SQL?
Yes.
PySpark is easier to test. For example, a transformation written in PySpark can be abstracted to a python function which can then be executed in isolation within a test, thus you can employ the use of one of the myriad of of python testing frameworks (personally I'm a fan of pytest). This isn't as easy with SQL where a transformation exists within the confines of the entire SQL statement and can't be abstracted without use of views or user-defined-functions which are physical database objects that need to be created.
PySpark is more composable. One can pull together custom logic from different places (perhaps written by different people) to define an end-to-end ETL process.
PySpark's lazy evaluation is a beautiful thing. It allows you to compose an ETL process in an exploratory fashion, making changes as you go. It really is what makes PySpark (and Spark in general) a great thing and the benefits of lazy evaluation can't really be explained, it has to be experienced.
Don't get me wrong, I love SQL and for ad-hoc exploration it can't be beaten. There are good, justifiable reasons, for using SQL over PySpark, but that wasn't your question.
These are just my opinions, others may beg to differ.
After getting help on the posted question and doing some research I came up with below response --
It does not matter which language do you choose (SQL or python). Since it uses Spark cluster, so Sparks distributes it across cluster. It depends on specific use cases where to use what.
Both SQL and PySpark dataframe intermediate results gets stored in memory.
In a same notebook we can use both the languages depending upon the situation.
Use Python - For heavy transformation (more complex data processing) or for analytical / machine learning purpose
Use SQL - When we are dealing with relational data source (focused on querying and manipulating structured data stored in a relational database)
Note: There may be some optimization techniques in both the languages which we can use to make the performance better.
Summary : Choose language based on the use cases. Both has the distributed processing because its running on Spark cluster.
Thank you !
What is the purpose of Apache Arrow? It converts from one binary format to another, but why do i need that? If I have a spark program,then spark can read parquet,so why do i need to convert it into another format,midway through my processing?
Is it to pass that data in memory to another language like python or java without having to write it to a text/json format?
Disclaimer: This question is broad and I am somewhat involved with the Apache Arrow project so my answer may/or may not be biased.
This question is broad in the sense that a question like a "When should I use NoSQL?" type of question is broad. It depends. This answer is based on the assumption that you already have a Spark pipeline. This answer is not an attempt at Spark Vs. Arrow (which is even broader to the point I wouldn't touch it).
Many Apache Spark pipelines would never need to use Arrow. Spark, unlike Arrow-based pipelines, has its own in-memory dataframe format (https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrame.html) which, to my knowledge, cannot be zero-copied to Arrow. So converting from one format to the other is likely to introduce a performance hit of some kind and any benefit you achieve is going to have to be weighed against that.
You brought up one great example, which is switching to other languages / libraries. For example, Spark currently uses Arrow to apply a Pandas UDF (https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html). In this case, whenever you are going to a library that doesn't use Spark's in-memory format (which means any non-Java library and some Java libraries) you are going to have to do a translation between in-memory formats and so you are going to pay the performance hit anyways and you might as well switch to Arrow.
There are some things that are faster with Arrow's format than Spark's format. I'm not going to try and list those here because, for the most part, the benefit isn't going to outweigh the cost of going Spark -> Arrow in the first place and I don't know that I have enough information to do so in any sort of comprehensive way. Instead, I'll provide one concrete example:
A common case for Arrow is when you need to transfer a table between processes that are on the same machine (or have a very fast I/O channel in between). In that case the cost of serializing to parquet and then deserializing back (Spark must do this to go Spark Dataframe -> Parquet -> Wire -> Parquet -> Spark Dataframe) is more expensive than the I/O saved (Parquet is more compact than Spark Dataframe so you will save some in transmission). If you have a lot of this type of communication it may be beneficial to leave Spark, do these transmissions in Arrow, and then return to Spark.
I am working on a project where configurable pipelines and lineage tracking of alterations to Spark DataFrames are both essential. The endpoints of this pipeline are usually just modified DataFrames (think of it as an ETL task). What made the most sense to me was to leverage the already existing Spark ML Pipeline API to track these alterations. In particular, the alterations (adding columns based on others, etc.) are implemented as custom Spark ML Transformers.
However, we are now having an internal about debate whether or not this is the most idiomatic way of implementing this pipeline. The other option would be to implement these transformations as series of UDFs and to build our own lineage tracking based on a DataFrame's schema history (or Spark's internal DF lineage tracking). The argument for this side is that Spark's ML pipelines are not intended just ETL jobs, and should always be implemented with goal of producing a column which can be fed to a Spark ML Evaluator. The argument against this side is that it requires a lot of work that mirrors already existing functionality.
Is there any problem with leveraging Spark's ML Pipelines strictly for ETL tasks? Tasks that only make use of Transformers and don't include Evaluators?
For me, seems like a great idea, especially if you can compose the different Pipelines generated into new ones since a Pipeline can itself be made of different pipelines since a Pipeline extends from PipelineStage up the tree (source: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Pipeline).
But keep in mind that you will probably being doing the same thing under the hood as explained here (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-mllib/spark-mllib-transformers.html):
Internally, transform method uses Spark SQL’s udf to define a function (based on createTransformFunc function described above) that will create the new output column (with appropriate outputDataType). The UDF is later applied to the input column of the input DataFrame and the result becomes the output column (using DataFrame.withColumn method).
If you have decided for other approach or found a better way, please, comment. It's nice to share knowledge about Spark.
Is it possible to save a Spark ML pipeline to a database (Cassandra for example)? From the documentation I can only see the save to path option:
myMLWritable.save(toPath);
Is there a way to somehow wrap or change the myMLWritable.write() MLWriter instance and redirect the output to the database?
It is not possible (or at least no supported) at this moment. ML writer is not extendable and depends on Parquet files and directory structure to represent models.
Technically speaking you can extract individual components and use internal private API to recreate models from scratch, but it is likely the only option.
Spark 2.0.0+
At first glance all Transformers and Estimators implement MLWritable. If you use Spark <= 1.6.0 and experience some issues with model saving I would suggest switching version.
Spark >= 1.6
Since Spark 1.6 it's possible to save your models using the save method. Because almost every model implements the MLWritable interface. For example LogisticRegressionModel, has it, and therefore it's possible to save your model to the desired path using it.
Spark < 1.6
Some operations on a DataFrames can be optimized and it translates to improved performance compared to plain RDDs. DataFramesprovide efficient caching and SQLish API is arguably easier to comprehend than RDD API.
ML Pipelinesare extremely useful and tools like cross-validator or differentevaluators are simply must-have in any machine pipeline and even if none of the above is particularly hard do implement on top of low level MLlib API it is much better to have ready to use, universal and relatively well tested solution.
I believe that at the end of the day what you get by using ML over MLLibis quite elegant, high level API. One thing you can do is to combine both to create a custom multi-step pipeline:
use ML to load, clean and transform data,
extract required data (see for example [extractLabeledPoints ]4 method) and pass to MLLib algorithm,
add custom cross-validation / evaluation
save MLLib model using a method of your choice (Spark model or PMML)
In Jira also there is temporary solution provided . Temporary Solution
Spark uses in memory computing and caching to decrease latency on complex analytics, however this is mainly for "iterative algorythms",
If I needed to perform a more basic analytic, say perhaps each element was a group of numbers and I wanted to look for elements with a standard deviation less than 'x' would Spark still decrease latency compared to regular cluster computing (without in memory computing)? Assuming I used that same commodity hardware in each case.
It tied for the top sorting framework using none of those extra mechanisms, so I would argue that is reason enough. But, you can also run streaming, graphing, or machine learning without having to switch gears. Then, you add in that you should use DataFrames wherever possible and you get query optimizations beyond any other framework that I know of. So, yes, Spark is the clear choice in almost every instance.
One good thing about spark is its Datasource API combining it with SparkSQL gives you ability to query and join different data sources together. SparkSQL now includes decent optimizer - catalyst. As mentioned in one of the answer along with core (RDD) in spark you can also include streaming data, apply machine learning models and graph algorithms. So yes.