I am getting this error "AnalysisException: Cannot redefine dataset" in my DLT pipeline. I am using a for loop to trigger multiple flows. I am trying to load different sources into the same target using dlt.create_target_table and dlt.apply_changes. So my pipeline is trying to define the same target tables for different inputs.
My inputs are [{Source: src_A, Target: tgt},{Source: src_B, Target: tgt}]. As mentioned in the data bricks cook book, is Union the only choice to combine multiple sources into one target? Can anyone help on this one.
Related
How can I combine mutiple datasets into one using Azure Machine Learning Studio?
(The following graph doesn't work)
Same question: https://learn.microsoft.com/en-us/answers/questions/666021/unable-to-use-34join-data34-to-combine-multiple-da.html
As per official documentation, The Join Data module does not support a right outer join, so if you want to ensure that rows from a particular dataset are included in the output, that dataset must be on the lefthand input.
For more information follow this link - How to configure Join Data
We want to create a Spark-based streaming data pipeline that consumes from a source (e.g. Kinesis), apply some basic transformations, and write the data to a file-based sink (e.g. s3). We have thousands of different event types coming in and the transformations would take place on a set of common fields. Once the events are transformed, they need to be split by writing them to different output locations according to the event type. This pipeline is described in the figure below:
Goals:
To infer schema safely in order to apply transformations based on the merged schema. The assumption is that the event types are compatible with each other (i.e. without overlapping schema structure) but the schema of any of them can change at unpredictable times. The pipeline should handle it dynamically.
To split the output after the transformations while keeping the original individual schema.
What we considered:
Schema inference seems to work fine on sample data. But is it safe for production usecases and for a large number of different event types?
Simply using partitionBy("type") while writing out is not enough because it would use the merged schema.
Doing the same here, casting everything to string, using marshamallow to validate, and then using from_json in a foreach like in https://www.waitingforcode.com/apache-spark-structured-streaming/two-topics-two-schemas-one-subscription-apache-spark-structured-streaming/read
seems the more reasonable approach
PMML, Mleap, PFA currently only support row based transformations. None of them support frame based transformations like aggregates or groupby or join. What is the recommended way to export a spark pipeline consisting of these operations.
I see 2 options wrt Mleap:
1) implement dataframe based transformers and the SQLTransformer-Mleap equivalent. This solution seems to be conceptually the best (since you can always encapsule such transformations in a pipeline element) but also alot of work tbh. See https://github.com/combust/mleap/issues/126
2) extend the DefaultMleapFrame with the respective operations, you want to perform and then actually apply the required actions to the data handed to the restserver within a modified MleapServing subproject.
I actually went with 2) and added implode, explode and join as methods to the DefaultMleapFrame and also a HashIndexedMleapFrame that allows for fast joins. I did not implement groupby and agg, but in Scala this is relatively easy to accomplish.
PMML and PFA are standards for representing machine learning models, not data processing pipelines. A machine learning model takes in a data record, performs some computation on it, and emits an output data record. So by definition, you are working with a single isolated data record, not a collection/frame/matrix of data records.
If you need to represent complete data processing pipelines (where the ML model is just part of the workflow) then you need to look for other/combined standards. Perhaps SQL paired with PMML would be a good choice. The idea is that you want to perform data aggregation outside of the ML model, not inside it (eg. a SQL database will be much better at it than any PMML or PFA runtime).
In Azure Data Factory v2 I've created a number of pipelines. I noticed that each pipeline I create there is a source and destination dataset created.
According to the ADF documentation: A dataset is a named view of data that simply points or references the data you want to use in your activities as inputs and outputs.
These datasets are visible within my data factory. I'm curious why I would care about these? These almost seem like 'under the hood' objects ADF creates to move data around. What value are these to me and why would I care about them?
These datasets are entities that can be reused. For example, dataset A can be referenced by many pipelines if those pipelines need the same data (same table or same file).
Linked services can be reused too. I think that's why ADF has these concepts.
You may be seeing those show up in your Factory if you create pipelines via the Copy Wizard Tool. That will create Datasets for your Source & Sink. The Copy Activity is the primary consumer of Datasets in ADF Pipelines.
If you are using ADFv2 to transform data, no DataSet is required. But if you are using ADF copy activity to copy data, DataSet is used to let ADF know the path and name of object to copy from/to. Once you have one dataset created, it can be used in many pipelines. Could you please help to let me understand more why creating a dataset is a friction to you in your projects?
I am working on a project where configurable pipelines and lineage tracking of alterations to Spark DataFrames are both essential. The endpoints of this pipeline are usually just modified DataFrames (think of it as an ETL task). What made the most sense to me was to leverage the already existing Spark ML Pipeline API to track these alterations. In particular, the alterations (adding columns based on others, etc.) are implemented as custom Spark ML Transformers.
However, we are now having an internal about debate whether or not this is the most idiomatic way of implementing this pipeline. The other option would be to implement these transformations as series of UDFs and to build our own lineage tracking based on a DataFrame's schema history (or Spark's internal DF lineage tracking). The argument for this side is that Spark's ML pipelines are not intended just ETL jobs, and should always be implemented with goal of producing a column which can be fed to a Spark ML Evaluator. The argument against this side is that it requires a lot of work that mirrors already existing functionality.
Is there any problem with leveraging Spark's ML Pipelines strictly for ETL tasks? Tasks that only make use of Transformers and don't include Evaluators?
For me, seems like a great idea, especially if you can compose the different Pipelines generated into new ones since a Pipeline can itself be made of different pipelines since a Pipeline extends from PipelineStage up the tree (source: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Pipeline).
But keep in mind that you will probably being doing the same thing under the hood as explained here (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-mllib/spark-mllib-transformers.html):
Internally, transform method uses Spark SQL’s udf to define a function (based on createTransformFunc function described above) that will create the new output column (with appropriate outputDataType). The UDF is later applied to the input column of the input DataFrame and the result becomes the output column (using DataFrame.withColumn method).
If you have decided for other approach or found a better way, please, comment. It's nice to share knowledge about Spark.