When I am trying to merge between two streams in Perforce, for example
Stream A is a parent and Stream B is a child of Stream A. I want to merge down to Stream B from Stream A. When I try to merge between these, in the merge dialog box there are three options
stream to stream merge
specify source and target files
use branch mapping
I have a question here, what is the difference between these 1st and 3rd?
These are all just different ways to specify a source and target mapping.
here are my streams, figure out how they relate and merge one into the other
here are two sets of files, merge one into the other
here is a branch mapping defining two sets of files, merge one into the other
The purpose of having streams is largely to make #2 and #3 obsolete -- when you merge two streams a branch mapping is automatically generated internally to perform the merge. If you don't use streams you define your own branch mappings (which gives you some flexibility but is also more work).
Related
I'm reading documentation about lakeFS and right now don't clearly understand what is a merge or even merge conflict in terms of lakeFS.
Let's say I use Apache Hudi for ACID support over a single table. I'd like to introduce multi-table ACID support and for this purpose would like to use lakeFS together with Hudi.
If I understand everything correctly, lakeFS is a data agnostic solution and knows nothing about the data itself. lakeFS only establishes boundaries (version control) and moderates somehow the concurent access to the data..
So the reasonable question is - if lakeFS is data agnostic, how it supports merge operation? What merge itself means in terms of lakeFS? And is it possible to have a merge conflict there?
You do understand everything correctly. You could see in the branching model page that lakeFS is currently data agnostic and relies simply on the hierarchical directory structure. A conflict would occur when two branches update the same file.
This behavior fits most data engineers CI/CD use cases.
In case you are working with Delta Lake and made changes to the same table from two different branches, there will still be a conflict because the two branches changed the log file. In order to resolve the conflict you would need to forgo one of the change sets.
Admittedly this is not the best user experience and it's currently being worked on. You could read more about it on the roadmap documentation.
We want to create a Spark-based streaming data pipeline that consumes from a source (e.g. Kinesis), apply some basic transformations, and write the data to a file-based sink (e.g. s3). We have thousands of different event types coming in and the transformations would take place on a set of common fields. Once the events are transformed, they need to be split by writing them to different output locations according to the event type. This pipeline is described in the figure below:
Goals:
To infer schema safely in order to apply transformations based on the merged schema. The assumption is that the event types are compatible with each other (i.e. without overlapping schema structure) but the schema of any of them can change at unpredictable times. The pipeline should handle it dynamically.
To split the output after the transformations while keeping the original individual schema.
What we considered:
Schema inference seems to work fine on sample data. But is it safe for production usecases and for a large number of different event types?
Simply using partitionBy("type") while writing out is not enough because it would use the merged schema.
Doing the same here, casting everything to string, using marshamallow to validate, and then using from_json in a foreach like in https://www.waitingforcode.com/apache-spark-structured-streaming/two-topics-two-schemas-one-subscription-apache-spark-structured-streaming/read
seems the more reasonable approach
According to the API guideline
https://www.perforce.com/manuals/v15.1/dvcs/_specify_mappings.html
It seems that I can only specify a one-to-one mapping. Is there any way I can specify a mapping with two sources into one destination?
For example:
//stream/main/... //depot/main/...
//stream/build/... //depot/main/...
Branch mappings are one-to-one. If you want to integrate multiple sources into one target, you need multiple branch mappings and multiple integrate commands. (I would recommend multiple submits as well; it is technically possible to squash multiple integrations into one submit but it multiplies the complexity of the conflict resolution process.)
YMMV, but pretty sure that after 2004.1, you should be able to use the + syntax to append rules instead of overwriting, like:
//stream/main/... //depot/main/...
+//stream/build/... //depot/main/...
Here is the associated reference on perforce views
I have a PBI file, in which I made a power query with several steps (such as merging two files).
I would like to produce several output files originating from this query, but specific steps for these files (I made a series of specific change to the data in these queries).
If I refresh my PBI file, I would like that all the queries from the original one to the three ones originating from this one get impacted.
I would also like to have impacts on the three other queries if I add a new step in the original query.
So far I used copy:
I took my original query, I did a right click and simply used the "copy" option. However, this is duplicating the previously merged files used to create the query at the same time.
I see that there is also the option "Duplicate" and "reference" in Power BI.
I tried doing some research and I read the following on "Duplicate":
"Duplicate copies a query with all the applied steps of it as a new query; an exact copy."
For me this seemed exactly that same as a "copy", I thus expected that I would get a copy of the previously merged files when I duplicated the query. But no. I tested it and only the selected query got duplicated.
When I did the test for "reference" however, my query appeared, only the result this time, (not the data use to create it), but it had no steps. When I try to click on "source", I cannot "see" the source.
I am thus puzzled as to the best way forward, and more broadly the best cases and practices to adopt.
What option could I choose enabling PBI to operate the same steps each time I refresh my source, i.e merging the two files and then doing a series of specific steps on three copies of my source ?
I suspect you want to do the following:
Load Universe 1 and Universe 2
Merge them into a single table Merge1
Create three new queries that reference Merge1
Create specific steps for each of the new queries
This way, each of the new queries starts at exactly the same place without having to load Universe 1 and Universe 2 three separate times.
If you go to your query dependencies view, it should look like this:
Notice that I've disabled load (to model) for the top 3 tables since we probably only need them as staging tables.
I am trying to merge with perforce the entire content from another branch (source and target files or branch mapping), but I don't want to merge the files which are supposed to be auto-generated based on other files.
Is there some way to define a set of files which should be skipped from merge when doing a branch merge? I would like to reuse the solution multiple times and the maintenance overhead should be low if possible.
You could set up a branch mapping, see http://www.perforce.com/perforce/doc.current/manuals/p4guide/06_codemgmt.html.
If your generated files have some pattern that makes them easy to identify with a P4 file specification, or if they live in a small subset of folders, then it should be pretty easy to exclude them from your mapping.