Measuring distance of values / strings

Measuring distance of values / strings - string

martin fownler was discussing event sourcing
https://martinfowler.com/eaaDev/EventSourcing.html
e.g. storing data as a set of events.
Now an example would be an account. You create an account with balance 0.
Then you put 10$. You withdraw 5$. You put another 100$. Now the balance is 105$, but you don't store 105$. What you do store is
+10
-5
+100
as a series of events in the database.
Now if I want I can say "undo the last 2 steps." then I just remove the 2 last changes in the database -> account is 10
Now: how can you do that with strings?
Say account name first is empty string. Then
dirk dietmeier then hans hansenmann then foo bar how can you capture this data as set of changes? While letting it be reversable e.g. the events need to be able to reverse itself. E.g. you could just say 'delete everything and then put foo bar but is there no better solution?
is there like a svn or git like algorithm? some encoding (hex, binary?)?

Now if I want I can say "undo the last 2 steps." then I just remove
the 2 last changes in the database -> account is 10
Not if you want to preserve the history. In production event sourced applications, I would issue a compensating event. E.g. New event Y that undoes what event X did. The git analogue to this would be git revert.
Now: how can you do that with strings?
It depends on your application.
If you are tracking changes to code, it makes sense to do some research on how to express differences between two files, such that you can revert at a later time. In this sense, your event would be similar to a git commit. I suggest you look at the diff linux command http://linuxcommand.org/man_pages/diff1.html and look at the source code, or how you can implement it.
If your event is something like CustomerFirstNameChanged, doing a diff makes very little sense. You would always want to revert to a previous state such as John or Rick.
Number 2 would also make sense with an event such as ArticleRedrafted, where you can go back to a previous version. Content editors don't see revisions as we see git commits when we use git revert... They see them as points in time that can be returned to.

There are many ways you could represent the change from one string to another string as a reversible operation. It really would depend on requirements.
I would not expect the deltification in a source control system to necessarily meet the general needs you've outlined above. For example, git's purpose in deltifying files for storage (which it only does when generating a pack) is to save space. Older versions of an object may be stored as deltas from newer versions, but there is never a need to reconstruct the newer version via delta from the older version - it's stored in its entirety specifically so that it can be accessed quickly (without need to combine deltas). So storing enough information for the delta to be used as a two-way transformation would be wasteful from git's point of view.
Instead of looking at the pack deltas, you could think about how diff represents a change. This is more line-oriented than you seem to want. If you have an edited file, you might represent the edits as
d37
- Line 37 used to contain this text, but it was removed
a42
+ We added this text on line 42
c99
- Line 99 used to have this text
+ Line 99 now has this text
You could do a character-oriented version of this. Say you define operators for "add characters at offset", "remove characters at offset", and "change characters at offset". You either have to be careful with escaping delimiters, or use explicit lengths instead of simply delimiting strings. And you also should consider that each edit might change the offset of the subsequent edits. Technically pre-image offsets have all the information you need, but reversal of the patch is more intuitive if you also store the post-image offsets.
So given
well, here is the original string
you might have a reversible delta like
0/0d6:well, 14/8c12:the original1:a33/16a11: with edits
to yield
here is a string with edits
Of course, calculating the "most correct" patch may not be easy.
0/0c33:well, here is the original string27:here is a string with edits
would be equally valid but degenerates to the line-oriented approach.
But that's just one example; like I said, you could define any number of ways, depending on requirements.

Related

VariesAcrossGroups lost when ReInsert_ing doc.ParameterBindings?

Our plugin maintains some instance parameter values across many elements, including those in groups.
Occasionally the end users will introduce data that activates an unused Category,
so we have to update the document parameter bindings, to include those categories. However, when we call
doc.ParameterBindings.ReInsert()
our existing parameter values inside groups are lost, because our VariesAcrossGroups flag is toggled back to false?
How did Revit intend this to work - are we supposed to use this in a different way, to not trigger this problem?
ReInsert() expects a base Definition argument, and would usualy get an ExternalDefinition supplied.
To learn, I instead tried to scan through the definition-keys of existing bindings and match those.
This way, I got the document's InternalDefinition, and tried calling Reinsert with that instead
(my hope was, that since its existing InternalDefinition DID include VariesAcrossGroups=true, this would help). Alas, Reinsert doesn't seem to care.
The problem, as you might guess, is that after VariesAcrossGroups=False, a lot of my instance parameters have collapsed into each other, so they all hold identical values. Given that they are IDs, this is less than ideal.
My current (intended) solution is to instead grab a backup of all existing parameter values BEFORE I update the bindings, then after the binding-update and variesAcrossGroups back to true, then inspect all values and re-assign all parameter-values that have been broken. But as you may surmise, this is less than ideal - it will be horribly slow for the users to use our plugin, and frankly it seems like something the revitAPI should take care of, not the plugin developer.
Are we using this the wrong way?
One approach I have considered, is to bind every possibly category I can think of, up front and once only. But I'm not sure that is possible. Categories in themselves are also difficult to work with, as you can only create them indirectly, by using your Project-Document as a factory (i.e. you cannot create a category yourself, you can only indirectly ask the Document to - maybe! - create a category for you, that you request). Because of this, I don't think you can bind for all categories up front - some categories only become available in the document, AFTER you have included a given family/type in your project.
To sum it up: First, I
doc.ParameterBindings.ReInsert()
my binding, with the updated categories. Then, I call
InternalDefinition.SetAllowVaryBetweenGroups()
(after having determined IDEF.VariesAcrossGroups has reverted back to false.)
I am interested to hear the best way to do this, without destroying the client's existing data.
Thank you very much in advance.

(I'm not sure I will accept my own answer).
My answer is just, that you can survive-circumvent this problem,
by scanning the entire revit database for your existing parmater values, before you update the document bindings.
Afterwards, you reset VariesAcrossGroups back to its lost value.
Then, you iterate through your collected parameters, and verify which ones have lost their original value, and reset them back to their intended value.
One trick that speeds this up a bit, is that you can check Element.GroupId <> -1. That is, those elements that are group members.
You only need to track elements which are group members, as it's precisely those that are affected by this Revit bug.
A further tip is, that you should not only watch out for parameter-values that have lost their original value. You must also watch out for parameter-values that have accidentally GOTTEN a value, but which should be left un-set.
I just use FilteredElementCollector with WhereElementIsNotElementType().
Performance-wise, it is of course horrible to do all this,
but given how Revit behaves, I see no other solution if you have to ship to your clients.

NodaTime TimeZone Data files naming

It appears that the time zone database files used by Nodatime are named by year, with releases within the same year incrementing by a
letter - i.e., "tzdb2019a.nzd" is current as I write this, the next release
will be "tzdb2019b.nzd", and some of the previous versions may have been
"tzdb2018a.nzd", "tzdb2018b.nzd", "tzdb2018c.nzd", etc.
However, I have not been able to find this naming convention formally documented anywhere, and assumptions make me nervous.
I expect the time zone data to change more often than my application
is updated, so the application periodically checks for the latest data file at
https://nodatime.org/tzdb/latest.txt, and downloads a new file if the one in
use is different. Eventually there will be several files locally available.
I want to know that I can sort these by name and be assured that I can
identify the most recent from among those that have already been
downloaded.

That's what I anticipate, certainly. We use the versioning from the IANA time zone page, just with a tzdb prefix and a .nzd suffix. So far, that's been enough, and it has maintained the sort order.
It's possible that we might want to provide other files at some point, e.g. if there's no IANA changes for a long time (as if!) but the CLDR Windows mapping files change significantly. I don't have any concrete plans for what I'd do in that case, but I can imagine something like tzdb2019-2.nzd etc.
It's hard to suggest specific mitigations against this without knowing the exact reason for providing other files, but you could potentially only download files if they match a regex of tzdb\d{4}[a-z]+.nzd.
I'd certainly communicate on the Noda Time discussion group before doing anything like this, so if you subscribe there you should get advance warning.
Another nasty possibility that we might need more than 26 releases in a single calendar year... IANA says that would go 2020a...2020z, then 2020za...2020zz etc. The above regex handles that situation, and it stays sortable in the normal way.
Another option I could provide is an XML or JSON format for "all releases" - so just like there's https://nodatime.org/tzdb/index.txt that just lists the files, I could provide https://nodatime.org/tzdb/index.json that lists the files and release dates. If you kept hold of that file along with the data, you'd always have more information. Let me know if that's of interest to you and I'll look into implementing it.

What's wrong with my SORT function here?

First off, I'm a complete beginner to anything Mainframe-related.
I have a training assignment at work to find matching keys in two files using SORT. I submitted this code to my mentor, pseudo-coded here because I can't access the system from home yet and didn't think to copy it before leaving:
//STEP01 EXEC SORT
//SORTIN DD DSN=file1
// DD DSN=file2
//SORTXSUM DD DSN=output file
//SORTOUT don't need this data anywhere specific so just tossing at spool
//SYSIN DD *
SORT FIELDS=(1,22,CH,A)
SUM FIELDS=NONE,XSUM
/*
When I stick a couple of random sequential files in, the output is exactly what I expect it to be. However, my mentor says it doesn't work. His English is kinda bad and I rarely understand what he's saying the first few times he repeats it.
This combined with him mentioning JOINKEYS (before promptly leaving work, of course) makes me think he just wants (needs?) it done a different way and is doing a really poor job of expressing it.
Either way, could someone please tell me whether or not the code I wrote sucks and explain why it apparently falls short of a method using JOINKEYS?

Here's the requirement that would satisfy:
Take two unsorted datasets; match them on a 22-byte key; output all the data to one of two files. Where keys are duplicate, pick a record of the matched group, whichever is convenient to you, and which selection cannot be guaranteed to be recreated in a subsequent run, and write it to an output file; write all records not written to the first file to the second file instead.
If that is the requirement, you are on to a winner, as it will perform better than the equivalent JOINKEYS.
The solution can also be modified in a few ways. With OPTION EQUALS or EQUALS on the SORT statement, it will always be the first record of equal keys which will be retained.
For more flexibility on what is retained, DUPKEYS could be used instead of SUM.
If the requirement can be satisfied with SUM or DUPKEYS it is more efficient to use them than to use JOINKEYS.
If the data is already in sequence, but otherwise the requirement is the same, then it is not a good way to do it. You can try a MERGE in place of the SORT, and have a SORTIN01 instead of your SORTIN.
If you had DFSORT instead of SyncSORT, you could use ICETOOL's SELECT operator to do all that XSUM and DUPKEYS can do (and more).
If you are doing something beyond what SUM and DUPKEYS can do, you'll need JOINKEYS.
For instance, if the data is already in sequence, you'd specify SORTED on the JOINKEYS for that input.
On the Mainframe, resources are paid for by the client. So we aim to avoid profligacy. If one way uses fewer resources, we chose that.
Without knowing your exact requirement, can't tell if your solution is the best :-)

Dynamics CRM 2011 Import Data Duplication Rules

I have a requirement in which I need to import data from excel (CSV) to Dynamics CRM regularly.
Instead of using some simple Data Duplication Rules, I need to implement a point system to determine whether a data is considered duplicate or not.
Let me give an example. For example these are the particular rules for Import:
First Name, exact match, 10 pts
Last Name, exact match, 15 pts
Email, exact match, 20 pts
Mobile Phone, exact match, 5 pts
And then the Threshold value => 19 pts
Now, if a record have First Name and Last Name matched with an old record in the entity, the points will be 25 pts, which is higher than the threshold (19 pts), therefore the data is considered as Duplicate
If, for example, the particular record only have same First Name and Mobile Phone, the points will be 15 pts, which is lower than the threshold and thus considered as Non-Duplicate
What is the best approach to achieve this requirement? Is it possible to utilize the default functionality of Import Data in the MS CRM? Is there any 3rd party Add-on that answer my requirement above?
Thank you for all the help.
Updated
Hi Konrad, thank you for your suggestions, let me elaborate here:
Excel. You could filter out the data using Excel and then, once you've obtained a unique list, import it.
Nice one but I don't think it is really workable in my case, the data will be coming regularly from client in moderate numbers (hundreds to thousands). Typically client won't check about the duplication on the data.
Workflow. Run a process removing any instance calculated as a duplicate.
Workflow is a good idea, however since it is being processed asynchronously, my concern is the user in some cases may already do some update/changes to the data inserted, before the workflow finish working.. therefore creating some data inconsistency or at the very least confusing user experience
Plugin. On every creation of a new record, you'd check if it's to be regarded as duplicate-ish and cancel it's creation (or mark for removal).
I like this approach. So I just import like usual (for example, to contact entity), but I already have a plugin in place that getting triggered every time a record is created, the plugin will check whether the record is duplicat-ish or not and took necessary action.

I haven't been fiddling a lot with duplicate detection but looking at your criteria you might be able to make rules that match those, pretty much three rules to cover your cases, full name match, last name and mobile phone match and email match.
If you want to do the points system I haven't seen any out of the box components that solve this, however CRM Extensions have a product called Import Manager that might have that kind of duplicate detection. They claim to have customized duplicate checking. Might be worth asking them about this.
Otherwise it's custom coding that will solve this problem.

I can think of the following approaches to the task (depending on the number of records, repetitiveness of the import, automatization requirement etc.) they may be all good somehow. Would you care to elaborate on the current conditions?
Excel. You could filter out the data using Excel and then, once you've obtained a unique list, import it.
Plugin. On every creation of a new record, you'd check if it's to be regarded as duplicate-ish and cancel it's creation (or mark for removal).
Workflow. Run a process removing any instance calculated as a duplicate.
You also need to consider the implication of such elimination of data. There's a mathematical issue. Suppose that the uniqueness' radius (i.e. the threshold in this 1D case) is 3. Consider the following set of numbers (it's listed twice, just in different order).
1 3 5 7 -> 1 _ 5 _
3 1 5 7 -> _ 3 _ 7
Are you sure that's the intended result? Under some circumstances, you can even end up with sets of records of different sizes (only depending on the order). I'm a bit curious on why and how the setup came up.
Personally, I'd go with plugin, if the above is OK by you. If you need to make sure that some of the unique-ish elements never get omitted, you'd probably best of applying a test algorithm to a backup of the data. However, that may defeat it's purpose.
In fact, it sounds so interesting that I might create the solution for you (just to show it can be done) and blog about it. What's the dead-line?

Canonizing diff-of-the-diffs

My question, in short, is following. (I will explain the rationale later, it's not very relevant to the question itself.) Sometimes, git-diff produces different diffs for the same changeset. It happens, for instance, when there is a large chunk of code that was replaced with a wholly different large chunk of code. Such change, could be expressed in diff terms as "add 2 lines, remove 4 lines, add 5 lines, etc" or "remove 4 lines, add 7 lines, etc". I would like it to be somehow consistent.
The question is how to make the diff canonical, i.e. cause the same change always yield the same diff? From the top of my head -- make the diff "greedy" or something.
The reason it annoys me so much is following. I use much the technique we call "diff-of-the-diffs". (Bear with me if I'm explaining trivial things). After I resolve git rebase conflicts, I produce the unified diff of the old change (i.e. this specific commit in the branch before rebase) and the new change (i.e. current index against HEAD) and then diff these diffs against each other. Such diff-of-the-diffs is amazing as a conflicts resolution sanity check -- you can see right away that a specific conflict was only because of the context change, etc.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string