Add a running total to a PowerQuery result - excel

I am trying to create a count of dates that occur during particular weeks (the last five weeks including this week) and a total running count for dates which would include records prior to the listed five week duration.
I could do this fairly easily using standard Excel formula, but I'm trying to learn PowerQuery and the M language.
My first query returns the relevant rows from the original source data (another query that pulls everything from the Excel table and forces the correct data type for each column).
The dates I'm trying to count are contained in the Section A#(lf)Turnbacks.
The query is called Triage_Turnbacks.
let
Source = #"Source_Data",
#"Filter_Rows" = Table.SelectRows(Source, each ([#"Cancelled?"] = null) and ([#"Section A#(lf)Turnbacks"] <> null)),
#"Removed_Columns" = Table.SelectColumns(Filter_Rows,{"VENDOR CODE", "Supplier Lookup", "Section A#(lf)Turnbacks"}),
#"Duplicate_Date_Column" = Table.DuplicateColumn(Removed_Columns, "Section A#(lf)Turnbacks", "Start_Of_Week"),
#"Start_of_Week" = Table.TransformColumns(#"Duplicate_Date_Column",{{"Start_Of_Week", Date.StartOfWeek, type date}})
in
Start_of_Week
| VENDOR CODE | Supplier Lookup | Section ATurnbacks | Start_Of_Week |
|-------------|-----------------|--------------------|---------------|
| A1 | Supplier A | 08/04/2019 | 08/04/2019 |
| A1 | Supplier A | 08/04/2019 | 08/04/2019 |
| A2 | Supplier B | 16/04/2019 | 15/04/2019 |
| A3 | Supplier C | 30/05/2019 | 27/05/2019 |
| A10 | Supplier D | 24/04/2019 | 22/04/2019 |
| A5 | Supplier E | 18/06/2019 | 17/06/2019 |
| A1 | Supplier A | 17/07/2019 | 15/07/2019 |
| A1 | Supplier A | 24/06/2019 | 24/06/2019 |
| A2 | Supplier B | 11/06/2019 | 10/06/2019 |
| A3 | Supplier C | 03/07/2019 | 01/07/2019 |
| A10 | Supplier D | 02/07/2019 | 01/07/2019 |
| A5 | Supplier E | 17/07/2019 | 15/07/2019 |
| A1 | Supplier A | 17/07/2019 | 15/07/2019 |
| A1 | Supplier A | 05/08/2019 | 05/08/2019 |
| A2 | Supplier B | 12/07/2019 | 08/07/2019 |
| A1 | Supplier A | 05/08/2019 | 05/08/2019 |
| A1 | Supplier A | 29/07/2019 | 29/07/2019 |
| A2 | Supplier B | 12/06/2019 | 10/06/2019 |
| A3 | Supplier C | 21/02/2019 | 18/02/2019 |
| A10 | Supplier D | 23/04/2019 | 22/04/2019 |
| A5 | Supplier E | 14/06/2019 | 10/06/2019 |
| A10 | Supplier D | 08/02/2019 | 04/02/2019 |
A second query returns the previous five week Monday and Sunday dates.
The query is called 5_Week_Calendar.
let
Source = Table.FromList({-5..0}, each{_}),
AddedStartOfWeek = Table.AddColumn(Source, "StartOfWeek", each Date.StartOfWeek(Date.AddDays(Date.From(DateTime.LocalNow()),[Column1]*7)), type date),
AddedEndOfWeek = Table.AddColumn(AddedStartOfWeek, "EndOfWeek", each Date.EndOfWeek([StartOfWeek]), type date),
RemovedColumn = Table.RemoveColumns(AddedEndOfWeek,{"Column1"})
in RemovedColumn
| StartOfWeek | EndOfWeek |
|-------------|------------|
| 01/07/2019 | 07/07/2019 |
| 08/07/2019 | 14/07/2019 |
| 15/07/2019 | 21/07/2019 |
| 22/07/2019 | 28/07/2019 |
| 29/07/2019 | 04/08/2019 |
| 05/08/2019 | 11/08/2019 |
My third query returns the count per week (i.e. which records in Triage_Turnbacks.Start_Of_Week matches 5_Week_Calendar.StartOfWeek.
let
Source = Table.NestedJoin(#"5_Week_Calendar", {"StartOfWeek"}, Triage_Turnbacks, {"Start_Of_Week"}, "Triage_Turnbacks", JoinKind.LeftOuter),
#"Removed Columns" = Table.RemoveColumns(Source,{"EndOfWeek"}),
#"Aggregated Triage_Turnbacks" = Table.AggregateTableColumn(#"Removed Columns", "Triage_Turnbacks", {{"Start_Of_Week", List.Count, "Total Turnbacks"}})
in
#"Aggregated Triage_Turnbacks"
| StartOfWeek | Total Turnbacks |
|-------------|-----------------|
| 01/07/2019 | 2 |
| 08/07/2019 | 1 |
| 15/07/2019 | 3 |
| 22/07/2019 | 1 |
| 29/07/2019 | 1 |
| 05/08/2019 | 2 |
The problem
How do I add the running total which includes records before the initial 5 week date of 01/07/2019?
| StartOfWeek | Total Turnbacks | Running Total |
|-------------|-----------------|---------------|
| 01/07/2019 | 2 | 15 |
| 08/07/2019 | 1 | 16 |
| 15/07/2019 | 3 | 19 |
| 22/07/2019 | 1 | 19 |
| 29/07/2019 | 1 | 20 |
| 05/08/2019 | 2 | 22 |

A running total is simple in concept but, as far as I know, M's standard library does not currently provide anything like List.CumulativeSum or Table.CumulativeSum. This makes things a bit awkward and a custom function will likely be needed.
A running total can be achieved with something like:
ListCumulativeSum = (listOfNumbers as list) as list => List.Accumulate(listOfNumbers, {}, (listState, currentItem) => listState & {List.Last(listState, 0) + currentItem}),
which, given a list of numbers, returns a list of numbers.
The equivalent for tables might be the function below. (I am sure there are many ways of doing this; some more performant/scalable than others.)
TableCumulativeSum = (someTable as table, columnToSum as text, newColumnName as text) =>
let
runningTotal = ListCumulativeSum(Table.Column(someTable, columnToSum)),
rightTable = Table.FromColumns({{0..List.Count(runningTotal)-1}, runningTotal}, {"$rightJoinKey", newColumnName}),
leftTable = Table.AddIndexColumn(someTable, "$leftJoinKey"),
joined = Table.Join(leftTable, {"$leftJoinKey"}, rightTable, {"$rightJoinKey"}, JoinKind.FullOuter),
dropped = Table.RemoveColumns(joined, {"$leftJoinKey", "$rightJoinKey"})
in dropped,
Overall and going back to your question, I think it can be achieved with the code below:
let
// This is just a function which returns a record. The record itself
// contains two values:
// • a list of dates (which will be the "left join column")
// • a replacer function (which conditionally replaces dates)
GetReplacerAndDates = (n as number, replaceWith as any) as record =>
let
startOfCurrentWeek = Date.StartOfWeek(DateTime.Date(DateTime.LocalNow())),
nMondaysAgo = Date.AddWeeks(startOfCurrentWeek, -n),
defaultAndLastNWeeks = {replaceWith} & List.Dates(nMondaysAgo, n + 1, #duration(7, 0, 0, 0)),
conditionalReplacer = (someDate as date) =>
let
startOfWeek = Date.StartOfWeek(someDate),
startOfWeekOrReplacement = if startOfWeek >= nMondaysAgo then startOfWeek else replaceWith
in startOfWeekOrReplacement,
toReturn = [replacer = conditionalReplacer, values = defaultAndLastNWeeks]
in toReturn,
rec = GetReplacerAndDates(5, "Prior period"), // Needs a better variable name.
// You don't need this below, I only used it to give me a starting point
// and so that I could verify my answer.
triageTurnbacks = Table.FromColumns(
{
{"A1","A1","A2","A3","A10","A5","A1","A1","A2","A3","A10","A5","A1","A1","A2","A1","A1","A2","A3","A10","A5","A10"},
{"Supplier A","Supplier A","Supplier B","Supplier C","Supplier D","Supplier E","Supplier A","Supplier A","Supplier B","Supplier C","Supplier D","Supplier E","Supplier A","Supplier A","Supplier B","Supplier A","Supplier A","Supplier B","Supplier C","Supplier D","Supplier E","Supplier D"},
{#date(2019,4,8), #date(2019,4,8), #date(2019,4,16), #date(2019,5,30), #date(2019,4,24), #date(2019,6,18), #date(2019,7,17), #date(2019,6,24), #date(2019,6,11), #date(2019,7,3), #date(2019,7,2), #date(2019,7,17), #date(2019,7,17), #date(2019,8,5), #date(2019,7,12), #date(2019,8,5), #date(2019,7,29), #date(2019,6,12), #date(2019,2,21), #date(2019,4,23), #date(2019,6,14), #date(2019,2,8)}
},
type table [Vendor Code = text, Supplier Lookup = text, Section A Turnbacks = date]
),
// I think the single step "startOfWeekColumn" below should probably replace the #"Duplicate_Date_Column"
// and #"Start_of_Week" steps of your Triage_Turnbacks query.
// You could also assign rec[replacer] to a variable,
// just to make the code/call site easier to read.
startOfWeekColumn = Table.AddColumn(triageTurnbacks, "startOfWeek", each rec[replacer]([Section A Turnbacks]), type any),
datesToJoin = Table.FromColumns({rec[values]}, {"startOfWeek"}),
joined = Table.NestedJoin(datesToJoin, "startOfWeek", startOfWeekColumn, "startOfWeek", "Total Turnbacks", JoinKind.LeftOuter),
reduced = Table.TransformColumns(joined, {{"Total Turnbacks", Table.RowCount, type number}}),
// Helper functions. I don't think cumulative summation can be achieved natively (as far as I know).
ListCumulativeSum = (listOfNumbers as list) as list => List.Accumulate(listOfNumbers, {}, (listState, currentItem) => listState & {List.Last(listState, 0) + currentItem}),
TableCumulativeSum = (someTable as table, columnToSum as text, newColumnName as text) as table =>
let
runningTotal = ListCumulativeSum(Table.Column(someTable, columnToSum)),
rightTable = Table.FromColumns({{0..List.Count(runningTotal)-1}, runningTotal}, {"$rightJoinKey", newColumnName}),
leftTable = Table.AddIndexColumn(someTable, "$leftJoinKey"),
joined = Table.Join(leftTable, {"$leftJoinKey"}, rightTable, {"$rightJoinKey"}, JoinKind.FullOuter),
dropped = Table.RemoveColumns(joined, {"$leftJoinKey", "$rightJoinKey"}),
changedTypes = Table.TransformColumnTypes(dropped, {{newColumnName, type number}})
in changedTypes,
runningTotal = TableCumulativeSum(reduced, "Total Turnbacks", "Running Total")
in
runningTotal
which gives me this:
which I think matches your expected output with the exception of the Total Turnbacks column of the week starting 22/07/2019 (where you have 1 and I have 0).
You could obviously filter/remove the "Prior period" row from the final table.
For further details, please see comments in the code. If they don't help, let me know.

Related

Reshaping table Excel PowerQuery

I have a large table in Excel, which is output of a data-gathering tool, that looks more or less like this:
DateA | ValueA | DateB | ValueB | ... | DateZ | ValueZ
---------------------------------------------------------------------------
2019-01-01 | 3 | 2019-01-01 | 6 | ... | 2019-01-04 | 7
2019-01-02 | 1 | 2019-01-04 | 2 | ... | 2019-01-05 | 3
And I'd like to process it so it would like this:
Date | Value | Type
-----------------------------
2019-01-01 | 3 | A
2019-01-02 | 1 | A
2019-01-01 | 6 | B
2019-01-04 | 2 | B
...
2019-01-04 | 7 | Z
2019-01-05 | 3 | Z
Because this is the format, that is used on our sql database.
How to do this in the least tedious way, preferably using PowerQuery? I'd like to avoid brute-force coping and pasting with vba loop.
The number of columns is fixed, but would be nice to have an option to add another one later on, the number of rows would however vary around some value (like 20, 21, 20, 22, 19, 20) day-to-day
Columns are harder to work with, so I'd first transform each column into a new row as a list.
ColumnsToRows =
Table.FromColumns(
{
Table.ToColumns(Source),
Table.ColumnNames(Source)
},
{"ColumnValues","ColumnName"}
)
This should give you a table as follows where each list consists of values in the corresponding column. For example, the top list is {1/1/2019,1/2/2019}. (The from columns part is to add the ColumnName column.)
| ColumnValues | ColumnName |
|--------------|------------|
| [List] | DateA |
| [List] | ValueA |
| [List] | DateB |
| [List] | ValueB |
| [List] | DateZ |
| [List] | ValueZ |
We can then filter this based on the data type in each list. To get the date rows you can write:
DataRows =
Table.SelectRows(
ColumnsToRows,
each Value.Type(List.First([ColumnValues])) = type date
)
Which gets you the following filtered table:
| ColumnValues | ColumnName |
|--------------|------------|
| [List] | DateA |
| [List] | DateB |
| [List] | DateZ |
If you expand the first column with Table.ExpandListColumn(DataRows, "ColumnValues"), then you get
| ColumnValues | ColumnName |
|--------------|------------|
| 1/1/2019 | DateA |
| 1/2/2019 | DateA |
| 1/1/2019 | DateB |
| 1/4/2019 | DateB |
| 1/4/2019 | DateZ |
| 1/5/2019 | DateZ |
The logic is analogous to filter and expand the value rows.
ValueRows =
Table.ExpandListColumn(
Table.SelectRows(
ColumnsToRows,
each Value.Type(List.First([ColumnValues])) = type number
),
"ColumnValues"
)
Which gets you a similar looking table:
| ColumnValues | ColumnName |
|--------------|------------|
| 3 | ValueA |
| 1 | ValueA |
| 6 | ValueB |
| 2 | ValueB |
| 7 | ValueZ |
| 3 | ValueZ |
Now we just need to combine together the columns we want into a single table:
Combine Columns =
Table.FromColumns(
{
DateRows[ColumnValues],
ValueRows[ColumnValues],
ValueRows[ColumnName]
},
{"Date", "Value", "Type"}
)
and then extract the text following Value in the column names.
ExtractType =
Table.TransformColumns(
CombineColumnns,
{{"Type", each Text.AfterDelimiter(_, "Value"), type text}}
)
The final table should be just as specified:
| Date | Value | Type |
|----------|-------|------|
| 1/1/2019 | 3 | A |
| 1/2/2019 | 1 | A |
| 1/1/2019 | 6 | B |
| 1/4/2019 | 2 | B |
| 1/4/2019 | 7 | Z |
| 1/5/2019 | 3 | Z |
All in a single query, the M code looks like this:
let
Source = <Source Goes Here>,
ColumnsToRows = Table.FromColumns({Table.ToColumns(Source), Table.ColumnNames(Source)}, {"ColumnValues","ColumnName"}),
DateRows = Table.ExpandListColumn(Table.SelectRows(ColumnsToRows, each Value.Type(List.First([ColumnValues])) = type date), "ColumnValues"),
ValueRows = Table.ExpandListColumn(Table.SelectRows(ColumnsToRows, each Value.Type(List.First([ColumnValues])) = type number), "ColumnValues"),
CombineColumnns = Table.FromColumns({DateRows[ColumnValues], ValueRows[ColumnValues], ValueRows[ColumnName]},{"Date", "Value", "Type"}),
ExtractType = Table.TransformColumns(CombineColumnns, {{"Type", each Text.AfterDelimiter(_, "Value"), type text}})
in
ExtractType

Filter filter criteria and then apply in countif statement in Excel

I have a table of filter criteria like this:
+----------+----------+------+------+------+
| Category | SpecName | Spec | Pass | Fail |
+----------+----------+------+------+------+
| A | S1 | 3 | | |
| A | S2 | 4 | | |
| B | S1 | 5 | | |
| C | S1 | 2 | | |
+----------+----------+------+------+------+
I have a table I want to apply the filter criteria to like this:
+----------+----+----+
| Category | S1 | S2 |
+----------+----+----+
| A | 5 | 3 |
| B | 4 | |
| A | 5 | 5 |
| C | 2 | |
| A | 2 | 6 |
+----------+----+----+
I want to fill the Pass and Fail columns in the filter criteria table with a count of items in second table with values >= the corresponding spec, like so.
+----------+----------+------+------+------+
| Category | SpecName | Spec | Pass | Fail |
+----------+----------+------+------+------+
| A | S1 | 3 | 1 | 2 |
| A | S2 | 4 | 1 | 2 |
| B | S1 | 5 | 0 | 1 |
| C | S1 | 2 | 1 | 0 |
+----------+----------+------+------+------+
Here are steps for how I might do it in a scripting language:
Filter first table to get all spec filter criteria for the Category on that row, as follows for the first row.
+----------+----------+------+
| Category | SpecName | Spec |
+----------+----------+------+
| A | S1 | 3 |
| A | S2 | 4 |
+----------+----------+------+
Copy table 2 to a variable iTable
+----------+----+----+
| Category | S1 | S2 |
+----------+----+----+
| A | 5 | 3 |
| B | 4 | |
| A | 5 | 5 |
| C | 2 | |
| A | 2 | 6 |
+----------+----+----+
For each spec name:
Find column in iTable with spec name
Filter spec name column in iTable by spec
After all filters applied, we would have:
+----------+----+----+
| Category | S1 | S2 |
+----------+----+----+
| A | 5 | 5 |
+----------+----+----+
Then just count the rows in iTable and assign to the cell in Pass column of the criteria table
Is this possible with Excel formulas?
If not, does anyone know how to do it with VBA?
Looking at an alternative layout for you spec criteria. Expand you columns to suit your need.
With each spec criteria being its own column life gets really easy. You just need to adjust your formula to match the number of criteria you have.
Based on the table at the end for layout, place the following formula in D3 and copy down as required.
=SUMPRODUCT(($G$2:$G$6=A3)*($H$2:$H$6>=B3)*($I$2:$I$6>=C3))
That will give you a count of passing all criteria. Its also a function that performs array like calcs. It could be repeated in the next column but in order to reduce dependency on array calculation and potentially speed things up depending on the amount of data to check, place the following in the top of the fail column and copy down as required:
=COUNTIF($G$2:$G$6,A3)-D3
Basically it subtracts the passes from the total count. This assumes you can only have PASS and FAIL as options.

Non-exact match vlookup with specific critiera

I have to column of data (time A and time B) and I would like to find out for each data a in A if there is a value b in B that meets the criteria of b-a = +/−0.007. I am trying to use vlookup but I cannot specify the criteria of b-a = +/−0.007. Can I do this using vlookup or there is other ways to do it in excel? Many thanks in advance for help!
The data example is shown below.
+----------------+------------------+
| Time A | Time B |
+----------------+------------------+
| 0.000 | 0.000 |
| 1.001 | 1.001 |
| 1.852 | 1.852 |
| 2.725 | 2.729 |
| 3.356 | 3.359 |
| 4.061 | 4.070 |
| 4.423 | 4.431 |
| 4.634 | 4.642 |
| 4.750 | 4.637 |
| 5.390 | 5.398 |
| 5.788 | 5.788 |
| 6.515 | 6.522 |
| 7.010 | 7.010 |
| 7.672 | 7.500 |
| 8.017 | 7.900 |
| 8.073 | 8.200 |
+----------------+------------------+
You could use this VBA solution:
Sub main()
Dim i As Integer
Dim j As Integer
For i = 2 To 16
For j = 2 To 16
If Abs(Cells(j, 2) - Cells(i, 1)) < 0.007 Then
Cells(i, 3) = j
End If
Next j
Next i
End Sub
It in column C it outputs the matching row index from column B:

Excel 2010: Moving cell right & up from original location (Offsetting it)

Program: Excel 2010
Experience Basic
Issue:
I have a large table of data with some "split cells" (first/last name & currency), this is how it is from the original data (copied & pasted from a webpage, the data is split in 2). I need to make a clean table with all data on 1 row, not 2. I have some sample data below, and then further down I have included how I want it to look like.
The original formatting is a HTML table, pulled from a database (which I do NOT have access to, however I can generate a CSV but that in itself is another question due to how it is setup.)
Assume: Data in (A1); there are more than the listed values & columns, and I will accept either formula OR VBA answers and lastly: Ignore the blank rows, they were inserted to show the difference between tables more clearly.
Original Data:
| Date | Transaction ID | Order Reference | Sender | Sender Email | Status | Payment Amount | Amount Paid |
|------------|----------------|-----------------|--------|--------------|--------|----------------|-------------|
| 17/04/2014 | transid | order | first | email | Paid | 5 | 5 |
| | | | last | | | AUD | AUD |
| | | | | | | | |
| 13/04/2014 | transid | order | first | email | Paid | 5 | 5 |
| | | | last | | | AUD | AUD |
| | | | | | | | |
| 13/04/2014 | transid | order | first | email | Paid | 5 | 5 |
| | | | last | | | AUD | AUD |
| | | | | | | | |
| 12/04/2014 | transid | order | first | email | Paid | 5 | 5 |
| | | | last | | | AUD | AUD |
Required Data: (notice the first/last are now on the same row as is the currency)
| Date | Transaction ID | Order Reference | Sender | | Sender Email | Status | Payment Amount | | Amount Paid | |
|------------|----------------|-----------------|--------|------|--------------|--------|----------------|-----|-------------|-----|
| 17/04/2014 | transid | order | first | last | email | Paid | 5 | AUD | 5 | AUD |
| | | | | | | | | | | |
| 13/04/2014 | transid | order | first | last | email | Paid | 5 | AUD | 5 | AUD |
| | | | | | | | | | | |
| 13/04/2014 | transid | order | first | last | email | Paid | 5 | AUD | 5 | AUD |
| | | | | | | | | | | |
| 12/04/2014 | transid | order | first | last | email | Paid | 5 | AUD | 5 | AUD |
Thank you, have a great weekend.
[EDIT]
Note, none of these cells are merged, each cell is an individual, and the "last" & "AUD" need to be moved right & up.
This should work for you:
Public Sub ModData()
Dim colDate As Long
Dim colTrans As Long
Dim colOrder As Long
Dim colSender As Long
Dim colSenderEmail As Long
Dim colStatus As Long
Dim colPmtAmt As Long
Dim colPaid As Long
Dim r As Long
Dim ws As Worksheet
colDate = 1
colTrans = 2
colOrder = 3
colSender = 4
' col 5 reserved for inserted col
colSenderEmail = 6
colStatus = 7
colPmtAmt = 8
' col 9 reserved for inserted col
colPaid = 10
Set ws = ActiveSheet
Application.ScreenUpdating = False
' Add extra columns needed.
ws.Columns(colSender + 1).Insert Shift:=xlToRight
ws.Columns(colPmtAmt + 1).Insert Shift:=xlToRight
' Move data to same row.
For r = 2 To 12 Step 2
ws.Cells(r, colSender + 1).Value = ws.Cells(r + 1, colSender).Value
ws.Cells(r, colPmtAmt + 1).Value = ws.Cells(r + 1, colPmtAmt).Value
ws.Cells(r, colPaid + 1).Value = ws.Cells(r + 1, colPaid).Value
Next r
' Delete unnecessary rows.
r = 3
While ws.Cells(r - 1, 1).Value <> ""
ws.Cells(r, 1).EntireRow.Delete
r = r + 1
Wend
Application.ScreenUpdating = True
End Sub

Python Multivariate Analysis? - to Produce a unique ID from record pairs of 3-10 in a 3.6M Recordset?

I am new to data mining and rough on stats and python.
We have an issue where we have case_reports table from a certain govenment agency in a mysql db where we have 1 or multiple fields of id : case, in theory they say the case is unique but that THERE can be multiple fields of case : id with quite similar but not exactly different data in the other row fields like date_rcvd,age,dosage,wt,age,route of administration,submission_type etc BUT FOR THE SAME PATIENT.
THEORETICALLY 'IDEAL' OF CASE REPORT SET FOR ONE PATIENT
| id = 55 | case = 1 | age = 20 | source = x | drug = az | date = p
| id = 12 | case = 1 | age = null | source = y | drug = az | date = q
| id = 11 | case = 1 | age = 20 | source = null | null | date = g
| id = 15 | case = 1 | age = 20 | source = z | drug = z | date = m
THE PROBLEM SET EXAMPLE FOR ONE PATIENT
| id = 55 | case = 1 | age = 20 | source = x | drug = az | date = p
| id = 12 | case = 2 | age = 20 | source = y | drug = az | date = q
| id = 55 | case = 1 | age = 20 | source = null | null | date = g
| id = 55 | case = 1 | age = null | source = x | drug = az | date = null
We just want to be able to GROUP BY a unique_id from some sort of clustering approach so we end up with a single result instead of 4 in a query for a case id THERE ARE ABOUT 12 OTHER FIELDS I CAN USE FOR MATCHING.
case = SUPPOSEDLY unique patient case rpt and they can be ongoing as the case develops and is recorded.
id = parameter_id link to other tables pertaining to that case
Again in THEORY ""THEY"" say these multiple case records should have at least one different id number BUT if they don't: ""just take the id with the most recent process_date and use that and DELETE id,case < recent and that 'will probably be the correct one leaving the propercase`"" - not reassuring: that case could have come from a different reporter source, a duplicate submission from an electronic submission along with a backup mailed report, or reflect a followup case report from a different reporter source for that id:case.
I was thinking of some sort of k-means clustering approach?
Any resources, code examples links, strategies to start are appreciated, often can't expect the whole answer magically just looking for a methodology to start with.
I've formed my null hypothesis that CASE is Unique to several ids and defined the problem and collected dataproving my null hypothesis that many unique id link to many some unique ids - an admitted flaw of their outdated, underfunded processing/reporting system. Data is - proprietary or I'd share more.
**UPDATED EXAMPLE DATA #idris IGNORE ABOVE EXAMPLE**
+---------+---------+----------+----------+----------+----------+----+-----+----------+
| id | case | date | rept_dt | mfr_dt | foll_seq | wt | age | gndr_cod |
+---------+---------+----------+----------+----------+----------+----+-----+----------+
| 4610164 | 3118662 | 20050314 | 20050311 | 20050301 | 2 | 0 | 51 | M |
| 4622120 | 3118662 | 20050322 | 20050321 | 20050314 | 3 | 0 | 0 | M |
| 4622120 | 3118662 | 20050329 | 20050325 | 20050314 | | 0 | 51 | M |
| 4802410 | 3118662 | 20051013 | 20051012 | 20051004 | 5 | 0 | 51 | M |
+---------+---------+----------+----------+----------+----------+----+-----+----------+
Best,
Colin

Resources