Creating Unique identifier in Excel that links the same record on different deliveries - excel

I need to create an unique ID that will identify each business, so I can see on each delivery if they were added, removed or updated. Some background below:
I have a client who will make deliveries of business to be mapped in CSV format
The client will every quarter deliver a new file, which probably will
not be exactly the same data, it may contain addition, removals,
updates.
I don't want to do normalization in the data, because I want to
compare each raw delivery to see what has changed from one delivery
to the other.
The files can contain thousands or just a few records
It will be difficult to create uniqueness only on field concatenation, because the record can change a lot, of just a little.
The order of the business can change during deliveries (the first one can be in the middle or in the bottom)
I think there is no way to create uniqueness, can someone help me here?
Example of data to be delivered:

Related

How to connect data in Excel Power Pivot data model with no unique identifier

I am trying to build an Excel Power Pivot data model using restaurant inspection data from my city, though I'm having trouble envisioning how to get this to work properly. I have three files I've imported into the data model but cannot figure out how to link:
business_lookup; each entry is unique with a business ID number, a business name and address.
inspection_lookup; each entry is distinct inspection on a specific date for a specific restaurant but has no unique identifier. It does not distinguish how many violations were found on this visit, just that a visit occured.
violations; a file full of each individual violation found on each of the inspection_lookup dates. It has the business ID, date and a description of the individual specific violation. For each inspection in the inspection_lookup, there are typically multiple corresponding violations in this table.
The problem, to my understanding, is that there's no unique field like "inspection_ID" that could link the inspection_lookup file to each of its many findings in the violations file to allow me to say on June 5, 2020, Jim's Fish House had three violations and they were X, Y and Z. I can connect both to the business_lookup file easily enough, but I can't figure out how to link these other two tables. How can I connect these two other files when all I know is that the unique business ID was inspected on a common date?
If in inspection_lookup you have date and specific restaurant (I assume it corresponds to business ID or business name), you can create unique key by concatenating these 2 columns (given you cannot have more than 1 inspection on the same day in the same restaurant). You can create the same unique key in violations and connect these 2 tables. Business_lookup has unique values so you can connect it to violations or inspection_lookup based on your use case.

Excel: Order by date within multiple IDs

I have a huge epidemiological dataset containing registry data with pathology reports and clinical information. I have merged several files into one masterfile in order to get all information from one file. Every patient is assigned an unique ID-number. Each patient can have several reports and hence the same ID number can be repeated several times in the ID column. For each ID entry = new row (= pathology or clinical report) there is a date of that sample/information reported.
My goal is to be able to read all pathology/clinical info for a particular ID within one row.
By sorting the IDs, I get a clear picture of the number of each ID that has been entered. The problem arises when there are several reports = multiple rows with identical ID because the dates within this one patients with several IDs = rows do not match. The dates come from pathology (sample date, answer date, clinical info date etc). The dates from pathology and clinical within one patient does not have to match exactly on the day but still within a reasonable timeframe e.g. within 1-2 months. This is best illustrated with an example.
I want to sort the columns so that dates from a particular row match together. I am sure there is a way to do that but I cannot figure it out.
Thanks in advance
The issue of mismatching records seems to arise once the two separate tables are merged into one. In order to fix this, there are several options you can take:
Re-do the merge but strengthen the way in which the tables are joined on.
Instead of only merging based on ID, see if there is another field that could easily connect the records, perhaps a medical record #, case #, or event #, and merge the tables based on this new field AND ID. This would be the strongest solution, however it will only work if you can find said field to strengthen the link.
A separate solution would be to first sort the original tables based on the dates so that they match up and then re-merging them together.
In theory this should solve your problem as I assume currently when matching up the two separate tables it is grabbing the first instance of patient X01 from both tables and matching them together. This can be confirmed by checking the merged query and looking to see if the mismatched records are in the same order as presented in the original tables. This is not perfect, as it relies on no clinical dates occurring between pathology dates for the record, so I would proceed with caution.
And to address your concern about losing track of ID's with multiple rows, this should not matter as in the end result after merged you can then sort by ID, however you can add multiple levels of sort by selecting the data and going to Data -> Sort -> Add Level. You can change the order in which the data is sorted (First by ID and then by Date).

Spotfire - Marking with a Y-Axis Custom expression

Currently, I'm trying to create a combination chart (graph) in Spotfire where a specific record could be bucketed into one or more categories. For example, I have a category for newly created invoices as well as a category for invoices that are in process. Easily, you can have an invoice that is both newly created and in process, and I want to show these both these categories as separate series so that I can monitor each category independently (i.e. if I have a population of 600 invoices of which 500 are newly created and 300 are in process, I want to be able to drill down into the 500 newly created invoices or the 300 in process ones, regardless of overlap)
Currently, I can create the graph by using a CASE statement for the Y-Axis (i.e. if it's a "New Invoice" Then 1, ELSE 0) so I can get the graph to show the correct number of records. However, with this method, Spotfire doesn't know that only records that satisfy the case statement should be marked; therefore, if I try to mark these specific transactions, I get detail for all transactions.
Has anyone figured out a way to get around this? Obviously, if each criteria was independent, the marking could be done really easily, but since it isn't, I can't seem to crack what seems like a very simple problem.
I would suggest creating a calculated column that splits your data set accordingly. For example, use an if statement or case statements to have your calculated column return "New Invoice", "In-progress Invoice", or "New and In-progress Invoice" based on your specific evaluation criteria for determining which bin each invoice falls in to. Then you could series the graph by that new column and it should work when you mark the specific invoices.
See if creating a hierarchy like [OLD or NEW], and place [STATUS] (which may be Open, or In Progress, or Closed) below the [OLD or NEW] column. Place this hierarchy in X axis.
You would see like OLD which can be further translated to either Open, In Progress, or Closed, and the same with New.

How do I create report-like data tables in Excel?

In the past I have created websites that extract data from a database and format it using tables.
Now, I am trying to do the same thing but with Excel, and I'm lost. I am used to using SQL commands to extract data from given fields and then sort/manipulate it.
Currently, I am able to print a report that provides me with an Excel spreadsheet full of raw data, but I would like to make my life easier and organize it into a report.
The column that I would like to reference contains duplicates, but the data in the adjacent columns is different.
To give an example, assume I had a spreadsheet of sales transactions. One column would be the Customer ID, and the adjacent columns would contain the quantity, the cost per unit, total cost, order ID, etc.
What I would want to do in this case would be to select all the transactions with the same Customer ID and add them together based on their Order ID. Then, I would want to print the result to a second sheet.
I realize that I can use built-in functions to accomplish this, but I would also like to format this report evenually using VBA. Also, since I will have a variable number of rows that differ from one report to the next, I haven't encountered a fucnction that will allow you to add rows.
I'm assuming this must be done with VBA.
Well you can do it manually, but it would take ages. So VBA would be good, particularly as you would be able to generate future reports quickly.
My interpretation of what your saying is that each row in your report will be the total for one customer ID. If it's something else, I imagine the below will still be mostly relevant.
I think it would be a bit much to give you the full answer, particularly as you haven't provided full detail but to take a stab at what you'd do:
Create your empty report page, whether it be a new worksheet or a new workbook
Loop through the table (probably using While next is not empty)
a. Identifying if a row is for a customer ID you haven't covered yet
i. If so then add a new entry in your report
ii. Else add it to the existing customer ID record (loop through until you find it)
Format your report so it looks pretty, e.g:
a. Fill the background in white
b. Throw in some filled bars
c. Put in good titles and totals etc.
For part 1, it might be better building an array first and then dumping the contents into the report. It depends how process intensive it will be - if very intense, an array should shave off time.

Matching "fuzzy" data based on several inputs

I have a search and matching problem:
Inputs
In my database, I have thousands of names, in addition to some other matching characteristics: a few columns of numerical data, and a few columns of other text that helps identify this specific company.
A prospective client has about 500 company names, and then sparsely populated additional characteristics as mentioned above for each of the names.
Current Process
In the past, the process has been a manual one, try to match each name given by the client by searching through the database, finding a name "like" the one reported to me, and then verifying that the additional characteristics match up. However, the main issue is that the names reported are not the same, can often contain abbreviations or only parts of the name stored in my database, and the additional characteristics may be incomplete or only partially matching as well.
Automation
I want to automate this process since it happens frequently. The optimal solution would input one company from the client list along with any of the additional characteristics they filled in for it, and then try to find the top 5 matches in my database.
I've never used Lucene or Sphinx, but they seem to be more document driven. Is there a way to format these inputs so those libraries work for this problem, or instead, what other software tools exist that would work?
To Lucene, a 'document' can easily be a row in a table and I think you will like the fuzzy~ search and search hit scoring capabilities.

Resources