How should I setup a table in a database to keep the history of records? - history

I'm trying to setup a SQL server database for an ASP.NET MVC site that will both store the latest information as well as a history of all the changes to the data.
The information for my site comes from XML files which are uploaded by the user. The site parses through the XML and writes the contained information into the sites database. Elements in successive uploads may actually represent the same thing, but some data may have changed. Yet like I said before I want to keep track of every version.
The table bellow shows one way I was thinking of approaching this. I would create duplicate records for each item in each upload. New items that match those in previous uploads will be assigned the same ID, but each item would be assigned to a unique upload ID.
Upload 1: Erik and Sara are added
Upload 2: Erik renamed to Eric, Bill added
Upload 3: Sarah grew 2" taller, Eric was removed.
[PERSONS TABLE]
PersonID Name Height UploadID
1 Erik 71 1
1 Eric 71 2
2 Sarah 70 1
2 Sarah 70 2
2 Sarah 72 3
3 Bill 76 2
3 Bill 76 3
[UPLOADS TABLE]
UploadID UploadTime
1 3/09/2011
2 4/01/2011
3 4/11/2011
However, this doesn't seem like the optimal solution to me because of how much information ends up being duplicated in the database. Is there a better way to approach this where only the changes are saved with each upload?

I think the problem is is that your PERSONS table no longer contains just information about PERSONS. It also contains information on the updloads. What I'm going to recommend probably won't decrease the size of your database; but it will make it a little easier to understand and work with.
PERSONS
PersonID, Name, Height
1 Eric 71
2 Sarah 72
3 Bill 76
UPLOAD
UploadID, UploadTime
1 3/09/2011
2 4/01/2011
3 4/11/2011
PERSONS_EDIT
PersonID, UploadID, ChangeSQL, ChangeDescription
1 1 "insert into PERSONS(Name, Height) VALUES('Erik', 71)" "Erik added"
1 2 "update PERSONS set name='Eric' where Name='Erik'" "Changed Erik's name"
.... ... ...... ....
I don't think you can do much beyond this to make your tables simpler or your database smaller. As you can see, your PERSONS_EDIT table is going to be your largest table. The database you're using might provide mechanisms to do this automatically (some sort of transaction recording or something) but I've never used anything like that so I'll leave it to other people on Stackoverflow to make any suggestions like that if they exist. If the PERSONS_EDIT table gets too large, you can look at deleting entries that are over a week/month/year old. The decision on when to do that would be up to you.
Some other reasons for making this change, in your first table, you had to use PersonId and UploadID as a primary key to your persons table. So, to actually get the most recent version of a PERSON within your application, you would have had to do something where you select person by id, and then order by their UploadId and select the one with the largest upload Id EVERY TIME YOU DO A TRANSACTION ON ONE PERSON.
Another benefit is that you don't have to do a bunch of fancy sql to get your edit history. Just do a select * from the PERSONS_EDIT table.

Related

organize a set of data based on open slots/sold-out slots

I am trying to analyze data based on the following scenario:
A group of places, each with its own ID gets available for visiting from time to time for an exclusive number of people - this number varies according to how well the last visit season performed - so far visit seasons were opened 3 times.
Let's suppose ID_01 in those three seasons had the following available slots/sold-out slots ratio: 25/24, 30/30, and 30/30, ID_02 had: 25/15, 20/18, and 25/21, and ID_03 had: 25/10, 15/15 and 20/13.
What would be the best way to design the database for such analysis on a single table?
So far I have used a table for each ID with all their available slots and sold-out amounts, but as the number of IDs gets higher and the number of visit seasons too (way beyond three at this point) it has been proving to be not ideal, hard to keep track of, and terrible to work with.
The best solution I could come up with was putting all IDs on a column and adding two columns for each season (ID | 1_available | 1_soldout | 2_available | 2_soldout | ...).
The Wikipedia article on database normalization would be a good starting point.
Based on the information you provided in your question, you create one table.
AvailableDate
-------------
AvailableDateID
LocationID
AvailableDate
AvailableSlots
SoldOutSlots
...
You may also have other columns you haven't mentioned. One possibility is SoldOutTimestamp.
The primary key is AvailableDateID. It's an auto-incrementing integer that has no meaning, other than to sort the rows in input order.
You also create a unique index on (LocationID, AvailableDate) and another unique index on (AvailableDate, LocationID). This allows you to retrieve the row by LocationID or by AvailableDate.

Inventory management - find employees with two OR more particular devices

I am managing the inventory of stock for an IT company.
Recently we've had to dole out lots of new iPhones as the old iPhones assigned to employees were incompatible with a particular piece of software.
With most employees working from home and IT staff being split into several different offices it can be a little difficult to co-ordinate things and make sure that the staff member who received a replacement iPhone actually sent back the original one!
It would be great to have an easy means of check for staff who have two (or possibly more) iPhones (any type) so that I can contact them and ask them to return the old device.
I can export the data from the SQL-based equipment database to Excel and analyse it but I don't have the experience to make things more automatic (and build a report).
Here is an example of the database (shown as an Excel file).
In this example case, John Murphy, has got two iPhones. He only needs one of them. Items with the "Employee_Name" set to "IT Service" and have the status set to 'With helpdesk' are in order and do not need to be included in the final report.
'Tammy Top' has only one iPhone and therefore mustn't appear in the report.
Thanks for your help!
UPDATE...
I've played around with Pivot Tables a little... it may be a start. Perhaps if someone is more experienced they could suggest a better way of setting up the values for the pivot table?
I am pretty sure, that there are better solutions.
But at the moment, I can offer you the following:
E2: =IF(AND(A2="IT Service", D2="With helpdesk"),0,COUNTIFS($A$2:$A2,A2,$D$2:$D2,D2))
G2: =FILTER(A2:D6,(E2:E6>1),"")
In column E I used a COUNTIFS formula to check how often an Employee_Name has occured until now. I wrapped it in an IF statement, that checks whether the combination of "IT Service" and "With helpdesk" occured. In that case, it would override the counter with 0.
In column G I used a FILTER formula to provide the relevant rows (A to D) of the source area in case the counter is higher than 1.
Seeing as the data is in SQL Server, just query the database instead of mucking around in Excel.
Something like:
SELECT Employee_Name FROM <table> GROUP BY Employee_Name WHERE COUNT(*) > 1 AND Employee_Name <> 'IT Service'
Should get a list of Employees with more than one phone. To get the full list something like:
SELECT * FROM <table> WHERE Employee_Name IN (
SELECT Employee_Name FROM <table> GROUP BY Employee_Name WHERE COUNT(*) > 1 AND Employee_Name <> 'IT Service')
should get you the list you want.
being the name of the table/view/SP that is generating the data you are importing into Excel.

cassandra table modeling for EDI data

I am trying to create tables to insert EDI data into Cassandra database. As all of you know there are more than 300+ fields in EDI table, What is the best way to create tables. Now I know we will not need all the 300+ fields , but definitely we will be using about 100-120 fields and the rest of the fields must be saved incase we need them in future. Now out of these 120 + fields we might use 12-15 for search purpose. Let us assume these are some of the fields and we have more of them say close to 300+
Shipment ID Number Date Reference Number Equipment Number
Shipper Name Shipper ID Shipper Addr1 Shipper Addr2 Shipper City Shipper State Shipper Zip
Consignee Name Consignee ID Consignee Addr Consignee Addr2 Consignee City Consignee State Consignee Zip Contact Name Phone
Item No. Authorization No. Lading Desc. Lading Qty Packaging Code Weight (Pounds) Commodity Code
SSCC Application ID
Purchase Order Reference ID Unit Quantity,
A B C D ….. AA AB AC ….. ZZ
Now out of these say we need
ID Number, Reference Number, Equipment Number, A, AD, GG, ….upto 15 as search fields,
Now what is the best way to model tables in this scenario.
Case 1) Keep all the fields in one table.
Well this sounds good but the Table is going to be huge. Search needs too many secondary indexes. Please let me know if this is a good approach or If my thinking is wrong.
Case 2) Split the table into 6 or 7 tables with necessary fields and just for information sake fields.
Say Table one: Shipment ID Number Date Reference Number Equipment Number Consignee Name Consignee ID Consignee Addr Consignee Addr2 Consignee City Consignee State Consignee Zip Contact Name Phone A D FA … etc say upto 40 fields
Assuming Table one has mostly used fields.
Say Table two:
Item No. Authorization No. Lading Desc. Lading Qty Packaging Code Weight (Pounds) Commodity Code … and say upto 200 fields or so …
Assume there are unto six or seven tables as well
Questions regarding to Case2)
1) How do I achieve joins in case I need so.
2) Most of the time I read suggestions as de-normalize the data and repeat the fields in both the tables.
2a) if so how do I insert the data into both the fields do I right my code in such a way or Do I use some other tools like Spark etc..
2b) What is the best way to achieve Joins in these scenarios.
2c) Also what is the best way to still achieve search scenarios with minimal Secondary indexes.
I know the question is vague and needs lot of assumptions but still I would appreciate if you can answer this. I am looking more towards the way my thinking should go . I have gone though many notes but did not find any solutions. I am only reading suggestions , mostly telling me to use de-normalized data . But how do we achieve it.
Please make as many assumptions as possible , I will try to understand, and keep open in using Spark and Cassandra where ever required. Please try to explain it as clearly as possible with example if possible so that it will be clear.
I appreciate the effort you will be taking in answering these questions.
Thanks
Tom

Tableau Calculated Field using FIXED

I have a database in Tableau from an Excel file. Every row in the database is one ticket (assigned to an Id of a customer) for a different theme park across two years.
The structure is like the following:
Every Id can buy tickets for different parks (or same park several times), also in different years.
What I am not able to do is flagging those customers who have been in the same park in two different years (in the example, customer 004 has been to the park a in 2016 and 2017).
How do I create this calculated field in Tableau?
(I managed to solve this in Excel with a sumproduct fucntion, but the database has more than 500k rows and after a while it crashes / plus I want to use a calculated field in case I update the excel file with a new park or a new year)
Ideally, the structure of the output I thought should be like the following (but I am open to different views, as long I get to the result): flag with 1 those customers who have visited the same park in two different years.
Create a calculated field called customer_park_years =
{ fixed [Customerid], [Park] : countd([year]) }
You can use that on the filter shelf to only include data for customer_park_years >= 2
Then you will be able to visualize only the data related to those customers visiting specific parks that they visited in multiple years. If you also want to then look at their behavior at other parks, you'll have to adjust your approach instead of just simply filtering out the other data. Changes depend on the details of your question.
But to answer your specific question, this should be an easy way to go.
Note that countd() can be slow for very large data sets, but it makes answering questions without reshaping your data easy, so its often a good tradeoff.
Try this !
IFNULL(str({fixed [Customerid],[Park]:IF sum(1)>1 then 1 ELSE 0 END}),'0')

New variable in Alteryx without join

So i've created a simple bit of analysis in alteryx, and I'm looking for a bit of advice.
I have a dataset of merchants and shoppers. I pull out the total number of unique shoppers with a summarize node countdistinct(shoppers) - let's say it's 100. Then, I create a table for the number of unique shoppers within each merchant, and that table looks something like this
Merchant | Unique Users
Merchant 1 | 76
Merchant 2 | 19
Merchant 3 | 97
Merchant 4 | 44
Merchant 5 | 55
...
I'd like to create a variable that will be [Number of Distinct Users]/countdistinct(shoppers).
I know that I could just append the value for countdistinct(shoppers) into a new third column in my table, but I'd prefer not to do that unless I have to. Is there a way to save the single number from countdistinct(shoppers) as a value and simply divide directly by that without having to append or join?
There currently is no way to do that in Alteryx. While it feels wrong or wasteful, by including the calculated count distinct value on every record, Alteryx can process each record individually and then pass it down stream.
I have heard talk of creating something like a global variable that could be accessed or updated to handle the situation that you're describing, but the complexity of managing reads and writes from different parts of the workflow as well as any resulting slowdown from thread coordination always sideline the discussion.

Resources