Matching two SPSS datasets [difficult] - statistics

I'm currently busy combining two datasets in SPSS, but it's not the usual problem, after come (crafty) manipulations I've managed to bring it down to:
-Dataset I: non-unique ID 'A'
-Dataset II: unique ID 'B'
I want keep dataset I and add to the data from B from dataset II for each row where A matches B to the each A.
So: dataset I contains a person's ID and a disease in each row (multiple diseases possible, hence non-unique ID) & dataset II contains a person's ID and address line (unique). I want to merge those so that each ID + disease gets updated with the address if that is available.
Next to this, I'd like to accomplish keeping the rows from I where A has no matching B in II and; add new cases to keep the rows from II where B did not match any A.
Would something like this be possible using SPSS?

See MATCH command and also this example.
Something like this should work (ensuring ID variables from each dataset have a common name, in this example simply "ID"):
DATASET ACTIVATE DS1.
SORT CASES BY ID.
DATASET ACTIVATE DS2.
SORT CASES BY ID.
DATASET ACTIVATE DS1.
MATCH FILES FILE=* /TABLE=DS1 /BY ID.

Related

How to identify all columns that have different values in a Spark self-join

I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.
Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html

Python3 Pandas dataframes: beside columns names are there also columns labels?

Many database management systems, such as Oracle, SQL Server or even statistical software like SAS, allow having, beside field names, also field labels.
E.g., in DBMS one may have a table called "Table1" with, among other fields, two fields called "income_A" and "income_B".
Now, in the DBMS logic, "income_A" and "income_B" are the field names.
Beside a name, those two fields can also have plain English labels associated to them, which clarify the actual meaning of those two fields; such as "A - Income of households with dependable children where both parents work and they have a post-degree level of education" and "B - Income of empty-nesters households where only one works".
Is there anything like that in Python3 Pandas dataframes?
I mean, I know I can give a dataframe column a "label" (which is, seen from the above DBMS perspective, more like a "name", in the sense that you can use it to refer to the column itself).
But can I also associate a longer description to the column, something that I can choose to display instead of the column "label" in print-outs and reports or that I can save into dataframe exports, e.g., in MS Excel format? Or do I have to do it all using data dictionaries, instead?
It does not seem that there is a way to store such meta info other than in the columns name. But the column name can be quite verbose. I tested up to 100 characters. Make sure to pass it as a collection.
Such a long name could be annoying to use for indexing in the code. You could use loc/iloc or assign the name to a string for use in indexing.
In[10]: pd.DataFrame([1, 2, 3, 4],columns=['how long can this be i want to know please tell me'])
Out[10]:
how long can this be i want to know please tell me
0 1
1 2
2 3
3 4
This page shows that the columns don't really have any attributes to play with other than the lablels.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html
There is some more info you can get about a dataframe:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

How to get a list of columns that will give me a unique record in Pyspark Dataframe

My intension is to write a python function that would take a pyspark DataFrame as input, and its output would be a list of columns (could be multiple lists) that gives a unique record when combined together.
So, if you take a set of values for the columns in the list, you would always get just 1 record from the DataFrame.
Example:
Input Dataframe
Name Role id
--------------------
Tony Dev 130
Stark Qa 131
Steve Prod 132
Roger Dev 133
--------------------
Output:
Name,Role
Name,id
Name,id,Role
Why is the output what it is?
For any Name,Role combination I will always get just 1 record
And, for any Name, id combination I will always get just 1 record.
There are ways to define a function, which will do exactly what you are asking for.
I will only show 1 possibility and it is a very naive solution. You can iterate through all the combinations of columns and check whether they form a unique entry in the table:
import itertools as it
def find_all_unique_columns_naive(df):
cols = df.columns
res = []
for num_of_cols in range(1, len(cols) + 1):
for comb in it.combinations(cols, num_of_cols):
num_of_nonunique = df.groupBy(*comb).count().where("count > 1").count()
if not num_of_nonunique:
res.append(comb)
return res
With a result for your example being:
[('Name',), ('id',), ('Name', 'Role'), ('Name', 'id'), ('Role', 'id'),
('Name', 'Role', 'id')]
There is obviously a performance issue, since this function is exponentially increasing in time as the number of columns grow, i.e. O(2^N). Meaning the runtime for a table with just 20 columns is already going to take quite a long time.
There are however some obvious ways to speed this up, f.e. in case you already know that column Name is unique, then definitely any combination which includes the already known unique combination will remain unique, hence you can already by that fact deduce that combinations (Name, Role), (Name, id) and (Name, Role, id) are unique as well and this will definitely reduce the search space quite efficiently. The worst case scenario however remains the same, i.e. in case the table has no unique combination of columns, you will have to exhaust the entire search space to make that conclusion.
As a conclusion, I'd suggest that you should think about why you want this function in the first place. There might be some specific use-cases for small tables I agree, just to save some time, but to be completely honest, this is not how one should treat a table. If a table exists, then there should be a purpose for the table to exist and a proper table design, i.e. how are the data inside the table really structured and updated. And that should be the starting point when looking for unique identifiers. Because even though you will be able to find other unique identifiers now with this method, it very well might be the case that the table design will destroy them with the next update. I'd much rather suggest to use the table's metadata and documentation, because then you can be sure that you are treating the table in the correct way as it was designed and, in case the table has a lot of columns, it is actually faster.

Excel Matching Customer Orders by Item and Quantity

Brief:
I have a large dataset, inside of which are Individual customer orders by item and quantity. What I'm trying to do is get excel to tell me which order numbers contain exact matches (in terms of items and quantities) to each other. Ideally, I'd like to have a tolerance of say 80% accuracy which I can flex to purpose but I'll take anything to get me off the ground.
Existing Solution:
At the moment, I've used concatenation to pair item with quantity, pivoted and then put the order references as column and concat as rows with quantity as data (sorted by quantity desc) and I'm visually scrolling across/down to find matches and then manually stripping in my main data where necessary. I have about 2,500 columns to check so was hoping I could find a more suitable solution with excel doing the legwork on identification.
Index/matching works at cross referencing a match for the concatenation but of course, the order numbers (which are unique) are different so its not giving me matches ACROSS orders.
Fingers crossed!
EDIT:
Data set with outcomes
As you can see, the bottom order reference has no correlation to the orders above it so is not listed as a match to the orders above but 3 are identical and 1 has a slightly different item but MOSTLY matches.

Excel Management for Inventory

Okay, hope this question will be clear enough that I can get an answer. Thanks for the help.
The situation is that I am downloading some information into two different spreadsheets which contains orders from two different stores.
The problem is that between these two stores the model numbers (SKU#) for a lot of items are different even though the product is the same. There is no changing that now. I do have a list of equivalencies. For example, I know that 00-XX-55 is the same in Store 1 as 22-FF-33. There isn't a logical equivalency so I would be setting them manually.
My question is if there is any way I can combine data from two sheets and set up manual equivalencies while doing this? Would excel allow me to manage the data in that way I can join the two unequal SKUs
You need a two-column translation table. Once you have this you can manage combined inventory because you can then determine the total inventory of a single item in both stores.
So in a solution do you want to translate all to the store 1 sku, the store 2 sku, or a third warehouse sku? I guess what I am driving at here is that there needs to be a superior synonym to sort of design around.
To build a translation table you would put the original sku (the sku that you will convert from, sort of the inferior number you do not want to go by for purposes of the summarization) into column A and the master sku into column B. We will call this sheet "converter".
You could either have:
A, B
00-XX-55, 22-FF-33
This could normalize everything to the 22- sku. Or you could do this:
A, B
00-XX-55, 123abc
22-FF-33, 123abc
This way if you want to normalize to a third value rather than either of the stores values.
In your inventory page col C is the sku column so in column D put =iferror(vlookup(C, converter!A:B, 2, false), C) and populate that all the way down. Now in each row you have the original and the master sku next to each other in C and D. If the sku was not found in the converter table then it would just use whatever value was in C. You can then build pivots tables using D to group them on.

Resources