How to get a list of columns that will give me a unique record in Pyspark Dataframe - apache-spark

My intension is to write a python function that would take a pyspark DataFrame as input, and its output would be a list of columns (could be multiple lists) that gives a unique record when combined together.
So, if you take a set of values for the columns in the list, you would always get just 1 record from the DataFrame.
Example:
Input Dataframe
Name Role id
--------------------
Tony Dev 130
Stark Qa 131
Steve Prod 132
Roger Dev 133
--------------------
Output:
Name,Role
Name,id
Name,id,Role
Why is the output what it is?
For any Name,Role combination I will always get just 1 record
And, for any Name, id combination I will always get just 1 record.

There are ways to define a function, which will do exactly what you are asking for.
I will only show 1 possibility and it is a very naive solution. You can iterate through all the combinations of columns and check whether they form a unique entry in the table:
import itertools as it
def find_all_unique_columns_naive(df):
cols = df.columns
res = []
for num_of_cols in range(1, len(cols) + 1):
for comb in it.combinations(cols, num_of_cols):
num_of_nonunique = df.groupBy(*comb).count().where("count > 1").count()
if not num_of_nonunique:
res.append(comb)
return res
With a result for your example being:
[('Name',), ('id',), ('Name', 'Role'), ('Name', 'id'), ('Role', 'id'),
('Name', 'Role', 'id')]
There is obviously a performance issue, since this function is exponentially increasing in time as the number of columns grow, i.e. O(2^N). Meaning the runtime for a table with just 20 columns is already going to take quite a long time.
There are however some obvious ways to speed this up, f.e. in case you already know that column Name is unique, then definitely any combination which includes the already known unique combination will remain unique, hence you can already by that fact deduce that combinations (Name, Role), (Name, id) and (Name, Role, id) are unique as well and this will definitely reduce the search space quite efficiently. The worst case scenario however remains the same, i.e. in case the table has no unique combination of columns, you will have to exhaust the entire search space to make that conclusion.
As a conclusion, I'd suggest that you should think about why you want this function in the first place. There might be some specific use-cases for small tables I agree, just to save some time, but to be completely honest, this is not how one should treat a table. If a table exists, then there should be a purpose for the table to exist and a proper table design, i.e. how are the data inside the table really structured and updated. And that should be the starting point when looking for unique identifiers. Because even though you will be able to find other unique identifiers now with this method, it very well might be the case that the table design will destroy them with the next update. I'd much rather suggest to use the table's metadata and documentation, because then you can be sure that you are treating the table in the correct way as it was designed and, in case the table has a lot of columns, it is actually faster.

Related

How to identify all columns that have different values in a Spark self-join

I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.
Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html

Python3 Pandas dataframes: beside columns names are there also columns labels?

Many database management systems, such as Oracle, SQL Server or even statistical software like SAS, allow having, beside field names, also field labels.
E.g., in DBMS one may have a table called "Table1" with, among other fields, two fields called "income_A" and "income_B".
Now, in the DBMS logic, "income_A" and "income_B" are the field names.
Beside a name, those two fields can also have plain English labels associated to them, which clarify the actual meaning of those two fields; such as "A - Income of households with dependable children where both parents work and they have a post-degree level of education" and "B - Income of empty-nesters households where only one works".
Is there anything like that in Python3 Pandas dataframes?
I mean, I know I can give a dataframe column a "label" (which is, seen from the above DBMS perspective, more like a "name", in the sense that you can use it to refer to the column itself).
But can I also associate a longer description to the column, something that I can choose to display instead of the column "label" in print-outs and reports or that I can save into dataframe exports, e.g., in MS Excel format? Or do I have to do it all using data dictionaries, instead?
It does not seem that there is a way to store such meta info other than in the columns name. But the column name can be quite verbose. I tested up to 100 characters. Make sure to pass it as a collection.
Such a long name could be annoying to use for indexing in the code. You could use loc/iloc or assign the name to a string for use in indexing.
In[10]: pd.DataFrame([1, 2, 3, 4],columns=['how long can this be i want to know please tell me'])
Out[10]:
how long can this be i want to know please tell me
0 1
1 2
2 3
3 4
This page shows that the columns don't really have any attributes to play with other than the lablels.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html
There is some more info you can get about a dataframe:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

How to optimize Cassandra model while still supporting querying by contents of lists

I just switched from Oracle to using Cassandra 2.0 with Datastax driver and I'm having difficulty structuring my model for this big data approach. I have a Persons table with UUID and serialized Persons. These Persons have lists of addresses, names, identifications, and DOBs. For each of these lists I have an additional table with a compound key on each value in the respective list and the additional person_UUID column. This model feels too relational to me, but I don't know how else to structure it so that I can have index(am able to search by) on address, name, identification, and DOB. If Cassandra supported indexes on lists I would have just the one Persons table containing indexed lists for each of these.
In my application we receive transactions, which can contain within them 0 or more of each of those address, name, identification, and DOB. The persons are scored based on which person matched which criteria. A single person with the highest score is matched to a transaction. Any additional address, name, identification, and DOB data from the transaction that was matched is then added to that person.
The problem I'm having is that this matching is taking too long and the processing is falling far behind. This is caused by having to loop through result sets performing additional queries since I can't make complex queries in Cassandra, and I don't have sufficient memory to just do a huge select all and filter in java. For instance, I would like to select all Persons having at least two names in common with the transaction (names can have their order scrambled, so there is no first, middle, last; that would just be three names) but this would require a 'group by' which Cassandra does not support, and if I just selected all having any of the names in common in order to filter in java the result set is too large and i run out of memory.
I'm currently searching by only Identifications and Addresses, which yield a smaller result set (although it could still be hundreds) and for each one in this result set I query to see if it also matches on names and/or DOB. Besides still being slow this does not meet the project's requirements as a match on Name and DOB alone would be sufficient to link a transaction to a person if no higher score is found.
I know in Cassandra you should model your tables by the queries you do, not by the relationships of the entities, but I don't know how to apply this while maintaining the ability to query individually by address, name, identification, and DOB.
Any help or advice would be greatly appreciated. I'm very impressed by Cassandra but I haven't quite figured out how to make it work for me.
Tables:
Persons
[UUID | serialized_Person]
addresses
[address | person_UUID]
names
[name | person_UUID]
identifications
[identification | person_UUID]
DOBs
[DOB | person_UUID]
I did a lot more reading, and I'm now thinking I should change these tables around to the following:
Persons
[UUID | serialized_Person]
addresses
[address | Set of person_UUID]
names
[name | Set of person_UUID]
identifications
[identification | Set of person_UUID]
DOBs
[DOB | Set of person_UUID]
But I'm afraid of going beyond the max storage for a set(65,536 UUIDs) for some names and DOBs. Instead I think I'll have to do a dynamic column family with the column names as the Person_UUIDs, or is a row with over 65k columns very problematic as well? Thoughts?
It looks like you can't have these dynamic column families in the new version of Cassandra, you have to alter the table to insert the new column with a specific name. I don't know how to store more than 64k values for a row then. With a perfect distribution I will run out of space for DOBs with 23 million persons, I'm expecting to have over 200 million persons. Maybe I have to just have multiple set columns?
DOBs
[DOB | Set of person_UUID_A | Set of person_UUID_B | Set of person_UUID_C]
and I just check size and alter table if size = 64k? Anything better I can do?
I guess it's just CQL3 that enforces this and that if I really wanted I can still do dynamic columns with the Cassandra 2.0?
Ugh, this page from Datastax doc seems to say I had it right the first way...:
When to use a collection
This answer is not very specific, but I'll come back and add to it when I get a chance.
First thing - don't serialize your Persons into a single column. This complicates searching and updating any person info. OTOH, there are people that know what they're saying that disagree with this view. ;)
Next, don't normalize your data. Disk space is cheap. So, don't be afraid to write the same data to two places. You code will need to make sure that the right thing is done.
Those items feed into this: If you want queries to be fast, consider what you need to make that query fast. That is, create a table just for that query. That may mean writing data to multiple tables for multiple queries. Pick a query, and build a table that holds exactly what you need for that query, indexed on whatever you have available for the lookup, such as an id.
So, if you need to query by address, build a table (really, a column family) indexed on address. If you need to support another query based on identification, index on that. Each table may contain duplicate data. This means when you add a new user, you may be writing the same data to more than one table. While this seems unnatural if relational databases are the only kind you've ever used, but you get benefits in return - namely, horizontal scalability thanks to the CAP Theorem.
Edit:
The two column families in that last example could just hold identifiers into another table. So, voilĂ  you have made an index. OTOH, that means each query takes two reads. But, still will be a performance improvement in many cases.
Edit:
Attempting to explain the previous edit:
Say you have a users table/column family:
CREATE TABLE users (
id uuid PRIMARY KEY,
display_name text,
avatar text
);
And you want to find a user's avatar given a display name (a contrived example). Searching users will be slow. So, you could create a table/CF that serves as an index, let's call it users_by_name:
CREATE TABLE users_by_name (
display_name text PRIMARY KEY,
user_id uuid
}
The search on display_name is now done against users_by_name, and that gives you the user_id, which you use to issue a second query against users. In this case, user_id in users_by_name has the value of the primary key id in users. Both queries are fast.
Or, you could put avatar in users_by_name, and accomplish the same thing with one query by using more disk space.
CREATE TABLE users_by_name (
display_name text PRIMARY KEY,
avatar text
}

Cassandra super column structure

I'm new to Cassandra, and I'm not familiar with super columns.
Consider this scenario: Suppose we have a some fields of a customer entity like
Name
Contact_no
address
and we can store all these values in a normal column. I want to arrange that when a person moves from one location to another location (the representative field could store the longitude and latitude) that values will be stored consecutively with respect to customer location. I think we can do this with super columns but I'm confused how to design the schema to accomplish this.
Please help me to create this schema and come to understand the concepts behind super columns.
supercolumns are really not recommended anymore...still used but more and more have switched to composite columns. For example playOrm uses this concept for indexing. If I am indexing an integer, and indexing row may look like this
rowkey = 10.pk56 10.pk39 11.pk50
Where the column name type is a composite integer and string in this case. These rows can be up to about 10 million columns though I have only run expirements up to 1 million my self. For example, playOrm's queries use these types of indexes to do a query that took 60 ms on 1,000,000 rows.
With playOrm, you can do scalable relational models in noSQL....you just need to figure out how to partition your data correctly as you can have as many partitions as you want in each table, but a partition should really not be over 10 million rows.
Back to the example though, if you have a table with columns numShares, price, username, age, you may wnat to index numShares and the above row would be that index so you could grab the index by key OR better yet, grab all column names with numShares > 20 and numShares < 50
Once you have those columns, you can then get the second half of the column name which is the primary key. The reason primary key is NOT a value is because as in the example above there is two rows pk56 and pk39 with the same 10 and you can't have two columns named 10, but you can have a 10.pk56 and 10.pk39.
later,
Dean

UniData UniQuery - two WITH

Alright I have little to no knowledge of SQL language, and am wondering what are the possible reasons for the slowness of two WITH vs one WITH in unidata.
Database has around ~1 million rows.
Ie/
SELECT somewhere WITH Column1 = "str" AND WITH Column2 = "Int" 5< minutes
Compared to
SELECT somewhere WITH Column1 = "str" ~1 second
somewhere is indexed (from my knowledge)
so is there anything I'm doing wrong?
If more information is required just ask, not sure what to supply.
Also whats the difference between WITH and WHERE?
This isn't SQL, it is UniQuery.
To clarify it for you, you can't index the file (somewhere, in this case), only the columns of the file. You might find Column1 is indexed and Column2 is not. Type in LIST.INDEX somewhere to find out what columns have been indexed.
For your question, you have only compared selecting on Column1 against selecting on Column1 & Column2 and assumed the vastly slower response is purely because you selected on 2 columns. Your next text should have been to select only on Column2 and seen how slow that was.
There are are many possible reasons to explain the difference in response, aside from indexing. In UniData columns are defined as 'dictionary items' There are different types of dictionary items. The most basic is a D-type dictionary item which is just a direct reference to a field in the record. Another type is the I or V-type, which is a derived field. The derived field can be as simple as returning a constant or as complex as performing an equivalent performing a JOIN with another file and/or some form of complex calculation. This this is should be simple to see that different columns can take vastly different amounts of processing to handle.
Other reasons are how deep in the record the column is (first field references will be faster than fields later in the record) as well as potential query caching that can affect the timings of your SELECTs.
For more information, check out the database's manuals at Rocket Software.
A single column SELECT on an indexed field will not even require that any data file records are read. If you look under the hood, you'll see that the index file is a normal hash file, and the single column SELECT will simply mean that the index file record with the key "str" is read. This could return thousands and thousands of keys in less than a second.
Once you add the second column, you are probably forcing the system to read all of those thousands and thousands of records, EVEN IF THE SECOND COLUMN IS INDEXED. This is going to take a measurable amount of more time.
In general, an index on a field with a small number of unique values is of dubious use. If the second column contains data that has a large number of possible values, leading to a smaller number of records with each particular index value, then it would be best to arrange the SELECT such that the index used is on the second column. I'm not sure, but it might be possible to simply reverse the order of the columns in the SELECT statement to do this. Otherwise you might need to run two SELECT statements back to back.
As an example, assume that the file has 600,000 records with Column1 = "str", and 2,000 records with Column2 = "int":
>SELECT somewhere WITH Column2 = "int"
>>SELECT somewhere with Column1 = "str"
Will read 2,000 records and should return almost instantly.
If the combination of Column1 and Column2 is something that you'll be SELECTing on frequently, then you might want to create a new dictionary item that combines the two, and build an index on that.
That being said, it shouldn't take a U2 system 5 minutes to run through a file of a million records. There's a very good chance that the file has become badly overflowed, and needs to be resized with a larger modulo to improve performance.

Resources