Oracle - Identifying Dominant Records - string

I have a table that contains two STRING values (all single words) along with a corresponding COUNT for each occurrence of the STRING, e.g;
ID STR_1 COUNT_1 STR_2 COUNT_2
1 ORANGES 2 APPLES 10
2 APPLES 10 ORANGES 2
3 ORANGES 2 BANANAS 1
4 BANANAS 1 APPLES 10
5 BANANAS 1 ORANGES 2
N.B. STR_1 is considered the ‘master’ value. Also, the COUNT for each individual STRING value will be consistent between STR_1 And STR_2 and between rows (e.g. ORANGES will always have a COUNT of 2)
What I’m trying to achieve is to remove records whereby an ‘enantiomer’ exists, for example; in the above data, ID 2 would be considered an ‘enantiomer’ of ID 1 (ID 1.STR_1 = ID.2 STR_2 and ID 1.STR_2 = ID.2 STR_1), however, ID 2 would be considered the dominant record with ID 1 being discarded (because the COUNT for APPLES is greater than the COUNT for ORANGES) – therefore the desired output would be;
ID STR_1 COUNT_1 STR_2 COUNT_2
2 APPLES 10 ORANGES 2
3 ORANGES 2 BANANAS 1
4 BANANAS 1 APPLES 10
IF a scenario exists whereby the COUNT values between different STRINGS match, the longest STRING would be considered the dominant record and retained e.g.;
ID STR_1 COUNT_1 STR_2 COUNT_2
1 ORANGES 10 APPLES 10
2 APPLES 10 ORANGES 10
3 ORANGES 10 BANANAS 1
4 BANANAS 1 APPLES 10
5 BANANAS 1 ORANGES 10
With the desired output being;
ID STR_1 COUNT_1 STR_2 COUNT_2
1 ORANGES 10 APPLES 10
3 ORANGES 10 BANANAS 1
4 BANANAS 1 APPLES 10
Test Data;
WITH
TEST_DATA AS
(
SELECT 1 ID, 'ORANGES' STR_1, 2 COUNT_1, 'APPLES' STR_2, 10 COUNT_2 FROM DUAL
UNION
SELECT 2 ID, 'APPLES' STR_1, 10 COUNT_1, 'ORANGES' STR_2, 2 COUNT_2 FROM DUAL
UNION
SELECT 3 ID, 'ORANGES' STR_1, 2 COUNT_1, 'BANANAS' STR_2, 1 COUNT_2 FROM DUAL
UNION
SELECT 4 ID, 'BANANAS' STR_1, 1 COUNT_1, 'APPLES' STR_2, 10 COUNT_2 FROM DUAL
UNION
SELECT 5 ID, 'BANANAS' STR_1, 1 COUNT_1, 'ORANGES' STR_2, 2 COUNT_2 FROM DUAL
)
Any help finding a solution to the above would be much appreciated.
Many thanks in advance.

Use anti join (not exists operator):
select *
from test_data t
where not exists (
select 1 from test_data t1
where t.str_1 = t1.str_2
and t.str_2 = t1.str_1
and (
t.count_1 < t1.count_1
or
t.count_1 = t1.count_1
and
length( t.str_1 ) < length( t1.str_1 )
)
)
order by id
In a case when for a given pair of rows both counts and lengths are equal, then the query picks both rows.

Related

On SQL with one-to-many merging and many as a narrowing condition

Use sqlalchemy
Parent table
id name
1 sea bass
2 Tanaka
3 Mike
4 Louis
5 Jack
Child table
id user_id pname number
1 1 Apples 2
2 1 Banana 1
3 1 Grapes 3
4 2 Apples 2
5 2 Banana 2
6 2 Grapes 1
7 3 Strawberry 5
8 3 Banana 3
9 3 Grapes 1
I want to sort by parent id with apples and number of bananas, but when I search for "parent id with apples", the search is filtered and the bananas disappear. I have searched for a way to achieve this, but have not been able to find it.
Thank you in advance for your help.
Translated with www.DeepL.com/Translator (free version)

How to find the total length of a column value that has multiple values in different rows for another column

Is there a way to find IDs that have both Apple and Strawberry, and then find the total length? and IDs that has only Apple, and IDS that has only Strawberry?
df:
ID Fruit
0 ABC Apple <-ABC has Apple and Strawberry
1 ABC Strawberry <-ABC has Apple and Strawberry
2 EFG Apple <-EFG has Apple only
3 XYZ Apple <-XYZ has Apple and Strawberry
4 XYZ Strawberry <-XYZ has Apple and Strawberry
5 CDF Strawberry <-CDF has Strawberry
6 AAA Apple <-AAA has Apple only
Desired output:
Length of IDs that has Apple and Strawberry: 2
Length of IDs that has Apple only: 2
Length of IDs that has Strawberry: 1
Thanks!
If always all values are only Apple or Strawberry in column Fruit you can compare sets per groups and then count ID by sum of Trues values:
v = ['Apple','Strawberry']
out = df.groupby('ID')['Fruit'].apply(lambda x: set(x) == set(v)).sum()
print (out)
2
EDIT: If there is many values:
s = df.groupby('ID')['Fruit'].agg(frozenset).value_counts()
print (s)
{Apple} 2
{Strawberry, Apple} 2
{Strawberry} 1
Name: Fruit, dtype: int64
You can use pivot_table and value_counts for DataFrames (Pandas 1.1.0.):
df.pivot_table(index='ID', columns='Fruit', aggfunc='size', fill_value=0)\
.value_counts()
Output:
Apple Strawberry
1 1 2
0 2
0 1 1
Alternatively you can use:
df.groupby(['ID', 'Fruit']).size().unstack('Fruit', fill_value=0)\
.value_counts()

Filter rows based on the count of unique values

I need to count the unique values of column A and filter out the column with values greater than say 2
A C
Apple 4
Orange 5
Apple 3
Mango 5
Orange 1
I have calculated the unique values but not able to figure out how to filer them df.value_count()
I want to filter column A that have greater than 2, expected Dataframe
A B
Apple 4
Orange 5
Apple 3
Orange 1
value_counts should be called on a Series (single column) rather than a DataFrame:
counts = df['A'].value_counts()
Giving:
A
Apple 2
Mango 1
Orange 2
dtype: int64
You can then filter this to only keep those >= 2 and use isin to filter your DataFrame:
filtered = counts[counts >= 2]
df[df['A'].isin(filtered.index)]
Giving:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Use duplicated with parameter keep=False:
df[df.duplicated(['A'], keep=False)]
Output:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1

sort pandas value_counts() primarily by descending counts and secondarily by ascending values

When applying value_counts() to a series in pandas, by default the counts are sorted in descending order, however the values are not sorted within each count.
How can i have the values within each identical count sorted in ascending order?
apples 5
peaches 5
bananas 3
carrots 3
apricots 1
The output of value_counts is a series itself (just like the input), so you have available all of the standard sorting options as with any series. For example:
df = pd.DataFrame({ 'fruit':['apples']*5 + ['peaches']*5 + ['bananas']*3 +
['carrots']*3 + ['apricots'] })
df.fruit.value_counts().reset_index().sort([0,'index'],ascending=[False,True])
index 0
0 apples 5
1 peaches 5
2 bananas 3
3 carrots 3
4 apricots 1
I'm actually getting the same results by default so here's a test with ascending=[False,False] to demonstrate that this is actually working as suggested.
df.fruit.value_counts().reset_index().sort([0,'index'],ascending=[False,False])
index 0
1 peaches 5
0 apples 5
3 carrots 3
2 bananas 3
4 apricots 1
I'm actually a bit confused about exactly what desired output here in terms of ascending vs descending, but regardless, there are 4 possible combos here and you can get it however you like by altering the ascending keyword argument.

Excel how to check if a row has any other values and return the index

I have this big table where the first column is the person. The second column is something they all have. And then there are different columns for possible other things they could have.
For example :
Person apples pears bananas oranges
Luc 7 0 0 0
Julia 10 0 0 2
Maria 8 0 0 0
Lena 15 0 3 0
Tina 2 1 0 1
I know for a fact that everybody eats apples, but i would like to know if people eat other things also and wich thing those are
The result should be
Person appels pears bananas oranges result
Luc 7 0 0 0 0
Julia 10 0 0 2 oranges
Maria 8 0 0 0 0
Lena 15 0 3 0 bananas
Tina 2 1 0 1 pears, oranges
The last column doesn't have to be the name of the fruit, I would be happy if I had the column number. I tried HLOOKUP, but this doesn't work . Or maybe I don't use the right lookup_value ? I use > 0 as lookup_value.
Can somebody please help me ?
One solution: Create "dummy variables" (formatted with text instead of {0,1} values) that are only populated for fruits actually consumed (N>0), and then concatenate in the results column:
ID Apples Pears Bananas Oranges d_apples d_pears d_bananas d_oranges result
Luc 7 0 0 0 Apples
Julia 10 0 0 2 Apples Oranges Oranges
Maria 8 0 0 0 Apples
Lena 15 0 3 0 Apples Bananas Bananas
Tina 2 1 0 1 Apples Pears Oranges Pears Oranges
In the d_apples column, I've entered =IF(B2>0, B$1, "") and filled this across the d_ columns.
In the results column, I've entered =TRIM(CONCATENATE(G2, " ", H2, " ", I2)).
The TRIM() function removes the extra blank spaces between words for fruits not eaten. Note of course that this doesn't add commas between words, and adding commas to the CONCATENATE() function will yield something like , Bananas,.
If you want the column numbers, you can change B$1 in the IF() function to COLUMN(B$1).

Resources