Sub queries in Apache Hive

Sub queries in Apache Hive - azure

I have a table with following structure
col1 col2 col3 col4 category
300 200 100 20 1
200 100 30 300 2
400 100 100 70 1
100 30 200 100 1
Now i am trying to calculate for col1 what % of total rows have value <= 100, for col2 what % of total rows have value<=50 and so on and from category I only want to select category 1
so the resulting table should look like
col1(<=100) col2(<=50)
x% x%
I tried something like this but don't know how to write sub query for this
SELECT COUNT(*) AS Total, COUNT(value1)* 100 /Total) AS col1(<=100) FROM table1 WHERE Category=1 GROUP BY value1 HAVING value1 <=100
Looks like I need multiple select queries, plz help

You can try using CASE statement as below:
"SELECT SUM(CASE WHEN col1 < 100 THEN 1 ELSE 0 END) / COUNT(col1) AS pct_col1,
SUM(CASE WHEN col2 < 50 THEN 1 ELSE 0 END) / COUNT(col2) AS pct_col2
FROM table1;"
Thanks
Arani

Related

excel find the count of 2 filtered columns

There are paired columns that I am comparing(col1 and col2, col3 and col4) with either blank or '0' or '1'. I basically want to know how many are intersect
id col1 col2 col3 col4
id1 0 1
id2 1 1 0
id3 0 1 1
id4
id5 0
for this table I want to count of how many ids are 0 or 1(between col1 and col2). If I use countA(b2:c4) I get 4 but I need to get 3 as only 3 ids are affected for each pair
. Is therea formula that would actually give 3 for col1 and col2 and 3 for col3 and col4.
SUMPRODUCT(--(B$2:B$7+C$2:C$7=0))
fails here and provides 3 instead of 5

Creating a column or list comprehension with multiple column conditions

I have a dataframe (sample) as under:
col0 col1 col2 col3
0 101 3 5
1 102 6 2 1
2 103 2
3 104 4 6 4
4 105 8 3
5 106 1
6 107
Now I need two things as new columns to the same dataframe (col4 and col5):
To bring latest value as per priority col3>col2>col1 for each row:
If col3 has value, col3, elif col2 has value, col2, elif col1 has value, col1, else "Invalid"
To know whether that row has 1/2/3 or no values against these columns.
If col3 has value, 3, elif col2 has value, 2, elif col1 has value, 1, else 0.
I have done list comprehensions in format [x1 if condition1 else x2 if condition2 else x3 for val in df['col']]
However, I do not understand how to loop through three columns in single list comprehension attempt.
Or if there is some other way than list comprehension to do this?
I tried this:
df['col4'] = [df['col3'] if df['col3'].notna() else df['col2'] if df['col2'].notna() else df['col1'] if df['col1'].notna() else "Invalid" for x in df['col0']]
df['col5'] = [3 if df['col3'].notna() else 2 if df['col2'].notna() else 1 if df['col1'].notna() else 0]
But they do not work.

One solution that I tried was as under, but it requires four lines of code for each column:
df.loc[df['col1'].notna(),['col5']] = 1
df.loc[df['col2'].notna(),['col5']] = 2
df.loc[df['col3'].notna(),['col5']] = 3
df['col5'] = df['col5'].fillna(0)
Please suggest if any other means is possible.

Applying "percentage of group total" to a column in a grouped dataframe

I have a dataframe from which I generate another dataframe using following code as under:
df.groupby(['Cat','Ans']).agg({'col1':'count','col2':'sum'})
This gives me following result:
Cat Ans col1 col2
A Y 100 10000.00
N 40 15000.00
B Y 80 50000.00
N 40 10000.00
Now, I need percentage of group totals for each group (level=0, i.e. "Cat") instead of count or sum.
For getting count percentage instead of count value, I could do this:
df['Cat'].value_counts(normalize=True)
But here I have sub-group "Ans" under the "Cat" group. And I need the percentage to be on each Cat group level and not the whole total.
So, expectation is:
Cat Ans col1 .. col3
A Y 100 .. 71.43 #(100/(100+40))*100
N 40 .. 28.57
B Y 80 .. 66.67
N 40 .. 33.33
Similarly, col4 will be percentage of group-total for col2.
Is there a function or method available for this?
How do we do this in an efficient way for large data?

You can use the level argument of DataFrame.sum (to perform a groupby) and have pandas take care of the index alignment for the division.
df['col3'] = df['col1']/df['col1'].sum(level='Cat')*100
col1 col2 col3
Cat Ans
A Y 100 10000.0 71.428571
N 40 15000.0 28.571429
B Y 80 50000.0 66.666667
N 40 10000.0 33.333333
For multiple columns you can loop the above, or have pandas align those too. I add a suffix to distinguish the new columns from the original columns when joining back with concat.
df = pd.concat([df, (df/df.sum(level='Cat')*100).add_suffix('_pct')], axis=1)
col1 col2 col1_pct col2_pct
Cat Ans
A Y 100 10000.0 71.428571 40.000000
N 40 15000.0 28.571429 60.000000
B Y 80 50000.0 66.666667 83.333333
N 40 10000.0 33.333333 16.666667

Create a new col in pandas with element of other columns

Hel lo, I would need help.
I have a dataframe such as :
table:
Col1 Col2 Col3 Sign
Loc1 1 60 -
Loc2 10 90 +
Loc3 40 100 +
Loc4 20 40 -
and from this table I want to create a Newcol with elements in others columns such as :
Col1 Col2 Col3 Sign Newcol
Loc1 1 60 - Loc1:1-60(-)
Loc2 10 90 + Loc2:11-90(+)
Loc3 40 100 + Loc3:41-100(+)
Loc4 20 40 - Loc4:20-40(-)
I tried:
table["Newcol"]=table['Col1']+":"+str(table['Col2'])+"-"+str(table['Col3'])+"("+table['Sign']+")"
But how can I take into account the fact that when I have a + sign, I have to add +1 to the Col3 for the Newcol name ?

Use Series.astype for convert to strings and for add 1 compare by + and convert value to integer with Series.add:
table["Newcol"] = (table['Col1']+":"+
(table['Col2'].add(table['Sign'].eq('+').astype(int))).astype(str)+"-"+
(table['Col3']).astype(str)+"("+
table['Sign']+")")
print (table)
Col1 Col2 Col3 Sign Newcol
0 Loc1 1 60 - Loc1:1-60(-)
1 Loc2 10 90 + Loc2:11-90(+)
2 Loc3 40 100 + Loc3:41-100(+)
3 Loc4 20 40 - Loc4:20-40(-)

Identify the relationship between two columns and its respective value count in pandas

I have a Data frame as below :
Col1 Col2 Col3 Col4
1 111 a Test
2 111 b Test
3 111 c Test
4 222 d Prod
5 333 e Prod
6 333 f Prod
7 444 g Test
8 555 h Prod
9 555 i Prod
Expected output :
Column 1 Column 2 Relationship Count
Col2 Col3 One-to-One 2
Col2 Col3 One-to-Many 3
Explanation :
I need to identify the relationship between Col2 & Col3 and also the value counts.
For Eg. 111(col2) is repeated 3 times and has 3 different respective values a,b,c in Col3.
This means col2 and col3 has one-to-Many relationship - count_1 : 1
222(col2) is not repeated and has only one respective value d in col3.
This means col2 and col3 has one-to-one relationshipt - count_2 : 1
333(col2) is repeated twice and has 2 different respective values e,f in col3.
This means col2 and col3 has one-to-Many relationship - count_1 : 1+1 ( increment this count for every one-to-many relationship)
Similarly for other column values increment the respective counter and display the final results as the expected dataframe.

If you only need to check the relationship between col2 and col3, you can do:
(
df.groupby(by='Col2').Col3
.apply(lambda x: 'One-to-One' if len(x)==1 else 'One-to-Many')
.to_frame('Relationship')
.groupby('Relationship').Relationship
.count().to_frame('Count').reset_index()
.assign(**{'Column 1':'Col2', 'Column 2':'Col3'})
.reindex(columns=['Column 1', 'Column 2', 'Relationship', 'Count'])
)
Output:
Column 1 Column 2 Relationship Count
0 Col2 Col3 One-to-Many 3
1 Col2 Col3 One-to-One 2

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Sub queries in Apache Hive - azure

You can try using CASE statement as below: "SELECT SUM(CASE WHEN col1 < 100 THEN 1 ELSE 0 END) / COUNT(col1) AS pct_col1, SUM(CASE WHEN col2 < 50 THEN 1 ELSE 0 END) / COUNT(col2) AS pct_col2 FROM table1;" Thanks Arani

Related

excel find the count of 2 filtered columns

Creating a column or list comprehension with multiple column conditions

Applying "percentage of group total" to a column in a grouped dataframe

Create a new col in pandas with element of other columns

Identify the relationship between two columns and its respective value count in pandas

Categories

Resources