How to use MAX and HAVING in Spark SQL? - subquery

I am learning SPARK (1.6.1) and CASSANDRA and I have been tasked with doing analysis since I know a little bit about SQL. I am getting data’s using SPARK-SQL with CASSANDRA. I am now trying to get data from a table named connectablesTempTable that has the following data
userhandle threadid connid
1 1 Kooky
2 1 millicent
1 1 norseman
1 2 caribou
2 1 zoology
3 1 polypropylene
3 1 cyanided
3 2 eluted
1 2 insurrectionising
1 1 orontes
2 3 perpetualness
2 2 aphaeretic
2 1 bloodthirstiness
3 2 cingulectomy
3 3 unlimed
3 2 unprison
1 1 hereat
2 1 humidify
From the above thread I need to query for “For every userhandle find out the threadid that has the maximum of connid”. The following is the result that I expect
userhandle threadid maxcount
1 1 4
2 1 4
3 2 3
In SQL, I can write the following query to get the values
SELECT userhandle,threadid, COUNT(connid)
FROM connectablesTempTable
GROUP BY userhandle,threadid
HAVING COUNT (connid)= (
select MAX(mycount) FROM (
SELECT userhandle,threadid, COUNT(connid) as mycount
FROM connectablesTempTable GROUP BY userhandle,threadid
)
);
When I try to do the same in SPARk-SQL, am getting an error
java.lang.RuntimeException: [1.47] failure: ``)'' expected” something.
When I was browsing the internet for a solution, I noticed that the SPARK SQL doesn’t support subqueries. Is there a way for me to get the answer for the above query?
Thanks,
Jollyguy

Related

Doubts pandas filtering data row by row

How can I solve this issue related on pandas? I've a dataframe of the following approach:
datetime64ns
type(int)
datetime64ns(analysis)
2019-02-02T10:02:05
4
2019-02-02T10:02:01
3
2019-02-02T10:02:02
4
2019-02-02T10:02:02
2019-02-02T10:02:04
3
2019-02-02T10:02:04
The goal is to do the following issue:
# psuedocode
for all the rows:
if datetime(analysis) exists and type=4:
insert in the a new row column type4=1
elseif datetime(analysis) exists and type=2:
insert in the a new row column type2=1
the idea to develop it is in order to make a group by count value. I'm sure that is possible because I manage to develop it in the past but I lost my .py file. Thanks for the attention
Need this?
df = pd.concat([df, pd.get_dummies(df['type(int)'].mask(
df['datetime64ns(analysis)'].isna()).astype('Int64')).add_prefix('type')], 1)
OUTPUT:
datetime64ns type(int) datetime64ns(analysis) type3 type4
0 2019-02-02T10:02:05 4 NaN 0 0
1 2019-02-02T10:02:01 3 NaN 0 0
2 2019-02-02T10:02:02 4 2019-02-02T10:02:02 0 1
3 2019-02-02T10:02:04 3 2019-02-02T10:02:04 1 0

How to sort pandas DataFrame on multiple columns using separate sorting functions for each column

I am trying to find a generic way to sort a DataFrame on multiple columns, where each column is sorted by a different arbitrary sort function.
For example, for input I might have
df = pd.DataFrame([[2,"Basic",6],[1,"Intermediate",9],[2,"Intermediate",6],[0,"Advanced",6],[0,"Basic",2],[1, 'Advanced', 6], [0,"Basic",3], ], columns=['Hour','Level','Value'])
Hour Level Value
0 2 Basic 6
1 1 Intermediate 9
2 2 Intermediate 6
3 0 Advanced 6
4 0 Basic 2
5 1 Advanced 6
6 0 Basic 3
and I want my output to be
Hour Level Value
0 0 Advanced 6
1 0 Basic 3
2 0 Basic 2
3 1 Advanced 6
4 1 Intermediate 9
5 2 Intermediate 6
6 2 Basic 6
I might have a function map as such
lambdaMap = {
"Hour": lambda x: x,
"Level": lambda x: [['Advanced', 'Intermediate', 'Basic'].index(l) for l in x]
"Value": lambda x: -x
}
I can apply any one of the sorting functions individually:
sortValue="Hour"
df.sort_values(by=sortValue, key=lambdaMap[sortValue])
I could create a loop to apply each sort successively:
for (column, func) in lambdaSort.items():
df = df.sort_values(by=column, key=func)
But none of these will create the output I'm looking for. Is this even possible? There are a lot of examples with how to achieve similar things for specific instances, but I'm curious if there is a way to achieve this generically, for use in the creation of API and/or general support libraries.
you can convert to categorical and do a sort:
df['Level'] = pd.Categorical(df['Level'],['Advanced', 'Intermediate', 'Basic'],
ordered=True)
out = df.sort_values(['Hour','Level','Value'],ascending=[True,True,False])
print(out)
Hour Level Value
3 0 Advanced 6
6 0 Basic 3
4 0 Basic 2
5 1 Advanced 6
1 1 Intermediate 9
2 2 Intermediate 6
0 2 Basic 6

Pandas DataFrame: how do we keep columns based on the index name?

I seem to run into some python or enumerate bugs that I am not quite sure how to fix it (See here for more details).
Long story short, I desire to see multiple data sets that has a column name of 0,4,6,8,10,12,14.
0 4 6 8 10 12
1 2 5 4 2 1
5 3 0 1 5 10
....
But my current data looks like the following
0 4 2 6 8 10 12
1 2 5 4 2 1
5 3 0 1 5 10
....
Therefore, I would like to add a code that keeps columns based on the index number (including only 0,4,6,8,10,12).
Is there a pandas function that can help with this?

Counting a group of columns on google spreadsheet

I have a couple of columns as shown below:
A B C D E
1 12 4 1
2 3 2 2
3 7
4 3 0 6
How would I be able to return a count of each column above so for example receive the result:
A B C D E
1 12 4 1
2 3 2 2
3 7
4 3 0 6
5 count:3 4 2 1
for each of the column. Im looking for a formula that would be able to do that in one cell(B5) returning a count for each of the columns, and avoid using fill handling as the data set is quite large
It's pretty easy, using Google Spreadsheet's functions:
=ArrayFormula(MMULT(TRANSPOSE(row(A1:A4)^0),--(len(A1:E4)>0)))
Or, if you want join them all:
=JOIN(", ",ArrayFormula(MMULT(TRANSPOSE(row(A1:A4)^0),--(len(A1:E4)>0))))

Grouping with SubSonic 3

I have the following table "GroupPriority":
Id Group Priority
1 1 0
2 2 0
3 3 0
4 2 1
5 1 1
6 2 2
7 3 1
I would like to group these on "Group", order them by "Priority" and then get the one in each "Group" with the highest Priority with the use of linq and subsonic 3.
In this case the result would be:
Id Group Priority
5 1 1
6 2 2
7 3 1
The sql would look like this:
SELECT *
FROM GroupPriority
WHERE (Priority =
(SELECT MAX(Priority)
FROM GroupPriority
WHERE (Group = GroupPriority.Group)))
Thanks
Got the solution:
var group_query = new Query<GroupPriority>(provider);
var items = from gp in group_query
where gp.Priority ==
(from gp_sub in group_query
where gp_sub.Group == gp.Group
select gp_sub.Priority).Max()
select gp;

Resources