How to write a subquery in Hive - subquery

I am stuck on a query.
I have data like
Uid, resp-date,camp-type
1 201403 A
1 201406 A
1 201406 B
1 201406 B
1 201407 A
1 201407 B
2 201402 A
2 201406 A
2 201406 B
And want to create some metrics like #of prod offered in last month,
The counting logic is:
If the product is A count the number of products offered in the last n month
If the product is B then count the number of products offered in last n months + number of product type A offered in the current month
Uid, resp-date,prod-type,#offered-last1month, #offered-last2month
1 201403 A 0 0
1 201406 A 0 0
1 201406 B 1 1
1 201406 B 1 1
1 201407 A 3 3
1 201407 B 4 4
2 201402 A 0 0
2 201406 A 0 0
2 201406 B 1 1
query:
Select m.uid, m.respdate,m.prodtype,
case when m.prodtype ='A' then ca.num-mails-1month
when m.prodtype ='B' then cb.num-mails-1month
end as mails-last1month
from m
Left outer Join
( select uid, count(* ) as num-mails-1month from
(
What is a way around to write a sub query that refers m.resp-date
)
) ca
on ca.uid =m.uid
Left outer Join
( select uid, count(* ) as num-mails-1month from
(
)
) cb
on cb.uid =m.uid

Related

Cumulative count using grouping, sorting, and condition

i want Cumulative count of zero only in column c grouped by column a and sorted by b if other number the count reset to 1
this a sample
df = pd.DataFrame({'a':[1,1,1,1,2,2,2,2],
'b':[1,2,3,4,1,2,3,4],
'c':[10,0,0,5,1,0,1,0]}
)
i try next code that work but if zero appear more than one time shift function didn't depend on new value and need to run more than one time depend on count of zero series
df.loc[df.c == 0 ,'n'] = df.n.shift(1)+1
i try next code it done with small data frame but when try with large data take a long time and didn't finsh
for ind in df.index:
if df.loc[ind,'c'] == 0 :
df.loc[ind,'new'] = df.loc[ind-1,'new']+1
else :
df.loc[ind,'new'] = 1
pd.DataFrame({'a':[1,1,1,1,2,2,2,2],
'b':[1,2,3,4,1,2,3,4],
'c':[10,0,0,5,1,0,1,0]}
The desired result
a b c n
0 1 1 10 1
1 1 2 0 2
2 1 3 0 3
3 1 4 5 1
4 2 1 1 1
5 2 2 0 2
6 2 3 1 1
7 2 4 0 2
Try use cumsum to create a group variable and then use groupby.cumcount to create the new column:
df.sort_values(['a', 'b'], inplace=True)
df['n'] = df['c'].groupby([df.a, df['c'].ne(0).cumsum()]).cumcount() + 1
df
a b c n
0 1 1 10 1
1 1 2 0 2
2 1 3 0 3
3 1 4 5 1
4 2 1 1 1
5 2 2 0 2
6 2 3 1 1
7 2 4 0 2

How to group a dataframe by multiple columns, sum and sort the totals in descending order?

Given the following dataframe:
user_id col1 col2
1 A 4
1 A 22
1 A 112
1 B -0.22222
1 B 9
1 C 0
2 A -1
2 A -5
2 K NA
And I want to group by user_id and col1 and count. Then to sort the counts within the groups in descending order.
Here is what I'm trying to do but I don't get the right output:
df[["user_id", "col1"]]. \
groupby(["user_id", "col1"]). \
agg(counts=("col1","count")). \
reset_index(). \
sort_values(["user_id", "col1", "counts"], ascending=False)
Please advise what should I change to make it work.
Expected output:
user_id col1 counts
1 A 3
B 2
C 1
2 A 2
K 1
Use GroupBy.size:
In [199]: df.groupby(['user_id', 'col1']).size()
Out[199]:
user_id col1
1 A 3
B 2
C 1
2 A 2
K 1
OR:
In [201]: df.groupby(['user_id', 'col1']).size().reset_index(name='counts')
Out[201]:
user_id col1 counts
0 1 A 3
1 1 B 2
2 1 C 1
3 2 A 2
4 2 K 1
EDIT:
In [206]: df.groupby(['user_id', 'col1']).agg({'col2': 'size'})
Out[206]:
col2
user_id col1
1 A 3
B 2
C 1
2 A 2
K 1
EDIT-2: For sorting, use:
In [213]: df.groupby(['user_id', 'col1'])['col2'].size().sort_values(ascending=False)
Out[213]:
user_id col1
1 A 3
2 A 2
1 B 2
2 K 1
1 C 1
Name: col2, dtype: int64
Using the main idea from Mayank answer:
df.groupby(["id_user","col1"]).size().reset_index(name="counts").sort_values(["id_user", "col1"], ascending=False)
Solved my issue.

Find a subset of rows (N rows) in a Pandas data frame having the same values at a subset of columns

I have a df which contains customer data without a primary key. The same customer might show up multiple times.
I have a field (df2['campaign']) that is an int and reflects how many times the customer shows up in the df. There are also many customer attributes.
In my example, going from top to bottom, for each row (i.e. customer), I would like to find all n rows (i.e. all n customers) whose values of the education and default columns are the same. Remember n is the int contained in df2['campaign']
So as shown below, for row 0 and 1 I should search 1 row but find nothing because there are no matching values for education-default combinations.
For row 2 I should search 1 row (because campaign == 1) where education-default values match, and find 1 row in index 4.
df2.head()
job marital education default campaign housing loan contact
0 3 1 0 0 1 0 0 1
1 7 1 3 1 1 0 0 1
2 7 1 3 0 1 2 0 1
3 0 1 1 0 1 0 0 1
4 7 1 3 0 1 0 2 1
Use df2_sorted = df2.sort(['education', 'default'], ascending=[1, 1]).
Then if your data is not noisy, the rows should become neighbors.

How to apply function to data frame column to created iterated column

I have IDs with system event times, and I have grouped the event times by id (individual systems) and made a new column where the value is 1 if the eventtimes.diff() is greater than 1 day, else 0 . Now that I have the flag I am trying to make a function that will be applied to groupby('ID') so the new column starts with 1 and keeps returning 1 for each row in the new column until the flag shows 1 then the new column will go up 1, to 2 and keep returning 2 until the flag shows 1 again.
I will apply this along with groupby('ID') since I need the new column to start over again at 1 for each ID.
I have tried to the following:
def try(x):
y = 1
if row['flag']==0:
y = y
else:
y += y+1
df['NewCol'] = df.groupby('ID')['flag'].apply(try)
I have tried differing variations of the above to no avail. Thanks in advance for any help you may provide.
Also, feel free to let me know if I messed up posting the question. Not sure if my title is great either.
Use boolean indexing for filtering + cumcount + reindex what is much faster solution as loopy apply :
I think you need for count only 1 per group and if no 1 then 1 is added to output:
df = pd.DataFrame({
'ID': ['a','a','a','a','b','b','b','b','b'],
'flag': [0,0,1,1,0,0,1,1,1]
})
df['new'] = (df[df['flag'] == 1].groupby('ID')['flag']
.cumcount()
.add(1)
.reindex(df.index, fill_value=1))
print (df)
ID flag new
0 a 0 1
1 a 0 1
2 a 1 1
3 a 1 2
4 b 0 1
5 b 0 1
6 b 1 1
7 b 1 2
8 b 1 3
Detail:
#filter by condition
print (df[df['flag'] == 1])
ID flag
2 a 1
3 a 1
6 b 1
7 b 1
8 b 1
#count per group
print (df[df['flag'] == 1].groupby('ID')['flag'].cumcount())
2 0
3 1
6 0
7 1
8 2
dtype: int64
#add 1 for count from 1
print (df[df['flag'] == 1].groupby('ID')['flag'].cumcount().add(1))
2 1
3 2
6 1
7 2
8 3
dtype: int64
If need count 0 and if no 0 is added -1:
df['new'] = (df[df['flag'] == 0].groupby('ID')['flag']
.cumcount()
.add(1)
.reindex(df.index, fill_value=-1))
print (df)
ID flag new
0 a 0 1
1 a 0 2
2 a 1 -1
3 a 1 -1
4 b 0 1
5 b 0 2
6 b 1 -1
7 b 1 -1
8 b 1 -1
Another 2 step solution:
df['new'] = df[df['flag'] == 1].groupby('ID')['flag'].cumcount().add(1)
df['new'] = df['new'].fillna(1).astype(int)
print (df)
ID flag new
0 a 0 1
1 a 0 1
2 a 1 1
3 a 1 2
4 b 0 1
5 b 0 1
6 b 1 1
7 b 1 2
8 b 1 3

Count which combination of items are bought most frequently

What is an efficient method of doing this?
I have a column with name of buyer and a column with item names. Each item the person bought is on a new row
For example:
Person 1 Item 1
Person 1 Item 2
Person 1 Item 5
Person 1 Item 7
Person 2 Item 1
Person 2 Item 2
Person 2 Item 11
Person 2 Item 15
Person 2 Item 20
Person 2 Item 21
Person 2 Item 17
Person 3 Item 1
Person 3 Item 2
Person 3 Item 6
Person 3 Item 11
Person 3 Item 15
Person 4 Item 1
Person 4 Item 2
Person 4 Item 5
Person 4 Item 7
There are about 1000000 rows in total and each person has an average of 30 items.
I want to count how often two specific items are bought by a person.
I am picturing it something like this
Item1 Item2 Item3 Item4 Item5 Item6
Item1 xxxxx 0% 0% 5% 10% 90%
Item2
Item3
Item4
Item5
Item6
I have tried using pivot table putting item on row labels and person on column labels then counting items. Then I can use a formula lookup and multiply the results from the pivot table but this is doesn't work with such a large file. Is there a more efficient method?
I am open to all kinds of solutions.
You can use a helper 'table' to do this. First create a table of purchases by person. The formula in this table is:
=SUMPRODUCT(--($A$1:$A$20=E$2),--($B$1:$B$20=$D3))
Which gives a 1/0 result if a person ever bought that item. Example:
Then create the grid of products like in your post and enter this formula:
=SUMPRODUCT($E3:$H3,INDEX($E$3:$H$12,MATCH(K$2,$D$3:$D$12,0),0))
Which multiples instances of purchase of Item X and Item Y. Example:
Maybe I misunderstand you, but you are not interested in the person, who buys but in the items, bought by the same person? I do not think you can do this in one step only using formulas (of course in vba you can do it easier).
To do it w/o vba you could:
List item
Sort by Person and Item
Build an accumulating string with all (different) items bought by one person
(untested: something like IF(A1=A2;B1;"")&B2
Ignore all strings but the last of a person (something like IF(A2=A3;"";B2)
After this you have something like
P I Items_a All_Items
1 A A
1 B AB
1 E ABE
1 G ABEG ABEG
2 A A
2 B AB
2 K ABK
2 O ABKO
2 Q ABKOQ
2 T ABKOQT
2 U ABKOQTU ABKOQTU
3 A A
3 B AB
3 F ABF
3 K ABFK
3 O ABFKO ABFKO
4 A A
4 B AB
4 E ABE
4 G ABEG ABEG
In the next step you could copy all combinations to a new table, and build all combinations (ascending, cause items were sorted) in a column and mark as 1 if condition matches
To explain it easier the items are named A, B, ... (corresponding to Item 1, Item 2 ... in your example)
The formula is something like
=IF(IS_ERROR(FIND(PART(B$1;1;1);$A2));0;1)*IF(IS_ERROR(FIND(PART(B$1;2;1);$A2));0;1)
And your case it would be the following possible combinations
AB AE AF AG AK AO AQ AT AU BE BF BG BK BO BQ BT BU EF EG EK EO EQ ET EU FG FK FO FQ FT FU GK GO GQ GT GU KO KQ KT KU OQ OT OU QT QU TU
But in this example 66% exist, So I only Show the beginning of the table:
XXXXX AB AE AF AG AK AO AQ AT AU BE BF
ABEG 1 1 0 1 0 0 0 0 0 1 0
ABEG 1 1 0 1 0 0 0 0 0 1 0
ABFKO 1 0 1 0 1 1 0 0 0 0 1
ABKOQTU 1 0 0 0 1 1 1 1 1 0 0
SUM ALL 4 2 1 2 2 2 1 1 1 2 1
And now you can count, whatever you want.
A simple WVERWEIS function would help to get this:
A B E F G K O Q T U
A 0 4 2 1 2 2 2 1 1 1
B 0 0 2 1 2 2 2 1 1 1
E 0 0 0 0 2 0 0 0 0 0
F 0 0 0 0 0 1 1 0 0 0
G 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 0 2 1 1 1
O 0 0 0 0 0 0 0 1 1 1
Q 0 0 0 0 0 0 0 0 1 1
T 0 0 0 0 0 0 0 0 0 1
U 0 0 0 0 0 0 0 0 0 0
But for my opinion you can handle that for maybe 10 Items (the Help- Cols would be n*(n-1)/2, 10 Items --> 45 Columns (cause AA, BB, ... are not evaluated)
In all other cases you should try to program that.

Resources