Doing association rules mining using efficient_apriori package in python.Found a very useful answer on how to convert the output to a dataframe
However, struggling with the Itemset output and hoping someone can help me parse that correctly. Guessing LHS is an index value, but struggling with the decimal values in RHS. Does anyone know how the encoding is done? I have tried the same with SKU descriptions, and get the same output.
Input dataframe looks like this:
| SKU | Count | Percent |
|----------------------------------------------------------------------|-------|-------------|
| "('000000009100000749',)" | 110 | 0.029633621 |
| "('000000009100000749', '000000009100000776')" | 1 | 0.000269397 |
| "('000000009100000749', '000000009100000776', '000000009100002260')" | 1 | 0.000269397 |
| "('000000009100000749', '000000009100000777', '000000009100002260')" | 1 | 0.000269397 |
| "('000000009100000749', '000000009100000777', '000000009100002530')" | 1 | 0.000269397 |
Output looks like this:
| | lhs | rhs | count_full | count_lhs | count_rhs | num_transactions | confidence | support |
|---|-----------------------------|-----------------------------|------------|-----------|-----------|------------------|------------|-------------|
| 0 | "(1,)" | "(0.00026939655172413793,)" | 168 | 168 | 168 | 297 | 1 | 0.565656566 |
| 1 | "(0.00026939655172413793,)" | "(1,)" | 168 | 168 | 168 | 297 | 1 | 0.565656566 |
| 2 | "(2,)" | "(0.0005387931034482759,)" | 36 | 36 | 36 | 297 | 1 | 0.121212121 |
| 3 | "(0.0005387931034482759,)" | "(2,)" | 36 | 36 | 36 | 297 | 1 | 0.121212121 |
| 4 | "(3,)" | "(0.0008081896551724138,)" | 21 | 21 | 21 | 297 | 1 | 0.070707071 |
Could someone help me understand what's outputting in the LHS and RHS columns, and how to join that back to the 'SKU'? Ideally would like the output to have the 'SKU' instead of whatever is showing up.
I have looked at the documentation and it is quite sparse.
I have code that unpivots columns into rows. There are 19 categories of data, 15 of which have been unpivoted. However, my problem is that some of the tables that are unpivoted are not showing up on the new rows. I am asking for anyone's expertise to help as this will be helpful for me in future endeavors. I have created a table. Bear in mind this time is extremely wide/long and has 131 columns I believe and only 7 rows. Below is the table of the original data (it is make believe data of course but will be used for real data in the future). Also, the 2nd table is the how I want it to look. The 3rd table is how I how it actually looks. Under that is my code. I will glady upvote anyone who will help me. Thank you in advance.
Original data:
| usr | Company | Dept.# | Dept1 | Dept2 | Dept3 | Dept4 | Hr1 | Tr1 | F1 | A1 | HOH1 | M1 | R1 | SO1 | BIG1 | T1 | P1 | X1 | Y1 | Z1 | Tin1 | Hr1 | Tr1 | F1 | A1 | HOH1 | M1 | R1 | SO1 | BIG1 | T1 | P1 | X1 | Y1 | Z1 | Tin1 | Hr1 | Tr1 | F1 | A1 | HOH1 | M1 | R1 | SO1 | BIG1 | T1 | P1 | X1 | Y1 | Z1 | Tin1 | Hr1 | Tr1 | F1 | A1 | HOH1 | M1 | R1 | SO1 | BIG1 | T1 | P1 | X1 | Y1 | Z1 | Tin1 | Hr2 | Tr2 | F2 | A2 | HOH2 | M2 | R2 | SO2 | BIG2 | T2 | P2 | X2 | Y2 | Z2 | Tin2 | Hr2 | Tr2 | F2 | A2 | HOH2 | M2 | R2 | SO2 | BIG2 | T2 | P2 | X2 | Y2 | Z2 | Tin2 | Hr2 | Tr2 | F2 | A2 | HOH2 | M2 | R2 | SO2 | BIG2 | T2 | P2 | X2 | Y2 | Z2 | Tin2 | Hr3 | Tr3 | F3 | A3 | HOH3 | M3 | R3 | SO3 | BIG3 | T3 | P3 | X2 | Y2 | Z2 | Tin2 | Hr3 | Tr3 | F3 | A3 | HOH3 | M3 | R3 | SO3 | BIG3 | T3 | P3 | X3 | Y3 | Z3 | Tin3 | Hr4 | Tr4 | F4 | A4 | HOH4 | M4 | R4 | SO4 | BIG4 | T4 | P4 | X4 | Y4 | Z4 | Tin4 |
|------|---------|--------|-------|-------|-------|-------|-----|-----|-----|-----|------|----|----|-----|------|----|-----|-----|-----|----|------|-----|-----|----|-----|------|----|-----|-----|------|----|-----|-----|-----|----|------|-----|-----|----|-----|------|----|----|-----|------|-----|----|----|----|-----|------|-----|-----|----|-----|------|----|----|-----|------|----|----|-----|-----|----|------|-----|-----|-----|-----|------|----|----|-----|------|----|----|-----|----|-----|------|-----|-----|----|-----|------|----|----|-----|------|----|----|----|----|-----|------|-----|-----|-----|-----|------|----|----|-----|------|----|----|----|-----|----|------|-----|-----|-----|-----|------|----|-----|-----|------|----|----|-----|-----|----|------|-----|-----|-----|-----|------|----|----|-----|------|----|----|-----|-----|-----|------|-----|-----|-----|-----|------|----|----|-----|------|----|-----|-----|-----|-----|------|
| xxxx | OS | 1 | Train | | | | 20 | 89 | 355 | 123 | 435 | 90 | 5 | 55 | 676 | 34 | 43 | 984 | 345 | 74 | 846 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
| xxxx | OPC | 2 | Poxy1 | Poxy2 | | | | | | | | | | | | | | | | | | 45 | 546 | 68 | 345 | 903 | 70 | 345 | 23 | 54 | 32 | 234 | 23 | 567 | 69 | 64 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 38 | 67 | 235 | 789 | 7 | 40 | 99 | 98 | 87 | 89 | 34 | 312 | 42 | 756 | 23 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
| xxxx | Oxy R | 4 | H1 | H2 | H3 | H4 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 22 | 36 | 13 | 678 | 64 | 40 | 34 | 239 | 76 | 87 | 34 | 999 | 965 | 34 | 93 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 89 | 54 | 761 | 765 | 9 | 20 | 22 | 65 | 78 | 98 | 78 | 75 | 354 | 23 | 23 | | | | | | | | | | | | | | | | 36 | 80 | 123 | 543 | 17 | 20 | 11 | 908 | 988 | 7 | 86 | 245 | 546 | 763 | 324 | 25 | 90 | 111 | 432 | 84 | 25 | 63 | 784 | 98 | 78 | 854 | 754 | 234 | 865 | 43 |
| xxxx | HPK | 3 | Test1 | Test2 | Test3 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 99 | 456 | 39 | 567 | 223 | 50 | 5 | 32 | 549 | 435 | 34 | 87 | 64 | 348 | 942 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 52 | 21 | 47 | 876 | 1 | 30 | 46 | 92 | 78 | 12 | 34 | 12 | 12 | 421 | 23 | | | | | | | | | | | | | | | | 90 | 76 | 773 | 654 | 49 | 10 | 223 | 982 | 566 | 23 | 54 | 786 | 356 | 73 | 654 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
| xxxx | Mano | 1 | Porp | | | | 42 | 657 | 645 | 234 | 344 | 80 | 45 | 364 | 97 | 23 | 634 | 34 | 23 | 87 | 84 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
| xxxx | Macro | 2 | Otto1 | Otto2 | | | | | | | | | | | | | | | | | | 75 | 574 | 46 | 456 | 453 | 60 | 44 | 235 | 867 | 5 | 433 | 234 | 346 | 46 | 35 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 23 | 433 | 186 | 987 | 2 | 30 | 34 | 58 | 87 | 43 | 34 | 23 | 62 | 73 | 32 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
How I want it to look:
| usr | Company | Dept# | Dept | Hrs | Tr | F | A | HOH | M | R | SO | BIG | T | P | X | Y | Z | Tin |
|------|---------|-------|-------|-----|-----|-----|-----|-----|----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
| xxxx | OS | 1 | Train | 20 | 89 | 355 | 123 | 435 | 90 | 5 | 55 | 676 | 34 | 43 | 984 | 345 | 74 | 846 |
| xxxx | OPC | 2 | Poxy1 | 45 | 546 | 68 | 345 | 903 | 70 | 345 | 23 | 54 | 32 | 234 | 23 | 567 | 69 | 64 |
| xxxx | OPC | 2 | Poxy2 | 38 | 67 | 235 | 789 | 7 | 40 | 99 | 98 | 87 | 89 | 34 | 312 | 42 | 756 | 23 |
| xxxx | Oxy R | 4 | H1 | 22 | 36 | 13 | 678 | 64 | 40 | 34 | 239 | 76 | 87 | 34 | 999 | 965 | 34 | 93 |
| xxxx | Oxy R | 4 | H2 | 89 | 54 | 761 | 765 | 9 | 20 | 22 | 65 | 78 | 98 | 78 | 75 | 354 | 23 | 23 |
| xxxx | Oxy R | 4 | H3 | 36 | 80 | 123 | 543 | 17 | 20 | 11 | 908 | 988 | 7 | 86 | 245 | 546 | 763 | 324 |
| xxxx | Oxy R | 4 | H4 | 25 | 90 | 111 | 432 | 84 | 25 | 63 | 784 | 98 | 78 | 854 | 754 | 234 | 865 | 43 |
| xxxx | HPK | 3 | Test1 | 99 | 456 | 39 | 567 | 223 | 50 | 5 | 32 | 549 | 435 | 34 | 87 | 64 | 348 | 942 |
| xxxx | HPK | 3 | Test2 | 52 | 21 | 47 | 876 | 1 | 30 | 46 | 92 | 78 | 12 | 34 | 12 | 12 | 421 | 23 |
| xxxx | HPK | 3 | Test3 | 90 | 76 | 773 | 654 | 49 | 10 | 223 | 982 | 566 | 23 | 54 | 786 | 356 | 73 | 654 |
| xxxx | Mano | 1 | Porp | 42 | 657 | 645 | 234 | 344 | 80 | 45 | 364 | 97 | 23 | 634 | 34 | 23 | 87 | 84 |
| xxxx | Macro | 2 | Otto1 | 73 | 574 | 46 | 456 | 453 | 60 | 44 | 235 | 867 | 5 | 433 | 234 | 346 | 46 | 35 |
| xxxx | Macro | 2 | Otto2 | 23 | 433 | 186 | 987 | 2 | 30 | 34 | 58 | 87 | 43 | 34 | 23 | 62 | 73 | 32 |
This is how it actually looks which is wrong. As you can see data is missing for some reason:
| usr | Company | Dept# | Dept | Hrs | Tr | F | A | HOH | M | R | SO | BIG | T | P | X | Y | Z | Tin |
|------|---------|-------|-------|-----|-----|-----|-----|-----|----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
| xxxx | OS | 1 | Train | 20 | 89 | 355 | 123 | 435 | 90 | 5 | 55 | 676 | 34 | 43 | 984 | 345 | 74 | 846 |
| xxxx | OPC | 2 | Poxy1 | 45 | 546 | 68 | 345 | 903 | 70 | 345 | 23 | 54 | 32 | 234 | 23 | 567 | 69 | 64 |
| xxxx | OPC | 2 | Poxy2 | 38 | 67 | 235 | 789 | 7 | 40 | 99 | 98 | 87 | 89 | 34 | 312 | 42 | 756 | 23 |
| xxxx | Oxy R | 4 | H1 | 22 | 36 | 13 | 678 | 64 | 40 | 34 | 239 | 76 | 87 | 34 | 999 | 965 | 34 | 93 |
| xxxx | Oxy R | 4 | H2 | 89 | 54 | 761 | 765 | 9 | 20 | 22 | 65 | 78 | 98 | 78 | 75 | 354 | 23 | 23 |
| xxxx | Oxy R | 4 | H3 | | | | | | | | | | | | | | | |
| xxxx | Oxy R | 4 | H4 | | | | | | | | | | | | | | | |
| xxxx | HPK | 3 | Test1 | 99 | 456 | 39 | 567 | 223 | 50 | 5 | 32 | 549 | 435 | 34 | 87 | 64 | 348 | 942 |
| xxxx | HPK | 3 | Test2 | 52 | 21 | 47 | 876 | 1 | 30 | 46 | 92 | 78 | 12 | 34 | 12 | 12 | 421 | 23 |
| xxxx | HPK | 3 | Test3 | | | | | | | | | | | | | | | |
| xxxx | Mano | 1 | Porp | 42 | 657 | 645 | 234 | 344 | 80 | 45 | 364 | 97 | 23 | 634 | 34 | 23 | 87 | 84 |
| xxxx | Macro | 2 | Otto1 | 73 | 574 | 46 | 456 | 453 | 60 | 44 | 235 | 867 | 5 | 433 | 234 | 346 | 46 | 35 |
| xxxx | Macro | 2 | Otto2 | 23 | 433 | 186 | 987 | 2 | 30 | 34 | 58 | 87 | 43 | 34 | 23 | 62 | 73 | 32 |
Here is my code:
Sub buttonclick()
Dim Ary As Variant, Nary As Variant, Cary As Variant
Dim r As Long, c As Long, nr As Long, cc As Long
Cary = Array("0853", 6898, 113128, 143143)
With Sheets("Sheet1")
Ary = .Range("A2:DM" & .Range("A" & Rows.Count).End(xlUp).Row).Value2
End With
ReDim Nary(1 To UBound(Ary) * 4, 1 To 19)
For r = 1 To UBound(Ary)
For c = 4 To 7
If Ary(r, c) = "" Then Exit For
nr = nr + 1
Nary(nr, 1) = Ary(r, 1): Nary(nr, 2) = Ary(r, 2): Nary(nr, 3) = Ary(r, 3)
Nary(nr, 4) = Ary(r, c)
For cc = Left(Cary(c - 4), 2) To Right(Cary(c - 4), 2) Step 15
Nary(nr, 5) = Nary(nr, 5) & Ary(r, cc)
Nary(nr, 6) = Nary(nr, 6) & Ary(r, cc + 1)
Nary(nr, 7) = Nary(nr, 7) & Ary(r, cc + 2)
Nary(nr, 8) = Nary(nr, 8) & Ary(r, cc + 3)
Nary(nr, 9) = Nary(nr, 9) & Ary(r, cc + 4)
Nary(nr, 10) = Nary(nr, 10) & Ary(r, cc + 5)
Nary(nr, 11) = Nary(nr, 11) & Ary(r, cc + 6)
Nary(nr, 12) = Nary(nr, 12) & Ary(r, cc + 7)
Nary(nr, 13) = Nary(nr, 13) & Ary(r, cc + 8)
Nary(nr, 14) = Nary(nr, 14) & Ary(r, cc + 9)
Nary(nr, 15) = Nary(nr, 15) & Ary(r, cc + 10)
Nary(nr, 16) = Nary(nr, 16) & Ary(r, cc + 11)
Nary(nr, 17) = Nary(nr, 17) & Ary(r, cc + 12)
Nary(nr, 18) = Nary(nr, 18) & Ary(r, cc + 13)
Nary(nr, 19) = Nary(nr, 19) & Ary(r, cc + 14)
Next cc
Next c
Next r
With Sheets("Sheet2")
.UsedRange.ClearContents
.Range("A1").Resize(, 19).Value = Array("usr", "Company", "Dept.#", "Dept", "Hrs",
"Tr", "F", "A", "HOH", "M", "R", "SO", "BIG", _
"T", "P", "X", "Y", "Z", "Tin")
.Range("A2").Resize(nr, 19).Value = Nary
End With
End Sub
I have a dataframe which contains negative numbers, with accountancy notation i.e.:
df.select('sales').distinct().show()
+------------+
| sales |
+------------+
| 18 |
| 3 |
| 10 |
| (5)|
| 4 |
| 40 |
| 0 |
| 8 |
| 16 |
| (2)|
| 2 |
| (1)|
| 14 |
| (3)|
| 9 |
| 19 |
| (6)|
| 1 |
| (9)|
| (4)|
+------------+
only showing top 20 rows
The numbers wrapped in () are negative. How can I replace them to have minus values instead i.e. (5) becomes -5 and so on.
Here is what I have tried:
sales = (
df
.select('sales')
.withColumn('sales_new',
sf.when(sf.col('sales').substr(1,1) == '(',
sf.concat(sf.lit('-'), sf.col('sales').substr(2,3)))
.otherwise(sf.col('sales')))
)
sales.show(20,False)
+---------+---------+
|salees |sales_new|
+---------+---------+
| 151 | 151 |
| 134 | 134 |
| 151 | 151 |
|(151) |-151 |
|(134) |-134 |
|(151) |-151 |
| 151 | 151 |
| 50 | 50 |
| 101 | 101 |
| 134 | 134 |
|(134) |-134 |
| 46 | 46 |
| 151 | 151 |
| 134 | 134 |
| 185 | 185 |
| 84 | 84 |
| 188 | 188 |
|(94) |-94) |
| 38 | 38 |
| 21 | 21 |
+---------+---------+
The issue is that the length of sales can vary so hardcoding a value into the substring() won't work in some cases.
I have tried using regexp_replace but get an error that:
PatternSyntaxException: Unclosed group near index 1
sales = (
df
.select('sales')
.withColumn('sales_new', regexp_replace(sf.col('sales'), '(', ''))
)
This can be solved with a case statement and regular expression together:
from pyspark.sql.functions import regexp_replace, col
sales = (
df
.select('sales')
.withColumn('sales_new', sf.when(sf.col('sales').substr(1,1) == '(',
sf.concat(sf.lit('-'), regexp_replace(sf.col('sales'), '\(|\)', '')))
.otherwise(sf.col('sales')))
)
sales.show(20,False)
+---------+---------+
|sales |sales_new|
+---------+---------+
|151 |151 |
|134 |134 |
|151 |151 |
|(151) |-151 |
|(134) |-134 |
|(151) |-151 |
|151 |151 |
|50 |50 |
|101 |101 |
|134 |134 |
|(134) |-134 |
|46 |46 |
|151 |151 |
|134 |134 |
|185 |185 |
|84 |84 |
|188 |188 |
|(94) |-94 |
|38 |38 |
|21 |21 |
+---------+---------+
You can slice the string from the second character to the second last character, and then convert it to float, for example:
def convert(number):
try:
number = float(number)
except:
number = number[1:-1]
number = float(number)
return number
You can iterate through all the elements and apply this function.
In order to enrich my stream data, I join it with a static dataset.
Actually, I join my input dataset twice with the same dataset to add informations about seller and buyer.
input:
+-----------+------+-----+------+
|transaction|seller|buyer|amount|
+-----------+------+-----+------+
| 1 | A | D | 100 |
| 2 | B | A | 10 |
| 3 | C | A | 20 |
+-----------+------+-----+------+
static dataset:
+------+-------+
|person|address|
+------+-------+
| A | #A |
| B | #B |
| C | #C |
| D | #D |
+------+-------+
Code:
iputDF.join(staticDS, iputDF("seller") <=> staticDS("person"))
.join(staticDS, iputDF("buyer") <=> staticDS("person"))
output:
+-----------+------+-------+-----+------+------+
|transaction|seller|#seller|buyer|#buyer|amount|
+-----------+------+-------+-----+------+------+
| 1 | A | #A | D | #D | 100 |
| 2 | B | #B | A | #A | 10 |
| 3 | C | #C | A | #A | 20 |
+-----------+------+-------+-----+------+------+
Is there an optimal solution to do this?
let's say that I have a table like the below:
| | Value 1 | Value 2 | Value 3 | |
|---|---------|---------|---------|---|
| A | 22 | 12 | 3 | |
| A | 5 | 6 | 12 | |
| A | 19 | 9 | 13 | |
| A | 22 | 43 | 31 | |
| B | 7 | 12 | 23 | |
| B | 5 | 5 | 8 | |
| B | 35 | 78 | 9 | |
| B | 45 | 1 | 8 | |
| C | 34 | 56 | 0 | |
| C | 22 | 1 | 14 | |
| C | 13 | 46 | 45 | |
and that I'd need to transform it into the below:
| | Value 1 | Value 2 | Value 3 | |
|---|---------|---------|---------|---|
| A | 22 | 12 | 3 | |
| A | 5 | 6 | 12 | |
| A | 19 | 9 | 13 | |
| A | 22 | 43 | 31 | |
| | 68 | 70 | 59 | |
| | | | | |
| B | 7 | 12 | 23 | |
| B | 5 | 5 | 8 | |
| B | 35 | 78 | 9 | |
| B | 45 | 1 | 8 | |
| | 92 | 96 | 48 | |
| | | | | |
| C | 34 | 56 | 0 | |
| C | 22 | 1 | 14 | |
| C | 13 | 46 | 45 | |
| | 69 | 103 | 59 | |
How could I obtain the desired effect automatically?
There would be n empty rows after each group and the sums of each column within the group.
You can use the Subtotal feature of Excel. Subtotal is in the "Data" tab of the ribbon. To automatically add the totals between groupings. I don't think it adds the blank row. If you absolutely need the blank row, then I can generate some VBA that will work.