Spark generate occurrence matrix - apache-spark

I have input transactions as shown
apples,mangos,eggs
milk,oranges,eggs
milk, cereals
mango,apples
I have to generate a Spark dataframe of co-occurrence matrix like this.
apple mango milk cereals eggs
apple 2 2 0 0 1
mango 2 2 0 0 1
milk 0 0 2 1 1
cereals 0 0 1 1 0
eggs 1 1 1 0 2
Apples and mango are bought twice together, so in a matrix[apple][mango] =2.
I am stuck in ideas for implementing this? Any suggestions would be great help. I am using PySpark to implement this.

If data looks like this:
df = spark.createDataFrame(
["apples,mangos,eggs", "milk,oranges,eggs", "milk,cereals", "mangos,apples"],
"string"
).toDF("basket")
Import
from pyspark.sql.functions import split, explode, monotonically_increasing_id
Split and explode:
long = (df
.withColumn("id", monotonically_increasing_id())
.select("id", explode(split("basket", ","))))
Self-join and corsstab
long.withColumnRenamed("col", "col_").join(long, ["id"]).stat.crosstab("col_", "col").show()
# +--------+------+-------+----+------+----+-------+
# |col__col|apples|cereals|eggs|mangos|milk|oranges|
# +--------+------+-------+----+------+----+-------+
# | cereals| 0| 1| 0| 0| 1| 0|
# | eggs| 1| 0| 2| 1| 1| 1|
# | milk| 0| 1| 1| 0| 2| 1|
# | mangos| 2| 0| 1| 2| 0| 0|
# | apples| 2| 0| 1| 2| 0| 0|
# | oranges| 0| 0| 1| 0| 1| 1|
# +--------+------+-------+----+------+----+-------+

Related

How to groupby by consective 1s in column in pyspark and keep groups with specific size

I want to split my pyspark dataframe in groups with monotonically increasing trend and keep the groups with size greater than 10.
here i tried some part of code,
from pyspark.sql import functions as F, Window
df = df1.withColumn(
"FLAG_INCREASE",
F.when(
F.col("x")
> F.lag("x").over(Window.partitionBy("x1").orderBy("time")),
1,
).otherwise(0),
)
i don't know how to do groupby by consective one's in pyspark... if anyone have better solution for this
same thing in pandas we can do like this :
df=df1.groupby((df1['x'].diff() < 0).cumsum())
how to convert this code to pyspark ?
example dataframe:
x
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 5
8 4
9 3
10 2
11 1
12 2
13 3
14 4
15 5
16 5
17 6
expected output
group1:
x
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 5
group2:
x
0 1
1 2
2 3
3 4
4 5
5 5
6 6
I'll map out all the steps (and keep all columns in the output) to replicate (df1['x'].diff() < 0).cumsum() which is easy to calculate using a lag.
However, it is important that your data has an ID column that has the order of the dataframe because unlike pandas, spark does not retain dataframe's sorting (due to its distributed nature). For this example, I've assumed that your data has an ID column named idx, which is the index you printed in your example input.
# input data
data_sdf.show(5)
# +---+---+
# |idx|val|
# +---+---+
# | 0| 1|
# | 1| 2|
# | 2| 2|
# | 3| 2|
# | 4| 3|
# +---+---+
# only showing top 5 rows
# calculating the group column
data_sdf. \
withColumn('diff_with_prevval',
func.col('val') - func.lag('val').over(wd.orderBy('idx'))
). \
withColumn('diff_lt_0',
func.coalesce((func.col('diff_with_prevval') < 0).cast('int'), func.lit(0))
). \
withColumn('diff_lt_0_cumsum',
func.sum('diff_lt_0').over(wd.orderBy('idx').rowsBetween(-sys.maxsize, 0))
). \
show()
# +---+---+-----------------+---------+----------------+
# |idx|val|diff_with_prevval|diff_lt_0|diff_lt_0_cumsum|
# +---+---+-----------------+---------+----------------+
# | 0| 1| null| 0| 0|
# | 1| 2| 1| 0| 0|
# | 2| 2| 0| 0| 0|
# | 3| 2| 0| 0| 0|
# | 4| 3| 1| 0| 0|
# | 5| 3| 0| 0| 0|
# | 6| 4| 1| 0| 0|
# | 7| 5| 1| 0| 0|
# | 8| 4| -1| 1| 1|
# | 9| 3| -1| 1| 2|
# | 10| 2| -1| 1| 3|
# | 11| 1| -1| 1| 4|
# | 12| 2| 1| 0| 4|
# | 13| 3| 1| 0| 4|
# | 14| 4| 1| 0| 4|
# | 15| 5| 1| 0| 4|
# | 16| 5| 0| 0| 4|
# | 17| 6| 1| 0| 4|
# +---+---+-----------------+---------+----------------+
You can now use the diff_lt_0_cumsum column in your groupBy() for further use.

Is it possible to filter columns by the sum of their values in Spark?

I'm loading a sparse table using PySpark where I want to remove all columns where the sum of all values in the column is above a threshold.
For example, the sum of column values of the following table:
+---+---+---+---+---+---+
| a| b| c| d| e| f|
+---+---+---+---+---+---+
| 1| 0| 1| 1| 0| 0|
| 1| 1| 0| 0| 0| 0|
| 1| 0| 0| 1| 1| 1|
| 1| 0| 0| 1| 1| 1|
| 1| 1| 0| 0| 1| 0|
| 0| 0| 1| 0| 1| 0|
+---+---+---+---+---+---+
Is 5, 2, 2, 3, 4 and 2. Filtering for all columns with sum >= 3 should output this table:
+---+---+---+
| a| d| e|
+---+---+---+
| 1| 1| 0|
| 1| 0| 0|
| 1| 1| 1|
| 1| 1| 1|
| 1| 0| 1|
| 0| 0| 1|
+---+---+---+
I tried many different solutions without success. df.groupBy().sum() is giving me the sum of column values, so I'm searching how I can then filter those with threshold and get only the remaining columns from the original dataframe.
As there are not only 6 but a couple of thousand columns, I'm searching for a scalable solution, where I don't have to type in every column name. Thanks for help!
You can do this with a collect (or a first) step.
from pyspark.sql import functions as F
sum_result = df.groupBy().agg(*(F.sum(col).alias(col) for col in df.columns)).first()
filtered_df = df.select(
*(col for col, value in sum_result.asDict().items() if value >= 3)
)
filtered_df.show()
+---+---+---+
| a| d| e|
+---+---+---+
| 1| 1| 0|
| 1| 0| 0|
| 1| 1| 1|
| 1| 1| 1|
| 1| 0| 1|
| 0| 0| 1|
+---+---+---+

Get all possible combinations recursively in an RDD in pyspark

I have made this algorithm, but with higher numbers looks like that doesn't work or its very slow, it will run in a cluster of big data(cloudera), so i think that i have to put the function into pyspark, any tip how improve it please
import pandas as pd import itertools as itts
number_list = [10953, 10423, 10053]
def reducer(nums): def ranges(n): print(n) return range(n, -1, -1)
num_list = list(map(ranges, nums)) return list(itts.product(*num_list))
data=pd.DataFrame(reducer(number_list)) print(data)
You can use crossJoin with DataFrame:
Here we have a simple example trying to compute the cross-product of three arrays,
i.e. [1,0], [2,1,0], [3,2,1,0]. Their cross-product should have 2*3*4 = 24 elements.
The code below shows how to achieve this.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
df1 = spark.createDataFrame([(1,),(0,)], ['v1'])
df2 = spark.createDataFrame([(2,), (1,),(0,)], ['v2'])
df3 = spark.createDataFrame([(3,), (2,),(1,),(0,)], ['v3'])
df1.show()
df2.show()
df3.show()
+---+
| v1|
+---+
| 1|
| 0|
+---+
+---+
| v2|
+---+
| 2|
| 1|
| 0|
+---+
+---+
| v3|
+---+
| 3|
| 2|
| 1|
| 0|
+---+
df = df1.crossJoin(df2).crossJoin(df3)
print('----------- Total rows: ', df.count())
df.show(30)
----------- Total rows: 24
+---+---+---+
| v1| v2| v3|
+---+---+---+
| 1| 2| 3|
| 1| 2| 2|
| 1| 2| 1|
| 1| 2| 0|
| 1| 1| 3|
| 1| 1| 2|
| 1| 1| 1|
| 1| 1| 0|
| 1| 0| 3|
| 1| 0| 2|
| 1| 0| 1|
| 1| 0| 0|
| 0| 2| 3|
| 0| 2| 2|
| 0| 2| 1|
| 0| 2| 0|
| 0| 1| 3|
| 0| 1| 2|
| 0| 1| 1|
| 0| 1| 0|
| 0| 0| 3|
| 0| 0| 2|
| 0| 0| 1|
| 0| 0| 0|
+---+---+---+
Your computation is pretty big:
(10953+1)*(10423+1)*(10053+1)=1148010922784, about 1 trillion rows. I would suggest increase the numbers slowly, spark is not as fast as you think when it involves table joins.
Also, try use broadcast on all your initial DataFrames, i.e. df1, df2, df3. See if it helps.

How to convert PySpark pipeline rdd (tuple inside tuple) into Data Frame?

I have a PySpark pipeline RDD like bellow
(1,([1,2,3,4],[5,3,4,5])
(2,([1,2,4,5],[4,5,6,7])
I want to generate Data Frame like below:
Id sid cid
1 1 5
1 2 3
1 3 4
1 4 5
2 1 4
2 2 5
2 4 6
2 5 7
Please help me on this.
If you have an RDD like this one,
rdd = sc.parallelize([
(1, ([1,2,3,4], [5,3,4,5])),
(2, ([1,2,4,5], [4,5,6,7]))
])
I would just use RDDs:
rdd.flatMap(lambda rec:
((rec[0], sid, cid) for sid, cid in zip(rec[1][0], rec[1][1]))
).toDF(["id", "sid", "cid"]).show()
# +---+---+---+
# | id|sid|cid|
# +---+---+---+
# | 1| 1| 5|
# | 1| 2| 3|
# | 1| 3| 4|
# | 1| 4| 5|
# | 2| 1| 4|
# | 2| 2| 5|
# | 2| 4| 6|
# | 2| 5| 7|
# +---+---+---+

Add a priority column in PySpark dataframe

I have a PySpark dataframe(input_dataframe) which looks like below:
**id** **col1** **col2** **col3** **col4** **col_check**
101 1 0 1 1 -1
102 0 1 1 0 -1
103 1 1 0 1 -1
104 0 0 1 1 -1
I want to have a PySpark function(update_col_check), which update column(col_check) of this dataframe. I will pass one column name as an argument to this function. Function should check if value of that column is 1, then update value of col_check as this column name., let us say i am passing col2 as an argument to this function:
output_dataframe = update_col_check(input_dataframe, col2)
So, my output_dataframe should look like below:
**id** **col1** **col2** **col3** **col4** **col_check**
101 1 0 1 1 -1
102 0 1 1 0 col2
103 1 1 0 1 col2
104 0 0 1 1 -1
Can i achieve this using PySpark? Any help will be appreciated.
You can do this fairly straight forward with functions when, otherwise:
from pyspark.sql.functions import when, lit
def update_col_check(df, col_name):
return df.withColumn('col_check', when(df[col_name] == 1, lit(col_name)).otherwise(df['col_check']))
update_col_check(df, 'col1').show()
+---+----+----+----+----+---------+
| id|col1|col2|col3|col4|col_check|
+---+----+----+----+----+---------+
|101| 1| 0| 1| 1| col1|
|102| 0| 1| 1| 0| -1|
|103| 1| 1| 0| 1| col1|
|104| 0| 0| 1| 1| -1|
+---+----+----+----+----+---------+
update_col_check(df, 'col2').show()
+---+----+----+----+----+---------+
| id|col1|col2|col3|col4|col_check|
+---+----+----+----+----+---------+
|101| 1| 0| 1| 1| -1|
|102| 0| 1| 1| 0| col2|
|103| 1| 1| 0| 1| col2|
|104| 0| 0| 1| 1| -1|
+---+----+----+----+----+---------+

Resources