Get all possible combinations recursively in an RDD in pyspark - python-3.x

I have made this algorithm, but with higher numbers looks like that doesn't work or its very slow, it will run in a cluster of big data(cloudera), so i think that i have to put the function into pyspark, any tip how improve it please
import pandas as pd import itertools as itts
number_list = [10953, 10423, 10053]
def reducer(nums): def ranges(n): print(n) return range(n, -1, -1)
num_list = list(map(ranges, nums)) return list(itts.product(*num_list))
data=pd.DataFrame(reducer(number_list)) print(data)

You can use crossJoin with DataFrame:
Here we have a simple example trying to compute the cross-product of three arrays,
i.e. [1,0], [2,1,0], [3,2,1,0]. Their cross-product should have 2*3*4 = 24 elements.
The code below shows how to achieve this.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
df1 = spark.createDataFrame([(1,),(0,)], ['v1'])
df2 = spark.createDataFrame([(2,), (1,),(0,)], ['v2'])
df3 = spark.createDataFrame([(3,), (2,),(1,),(0,)], ['v3'])
df1.show()
df2.show()
df3.show()
+---+
| v1|
+---+
| 1|
| 0|
+---+
+---+
| v2|
+---+
| 2|
| 1|
| 0|
+---+
+---+
| v3|
+---+
| 3|
| 2|
| 1|
| 0|
+---+
df = df1.crossJoin(df2).crossJoin(df3)
print('----------- Total rows: ', df.count())
df.show(30)
----------- Total rows: 24
+---+---+---+
| v1| v2| v3|
+---+---+---+
| 1| 2| 3|
| 1| 2| 2|
| 1| 2| 1|
| 1| 2| 0|
| 1| 1| 3|
| 1| 1| 2|
| 1| 1| 1|
| 1| 1| 0|
| 1| 0| 3|
| 1| 0| 2|
| 1| 0| 1|
| 1| 0| 0|
| 0| 2| 3|
| 0| 2| 2|
| 0| 2| 1|
| 0| 2| 0|
| 0| 1| 3|
| 0| 1| 2|
| 0| 1| 1|
| 0| 1| 0|
| 0| 0| 3|
| 0| 0| 2|
| 0| 0| 1|
| 0| 0| 0|
+---+---+---+
Your computation is pretty big:
(10953+1)*(10423+1)*(10053+1)=1148010922784, about 1 trillion rows. I would suggest increase the numbers slowly, spark is not as fast as you think when it involves table joins.
Also, try use broadcast on all your initial DataFrames, i.e. df1, df2, df3. See if it helps.

Related

Check if a column is consecutive with groupby in pyspark

I have a pyspark dataframe that looks like this:
import pandas as pd
foo = pd.DataFrame({'group': ['a','a','a','b','b','c','c','c'], 'value': [1,2,3,4,5,2,4,5]})
I would like to create a new binary column is_consecutive that indicates if the values in the value column are consecutive by group.
The output should look like this:
foo = pd.DataFrame({'group': ['a','a','a','b','b','c','c','c'], 'value': [1,2,3,4,5,2,4,5],
'is_consecutive': [1,1,1,1,1,0,0,0]})
How could I do that in pyspark?
You can use lag to compare values with the previous row and check if they are consecutive, then use min to determine whether all rows are consecutive in a given group.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'consecutive',
F.coalesce(
F.col('value') - F.lag('value').over(Window.partitionBy('group').orderBy('value')) == 1,
F.lit(True)
).cast('int')
).withColumn(
'all_consecutive',
F.min('consecutive').over(Window.partitionBy('group'))
)
df2.show()
+-----+-----+-----------+---------------+
|group|value|consecutive|all_consecutive|
+-----+-----+-----------+---------------+
| c| 2| 1| 0|
| c| 4| 0| 0|
| c| 5| 1| 0|
| b| 4| 1| 1|
| b| 5| 1| 1|
| a| 1| 1| 1|
| a| 2| 1| 1|
| a| 3| 1| 1|
+-----+-----+-----------+---------------+
You can use lead and subtract the same with the existing value then find max of the window, once done , put a condition saying return 0 is max is >1 else return 1
w = Window.partitionBy("group").orderBy(F.monotonically_increasing_id())
(foo.withColumn("Diff",F.lead("value").over(w)-F.col("value"))
.withColumn("is_consecutive",F.when(F.max("Diff").over(w)>1,0).otherwise(1))
.drop("Diff")).show()
+-----+-----+--------------+
|group|value|is_consecutive|
+-----+-----+--------------+
| a| 1| 1|
| a| 2| 1|
| a| 3| 1|
| b| 4| 1|
| b| 5| 1|
| c| 2| 0|
| c| 4| 0|
| c| 5| 0|
+-----+-----+--------------+

regarding the usage of F.count(F.col("some column").isNotNull()) in window function

I am trying to test the usage of F.count(F.col().isNotNull()) in window function. Please see the following code script
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
list=([1,5,4],
[1,5,None],
[1,5,1],
[1,5,4],
[2,5,1],
[2,5,2],
[2,5,None],
[2,5,None],
[2,5,4])
df=spark.createDataFrame(list,['I_id','p_id','xyz'])
w= Window().partitionBy("I_id","p_id").orderBy(F.col("xyz").asc_nulls_first())
df.withColumn("xyz1",F.count(F.col("xyz").isNotNull()).over(w)).show()
The result is shown as follows. In the first two rows, my understanding is that F.count(F.col("xyz") should count the non-zero items from xyz = -infinity to xyz = null, how does the following isNotNull() process this. Why it gets 2 for the first two rows in xyz1 column.
If you count the Booleans, since they are either True or False, you will count all the rows in the specified window, regardless of whether xyz is null or not.
What you could do is to sum the isNotNull Boolean rather than counting them.
df.withColumn("xyz1",F.sum(F.col("xyz").isNotNull().cast('int')).over(w)).show()
+----+----+----+----+
|I_id|p_id| xyz|xyz1|
+----+----+----+----+
| 2| 5|null| 0|
| 2| 5|null| 0|
| 2| 5| 1| 1|
| 2| 5| 2| 2|
| 2| 5| 4| 3|
| 1| 5|null| 0|
| 1| 5| 1| 1|
| 1| 5| 4| 3|
| 1| 5| 4| 3|
+----+----+----+----+
Another way is to do a conditional count using when:
df.withColumn("xyz1",F.count(F.when(F.col("xyz").isNotNull(), 1)).over(w)).show()
+----+----+----+----+
|I_id|p_id| xyz|xyz1|
+----+----+----+----+
| 2| 5|null| 0|
| 2| 5|null| 0|
| 2| 5| 1| 1|
| 2| 5| 2| 2|
| 2| 5| 4| 3|
| 1| 5|null| 0|
| 1| 5| 1| 1|
| 1| 5| 4| 3|
| 1| 5| 4| 3|
+----+----+----+----+

Is it possible to filter columns by the sum of their values in Spark?

I'm loading a sparse table using PySpark where I want to remove all columns where the sum of all values in the column is above a threshold.
For example, the sum of column values of the following table:
+---+---+---+---+---+---+
| a| b| c| d| e| f|
+---+---+---+---+---+---+
| 1| 0| 1| 1| 0| 0|
| 1| 1| 0| 0| 0| 0|
| 1| 0| 0| 1| 1| 1|
| 1| 0| 0| 1| 1| 1|
| 1| 1| 0| 0| 1| 0|
| 0| 0| 1| 0| 1| 0|
+---+---+---+---+---+---+
Is 5, 2, 2, 3, 4 and 2. Filtering for all columns with sum >= 3 should output this table:
+---+---+---+
| a| d| e|
+---+---+---+
| 1| 1| 0|
| 1| 0| 0|
| 1| 1| 1|
| 1| 1| 1|
| 1| 0| 1|
| 0| 0| 1|
+---+---+---+
I tried many different solutions without success. df.groupBy().sum() is giving me the sum of column values, so I'm searching how I can then filter those with threshold and get only the remaining columns from the original dataframe.
As there are not only 6 but a couple of thousand columns, I'm searching for a scalable solution, where I don't have to type in every column name. Thanks for help!
You can do this with a collect (or a first) step.
from pyspark.sql import functions as F
sum_result = df.groupBy().agg(*(F.sum(col).alias(col) for col in df.columns)).first()
filtered_df = df.select(
*(col for col, value in sum_result.asDict().items() if value >= 3)
)
filtered_df.show()
+---+---+---+
| a| d| e|
+---+---+---+
| 1| 1| 0|
| 1| 0| 0|
| 1| 1| 1|
| 1| 1| 1|
| 1| 0| 1|
| 0| 0| 1|
+---+---+---+

Simplify code and reduce join statements in pyspark data frames

I have a data frame in pyspark like below.
df.show()
+---+-------------+
| id| device|
+---+-------------+
| 3| mac pro|
| 1| iphone|
| 1|android phone|
| 1| windows pc|
| 1| spy camera|
| 2| spy camera|
| 2| iphone|
| 3| spy camera|
| 3| cctv|
+---+-------------+
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
from pyspark.sql.functions import col
phones_df = df.filter(col('device').isin(phone_list)).groupBy("id").count().selectExpr("id as id", "count as phones")
phones_df.show()
+---+------+
| id|phones|
+---+------+
| 1| 2|
| 2| 1|
+---+------+
pc_df = df.filter(col('device').isin(pc_list)).groupBy("id").count().selectExpr("id as id", "count as pc")
pc_df.show()
+---+---+
| id| pc|
+---+---+
| 1| 1|
| 3| 1|
+---+---+
security_df = df.filter(col('device').isin(security_list)).groupBy("id").count().selectExpr("id as id", "count as security")
security_df.show()
+---+--------+
| id|security|
+---+--------+
| 1| 1|
| 2| 1|
| 3| 2|
+---+--------+
Then I want to do a full outer join on all the three data frames. I have done like below.
full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)
final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)
Final_df.show()
+---+------+----+--------+
| id|phones| pc|security|
+---+------+----+--------+
| 1| 2| 1| 1|
| 2| 1|null| 1|
| 3| null| 1| 2|
+---+------+----+--------+
I am able to get what I want but want to simplify my code.
1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this.
2) I want to simplify the join statements to one statement
How can I do this? Could anyone explain.
Here is one way using when.otherwise to map column to categories, and then pivot it to the desired output:
import pyspark.sql.functions as F
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
+---+----+------+--------+
| id| pc|phones|security|
+---+----+------+--------+
| 1| 1| 2| 1|
| 3| 1| null| 2|
| 2|null| 1| 1|
+---+----+------+--------+

Reducing a dataframe to the most frequent combinations of two columns

I have a json file which I import using the following code:
spark = SparkSession.builder.master("local").appName('GPS').config(conf=SparkConf()).getOrCreate()
df = spark.read.json("SensorData.json")
The result is a dataframe similar to this:
+---+---+
| A| B|
+---+---+
| 1| 3|
| 2| 1|
| 2| 3|
| 1| 2|
| 3| 1|
| 1| 2|
| 2| 1|
| 1| 3|
| 1| 2|
+---+---+
My task is using PySpark to reduce the data to only the most frequent combinations of two columns (A and B)
So the wanted output is this
+---+---+-----+
| A| B|count|
+---+---+-----+
| 1| 2| 3|
| 2| 1| 2|
+---+---+-----+
You can do that with a combination of groupBy and limit:
spark = SparkSession.builder.master("local").appName('GPS').config(conf=SparkConf()).getOrCreate()
df = spark.read.json("SensorData.json")
df.groupBy("A","B")
.count()
.sort("count",ascending = False)
.limit(2)
.show()
+---+---+-----+
| A| B|count|
+---+---+-----+
| 1| 2| 3|
| 2| 1| 2|
+---+---+-----+

Resources