regarding the usage of rangebetween in Windows.Partition function - apache-spark

I run the following code script
from pyspark.sql import Window
from pyspark.sql import functions as func
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
tup = [(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b")]
df = sqlContext.createDataFrame(tup, ["id", "category"])
df.show()
Then there has the following window partition, and the result is shown as following. I am confused on how this result was generated using the rangebeween. For instance, why the fourth row of sum column is 4, how rangeBetween(Window.currentRow, 1) works to get this value of 4. Moreover, according to Spark doc,
Window.currentRow is defined as 0, why the code does not use 0 instead.
window = Window.partitionBy("category").orderBy("id").rangeBetween(Window.currentRow, 1)
df.withColumn("sum", func.sum("id").over(window)).show()

Window.currentRow and 0 should be equivalent. I guess it's just a matter of preference. As for why you're getting 4, that's because the window spans between the values of id between the value of the current row and that value plus one, i.e. 1 (current row) and 2 (plus one). The three rows where id is 1 or 2 will be included in the window, so the sum gives 1+1+2 = 4.

Related

PySpark not null and not nan values function for all type of columns

I need to build a method that receives a pyspark.sql.Column 'c' and returns a new pyspark.sql.Column that contains the information to build a list with True/False depending if the values on the column are nulls/nan
PySpark has the column method c.isNotNull() which will work in the case of not null values. It also has pyspark.sql.functions.isnan, which receives a pyspark.sql.Column, which works with nans (but does not work with datetime/bool cols)
I'm trying to build a function that looks like this:
from pyspark.sql import functions as F
def notnull(c):
return c.isNotNull() & ~F.isnan(c)
And then I want to use that functions with any column type in my DataFrame to obtain if there are not null/not nan values within that column. But this fails when the provided column type is bool or datetime:
import datetime
import numpy as np
import pandas as pd
from pyspark import SparkConf
from pyspark.sql import SparkSession
# Building SparkSession 'spark'
conf = (SparkConf().setAppName("example")
.setMaster("local[*]")
.set("spark.sql.execution.arrow.enabled", "true"))
spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
# Data input and initialazing pd_df
data = {
'string_col': ['1', '1', '1', None],
'bool_col': [True, True, True, False],
'datetime_col': [
datetime.datetime(2018, 12, 9),
datetime.datetime(2018, 12, 9),
datetime.datetime(2018, 12, 9),
pd.NaT],
'float_col': [1.0, 1.0, 1.0, np.nan]
}
pd_df = pd.DataFrame(data)
# Creating spark_df from pd_df
spark_df = spark.createDataFrame(pd_df)
# This should return a new dataframe with the column 'notnulls' added
# Note: This works fine with 'float_col' and 'string_col' but does not
# work with 'bool_col' or 'datetime_col'
spark_df.withColumn('notnulls', notnull(spark_df['datetime_col'])).collect()
Running this snippet (using 'datetime_col') will throw the following exception:
pyspark.sql.utils.AnalysisException: "cannot resolve 'isnan(`datetime_col`)'
due to data type mismatch: argument 1 requires (double or float) type, however,
'`datetime_col`' is of timestamp type.;;\n'Project [category#217,
float_col#218, string_col#219, bool_col#220, CASE WHEN isnan(datetime_col#221)
THEN NOT isnan(datetime_col#221) ELSE isnotnull(datetime_col#221) END AS
datetime_col#231]\n+- LogicalRDD [category#217, float_col#218, string_col#219,
bool_col#220, datetime_col#221], false\n"
I understand this is because the isnan function cannot be applied to 'datetime_col', since it's not float/double type. Since 'c' is a pyspark.sql.Column Object, I can't access it's dtype to behave differently based on the column type. I want to avoid using a pandas_udf to solve this issue, but I'm not being able to find any different way to do it.
I'm using the following dependencies:
numpy==1.19.1
pandas==1.0.4
pyarrow==1.0.0
pyspark==2.4.5

Appending column name to column value using Spark

I have data in comma separated file, I have loaded it in the spark data frame:
The data looks like:
A B C
1 2 3
4 5 6
7 8 9
I want to transform the above data frame in spark using pyspark as:
A B C
A_1 B_2 C_3
A_4 B_5 C_6
--------------
Then convert it to list of list using pyspark as:
[[ A_1 , B_2 , C_3],[A_4 , B_5 , C_6]]
And then run FP Growth algorithm using pyspark on the above data set.
The code that I have tried is below:
from pyspark.sql.functions import col, size
from pyspark.sql.functions import *
import pyspark.sql.functions as func
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
df = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/tables/data.csv")
names=df.schema.names
Then I thought of doing something inside for loop:
for name in names:
-----
------
After this I will be using fpgrowth:
df = spark.createDataFrame([
(0, [ A_1 , B_2 , C_3]),
(1, [A_4 , B_5 , C_6]),)], ["id", "items"])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)
A number of concepts here for those who use Scala normally showing how to do with pyspark. Somewhat different but learnsome for sure, although to how many is the big question. I certainly learnt a point on pyspark with zipWithIndex myself. Anyway.
First part is to get stuff into desired format, probably too may imports but leaving as is:
from functools import reduce
from pyspark.sql.functions import lower, col, lit, concat, split
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql import functions as f
source_df = spark.createDataFrame(
[
(1, 11, 111),
(2, 22, 222)
],
["colA", "colB", "colC"]
)
intermediate_df = (reduce(
lambda df, col_name: df.withColumn(col_name, concat(lit(col_name), lit("_"), col(col_name))),
source_df.columns,
source_df
) )
allCols = [x for x in intermediate_df.columns]
result_df = intermediate_df.select(f.concat_ws(',', *allCols).alias('CONCAT_COLS'))
result_df = result_df.select(split(col("CONCAT_COLS"), ",\s*").alias("ARRAY_COLS"))
# Add 0,1,2,3, ... with zipWithIndex, we add it at back, but that does not matter, you can move it around.
# Get new Structure, the fields (one in this case but done flexibly, plus zipWithIndex value.
schema = StructType(result_df.schema.fields[:] + [StructField("index", LongType(), True)])
# Need this dict approach with pyspark, different to Scala.
rdd = result_df.rdd.zipWithIndex()
rdd1 = rdd.map(
lambda row: tuple(row[0].asDict()[c] for c in schema.fieldNames()[:-1]) + (row[1],)
)
final_result_df = spark.createDataFrame(rdd1, schema)
final_result_df.show(truncate=False)
returns:
+---------------------------+-----+
|ARRAY_COLS |index|
+---------------------------+-----+
|[colA_1, colB_11, colC_111]|0 |
|[colA_2, colB_22, colC_222]|1 |
+---------------------------+-----+
Second part is the old zipWithIndex with pyspark if you need 0,1,.. Painful compared to Scala.
In general easier to solve in Scala.
Not sure on performance, not a foldLeft, interesting. I think it is OK actually.

Do not discard keys with null values when converting to JSON in PySpark DataFrame

I am creating a column in a DataFrame from several other columns that I want to store as a JSON serialized string. When the serialization to JSON occurs, keys with null values are dropped. Is there a way to keep keys even if the value is null?
Sample program illustrating the issue:
from pyspark.sql import functions as F
df = sc.parallelize([
(1, 10),
(2, 20),
(3, None),
(4, 40),
]).toDF(['id', 'data'])
df.collect()
#[Row(id=1, data=10),
# Row(id=2, data=20),
# Row(id=3, data=None),
# Row(id=4, data=40)]
df_s = df.select(F.struct('data').alias('struct'))
df_s.collect()
#[Row(struct=Row(data=10)),
# Row(struct=Row(data=20)),
# Row(struct=Row(data=None)),
# Row(struct=Row(data=40))]
df_j = df.select(F.to_json(F.struct('data')).alias('json'))
df_j.collect()
#[Row(json=u'{"data":10}'),
# Row(json=u'{"data":20}'),
# Row(json=u'{}'), <= would like this to be u'{"data":null}'
# Row(json=u'{"data":40}')]
Running Spark 2.1.0
Could not find a Spark specific solution so just wrote a udf and used the python json package:
import json
from pyspark.sql import types as T
def to_json(data):
return json.dumps({'data': data})
to_json_udf = F.udf(to_json, T.StringType())
df.select(to_json_udf('data').alias('json')).collect()
# [Row(json=u'{"data": 10}'),
# Row(json=u'{"data": 20}'),
# Row(json=u'{"data": null}'),
# Row(json=u'{"data": 40}')]
Also posted on this StackOverflow post:
Since Pyspark 3, one can use the ignoreNullFields option when
writing to a JSON file.
spark_dataframe.write.json(output_path,ignoreNullFields=False)
Pyspark docs:
https://spark.apache.org/docs/3.1.1/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter.json

Using VectorAssembler in Spark

I got the following dataframe (it is assumed that it is already a dataframe):
val df = sc.parallelize(Seq((1, 2, 10), (3, 4, 11), (5, 6, 12)))
.toDF("a", "b", "c")
and I want to combine the columns(not all) to one column and make it an rdd of Array[Double]. I am doing the following:
import org.apache.spark.ml.feature.VectorAssembler
val colSelected = List("a","b")
val assembler = new VectorAssembler()
.setInputCols(colSelected.toArray)
.setOutputCol("features")
val output = assembler.transform(df).select("features").rdd
Till here it is ok. Now the output is a dataframe of the format RDD[spark.sql.Row]. I am unable to transform this to a format of RDD[Array[Double]]. Any way?
I have tried something like the following but with no success:
output.map { case Row(a: Vector[Double]) => a.getAs[Array[Double]]("features")}
The correct solution (this assumes Spark 2.0+, in 1.x use o.a.s.mllib.linalg.Vector):
import org.apache.spark.ml.linalg.Vector
output.map(_.getAs[Vector]("features").toArray)
ml / mllib Vector created by VectorAssembler is not the same as scala.collection.Vector.
Row.getAs should be used with expected type. It doesn't perform any type conversions and o.a.s.ml(lib).linalg.Vector is not an Array[Double].

How do I flatMap a row of arrays into multiple rows?

After parsing some jsons I have a one-column DataFrame of arrays
scala> val jj =sqlContext.jsonFile("/home/aahu/jj2.json")
res68: org.apache.spark.sql.DataFrame = [r: array<bigint>]
scala> jj.first()
res69: org.apache.spark.sql.Row = [List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)]
I'd like to explode each row out into several rows. How?
edit:
Original json file:
{"r": [0,1,2,3,4,5,6,7,8,9]}
{"r": [0,1,2,3,4,5,6,7,8,9]}
I want an RDD or a DataFrame with 20 rows.
I can't simply use flatMap here - I'm not sure what the appropriate command in spark is:
scala> jj.flatMap(r => r)
<console>:22: error: type mismatch;
found : org.apache.spark.sql.Row
required: TraversableOnce[?]
jj.flatMap(r => r)
You can use DataFrame.explode to achieve what you desire. Below is what I tried in spark-shell with your sample json data.
import scala.collection.mutable.ArrayBuffer
val jj1 = jj.explode("r", "r1") {list : ArrayBuffer[Long] => list.toList }
val jj2 = jj1.select($"r1")
jj2.collect
You can refer to API documentation to understand more DataFrame.explode
I've tested this with Spark 1.3.1
Or you can use Row.getAs function:
import scala.collection.mutable.ArrayBuffer
val elementsRdd = jj.select(jj("r")).map(t=>t.getAs[ArrayBuffer[Long]](0)).flatMap(x=>x)
elementsRdd.count()
>>>Long = 20
elementsRdd.take(5)
>>>Array[Long] = Array(0, 1, 2, 3, 4)
In Spark 1.3+ you can use explode function directly on the column of interest:
import org.apache.spark.sql.functions.explode
jj.select(explode($"r"))

Resources