How do I flatMap a row of arrays into multiple rows? - apache-spark

After parsing some jsons I have a one-column DataFrame of arrays
scala> val jj =sqlContext.jsonFile("/home/aahu/jj2.json")
res68: org.apache.spark.sql.DataFrame = [r: array<bigint>]
scala> jj.first()
res69: org.apache.spark.sql.Row = [List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)]
I'd like to explode each row out into several rows. How?
edit:
Original json file:
{"r": [0,1,2,3,4,5,6,7,8,9]}
{"r": [0,1,2,3,4,5,6,7,8,9]}
I want an RDD or a DataFrame with 20 rows.
I can't simply use flatMap here - I'm not sure what the appropriate command in spark is:
scala> jj.flatMap(r => r)
<console>:22: error: type mismatch;
found : org.apache.spark.sql.Row
required: TraversableOnce[?]
jj.flatMap(r => r)

You can use DataFrame.explode to achieve what you desire. Below is what I tried in spark-shell with your sample json data.
import scala.collection.mutable.ArrayBuffer
val jj1 = jj.explode("r", "r1") {list : ArrayBuffer[Long] => list.toList }
val jj2 = jj1.select($"r1")
jj2.collect
You can refer to API documentation to understand more DataFrame.explode

I've tested this with Spark 1.3.1
Or you can use Row.getAs function:
import scala.collection.mutable.ArrayBuffer
val elementsRdd = jj.select(jj("r")).map(t=>t.getAs[ArrayBuffer[Long]](0)).flatMap(x=>x)
elementsRdd.count()
>>>Long = 20
elementsRdd.take(5)
>>>Array[Long] = Array(0, 1, 2, 3, 4)

In Spark 1.3+ you can use explode function directly on the column of interest:
import org.apache.spark.sql.functions.explode
jj.select(explode($"r"))

Related

regarding the usage of rangebetween in Windows.Partition function

I run the following code script
from pyspark.sql import Window
from pyspark.sql import functions as func
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
tup = [(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b")]
df = sqlContext.createDataFrame(tup, ["id", "category"])
df.show()
Then there has the following window partition, and the result is shown as following. I am confused on how this result was generated using the rangebeween. For instance, why the fourth row of sum column is 4, how rangeBetween(Window.currentRow, 1) works to get this value of 4. Moreover, according to Spark doc,
Window.currentRow is defined as 0, why the code does not use 0 instead.
window = Window.partitionBy("category").orderBy("id").rangeBetween(Window.currentRow, 1)
df.withColumn("sum", func.sum("id").over(window)).show()
Window.currentRow and 0 should be equivalent. I guess it's just a matter of preference. As for why you're getting 4, that's because the window spans between the values of id between the value of the current row and that value plus one, i.e. 1 (current row) and 2 (plus one). The three rows where id is 1 or 2 will be included in the window, so the sum gives 1+1+2 = 4.

Pyspark: Unable to turn RDD into DataFrame due to data type str instead of StringType

I'm doing some complex operations in Pyspark where the last operation is a flatMap that yields an object of type pyspark.rdd.PipelinedRDD whose content is simply a list of strings:
print(output_data.take(8))
> ['a', 'abc', 'a', 'aefgtr', 'bcde', 'bc', 'bhdsjfk', 'b']
I'm starting my Spark-Session like this (local session for testing):
spark = SparkSession.builder.appName("my_app")\
.config('spark.sql.shuffle.partitions', '2').master("local").getOrCreate()
My input data looks like this:
input_data = (('a', ('abc', [[('abc', 23)], 23, False, 3])),
('a', ('abcde', [[('abcde', 17)], 17, False, 5])),
('a', ('a', [[('a', 66)], 66, False, 1])),
('a', ('aefgtr', [[('aefgtr', 65)], 65, False, 6])),
('b', ('bc', [[('bc', 25)], 25, False, 2])),
('b', ('bcde', [[('bcde', 76)], 76, False, 4])),
('b', ('b', [[('b', 13)], 13, False, 1])),
('b', ('bhdsjfk', [[('bhdsjfk', 36)], 36, False, 7])))
input_data = sc.parallelize(input_data)
I want to turn that output RDD into a DataFrame with one column like this:
schema = StructType([StructField("term", StringType())])
df = spark.createDataFrame(output_data, schema=schema)
This doesn't work, I'm getting this error:
TypeError: StructType can not accept object 'a' in type <class 'str'>
So I tried it without schema and got this error:
TypeError: Can not infer schema for type: <class 'str'>
EDIT: The same error happens when trying toDF().
So for some reason I have a pyspark.rdd.PipelinedRDD whose elements are not StringType but standard Python str.
I'm relatively new to Pyspark so can someone enlighten me on why this might be happening?
I'm surprised Pyspark isn't able to implicitely cast str to StringType.
I can't post the entire code, just saying that I'm doing some complex stuff with strings including string comparison and for-loops. I'm not explicitely typecasting anything though.
One solution would be to convert your RDD of String into a RDD of Row as follows:
from pyspark.sql import Row
df = spark.createDataFrame(output_data.map(lambda x: Row(x)), schema=schema)
# or with a simple list of names as a schema
df = spark.createDataFrame(output_data.map(lambda x: Row(x)), schema=['term'])
# or even use `toDF`:
df = output_data.map(lambda x: Row(x)).toDF(['term'])
# or another variant
df = output_data.map(lambda x: Row(term=x)).toDF()
Interestingly, as you mention it, specifiying a schema for a RDD of a raw type like string does not work. Yet, if we only specify the type, it works but you cannot specify the name. Another approach would thus be to do just that and rename the column called value like this:
from pyspark.sql import functions as F
df = spark.createDataFrame(output_data, StringType())\
.select(F.col('value').alias('term'))
# or similarly
df = spark.createDataFrame(output_data, "string")\
.select(F.col('value').alias('term'))

PySpark not null and not nan values function for all type of columns

I need to build a method that receives a pyspark.sql.Column 'c' and returns a new pyspark.sql.Column that contains the information to build a list with True/False depending if the values on the column are nulls/nan
PySpark has the column method c.isNotNull() which will work in the case of not null values. It also has pyspark.sql.functions.isnan, which receives a pyspark.sql.Column, which works with nans (but does not work with datetime/bool cols)
I'm trying to build a function that looks like this:
from pyspark.sql import functions as F
def notnull(c):
return c.isNotNull() & ~F.isnan(c)
And then I want to use that functions with any column type in my DataFrame to obtain if there are not null/not nan values within that column. But this fails when the provided column type is bool or datetime:
import datetime
import numpy as np
import pandas as pd
from pyspark import SparkConf
from pyspark.sql import SparkSession
# Building SparkSession 'spark'
conf = (SparkConf().setAppName("example")
.setMaster("local[*]")
.set("spark.sql.execution.arrow.enabled", "true"))
spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
# Data input and initialazing pd_df
data = {
'string_col': ['1', '1', '1', None],
'bool_col': [True, True, True, False],
'datetime_col': [
datetime.datetime(2018, 12, 9),
datetime.datetime(2018, 12, 9),
datetime.datetime(2018, 12, 9),
pd.NaT],
'float_col': [1.0, 1.0, 1.0, np.nan]
}
pd_df = pd.DataFrame(data)
# Creating spark_df from pd_df
spark_df = spark.createDataFrame(pd_df)
# This should return a new dataframe with the column 'notnulls' added
# Note: This works fine with 'float_col' and 'string_col' but does not
# work with 'bool_col' or 'datetime_col'
spark_df.withColumn('notnulls', notnull(spark_df['datetime_col'])).collect()
Running this snippet (using 'datetime_col') will throw the following exception:
pyspark.sql.utils.AnalysisException: "cannot resolve 'isnan(`datetime_col`)'
due to data type mismatch: argument 1 requires (double or float) type, however,
'`datetime_col`' is of timestamp type.;;\n'Project [category#217,
float_col#218, string_col#219, bool_col#220, CASE WHEN isnan(datetime_col#221)
THEN NOT isnan(datetime_col#221) ELSE isnotnull(datetime_col#221) END AS
datetime_col#231]\n+- LogicalRDD [category#217, float_col#218, string_col#219,
bool_col#220, datetime_col#221], false\n"
I understand this is because the isnan function cannot be applied to 'datetime_col', since it's not float/double type. Since 'c' is a pyspark.sql.Column Object, I can't access it's dtype to behave differently based on the column type. I want to avoid using a pandas_udf to solve this issue, but I'm not being able to find any different way to do it.
I'm using the following dependencies:
numpy==1.19.1
pandas==1.0.4
pyarrow==1.0.0
pyspark==2.4.5

Load Data for Machine Learning in Spark [duplicate]

I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the current situation :
val ratings = sc.textFile(new File("/user/ubuntu/kang/0829/rawRatings.csv").toString).map { line =>
val fields = line.split(",")
(fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}
val user = ratings.map{ case (user,product,rate) => (user,(product.toInt,rate.toDouble))}
val usergroup = user.groupByKey
val data =usergroup.map{ case(x,iter) => (x,iter.map(_._1).toArray,iter.map(_._2).toArray)}
val data_DF = data.toDF("user","item","rating")
I am using Spark 2.0.
The issue you are facing can be divided into the following :
Converting your ratings (I believe) into LabeledPoint data X.
Saving X in libsvm format.
1. Converting your ratings into LabeledPoint data X
Let's consider the following raw ratings :
val rawRatings: Seq[String] = Seq("0,1,1.0", "0,3,3.0", "1,1,1.0", "1,2,0.0", "1,3,3.0", "3,3,4.0", "10,3,4.5")
You can handle those raw ratings as a coordinate list matrix (COO).
Spark implements a distributed matrix backed by an RDD of its entries : CoordinateMatrix where each entry is a tuple of (i: Long, j: Long, value: Double).
Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.)
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.rdd.RDD
val data: RDD[MatrixEntry] =
sc.parallelize(rawRatings).map {
line => {
val fields = line.split(",")
val i = fields(0).toLong
val j = fields(1).toLong
val value = fields(2).toDouble
MatrixEntry(i, j, value)
}
}
Now let's convert that RDD[MatrixEntry] to a CoordinateMatrix and extract the indexed rows :
val df = new CoordinateMatrix(data) // Convert the RDD to a CoordinateMatrix
.toIndexedRowMatrix().rows // Extract indexed rows
.toDF("label", "features") // Convert rows
2. Saving LabeledPoint data in libsvm format
Since Spark 2.0, You can do that using the DataFrameWriter . Let's create a small example with some dummy LabeledPoint data (you can also use the DataFrame we created earlier) :
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val df = Seq(neg,pos).toDF("label","features")
Unfortunately we still can't use the DataFrameWriter directly because while most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types.
Utilities for converting DataFrame columns from mllib.linalg to ml.linalg types (and vice versa) can be found in org.apache.spark.mllib.util.MLUtils. In our case we need to do the following (for both the dummy data and the DataFrame from step 1.)
import org.apache.spark.mllib.util.MLUtils
// convert DataFrame columns
val convertedVecDF = MLUtils.convertVectorColumnsToML(df)
Now let's save the DataFrame :
convertedVecDF.write.format("libsvm").save("data/foo")
And we can check the files contents :
$ cat data/foo/part*
0.0 1:1.0 3:3.0
1.0 1:1.0 2:0.0 3:3.0
EDIT:
In current version of spark (2.1.0) there is no need to use mllib package. You can simply save LabeledPoint data in libsvm format like below:
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val df = Seq(neg,pos).toDF("label","features")
df.write.format("libsvm").save("data/foo")
In order to convert an existing to a typed DataSet I suggest the following; Use the following case class:
case class LibSvmEntry (
value: Double,
features: L.Vector)
The you can use the map function to convert it to a LibSVM entry like so:
df.map[LibSvmEntry](r: Row => /* Do your stuff here*/)
libsvm datatype features is a sparse vector, u can use pyspark.ml.linalg.SparseVector to solve the problem
a = SparseVector(4, [1, 3], [3.0, 4.0])
def sparsevecfuc(len,index,score):
"""
args: len int, index array, score array
"""
return SparseVector(len,index,score)
trans_sparse = udf(sparsevecfuc,VectorUDT())

Using VectorAssembler in Spark

I got the following dataframe (it is assumed that it is already a dataframe):
val df = sc.parallelize(Seq((1, 2, 10), (3, 4, 11), (5, 6, 12)))
.toDF("a", "b", "c")
and I want to combine the columns(not all) to one column and make it an rdd of Array[Double]. I am doing the following:
import org.apache.spark.ml.feature.VectorAssembler
val colSelected = List("a","b")
val assembler = new VectorAssembler()
.setInputCols(colSelected.toArray)
.setOutputCol("features")
val output = assembler.transform(df).select("features").rdd
Till here it is ok. Now the output is a dataframe of the format RDD[spark.sql.Row]. I am unable to transform this to a format of RDD[Array[Double]]. Any way?
I have tried something like the following but with no success:
output.map { case Row(a: Vector[Double]) => a.getAs[Array[Double]]("features")}
The correct solution (this assumes Spark 2.0+, in 1.x use o.a.s.mllib.linalg.Vector):
import org.apache.spark.ml.linalg.Vector
output.map(_.getAs[Vector]("features").toArray)
ml / mllib Vector created by VectorAssembler is not the same as scala.collection.Vector.
Row.getAs should be used with expected type. It doesn't perform any type conversions and o.a.s.ml(lib).linalg.Vector is not an Array[Double].

Resources