DataFrame : Convert array inside a column to RDD[Array[String]] - apache-spark

Given a dataframe :
+---+----------+
|key| value|
+---+----------+
|foo| bar|
|bar| one, two|
+---+----------+
Then I'd like to use the value column as entry to FPGrowth which must look like RDD[Array[String]]
val transactions: RDD[Array[String]] = df.select("value").rdd.map(x => x.getList(0).toArray.map(_.toString))
import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel}
val fpg = new FPGrowth().setMinSupport(0.01)
val model = fpg.run(transactions)
I get exception :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 141.0 failed 1 times, most recent failure: Lost task 7.0 in stage 141.0 (TID 2232, localhost): java.lang.ClassCastException: java.lang.String cannot be cast to scala.collection.Seq
Any suggestion welcome !

Instead of
val transactions: RDD[Array[String]] = df.select("value").rdd.map(x => x.getList(0).toArray.map(_.toString))
try using
val transactions= df.select("value").rdd.map(_.toString.stripPrefix("[").stripSuffix("]").split(","))
It gives a desired ouptut as expected i.e. RDD[Array[String]]
val transactions= df.select("value").rdd.map(_.toString.stripPrefix("[").stripSuffix("]").split(","))
transactions: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[10] at map at <console>:33
scala> transactions.take(2)
res21: Array[Array[String]] = Array(Array(bar), Array(one, two))
To remove the "[" and "]" ,one can use stripPrefix and stripSuffix function before split function.

Related

How to find a median value in pyspark

These are the values of my dateframe:
+-------+----------+
| ID| Date_Desc|
+-------+----------+
|8951354|2012-12-31|
|8951141|2012-12-31|
|8952745|2012-12-31|
|8952223|2012-12-31|
|8951608|2012-12-31|
|8950793|2012-12-31|
|8950760|2012-12-31|
|8951611|2012-12-31|
|8951802|2012-12-31|
|8950706|2012-12-31|
|8951585|2012-12-31|
|8951230|2012-12-31|
|8955530|2012-12-31|
|8950570|2012-12-31|
|8954231|2012-12-31|
|8950703|2012-12-31|
|8954418|2012-12-31|
|8951685|2012-12-31|
|8950586|2012-12-31|
|8951367|2012-12-31|
+-------+----------+
I tried to create a median value of a date column in pyspark:
df1 = df1.groupby('Date_Desc').agg(f.expr('percentile(ID, array(0.25))')[0].alias('%25'),
f.expr('percentile(ID, array(0.50))')[0].alias('%50'),
f.expr('percentile(ID, array(0.75))')[0].alias('%75'))
But I get this an error:
Py4JJavaError: An error occurred while calling o198.showString. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 29.0 failed 1 times, most recent failure: Lost task
1.0 in stage 29.0 (TID 427, 5bddc801333f, executor driver): org.apache.spark.SparkUpgradeException: You may get a different result
due to the upgrading of Spark 3.0: Fail to parse '11/23/04 9:00' in
the new parser. You can set spark.sql.legacy.timeParserPolicy to
LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED
and treat it as an invalid datetime string.
With Spark ≥ 3.1.0 :
from pyspark.sql.functions import percentile_approx
df1.groupBy("Date_Desc").agg(percentile_approx("ID", 0.5).alias("%50"))

How to have 100+ words of any size in like clause of spark sql?

I need to suppress set of keywords coming from an API in my dataset across all columns. I have currently like clause similar to the following.
SPARK version :2.1.0
where lower(c) not like '%gun%' and lower(c) not like '%torture%' and lower(c) not like '%tough-mudder%' and lower(c) not like '%health & beauty - first aid%' and lower(c) not like '%hippie-butler%'
I get the following error
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0" grows beyond 64 KB
To mitigate this I try to break the problem into subproblem by by obtaining result for every 15 keywords then applying the next 15 keywords over the result obtained before. So for every 15 I invoke a method for applying current set of 15 and result obtained is passed as input to the same function until all words are over.
dataSet = getTransformedDataset(dataSet, word);
My query looks like this:
select uuid , c from cds where lower(c) not like '%gun%' and lower(c) not like '%kill%' and lower(c) not like '%murder%' ..
Now it works fine for smaller dataset.But for larger dataset it takes more memory than allowed and which we have configured.
Job aborted due to stage failure: Task 5 in stage 13.0 failed 4 times, most recent failure: Lost task 5.3 in stage 13.0 (TID 2093, executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 50.0 GB of 49 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Added code. Any help will be very much appreciated.
private Dataset<Row> getTransformedRowDataset(final Dataset<Row> input,Dataset<Row> concatenatedColumnsDS, final List<String> regexToSuppress, final List<String> wordsToSuppress) {
final String regexString = regexToSuppress.stream().collect(Collectors.joining("|", "(", ")"));
final String likeString = wordsToSuppress
.stream()
.map(s -> " lower(c) not like '%" + s.toLowerCase() + "%' ")
.collect(Collectors.joining(" and "));
if(!likeString.isEmpty()) {
concatenatedColumnsDS = concatenatedColumnsDS.where(likeString);
}
final Dataset<Row> joinedDs = input.join(concatenatedColumnsDS, "uuid");
return "()".equals(regexString) || regexString.isEmpty() ? joinedDs.drop("c") :
joinedDs.where(" c not rlike '" + regexString + "'").drop("c");
}
try filter
{ package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col}
object filterWorld extends App {
val spark = SparkSession.builder()
.master("local")
.appName("Mapper")
.getOrCreate()
import spark.implicits._
case class Person(
ID: Int,
firstName: String,
lastName: String,
description: String,
comment: String
)
val personDF = Seq(
Person(1, "FN1", "LN1", "TEST", "scala"),
Person(2, "FN2", "LN2", "develop", "spark"),
Person(3, "FN3", "LN3", "test", "sql"),
Person(4, "FN4", "LN4", "develop", "java"),
Person(5, "FN5", "LN5", "test", "c#"),
Person(6, "FN6", "LN6", "architect", "python"),
Person(7, "FN7", "LN7", "test", "spark"),
Person(8, "FN8", "LN8", "architect", "scala"),
Person(9, "FN9", "LN9", "qa", "hql"),
Person(10, "FN10", "LN10", "manager", "haskell")
).toDF()
personDF.show(false)
// +---+---------+--------+-----------+-------+
// |ID |firstName|lastName|description|comment|
// +---+---------+--------+-----------+-------+
// |1 |FN1 |LN1 |TEST |scala |
// |2 |FN2 |LN2 |develop |spark |
// |3 |FN3 |LN3 |test |sql |
// |4 |FN4 |LN4 |develop |java |
// |5 |FN5 |LN5 |test |c# |
// |6 |FN6 |LN6 |architect |python |
// |7 |FN7 |LN7 |test |spark |
// |8 |FN8 |LN8 |architect |scala |
// |9 |FN9 |LN9 |qa |hql |
// |10 |FN10 |LN10 |manager |haskell|
// +---+---------+--------+-----------+-------+
//
val fltr = !col("description").like("%e%") && !col("comment").like("%s%")
val res = personDF.filter(fltr)
res.show(false)
// +---+---------+--------+-----------+-------+
// |ID |firstName|lastName|description|comment|
// +---+---------+--------+-----------+-------+
// |9 |FN9 |LN9 |qa |hql |
// +---+---------+--------+-----------+-------+
}
}

Spark error using a variable in map operation [duplicate]

This question already has answers here:
Resolving dependency problems in Apache Spark
(7 answers)
Closed 4 years ago.
I am trying to iterate over a DataFrame and apply map operation over it's rows.
import spark.implicits._
import org.apache.spark.sql.Row
case class SomeData(name:String, value: Int)
val input = Seq(SomeData("a",2), SomeData("b", 3)).toDF
val SOME_STRING = "some_string"
input.map(row =>
SOME_STRING
).show
The above code fails with following exception,
ERROR TaskSetManager: Task 0 in stage 4.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 14, ip-xxxx, executor 4): java.lang.NoClassDefFoundError: L$iw;
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2583)
at java.lang.Class.getDeclaredField(Class.java:2068)
However, if the variable is replaced with string, the code works.
input.map(row =>
"some_string"
).show
+-----------+
| value|
+-----------+
|some_string|
|some_string|
+-----------+
Is there anything wrong with the above code? Is it possible to use variables and function calls inside map operation.
it's normal, your variable is define inside the driver then use inside a worker, so your worker dont know the variable.
what you can do is :
input.map(row =>
val SOME_STRING = "some_string"
).show
you can also check broadcast variable : https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables

Spark classnotfoundexception in UDF

When I call a function it works. but when I call that function in UDF will not work.
This is full code.
val sparkConf = new SparkConf().setAppName("HiveFromSpark").set("spark.driver.allowMultipleContexts","true")
val sc = new SparkContext(sparkConf)
val hive = new org.apache.spark.sql.hive.HiveContext(sc)
///////////// UDFS
def toDoubleArrayFun(vec:Any) : scala.Array[Double] = {
return vec.asInstanceOf[WrappedArray[Double]].toArray
}
def toDoubleArray=udf((vec:Any) => toDoubleArrayFun(vec))
//////////// PROCESS
var df = hive.sql("select vec from mst_wordvector_tapi_128dim where word='soccer'")
println("==== test get value then transform")
println(df.head().get(0))
println(toDoubleArrayFun(df.head().get(0)))
println("==== test transform by udf")
df.withColumn("word_v", toDoubleArray(col("vec")))
.show(10);
Then this the output.
sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext#6e9484ad
hive: org.apache.spark.sql.hive.HiveContext =
toDoubleArrayFun: (vec: Any)Array[Double]
toDoubleArray: org.apache.spark.sql.UserDefinedFunction
df: org.apache.spark.sql.DataFrame = [vec: array<double>]
==== test get value then transform
WrappedArray(-0.88675,, 0.0216657)
[D#4afcc447
==== test transform by udf
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, xdad008.band.nhnsystem.com): java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$$iwC$$$$5ba2a895f25683dd48fe725fd825a71$$$$$$iwC$$anonfun$toDoubleArray$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
Full output here.
https://gist.github.com/jeesim2/efb52f12d6cd4c1b255fd0c917411370
As you can see "toDoubleArrayFun" function works well, but in udf it claims ClassNotFoundException.
I can not change the hive data structure, and need to convert vec to Array[Double] to make a Vector instance.
So what problem with code above?
Spark version is 1.6.1
Update 1
Hive table's 'vec' column type is "array<double>"
Below code also cause error
var df = hive.sql("select vec from mst_wordvector_tapi_128dim where
word='hh'")
df.printSchema()
var word_vec = df.head().get(0)
println(word_vec)
println(Vectors.dense(word_vec))
output
df: org.apache.spark.sql.DataFrame = [vec: array<double>]
root
|-- vec: array (nullable = true)
| |-- element: double (containsNull = true)
==== test get value then transform
word_vec: Any = WrappedArray(-0.88675,...7)
<console>:288: error: overloaded method value dense with alternatives:
(values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
(firstValue: Double,otherValues:Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Any)
println(Vectors.dense(word_vec))
This means hive 'array<double>' column cant not be casted to Array<Double>
Actually I want to calculate distance:Double with two array<double> column.
How do I add Vector column based on array<double> column?
Typical method is
Vectors.sqrt(Vectors.dense(Array<Double>, Array<Double>)
Since udf function has to go serialization and deserialization process, any DataType will not work. You will have to define exact DataType of the column you are passing to the udf function.
From the output in your question it seems that you have only one column in your dataframe i.e. vec which is of Array[Double] type
df: org.apache.spark.sql.DataFrame = [vec: array<double>]
There actually is no need of that udf function as your vec column is already of Array dataType and that is what your udf function is doing as well i.e. casting the value to Array[Double].
Now, your other function call is working
println(toDoubleArrayFun(df.head().get(0)))
because there is no need of serialization and de-serialization process, its just scala function call.

Handle Null Values when using CustomSchema in apache spark

I am importing data based on a customSchema which I have defined in the following way
import org.apache.spark.sql.types.{StructType, StructField,DoubleType,StringType };
val customSchema_train = StructType(Array(
StructField("x53",DoubleType,true),
StructField("x95",DoubleType,true),
StructField("x88",DoubleType,true),
StructField("x30",DoubleType,true),
StructField("x42",DoubleType,true),
StructField("x28",DoubleType,true)
))
val train_orig = sqlContext.read.format("com.databricks.spark.csv").option("header","true").schema(customSchema_train).option("nullValue","null").load("/....../train.csv").cache
Now I know there are null values in my data which are there as "null" and I have tried to handle that accordingly. The import happens without any error but when I try to describe the data I get this error
train_df.describe().show
SparkException: Job aborted due to stage failure: Task 0 in stage 46.0 failed 1 times, most recent failure: Lost task 0.0 in stage 46.0 (TID 56, localhost): java.text.ParseException: Unparseable number: "null"
How Can handle this error?

Resources