Spark MLlib Association Rules confidence is greater than 1.0 - apache-spark

I was using Spark 2.0.2 to extract some association rules from some data, while when I get the result, I found I have some strange rules, such as the followings:
【[MUJI,ROEM,西单科技广场] => Bauhaus ] 2.0
“2.0” is the confidence of the rule printed, isn't it the meaning of "the probability of antecedent to consequent" and should be less than 1.0?

KEY WORD: transactions != freqItemset
SOLUTIONS: Use spark.mllib.FPGrowth instead, it accepts a rdd of transactions and can automatically calculate freqItemsets.
Hello, I found it. The reason of this phenomenon is because my input FreqItemset data freqItemsets is wrong. Let's to into detail. I simply use three original transactions ("a"),("a","b","c"),("a","b","d"), the frequency of them are all the same 1.
At the beginning, I thought spark would auto calculate sub-itemset frequency, the only thing I need to do is to create freqItemsets like this (the official example show us):
val freqItemsets = sc.parallelize(Seq(
new FreqItemset(Array("a"), 1),
new FreqItemset(Array("a","b","d"), 1),
new FreqItemset(Array("a", "b","c"), 1)
))
Here is the reason why it make mistakes, AssociationRules's params is FreqItemset, not the transactions, so I made a wrong understand of these two definition.
According to the three transactions, the freqItemsets should be
new FreqItemset(Array("a"), 3),//because "a" appears three times in three transactions
new FreqItemset(Array("b"), 2),//"b" appears two times
new FreqItemset(Array("c"), 1),
new FreqItemset(Array("d"), 1),
new FreqItemset(Array("a","b"), 2),// "a" and "b" totally appears two times
new FreqItemset(Array("a","c"), 1),
new FreqItemset(Array("a","d"), 1),
new FreqItemset(Array("b","d"), 1),
new FreqItemset(Array("b","c"), 1)
new FreqItemset(Array("a","b","d"), 1),
new FreqItemset(Array("a", "b","c"), 1)
You can do this statistical work your self use the following code
val transactons = sc.parallelize(
Seq(
Array("a"),
Array("a","b","c"),
Array("a","b","d")
))
val freqItemsets = transactions
.map(arr => {
(for (i <- 1 to arr.length) yield {
arr.combinations(i).toArray
})
.toArray
.flatten
})
.flatMap(l => l)
.map(a => (Json.toJson(a.sorted).toString(), 1))
.reduceByKey(_ + _)
.map(m => new FreqItemset(Json.parse(m._1).as[Array[String]], m._2.toLong))
//then use freqItemsets like the example code
val ar = new AssociationRules()
.setMinConfidence(0.8)
val results = ar.run(freqItemsets)
//....
Simply we can use FPGrowth instead of "AssociationRules", it accepts rdd of transactions.
val fpg = new FPGrowth()
.setMinSupport(0.2)
.setNumPartitions(10)
val model = fpg.run(transactions) //transactions is defined in the previous code
That's all.

Related

Cross-Validation in apache-spark. How to create Parameter Grid?

I am trying to set up a ParamGrid for using cross-validation later. But I could not find any explanation about the input arguments.
After creating a pipeline I am trying to create a parameter grid, but since I do not understand, which entries are expected I keep getting errors.
//creating my pipeline with indexer, oneHotEncoder, creating the feature vector and applying linear regression on it
val IndexedList = StringList.flatMap{ name =>
val indexer = new StringIndexer().setInputCol(name).setOutputCol(name + "Index")
val encoder = new OneHotEncoderEstimator()
.setInputCols(Array(name+ "Index"))
.setOutputCols(Array(name + "vec"))
Array(indexer,encoder)
}
val features = new VectorAssembler().setInputCols(Array("Modellvec", "KM", "Hubraum", "Fuelvec","Farbevec","Typevec","F1","F2","F3","F4","F5","F6","F7","F8")).setOutputCol("Features2")
val linReg = new LinearRegression()//.setFeaturesCol(features2.getOutputCol).setLabelCol("Preis")
//creates the Array of stages
val IndexedList3 = (IndexedList :+ features :+ linReg).toArray[PipelineStage]
val pipeline2 = new Pipeline()
//This grid should be created in order to apply cross-validation
val <b>pipeline_grid</b> = new ParamGridBuilder()
.baseOn(pipeline2.stages -> IndexedList3)
.addGrid(linReg.regParam, Array(10,15,20,25,30,35,40,45,50,55,60,65,70,75) ).build()
The first part works just fine when I run it separately.
The problem is, that I do not understand, how the Array in "addGrid" should look like(or how I should choose the values) and why it is a problem, that linReg.regParam is of type DoubleParam, since addGrid IS defined on this type.
In most examples I have seen, this Array appears out of nowhere. So could someone explain to me, where it comes from?

Check Spark Dataframe row has ANY column meeting a condition and stop when first such column found

The following code can be used to filter rows that contain a value of 1. Image there are a lot of columns.
import org.apache.spark.sql.types.StructType
val df = sc.parallelize(Seq(
("r1", 1, 1),
("r2", 6, 4),
("r3", 4, 1),
("r4", 1, 2)
)).toDF("ID", "a", "b")
val ones = df.schema.map(c => c.name).drop(1).map(x => when(col(x) === 1, 1).otherwise(0)).reduce(_ + _)
df.withColumn("ones", ones).where($"ones" === 0).show
The downside here is that it should ideally stop when the first such condition is met. I.e. the first column found. OK, we all know that.
But I cannot find an elegant method to achieve this without presumably using a UDF or very specific logic. The map will process all cols.
Can therefore a fold(Left) be used that can terminate when first occurrence found possibly? Or some other approach? May be an oversight.
My first idea was to use logical expressions and hope for short-circuiting, but it seems spark is not doing this :
df
.withColumn("ones", df.columns.tail.map(x => when(col(x) === 1, true)
.otherwise(false)).reduceLeft(_ or _))
.where(!$"ones")
.show()
But I'm no sure whether spark does support short-circuiting, I think not (https://issues.apache.org/jira/browse/SPARK-18712)
So alternatively you can apply a custom function on your rows using lazy exist on scala's Seq:
df
.map{r => (r.getString(0),r.toSeq.tail.exists(c => c.asInstanceOf[Int]==1))}
.toDF("ID","ones")
.show()
This approach is similar to an UDF, so not sure if thats what you accept.

spark spelling correction via udf

I need to correct some spellings using spark.
Unfortunately a naive approach like
val misspellings3 = misspellings1
.withColumn("A", when('A === "error1", "replacement1").otherwise('A))
.withColumn("A", when('A === "error1", "replacement1").otherwise('A))
.withColumn("B", when(('B === "conditionC") and ('D === condition3), "replacementC").otherwise('B))
does not work with spark How to add new columns based on conditions (without facing JaninoRuntimeException or OutOfMemoryError)?
The simple cases (the first 2 examples) can nicely be handled via
val spellingMistakes = Map(
"error1" -> "fix1"
)
val spellingNameCorrection: (String => String) = (t: String) => {
titles.get(t) match {
case Some(tt) => tt // correct spelling
case None => t // keep original
}
}
val spellingUDF = udf(spellingNameCorrection)
val misspellings1 = hiddenSeasonalities
.withColumn("A", spellingUDF('A))
But I am unsure how to handle the more complex / chained conditional replacements in an UDF in a nice & generalizeable manner.
If it is only a rather small list of spellings < 50 would you suggest to hard code them within a UDF?
You can make the UDF receive more than one column:
val spellingCorrection2= udf((x: String, y: String) => if (x=="conditionC" && y=="conditionD") "replacementC" else x)
val misspellings3 = misspellings1.withColumn("B", spellingCorrection2($"B", $"C")
To make this more generalized you can use a map from a tuple of the two conditions to a string same as you did for the first case.
If you want to generalize it even more then you can use dataset mapping. Basically create a case class with the relevant columns and then use as to convert the dataframe to a dataset of the case class. Then use the dataset map and in it use pattern matching on the input data to generate the relevant corrections and convert back to dataframe.
This should be easier to write but would have a performance cost.
For now I will go with the following which seems to work just fine and is more understandable: https://gist.github.com/rchukh/84ac39310b384abedb89c299b24b9306
If spellingMap is the map containing correct spellings, and df is the dataframe.
val df: DataFrame = _
val spellingMap = Map.empty[String, String] //fill it up yourself
val columnsWithSpellingMistakes = List("abc", "def")
Write a UDF like this
def spellingCorrectionUDF(spellingMap:Map[String, String]) =
udf[(String), Row]((value: Row) =>
{
val cellValue = value.getString(0)
if(spellingMap.contains(cellValue)) spellingMap(cellValue)
else cellValue
})
And finally, you can call them as
val newColumns = df.columns.map{
case columnName =>
if(columnsWithSpellingMistakes.contains(columnName)) spellingCorrectionUDF(spellingMap)(Column(columnName)).as(columnName)
else Column(columnName)
}
df.select(newColumns:_*)

Save Spark org.apache.spark.mllib.linalg.Matrix to a file

The result of correlation in Spark MLLib is a of type org.apache.spark.mllib.linalg.Matrix. (see http://spark.apache.org/docs/1.2.1/mllib-statistics.html#correlations)
val data: RDD[Vector] = ...
val correlMatrix: Matrix = Statistics.corr(data, "pearson")
I would like to save the result into a file. How can I do this?
Here is a simple and effective approach to save the Matrix to hdfs and specify the separator.
(The transpose is used since .toArray is in column major format.)
val localMatrix: List[Array[Double]] = correlMatrix
.transpose // Transpose since .toArray is column major
.toArray
.grouped(correlMatrix.numCols)
.toList
val lines: List[String] = localMatrix
.map(line => line.mkString(" "))
sc.parallelize(lines)
.repartition(1)
.saveAsTextFile("hdfs:///home/user/spark/correlMatrix.txt")
As Matrix is Serializable, you can write it using normal Scala.
You can find an example here.
The answer by Dylan Hogg was great, to enhance it slightly, add a column index. (In my use case, once I created a file and downloaded it, it was not sorted due to the nature of parallel process etc.)
ref: https://www.safaribooksonline.com/library/view/scala-cookbook/9781449340292/ch10s12.html
substitute with this line and it will put a sequence number on the line (starting w/ 0) making it easier to sort when you go to view it
val lines: List[String] = localMatrix
.map(line => line.mkString(" "))
.zipWithIndex.map { case(line, count) => s"$count $line" }
Thank you for your suggestion. I came out with this solution. Thanks to Ignacio for his suggestions
val vtsd = sd.map(x => Vectors.dense(x.toArray))
val corrMat = Statistics.corr(vtsd)
val arrayCor = corrMat.toArray.toList
val colLen = columnHeader.size
val toArr2 = sc.parallelize(arrayCor).zipWithIndex().map(
x => {
if ((x._2 + 1) % colLen == 0) {
(x._2, arrayCor.slice(x._2.toInt + 1 - colLen, x._2.toInt + 1).mkString(";"))
} else {
(x._2, "")
}
}).filter(_._2.nonEmpty).sortBy(x => x._1, true, 1).map(x => x._2)
toArr2.coalesce(1, true).saveAsTextFile("/home/user/spark/cor_" + System.currentTimeMillis())

Spark - convert string IDs to unique integer IDs

I have a dataset which looks like this, where each user and product ID is a string:
userA, productX
userA, productX
userB, productY
with ~2.8 million products and 300 million users; about 2.1 billion user-product associations.
My end goal is to run Spark collaborative filtering (ALS) on this dataset. Since it takes int keys for users and products, my first step is to assign a unique int to each user and product, and transform the dataset above so that users and products are represented by ints.
Here's what I've tried so far:
val rawInputData = sc.textFile(params.inputPath)
.filter { line => !(line contains "\\N") }
.map { line =>
val parts = line.split("\t")
(parts(0), parts(1)) // user, product
}
// find all unique users and assign them IDs
val idx1map = rawInputData.map(_._1).distinct().zipWithUniqueId().cache()
// find all unique products and assign IDs
val idx2map = rawInputData.map(_._2).distinct().zipWithUniqueId().cache()
idx1map.map{ case (id, idx) => id + "\t" + idx.toString
}.saveAsTextFile(params.idx1Out)
idx2map.map{ case (id, idx) => id + "\t" + idx.toString
}.saveAsTextFile(params.idx2Out)
// join with user ID map:
// convert from (userStr, productStr) to (productStr, userIntId)
val rev = rawInputData.cogroup(idx1map).flatMap{
case (id1, (id2s, idx1s)) =>
val idx1 = idx1s.head
id2s.map { (_, idx1)
}
}
// join with product ID map:
// convert from (productStr, userIntId) to (userIntId, productIntId)
val converted = rev.cogroup(idx2map).flatMap{
case (id2, (idx1s, idx2s)) =>
val idx2 = idx2s.head
idx1s.map{ (_, idx2)
}
}
// save output
val convertedInts = converted.map{
case (a,b) => a.toInt.toString + "\t" + b.toInt.toString
}
convertedInts.saveAsTextFile(params.outputPath)
When I try to run this on my cluster (40 executors with 5 GB RAM each), it's able to produce the idx1map and idx2map files fine, but it fails with out of memory errors and fetch failures at the first flatMap after cogroup. I haven't done much with Spark before so I'm wondering if there is a better way to accomplish this; I don't have a good idea of what steps in this job would be expensive. Certainly cogroup would require shuffling the whole data set across the network; but what does something like this mean?
FetchFailed(BlockManagerId(25, ip-***.ec2.internal, 48690), shuffleId=2, mapId=87, reduceId=25)
The reason I'm not just using a hashing function is that I'd eventually like to run this on a much larger dataset (on the order of 1 billion products, 1 billion users, 35 billion associations), and number of Int key collisions would become quite large. Is running ALS on a dataset of that scale even close to feasible?
I looks like you are essentially collecting all lists of users, just to split them up again. Try just using join instead of cogroup, which seems to me to do more like what you want. For example:
import org.apache.spark.SparkContext._
// Create some fake data
val data = sc.parallelize(Seq(("userA", "productA"),("userA", "productB"),("userB", "productB")))
val userId = sc.parallelize(Seq(("userA",1),("userB",2)))
val productId = sc.parallelize(Seq(("productA",1),("productB",2)))
// Replace userName with ID's
val userReplaced = data.join(userId).map{case (_,(prod,user)) => (prod,user)}
// Replace product names with ID's
val bothReplaced = userReplaced.join(productId).map{case (_,(user,prod)) => (user,prod)}
// Check results:
bothReplaced.collect()) // Array((1,1), (1,2), (2,2))
Please drop a comments on how well it performs.
(I have no idea what FetchFailed(...) means)
My platform version : CDH :5.7, Spark :1.6.0/StandAlone;
My Test Data Size:31815167 all data; 31562704 distinct user strings, 4140276 distinct product strings .
First idea:
My first idea is to use collectAsMap action and then use the map idea to change the user/product string to int . With driver memory up to 12G , i got OOM or GC overhead exception (the exception is limited by driver memory).
But this idea can only use on a small data size, with bigger data size , you need a bigger driver memory .
Second idea :
Second idea is to use join method, as Tobber proposaled. Here is some test result:
Job setup:
driver: 2G , 2 cpu;
executor : (8G , 4 cpu) * 7;
I follow the steps:
1) find unique user strings and zipWithIndexes;
2) join the original data;
3) save the encoded data;
The job take about 10 minutes to finish.

Resources