How to apply FP-Growth to dataset after groupBy? - apache-spark

I'd like to use FP-Growth from Spark MLlib in Spark 2.1.
My data has only two columns item_group and item.
I have tried the following but it does not work:
sc = SparkSession.builder.appName("Assoziationsanalyse").getOrCreate()
hiveCtx = SQLContext(sc)
input = hiveCtx.sql("""select * from bosch.input_view""").
groupBy("item_group").
agg(collect_list("item")).
alias("items").
rdd.
map(lambda x : x.items)
model = FPGrowth.train(input, minSupport=0.2, numPartitions=10)

I have solved the problem now using an other approch which I have found in a discussion here.
data=hiveCtx.sql("""select * from bosch.input_view""")
datardd=data.rdd.map(lambda x (x[0],x[1])).groupByKey().mapValues(list).values()
model = FPGrowth.train(datardd, minSupport=0.1, numPartitions=10)

Related

what is the mapper and reducer that are used in computeSVD() function?

i am new to Map reduce and i want to do some research to compute svd using mapreduce.
the code side : i have found computeSVD a pyspark function and it uses mapreduce as said in this discussion .
the theory side : what is the mapper and reducer that are used in computeSVD() function ?
my code
findspark.init('C:\spark\spark-3.0.3-bin-hadoop2.7')
conf=SparkConf()
conf.setMaster("local[*]")
conf.setAppName('firstapp')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
rows = np.loadtxt('data.txt', dtype=float) # data.txt is a (m rows x n cols) matrix m>n
rows = sc.parallelize(rows)
mat = RowMatrix(rows)
svd = mat.computeSVD(5, computeU=True)
i would highely appriciate any help.
You can see in this function the usage of mapPartitions and reduceByKey from the RDD object, which do something similar to MapReduce, but is not the same library as Hadoop Mapreduce

K means clustering ml library using apache spark

I am trying to implement k means clustering using apache spark ml version in 2.0.2. After finding the cluster center , facing the issue in how do identity the data belongs to which cluster point. Please help me.. Thanks in advance. Please find my code :
val tokenFrameprocess = TokenizerProcessor.process("value", "tokenized")
val stopwordRemover = StopWordsProcessor.process("tokenized", "stopwords")
val stemmingProcess = StemmingProcessor.process("value", "stemmed")
val HashingProcess = HashingTFProcessor.process("stopwords" ,"features")
val pipeline = new Pipeline().setStages(Array(tokenFrameprocess,stopwordRemover,stemmingProcess,HashingProcess))
val finalFrameProcess = pipeline.fit(df)
val finalFramedata = finalFrameProcess.transform(df);
val kmeans = new KMeans().setK(4).setSeed(1L)
val model = kmeans.fit(finalFramedata)
val WSSSE = model.computeCost(finalFramedata)
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
You have to set the feature column which will be used to predict on while instantiating the Kmeans like following example:
val kmeans = new KMeans().setK(4).setSeed(1L).setFeaturesCol("features").setPredictionCol("prediction")
Once you call the .fit() on the kmeans, It returns a Transformer. In case of your example, variable "model" is a transformer. You can call the .transform() To get the desired prediction on the given data. For example below lines will give you the dataframe with prediction column.
val model = kmeans.fit(finalFramedata)
val transformedDF = model.transform(finalFramedata)
transformedDF.show(false)
Prediction column denotes clusters points. If you give k value as 3 and the prediction column will have values like 0,1,2.

How to use L1 penalty in pyspark.ml.regression.LinearRegressionModel for features selection?

Firstly, I use spark 1.6.0. I want to use L1 penalty in pyspark.ml.regression.LinearRegressionModel for features selection.
But I can not get the detailed coefficients when calling the function:
lr = LogisticRegression(elasticNetParam=1.0, regParam=0.01,maxIter=100,fitIntercept=False,standardization=False)
model = lr.fit(df_one_hot_train)
print model.coefficients.toArray().astype(float).tolist()
I only get sparse list like:
[0,0,0,0,0,..,-0.0871650387514,..,]
While when I use sklearn.linear_model.LogisticRegression model, I can get the detailed list without zero value in coef_ list like:
[0.03098372361467529,-0.13709075166114365,-0.15069548597557908,-0.017968044053830862]
With the better performance in spark, I could finished my work faster. I just want to use L1 penalty for feature selection.
I think I should use more detailed values of coefficients for my feature selection work just as sklearn does, how can I solve my problem?
Below is a working code snip in Spark 2.1.
The key to extract values is :
stages(4).asInstanceOf[LinearRegressionModel]
Spark 1.6 may have something similar.
val holIndIndexer = new StringIndexer().setInputCol("holInd").setOutputCol("holIndIndexer")
val holIndEncoder = new OneHotEncoder().setInputCol("holIndIndexer").setOutputCol("holIndVec")
val time_intervaLEncoder = new OneHotEncoder().setInputCol("time_interval").setOutputCol("time_intervaLVec")
val assemblerL1 = (new VectorAssembler()
.setInputCols(Array("time_intervaLVec", "holIndVec", "length")).setOutputCol("features") )
val lrL1 = new LinearRegression().setFeaturesCol("features").setLabelCol("travel_time")
val pipelineL1 = new Pipeline().setStages(Array(holIndIndexer,holIndEncoder,time_intervaLEncoder,assemblerL1, lrL1))
val modelL1 = pipelineL1.fit(dfTimeMlFull)
val l1Coeff =modelL1.stages(4).asInstanceOf[LinearRegressionModel].coefficients
println(l1Coeff)

Get CSV to Spark dataframe

I'm using python on Spark and would like to get a csv into a dataframe.
The documentation for Spark SQL strangely does not provide explanations for CSV as a source.
I have found Spark-CSV, however I have issues with two parts of the documentation:
"This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3"
Do I really need to add this argument everytime I launch pyspark or spark-submit? It seems very inelegant. Isn't there a way to import it in python rather than redownloading it each time?
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv") Even if I do the above, this won't work. What does the "source" argument stand for in this line of code? How do I simply load a local file on linux, say "/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?
With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read gives you a DataFrameReader instance, with a .csv() method:
df = sqlContext.read.csv("/path/to/your.csv")
Note that you can also indicate that the csv file has a header by adding the keyword argument header=True to the .csv() call. A handful of other options are available, and described in the link above.
from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
Employee_rdd = sc.textFile("\..\Employee.csv")
.map(lambda line: line.split(","))
Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])
Employee_df.show()
for Pyspark, assuming that the first row of the csv file contains a header
spark = SparkSession.builder.appName('chosenName').getOrCreate()
df=spark.read.csv('fileNameWithPath', mode="DROPMALFORMED",inferSchema=True, header = True)
Read the csv file in to a RDD and then generate a RowRDD from the original RDD.
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))
# The schema is encoded in a string.
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)
source: SPARK PROGRAMMING GUIDE
If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.
Dependencies:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
Read the whole file at once into a Spark DataFrame:
sc = SparkContext('local','example') # if using locally
sql_sc = SQLContext(sc)
pandas_df = pd.read_csv('file.csv') # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2'])
s_df = sql_sc.createDataFrame(pandas_df)
Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:
chunk_100k = pd.read_csv('file.csv', chunksize=100000)
for chunky in chunk_100k:
Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
try:
Spark_full_rdd += Spark_temp_rdd
except NameError:
Spark_full_rdd = Spark_temp_rdd
del Spark_temp_rdd
Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])
Following Spark 2.0, it is recommended to use a Spark Session:
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Create a SparkSession
spark = SparkSession \
.builder \
.appName("basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
def mapper(line):
fields = line.split(',')
return Row(ID=int(fields[0]), field1=str(fields[1].encode("utf-8")), field2=int(fields[2]), field3=int(fields[3]))
lines = spark.sparkContext.textFile("file.csv")
df = lines.map(mapper)
# Infer the schema, and register the DataFrame as a table.
schemaDf = spark.createDataFrame(df).cache()
schemaDf.createOrReplaceTempView("tablename")
I ran into similar problem. The solution is to add an environment variable named as "PYSPARK_SUBMIT_ARGS" and set its value to "--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell". This works with Spark's Python interactive shell.
Make sure you match the version of spark-csv with the version of Scala installed. With Scala 2.11, it is spark-csv_2.11 and with Scala 2.10 or 2.10.5 it is spark-csv_2.10.
Hope it works.
Based on the answer by Aravind, but much shorter, e.g. :
lines = sc.textFile("/path/to/file").map(lambda x: x.split(","))
df = lines.toDF(["year", "month", "day", "count"])
With the current implementation(spark 2.X) you dont need to add the packages argument, You can use the inbuilt csv implementation
Additionally as the accepted answer you dont need to create an rdd then enforce schema that has 1 potential problem
When you read the csv as then it will mark all the fields as string and when you enforce the schema with an integer column you will get exception.
A better way to do the above would be
spark.read.format("csv").schema(schema).option("header", "true").load(input_path).show()

Using DataFrame with MLlib

Let's say I have a DataFrame (that I read in from a csv on HDFS) and I want to train some algorithms on it via MLlib. How do I convert the rows into LabeledPoints or otherwise utilize MLlib on this dataset?
Assuming you're using Scala:
Let's say your obtain the DataFrame as follows:
val results : DataFrame = sqlContext.sql(...)
Step 1: call results.printSchema() -- this will show you not only the columns in the DataFrame and (this is important) their order, but also what Spark SQL thinks are their types. Once you see this output things get a lot less mysterious.
Step 2: Get an RDD[Row] out of the DataFrame:
val rows: RDD[Row] = results.rdd
Step 3: Now it's just a matter of pulling whatever fields interest you out of the individual rows. For this you need to know the 0-based position of each field and it's type, and luckily you obtained all that in Step 1 above. For example,
let's say you did a SELECT x, y, z, w FROM ... and printing the schema yielded
root
|-- x double (nullable = ...)
|-- y string (nullable = ...)
|-- z integer (nullable = ...)
|-- w binary (nullable = ...)
And let's say all you wanted to use x and z. You can pull them out into an RDD[(Double, Integer)] as follows:
rows.map(row => {
// x has position 0 and type double
// z has position 2 and type integer
(row.getDouble(0), row.getInt(2))
})
From here you just use Core Spark to create the relevant MLlib objects. Things could get a little more complicated if your SQL returns columns of array type, in which case you'll have to call getList(...) for that column.
Assuming you're using JAVA (Spark version 1.6.2):
Here is a simple example of JAVA code using DataFrame for machine learning.
It loads a JSON with the following structure,
[{"label":1,"att2":5.037089672359123,"att1":2.4100883023159456}, ... ]
splits the data into training and testing,
train the model using the train data,
apply the model to the test data and
stores the results.
Moreover according to the official documentation the "DataFrame-based API is primary API" for MLlib since the current version 2.0.0. So you can find several examples using DataFrame.
The code:
SparkConf conf = new SparkConf().setAppName("MyApp").setMaster("local[2]");
SparkContext sc = new SparkContext(conf);
String path = "F:\\SparkApp\\test.json";
String outputPath = "F:\\SparkApp\\justTest";
System.setProperty("hadoop.home.dir", "C:\\winutils\\");
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
DataFrame df = sqlContext.read().json(path);
df.registerTempTable("tmp");
DataFrame newDF = df.sqlContext().sql("SELECT att1, att2, label FROM tmp");
DataFrame dataFixed = newDF.withColumn("label", newDF.col("label").cast("Double"));
VectorAssembler assembler = new VectorAssembler().setInputCols(new String[]{"att1", "att2"}).setOutputCol("features");
StringIndexer indexer = new StringIndexer().setInputCol("label").setOutputCol("labelIndexed");
// Split the data into training and test
DataFrame[] splits = dataFixed.randomSplit(new double[] {0.7, 0.3});
DataFrame trainingData = splits[0];
DataFrame testData = splits[1];
DecisionTreeClassifier dt = new DecisionTreeClassifier().setLabelCol("labelIndexed").setFeaturesCol("features");
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {assembler, indexer, dt});
// Train model
PipelineModel model = pipeline.fit(trainingData);
// Make predictions
DataFrame predictions = model.transform(testData);
predictions.rdd().coalesce(1,true,null).saveAsTextFile("justPlay.txt" +System.currentTimeMillis());
RDD based Mllib is on its way to be deprecated, so you should rather use DataFrame based Mllib.
Generally the input to these MLlib apis is a DataFrame containing 2 columns - label and feature. There are various methods to build this DataFrame - low level apis like org.apache.spark.mllib.linalg.{Vector, Vectors}, org.apache.spark.mllib.regression.LabeledPoint, org.apache.spark.mllib.linalg.{Matrix, Matrices} etc. They all take numeric values for feature and label.
Words can be converted to vectors using - org.apache.spark.ml.feature.{Word2Vec, Word2VecModel}. This documentation explains more - https://spark.apache.org/docs/latest/mllib-data-types.html
Once input dataframe with label and feature is created, instantiate the MLlib api and pass in the DataFrame to 'fit' function to get the model and then call 'transform' or 'predict' function on the model to get the results.
Example -
training file looks like -
<numeric label> <a string separated by space>
//Build word vector
val trainingData = spark.read.parquet(<path to training file>)
val sampleDataDf = trainingData
.map { r =>
val s = r.getAs[String]("value").split(" ")
val label = s.head.toDouble
val feature = s.tail
(label, feature)
}.toDF("lable","feature_words")
val word2Vec = new Word2Vec()
.setInputCol("feature_words")
.setOutputCol("feature_vectors")
.setMinCount(0)
.setMaxIter(10)
//build word2Vector model
val model = word2Vec.fit(sampleDataDf)
//convert training text data to vector and labels
val wVectors = model.transform(sampleDataDf)
//train LinearSVM model
val svmAlgorithm = new LinearSVC()
.setFeaturesCol("feature_vectors")
.setMaxIter(100)
.setLabelCol("lable")
.setRegParam(0.01)
.setThreshold(0.5)
.fit(wVectors) //use word vectors created
//Predict new data, same format as training data containing words
val predictionData = spark.read.parquet(<path to prediction file>)
val pDataDf = predictionData
.map { r =>
val s = r.getAs[String]("value").split(" ")
val label = s.head.toDouble
val feature = s.tail
(label, feature)
}.toDF("lable","feature_words")
val pVectors = model.transform(pDataDf)
val predictionlResult = pVectors.map{ r =>
val s = r.getAs[Seq[String]]("feature_words")
val v = r.getAs[Vector]("feature_vectors")
val c = svmAlgorithm.predict(v) // predict using trained SVM
s"$c ${s.mkString(" ")}"
}

Resources