FPGrowth: Input data is not cached pyspark - python-3.x

I am trying to run following example code. Even-though I have cached my data, I am getting "Input data is not cached pyspark" warning. Because of this issue, I am not able to use fp growth algorithm for large datasets.
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import SparkSession
"""
An example demonstrating FPGrowth.
Run with:
bin/spark-submit examples/src/main/python/ml/fpgrowth_example.py
"""
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("FPGrowthExample")\
.getOrCreate()
# $example on$
df = spark.createDataFrame([
(0, [1, 2, 5]),
(1, [1, 2, 3, 5]),
(2, [1, 2])
], ["id", "items"])
df = df.cache()
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)
# Display frequent itemsets.
model.freqItemsets.show()
# Display generated association rules.
model.associationRules.show()
# transform examines the input items against all the association rules and summarize the
# consequents as prediction
model.transform(df).show()
spark.stop()

Why:
Because ml.fpm.FPGrowth converts data to RDD and runs mllib.fpm.FPGrowth on this RDD. RDD is not cached and this causes the warning in mllib code.
What can you do about it:
In your code nothing. If you think this is a big issue (shouldn't be) open a JIRA ticket and create a pull request.
Because of this issue, I am not able to use fp growth algorithm for large datasets.
It can cause unnecessary allocation and slowdown, but shouldn't be limiting. If you experience failure it is possible that parameters require tuning.

Related

How to pass SparseVectors to `mllib` in pyspark

I am using pyspark 1.6.3 through Zeppelin with python 3.5.
I am trying to implement Latent Dirichlet Allocation using the pyspark CountVectorizer and LDA functions. First, the problem: here is the code I am using. Let df be a spark dataframe with tokenized text in a column 'tokenized'
vectors = 'vectors'
cv = CountVectorizer(inputCol = 'tokenized', outputCol = vectors)
model = cv.fit(df)
df = model.transform(df)
corpus = df.select(vectors).rdd.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
ldaModel = LDA.train(corpus, k=25)
This code is taken more or less from the pyspark api docs.
On the call to LDA I get the following error:
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
The internet tells me that this is due to a type mismatch.
So lets look at the types for LDA and from CountVectorizer. From spark docs here is another example of a sparse vector going into LDA:
>>> from pyspark.mllib.linalg import Vectors, SparseVector
>>> data = [
... [1, Vectors.dense([0.0, 1.0])],
... [2, SparseVector(2, {0: 1.0})],
... ]
>>> rdd = sc.parallelize(data)
>>> model = LDA.train(rdd, k=2, seed=1)
I implement this myself and this is what rdd looks like:
>> testrdd.take(2)
[[1, DenseVector([0.0, 1.0])], [2, SparseVector(2, {0: 1.0})]]
On the other hand, if I go to my original code and look at corpus the rdd with the output of CountVectorizer, I see (edited to remove extraneous bits):
>> corpus.take(3)
[[0, Row(vectors=SparseVector(130593, {0: 30.0, 1: 13.0, ...
[1, Row(vectors=SparseVector(130593, {0: 52.0, 1: 44.0, ...
[2, Row(vectors=SparseVector(130593, {0: 14.0, 1: 6.0, ...
]
So the example I used (from the docs!) doesn't produce a tuple of (index, SparseVector), but a (index, Row(SparseVector))... or something?
Questions:
Is the Row wrapper around the SparseVector what is causing this error?
If so, how do I get rid of the Row object? Row is a property of a df, but I used df.rdd to convert to an rdd; what else would I need to do?
It maybe the problem. Just extract vectors from the Row object.
corpus = df.select(vectors).rdd.zipWithIndex().map(lambda x: [x[1], x[0]['vectors']]).cache()

Write dataframe to kafka pyspark

I have a spark dataframe which I would like to write to Kafka. I have tried below snippet,
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers = util.get_broker_metadata())
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v'))
for row in df.rdd.collect():
producer.send('topic',str(row.asDict()))
producer.flush()
This works but problem with this snippet is this is not Scalable as every time collect runs, data will be aggregated on driver node and can slow down all operations.
As foreach operation on dataframe can run in parallel on worker nodes. I tried below approach.
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers = util.get_broker_metadata())
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v'))
def custom_fun(row):
producer.send('topic',str(row.asDict()))
producer.flush()
df.foreach(custom_fun)
This doesn't and gives pickling error. PicklingError: Cannot pickle objects of type <type 'itertools.count'> Not able to understand the reason behind this error. Can anyone help me understand this error or provide any other parallel solution?
The error you get looks unrelated to Kafka writes. Looks like somewhere else in your code you use itertools.count (AFAIK it is not used in Spark's source at all, it is of course possible that it comes with KafkaProducer) which is for some reason serialized with cloudpickle module. Changing Kafka writing code might have no impact at all. If KafkaProducer is the source of the error, you should be able to resolve this with forachPartition:
from kafka import KafkaProducer
def send_to_kafka(rows):
producer = KafkaProducer(bootstrap_servers = util.get_broker_metadata())
for row in rows:
producer.send('topic',str(row.asDict()))
producer.flush()
df.foreachPartition(send_to_kafka)
That being said:
or provide any other parallel solution?
I would recommend using Kafka source. Include Kafka SQL package, for example:
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
And:
from pyspark.sql.functions import to_json, col, struct
(df
.select(to_json(struct([col(c).alias(c) for c in df.columns])))
.write
.format("kafka")
.option("kafka.bootstrap.servers", botstrap_servers)
.option("topic", topic)
.save())

In PySpark RDD, how to use foreachPartition() to print out the first record of each partition?

In PySpark RDD, how to use foreachPartition() to print out the first record of each partition?
You can do this:
def f(iterator):
print(iterator.next())
or
def f(iterator):
print(list(iterator)[0])
Then, you can apply one of the above functions to an RDD as follows:
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
rdd1.foreachPartition(f)
Note that this will print in each of spark workers so you should access the workers' logs to see the results.
For more information check the documentation here

Do not discard keys with null values when converting to JSON in PySpark DataFrame

I am creating a column in a DataFrame from several other columns that I want to store as a JSON serialized string. When the serialization to JSON occurs, keys with null values are dropped. Is there a way to keep keys even if the value is null?
Sample program illustrating the issue:
from pyspark.sql import functions as F
df = sc.parallelize([
(1, 10),
(2, 20),
(3, None),
(4, 40),
]).toDF(['id', 'data'])
df.collect()
#[Row(id=1, data=10),
# Row(id=2, data=20),
# Row(id=3, data=None),
# Row(id=4, data=40)]
df_s = df.select(F.struct('data').alias('struct'))
df_s.collect()
#[Row(struct=Row(data=10)),
# Row(struct=Row(data=20)),
# Row(struct=Row(data=None)),
# Row(struct=Row(data=40))]
df_j = df.select(F.to_json(F.struct('data')).alias('json'))
df_j.collect()
#[Row(json=u'{"data":10}'),
# Row(json=u'{"data":20}'),
# Row(json=u'{}'), <= would like this to be u'{"data":null}'
# Row(json=u'{"data":40}')]
Running Spark 2.1.0
Could not find a Spark specific solution so just wrote a udf and used the python json package:
import json
from pyspark.sql import types as T
def to_json(data):
return json.dumps({'data': data})
to_json_udf = F.udf(to_json, T.StringType())
df.select(to_json_udf('data').alias('json')).collect()
# [Row(json=u'{"data": 10}'),
# Row(json=u'{"data": 20}'),
# Row(json=u'{"data": null}'),
# Row(json=u'{"data": 40}')]
Also posted on this StackOverflow post:
Since Pyspark 3, one can use the ignoreNullFields option when
writing to a JSON file.
spark_dataframe.write.json(output_path,ignoreNullFields=False)
Pyspark docs:
https://spark.apache.org/docs/3.1.1/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter.json

Caching factor of MatrixFactorizationModel in PySpark

after loading a saved MatrixFactorizationModel I get the warnings:
MatrixFactorizationModelWrapper: Product factor does not have a partitioner. Prediction on individual records could be slow.
MatrixFactorizationModelWrapper: Product factor is not cached. Prediction could be slow.
and indeed the computation is slow and will not scale well
how do I set a partitioner and cache the Product factor?
adding code that demonstrates the problem:
from pyspark import SparkContext
import sys
sc = SparkContext("spark://hadoop-m:7077", "recommend")
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
model = MatrixFactorizationModel.load(sc, "model")
model.productFeatures.cache()
i get:
Traceback (most recent call last):
File "/home/me/recommend.py", line 7, in
model.productFeatures.cache()
AttributeError: 'function' object has no attribute 'cache'
Concerning the caching, like I wrote in the comment box, you can cache your rdd doing the following :
rdd.cache() # for Scala, Java and Python
EDIT: The userFeatures and the productFeatures are both of type RDD[(Int, Array[Double]). (Ref. Official Documentation)
To cache the productFeature, you can do the following
model.productFeatures().cache()
Of course I consider that loaded model is called model.
Example :
r1 = (1, 1, 1.0)
r2 = (1, 2, 2.0)
r3 = (2, 1, 2.0)
ratings = sc.parallelize([r1, r2, r3])
from pyspark.mllib.recommendation import ALS
model = ALS.trainImplicit(ratings, 1, seed=10)
model.predict(2, 2)
feats = model.productFeatures()
type(feats)
>> MapPartitionsRDD[137] at mapPartitions at PythonMLLibAPI.scala:1074
feats.cache()
As for the warning concerning the partitioner, even if you partition your model, let's say by feature with .partitionBy() to balance it it would still be too expensive performance.
There is a JIRA ticket (SPARK-8708) concerning this issue that should be resolved in the next release of Spark (1.5).
Nevertheless, if you want to learning more about partitioning algorithms, I invite you to read the the discussion in this ticket SPARK-3717 that argues about partitioning by features within the DecisionTree and RandomForest algorithms.

Resources