Problems receving messages from pubsublite with spark streaming

Problems receving messages from pubsublite with spark streaming - apache-spark

I have a problem, I try to receive the messages from pubsublite in real time from a spark cluster on GCP, but they are grouped in blocks of one minute.
My code:
producer.py
import random
import time
from proj_BOLSA import settings
from google.cloud.pubsublite.cloudpubsub import PublisherClient
from google.cloud.pubsublite.types import (
CloudRegion,
CloudZone,
MessageMetadata,
TopicPath,
)
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/home/vm-sergiolr-development/Desktop/projecteemo_code/proj_BOLSA/credentials/gcp_authentication.json"
regional = True
if regional:
location = CloudRegion(settings.REGION)
else:
location = CloudZone(CloudRegion(settings.REGION), settings.ZONE)
topic_path = TopicPath(settings.PROJECT_NUMBER, location, settings.TOPIC)
# PublisherClient() must be used in a `with` block or have __enter__() called before use.
with PublisherClient() as publisher_client:
for i in range(6000):
data = "number: "+str(random.randint(0, 300))
api_future = publisher_client.publish(topic_path, data.encode("utf-8"))
# result() blocks. To resolve API futures asynchronously, use add_done_callback().
message_id = api_future.result()
message_metadata = MessageMetadata.decode(message_id)
print(
f"Published {data} to {topic_path} with partition {message_metadata.partition.value} and offset {message_metadata.cursor.offset}."
)
time.sleep(20)
consumer.py
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StringType
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.sql.functions import from_json, col
# TODO(developer):
project_number = xxxxxxxx
location = "europe-west1"
subscription_id = "s_producer"
spark = SparkSession.builder.appName("read-app").master("yarn").getOrCreate()
sdf = (
spark.readStream.format("pubsublite")
.option(
"pubsublite.subscription",
f"projects/{project_number}/locations/{location}/subscriptions/{subscription_id}",
)
.option("rowsPerSecond", 1).load()
)
sdf = sdf.withColumn("data", sdf.data.cast(StringType()))
query = (
sdf.writeStream.format("console")
.outputMode("append")
.trigger(processingTime="1 second")
.option("truncate", False)
.start()
)
# Wait 120 seconds (must be >= 60 seconds) to start receiving messages.
query.awaitTermination()
query.stop()
results
-------------------------------------------
Batch: 1
-------------------------------------------
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+
|subscription |partition|offset|key|data |publish_timestamp |event_timestamp|attributes|
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0 |5942 |[] |number: 74 |2022-08-05 08:42:47.796738|null |{} |
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0 |5943 |[] |number: 288|2022-08-05 08:43:07.849063|null |{} |
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0 |5944 |[] |number: 156|2022-08-05 08:43:27.952513|null |{} |
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+
-------------------------------------------
Batch: 2
-------------------------------------------
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+
|subscription |partition|offset|key|data |publish_timestamp |event_timestamp|attributes|
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0 |5945 |[] |number: 162|2022-08-05 08:43:48.00867 |null |{} |
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0 |5946 |[] |number: 262|2022-08-05 08:44:08.062032|null |{} |
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0 |5947 |[] |number: 59 |2022-08-05 08:44:28.11492 |null |{} |
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+
-------------------------------------------
Batch: 3
-------------------------------------------
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+
|subscription |partition|offset|key|data |publish_timestamp |event_timestamp|attributes|
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0 |5948 |[] |number: 54 |2022-08-05 08:44:48.168997|null |{} |
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0 |5949 |[] |number: 206|2022-08-05 08:45:08.225344|null |{} |
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0 |5950 |[] |number: 109|2022-08-05 08:45:28.328074|null |{} |
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+
what is the error that I am committing to not be able to read the messages as they are coming to me instead of grouping them in batches of 1 minute.
Thank you!

You are using the micro batch streaming mode where the spark runtime decides how many messages to read from the source at a time. It's actually reading ~30s windows, not 1 minute windows of data.
To read smaller time windows for small amounts of data, you would need to use the experimental continuous processing mode https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing

Related

What is the best practice to fit time-series based dataframe to predict multiple columns in PySpark?

I'm experimenting with the RandomForestRegressor. The main dataframe created by Aggregations on Windows over Event-Time. Let's say below are train and test sets in the form of the dataframes including timestamp column (type struct):
#trainset dataframe
+------------------------------------------+----+.....+----+
|window_frame_24_Hours |A | ... |Z |
+------------------------------------------+---------------+
|{2021-11-28 00:00:00, 2021-11-29 00:00:00}|316 | ... |666 |
|{2021-11-27 00:00:00, 2021-11-28 00:00:00}|324 | ... |526 |
|{2021-11-26 00:00:00, 2021-11-27 00:00:00}|261 | ... |414 |
|{2021-11-25 00:00:00, 2021-11-26 00:00:00}|268 | ... |632 |
|{2021-11-24 00:00:00, 2021-11-25 00:00:00}|284 | ... |578 |
|{2021-11-23 00:00:00, 2021-11-24 00:00:00}|232 | ... |226 |
... .... ...
|{2021-11-02 00:00:00, 2021-11-03 00:00:00}|94 | ... |100 |
|{2021-11-01 00:00:00, 2021-11-02 00:00:00}|106 | ... |666 |
|{2021-10-31 00:00:00, 2021-11-01 00:00:00}|130 | ... |108 |
|{2021-10-30 00:00:00, 2021-10-31 00:00:00}|112 | ... |35 |
+------------------------------------------+----+.....+----+
#testset dataframe
+------------------------------------------+----+.....+----+
|window_frame_24_Hours |A | ... |Z |
+------------------------------------------+---------------+
|{2021-11-28 00:00:00, 2021-11-29 00:00:00}|?? | ... |?? | <-- predict
+------------------------------------------+----+.....+----+
following is my RF model implementation:
#Dependencies
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.regression import RandomForestRegressor, GBTRegressor
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
value = X_train.columns[1:]
assembler = VectorAssembler(inputCols=value, outputCol='features')
scaler = MinMaxScaler(inputCol='features', outputCol='scaled_features')
# Create a RandomForest model.
rf = RandomForestRegressor(featuresCol="scaled_features", labelCol="label") #labelCol=value
# Chain model, assembler and scaler into a Pipeline.
pipeline = Pipeline(stages=[assembler, scaler, rf])
# Train model on training data.
rf_model = pipeline.fit(X_train)
# Make predictions.
predictions = rf_model.transform(X_test)
# Select example rows to display.
predictions = predictions.select("window_frame_24_Hours", "prediction", "features", "scaled_features").sort('window_frame_24_Hours.start')
##predictions.show(45, truncate = False)
I want to predict columns A - Z over certain time so I exclude timestamp column by X_train.columns[1:] and pass target columns via a VectorAssembler to assembler. Then:
when I use labelCol=value in rf = RandomForestRegressor(featuresCol="scaled_features", labelCol=value) I face:
TypeError: Invalid param value given for param "labelCol". Could not convert <class 'list'> to string type
when I use labelCol="label" in rf = RandomForestRegressor(featuresCol="scaled_features", labelCol=value) I face:
AnalysisException: Cannot resolve column name "A" among (window_frame_24_Hours, A, ..., Z,
the question:
What is the best practice to fit time-series based dataframe to predict multiple columns. (multi-regression problem)
Should I use for-loop and train for each column based on this disappointing post for spark? if so, how?
I tried to check other time-series regression examples 1, 2 & 3 as well as this video but their problem was not multi-regressions (predict lots of columns) nor it's not clear how they handle it. I know in in python there is solution for Multioutput regression.

How to use Confluent Schema Registry with from_avro standard function? [duplicate]

This question already has answers here:
Integrating Spark Structured Streaming with the Confluent Schema Registry
(10 answers)
Closed 3 years ago.
My Kafka and Schema Registry are based on Confluent Community Platform 5.2.2, and My Spark has version 2.4.4. I started Spark REPL env with:
./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4,org.apache.spark:spark-avro_2.11:2.4.4
And setup Kafka source for spark session:
val brokerServers = "my_confluent_server:9092"
val topicName = "my_kafka_topic_name"
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokerServers)
.option("subscribe", topicName)
.load()
And I got schema information about key and value with:
import io.confluent.kafka.schemaregistry.client.rest.RestService
val schemaRegistryURL = "http://my_confluent_server:8081"
val restService = new RestService(schemaRegistryURL)
val keyRestResponseSchemaStr: String = restService.getLatestVersionSchemaOnly(topicName + "-key")
val valueRestResponseSchemaStr: String = restService.getLatestVersionSchemaOnly(topicName + "-value")
Firstly, if I queried it with writeStream for "key", i.e.
import org.apache.spark.sql.avro._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.DataFrame
import java.time.LocalDateTime
val query = df.writeStream
.outputMode("append")
.foreachBatch((batchDF: DataFrame, batchId: Long) => {
val rstDF = batchDF
.select(
from_avro($"key", keyRestResponseSchemaStr).as("key"),
from_avro($"value", valueRestResponseSchemaStr).as("value"))
println(s"${LocalDateTime.now} --- Batch ${batchId}, ${batchDF.count} rows")
//rstDF.select("value").show
rstDF.select("key").show
})
.trigger(Trigger.ProcessingTime("120 seconds"))
.start()
query.awaitTermination()
There is no errors, even count of rows are shown, but I could not got any data.
2019-09-16T10:30:16.984 --- Batch 0, 0 rows
+---+
|key|
+---+
+---+
2019-09-16T10:32:00.401 --- Batch 1, 27 rows
+---+
|key|
+---+
| []|
| []|
| []|
| []|
| []|
| []|
| []|
| []|
| []|
| []|
| []|
| []|
| []|
| []|
| []|
| []|
| []|
| []|
| []|
| []|
+---+
only showing top 20 rows
But if I select "value":
import org.apache.spark.sql.avro._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.DataFrame
import java.time.LocalDateTime
val query = df.writeStream
.outputMode("append")
.foreachBatch((batchDF: DataFrame, batchId: Long) => {
val rstDF = batchDF
.select(
from_avro($"key", keyRestResponseSchemaStr).as("key"),
from_avro($"value", valueRestResponseSchemaStr).as("value"))
println(s"${LocalDateTime.now} --- Batch ${batchId}, ${batchDF.count} rows")
rstDF.select("value").show
//rstDF.select("key").show
})
.trigger(Trigger.ProcessingTime("120 seconds"))
.start()
query.awaitTermination()
I got message:
2019-09-16T10:34:54.287 --- Batch 0, 0 rows
+-----+
|value|
+-----+
+-----+
2019-09-16T10:36:00.416 --- Batch 1, 19 rows
19/09/16 10:36:03 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 3)
org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: -1
at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:336)
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
at org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:422)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:414)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:181)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:50)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
So I think there are two levels fo problems:
Firstly, there are different avro deserialization logic for key and value, and current "from_avro" only support key, rather than value
Even for key, there is no error, but deserializer of "from_avro" could not get real data.
Do you think I have any wrong steps? Or, should from_avro and to_avro need be enhanced?
Thanks.

Your key and value are entirely byte arrays, and are prefixed with integer values for their IDs. Spark-Avro does not support that format, only "Avro container object" formats that contain the schema as part of the record.
In other words, you need to invoke the functions from Confluent deserializers , not the "plain Avro" deserializers, in order to first get Avro objects, then you can put schemas on those
Spark should enhance from_avro and to_avro?
They should, but they won't. Ref SPARK-26314. Sidenote that Databricks does offer Schema Registry integration with functions of the same name, only to add to the confusion
The workaround would be to use this library - https://github.com/AbsaOSS/ABRiS
Or see other solutions at Integrating Spark Structured Streaming with the Confluent Schema Registry

Error when running a query involving ROUND function in spark sql

I am trying, in pyspark, to obtain a new column by rounding one column of a table to the precision specified, in each row, by another column of the same table, e.g., from the following table:
+--------+--------+
| Data|Rounding|
+--------+--------+
|3.141592| 3|
|0.577215| 1|
+--------+--------+
I should be able to obtain the following result:
+--------+--------+--------------+
| Data|Rounding|Rounded_Column|
+--------+--------+--------------+
|3.141592| 3| 3.142|
|0.577215| 1| 0.6|
+--------+--------+--------------+
In particular, I have tried the following code:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import (
StructType, StructField, FloatType, LongType,
IntegerType
)
pdDF = pd.DataFrame(columns=["Data", "Rounding"], data=[[3.141592, 3],
[0.577215, 1]])
mySchema = StructType([ StructField("Data", FloatType(), True),
StructField("Rounding", IntegerType(), True)])
spark = (SparkSession.builder
.master("local")
.appName("column rounding")
.getOrCreate())
df = spark.createDataFrame(pdDF,schema=mySchema)
df.show()
df.createOrReplaceTempView("df_table")
df_rounded = spark.sql("SELECT Data, Rounding, ROUND(Data, Rounding) AS Rounded_Column FROM df_table")
df_rounded .show()
but I get the following error:
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve 'round(df_table.`Data`, df_table.`Rounding`)' due to data type mismatch: Only foldable Expression is allowed for scale arguments; line 1 pos 23;\n'Project [Data#0, Rounding#1, round(Data#0, Rounding#1) AS Rounded_Column#12]\n+- SubqueryAlias df_table\n +- LogicalRDD [Data#0, Rounding#1], false\n"
Any help would be deeply appreciated :)

With spark sql , the catalyst throws out the following error in your run - Only foldable Expression is allowed for scale arguments
i.e #param scale new scale to be round to, this should be a constant int at runtime
ROUND only expect a Literal for the scale. you can try out writing custom code instead of spark-sql way.
EDIT:
With UDF,
val df = Seq(
(3.141592,3),
(0.577215,1)).toDF("Data","Rounding")
df.show()
df.createOrReplaceTempView("df_table")
import org.apache.spark.sql.functions._
def RoundUDF(customvalue:Double, customscale:Int):Double = BigDecimal(customvalue).setScale(customscale, BigDecimal.RoundingMode.HALF_UP).toDouble
spark.udf.register("RoundUDF", RoundUDF(_:Double,_:Int):Double)
val df_rounded = spark.sql("select Data, Rounding, RoundUDF(Data, Rounding) as Rounded_Column from df_table")
df_rounded.show()
Input:
+--------+--------+
| Data|Rounding|
+--------+--------+
|3.141592| 3|
|0.577215| 1|
+--------+--------+
Output:
+--------+--------+--------------+
| Data|Rounding|Rounded_Column|
+--------+--------+--------------+
|3.141592| 3| 3.142|
|0.577215| 1| 0.6|
+--------+--------+--------------+

Spark gives Error when creating DataFrame

I have downloaded spark version 2.3.1 and hadoop version 2.7 and java jdk 8.
Every thing works fine for simple exercises, but when i tried to create dataframe. it start to though error.
the following code runs with out error.
import numpy as np
TOTAL = 1000000
dots = sc.parallelize([2.0 * np.random.random(2) - 1.0 for i in range(TOTAL)]).cache()
print("Number of random points:", dots.count())
stats = dots.stats()
print('Mean:', stats.mean())
print('stdev:', stats.stdev())
but when i tried the following code requires the input to change into dataframe
df = sc.parallelize([Row(name='ab',age=20), Row(name='ab',age=20)]).toDF()
it throws the following error

you were missing import for Row.
from pyspark.sql import Row
df = sc.parallelize([Row(name='ab',age=20), Row(name='ab',age=20)]).toDF()
df.show()
Result:
+---+----+
|age|name|
+---+----+
| 20| ab|
| 20| ab|
+---+----+

How can I get the indices of categorical variables in a Spark DataFrame? [duplicate]

How do I handle categorical data with spark-ml and not spark-mllib ?
Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a featuresCol argument, which specifies the name of the column of features in the DataFrame, and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame.
Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol.
However, the VectorAssembler only accepts numeric types, boolean type, and vector type (according to the Spark website), so I can't put strings in my features vector.
How should I proceed?

I just wanted to complete Holden's answer.
Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead.
In Scala:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer}
val df = Seq((0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)).toDF("id", "category1", "category2")
val indexer = new StringIndexer().setInputCol("category1").setOutputCol("category1Index")
val encoder = new OneHotEncoderEstimator()
.setInputCols(Array(indexer.getOutputCol, "category2"))
.setOutputCols(Array("category1Vec", "category2Vec"))
val pipeline = new Pipeline().setStages(Array(indexer, encoder))
pipeline.fit(df).transform(df).show
// +---+---------+---------+--------------+-------------+-------------+
// | id|category1|category2|category1Index| category1Vec| category2Vec|
// +---+---------+---------+--------------+-------------+-------------+
// | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])|
// | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])|
// | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
// | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
// | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
// | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
// +---+---------+---------+--------------+-------------+-------------+
In Python:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator
df = spark.createDataFrame([(0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)], ["id", "category1", "category2"])
indexer = StringIndexer(inputCol="category1", outputCol="category1Index")
inputs = [indexer.getOutputCol(), "category2"]
encoder = OneHotEncoderEstimator(inputCols=inputs, outputCols=["categoryVec1", "categoryVec2"])
pipeline = Pipeline(stages=[indexer, encoder])
pipeline.fit(df).transform(df).show()
# +---+---------+---------+--------------+-------------+-------------+
# | id|category1|category2|category1Index| categoryVec1| categoryVec2|
# +---+---------+---------+--------------+-------------+-------------+
# | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])|
# | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])|
# | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
# | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
# | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
# | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
# +---+---------+---------+--------------+-------------+-------------+
Since Spark 1.4.0, MLLib also supplies OneHotEncoder feature, which maps a column of label indices to a column of binary vectors, with at most a single one-value.
This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features
Let's consider the following DataFrame:
val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c"))
.toDF("id", "category")
The first step would be to create the indexed DataFrame with the StringIndexer:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
indexed.show
// +---+--------+-------------+
// | id|category|categoryIndex|
// +---+--------+-------------+
// | 0| a| 0.0|
// | 1| b| 2.0|
// | 2| c| 1.0|
// | 3| a| 0.0|
// | 4| a| 0.0|
// | 5| c| 1.0|
// +---+--------+-------------+
You can then encode the categoryIndex with OneHotEncoder :
import org.apache.spark.ml.feature.OneHotEncoder
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
encoded.select("id", "categoryVec").show
// +---+-------------+
// | id| categoryVec|
// +---+-------------+
// | 0|(2,[0],[1.0])|
// | 1| (2,[],[])|
// | 2|(2,[1],[1.0])|
// | 3|(2,[0],[1.0])|
// | 4|(2,[0],[1.0])|
// | 5|(2,[1],[1.0])|
// +---+-------------+

I am going to provide an answer from another perspective, since I was also wondering about categorical features with regards to tree-based models in Spark ML (not MLlib), and the documentation is not that clear how everything works.
When you transform a column in your dataframe using pyspark.ml.feature.StringIndexer extra meta-data gets stored in the dataframe that specifically marks the transformed feature as a categorical feature.
When you print the dataframe you will see a numeric value (which is an index that corresponds with one of your categorical values) and if you look at the schema you will see that your new transformed column is of type double. However, this new column you created with pyspark.ml.feature.StringIndexer.transform is not just a normal double column, it has extra meta-data associated with it that is very important. You can inspect this meta-data by looking at the metadata property of the appropriate field in your dataframe's schema (you can access the schema objects of your dataframe by looking at yourdataframe.schema)
This extra metadata has two important implications:
When you call .fit() when using a tree based model, it will scan the meta-data of your dataframe and recognize fields that you encoded as categorical with transformers such as pyspark.ml.feature.StringIndexer (as noted above there are other transformers that will also have this effect such as pyspark.ml.feature.VectorIndexer). Because of this, you DO NOT have to one-hot encode your features after you have transformed them with StringIndxer when using tree-based models in spark ML (however, you still have to perform one-hot encoding when using other models that do not naturally handle categoricals like linear regression, etc.).
Because this metadata is stored in the data frame, you can use pyspark.ml.feature.IndexToString to reverse the numeric indices back to the original categorical values (which are often strings) at any time.

There is a component of the ML pipeline called StringIndexer you can use to convert your strings to Double's in a reasonable way. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer has more documentation, and http://spark.apache.org/docs/latest/ml-guide.html shows how to construct pipelines.

I use the following method for oneHotEncoding a single column in a Spark dataFrame:
def ohcOneColumn(df, colName, debug=False):
colsToFillNa = []
if debug: print("Entering method ohcOneColumn")
countUnique = df.groupBy(colName).count().count()
if debug: print(countUnique)
collectOnce = df.select(colName).distinct().collect()
for uniqueValIndex in range(countUnique):
uniqueVal = collectOnce[uniqueValIndex][0]
if debug: print(uniqueVal)
newColName = str(colName) + '_' + str(uniqueVal) + '_TF'
df = df.withColumn(newColName, df[colName]==uniqueVal)
colsToFillNa.append(newColName)
df = df.drop(colName)
df = df.na.fill(False, subset=colsToFillNa)
return df
I use the following method for oneHotEncoding Spark dataFrames:
from pyspark.sql.functions import col, countDistinct, approxCountDistinct
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoderEstimator
def detectAndLabelCat(sparkDf, minValCount=5, debug=False, excludeCols=['Target']):
if debug: print("Entering method detectAndLabelCat")
newDf = sparkDf
colList = sparkDf.columns
for colName in sparkDf.columns:
uniqueVals = sparkDf.groupBy(colName).count()
if debug: print(uniqueVals)
countUnique = uniqueVals.count()
dtype = str(sparkDf.schema[colName].dataType)
#dtype = str(df.schema[nc].dataType)
if (colName in excludeCols):
if debug: print(str(colName) + ' is in the excluded columns list.')
elif countUnique == 1:
newDf = newDf.drop(colName)
if debug:
print('dropping column ' + str(colName) + ' because it only contains one unique value.')
#end if debug
#elif (1==2):
elif ((countUnique < minValCount) | (dtype=="String") | (dtype=="StringType")):
if debug:
print(len(newDf.columns))
oldColumns = newDf.columns
newDf = ohcOneColumn(newDf, colName, debug=debug)
if debug:
print(len(newDf.columns))
newColumns = set(newDf.columns) - set(oldColumns)
print('Adding:')
print(newColumns)
for newColumn in newColumns:
if newColumn in newDf.columns:
try:
newUniqueValCount = newDf.groupBy(newColumn).count().count()
print("There are " + str(newUniqueValCount) + " unique values in " + str(newColumn))
except:
print('Uncaught error discussing ' + str(newColumn))
#else:
# newColumns.remove(newColumn)
print('Dropping:')
print(set(oldColumns) - set(newDf.columns))
else:
if debug: print('Nothing done for column ' + str(colName))
#end if countUnique == 1, elif countUnique other condition
#end outer for
return newDf

You can cast a string column type in a spark data frame to a numerical data type using the cast function.
from pyspark.sql import SQLContext
from pyspark.sql.types import DoubleType, IntegerType
sqlContext = SQLContext(sc)
dataset = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('./data/titanic.csv')
dataset = dataset.withColumn("Age", dataset["Age"].cast(DoubleType()))
dataset = dataset.withColumn("Survived", dataset["Survived"].cast(IntegerType()))
In the above example, we read in a csv file as a data frame, cast the default string datatypes into integer and double, and overwrite the original data frame. We can then use the VectorAssembler to merge the features in a single vector and apply your favorite Spark ML algorithm.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Problems receving messages from pubsublite with spark streaming - apache-spark

Related

What is the best practice to fit time-series based dataframe to predict multiple columns in PySpark?

How to use Confluent Schema Registry with from_avro standard function? [duplicate]

Error when running a query involving ROUND function in spark sql

Spark gives Error when creating DataFrame

How can I get the indices of categorical variables in a Spark DataFrame? [duplicate]

Categories

Resources