Spark: How to group and count RDD that meet certain condition?

Spark: How to group and count RDD that meet certain condition? - apache-spark

My RDD's type is RDD[Map], and the map format is like:
{"date": "2015-01-01", "topic": "sports", "content": "foo,bar"}
...
Now I would like to obtain a sequence like
{"date": "2015-01-01", "topic":"sports", "count":22}
that is, the count of every topic for each day.
How to group and count it in Spark?

Here is the code using spark sql on spark 1.3.0, this code is well tested and if you are familiar with sql you can write simple queries to process your JSON data. Please note that syntax is little different in latest version of Spark (for eg: 1.5):
Save file to HDFS (eg: /user/cloudera/data.json)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.sql("set spark.sql.shuffle.partitions=10");
-- You can change number of partitions to the number you want, by default it will use 200
import sqlContext.implicits._
val jsonData = sqlContext.jsonFile("/user/cloudera/data.json")
jsonData.registerTempTable("jsonData")
val tableData=sqlContext.sql("select \"date\", topic, count(1) from jsonData group by \"date\", topic")
tableData.collect().foreach(println)

If Map is an object having the fields you have shown, you can simply do this:
import org.apache.spark.SparkContext._
resultRDD=yourRDD.map( x => ((x.date,x.topic), 1)).reduceByKey(_+_)
resultRDD.map (
x =>
// here you have to create the JSON you want as output
// knowing that x._1._1 contains the date, x._1._2 contains the topic
// and x._2 contains the count
)
The code I have written i in Scala, but I'm sure it'll be easy for you to adapt it to your language if you're using Java or Python.
Moreover pay attention to the import I put since it's necessary to have an implicit conversion between a RDD and a PairRDD.

Related

Spark infer schema with limit during a read.csv

I'd like to infer a Spark.DataFrame schema from a directory of CSV files using a small subset of the rows (say limit(100)).
However, setting inferSchema to True means that the Input Size / Records for the FileScanRDD seems to always be equal to the number of rows in all the CSV files.
Is there a way to make the FileScan more selective, such that Spark looks at fewer rows when inferring a schema?
Note: setting the samplingRatio option to be < 1.0 does not have the desired behaviour, though it is clear that inferSchema uses only the sampled subset of rows.

You could read a subset of your input data into a dataSet of String.
The CSV method allows you to pass this as a parameter.
Here is a simple example (I'll leave reading the sample of rows from the input file to you):
val data = List("1,2,hello", "2,3,what's up?")
val csvRDD = sc.parallelize(data)
val df = spark.read.option("inferSchema","true").csv(csvRDD.toDS)
df.schema
When run in spark-shell, the final line from the above prints (I reformatted it for readability):
res4: org.apache.spark.sql.types.StructType =
StructType(
StructField(_c0,IntegerType,true),
StructField(_c1,IntegerType,true),
StructField(_c2,StringType,true)
)
Which is the correct Schema for my limited input data set.

Assuming you are only interested in the schema, here is a possible approach based on cipri.l's post in this link
import org.apache.spark.sql.execution.datasources.csv.{CSVOptions, TextInputCSVDataSource}
def inferSchemaFromSample(sparkSession: SparkSession, fileLocation: String, sampleSize: Int, isFirstRowHeader: Boolean): StructType = {
// Build a Dataset composed of the first sampleSize lines from the input files as plain text strings
val dataSample: Array[String] = sparkSession.read.textFile(fileLocation).head(sampleSize)
import sparkSession.implicits._
val sampleDS: Dataset[String] = sparkSession.createDataset(dataSample)
// Provide information about the CSV files' structure
val firstLine = dataSample.head
val extraOptions = Map("inferSchema" -> "true", "header" -> isFirstRowHeader.toString)
val csvOptions: CSVOptions = new CSVOptions(extraOptions, sparkSession.sessionState.conf.sessionLocalTimeZone)
// Infer the CSV schema based on the sample data
val schema = TextInputCSVDataSource.inferFromDataset(sparkSession, sampleDS, Some(firstLine), csvOptions)
schema
}
Unlike GMc's answer from above, this approach tries to directly infer the schema the same way the DataFrameReader.csv() does in the background (but without going through the effort of building an additional Dataset with that schema, that we would then only use to retrieve the schema back from it)
The schema is inferred based on a Dataset[String] containing only the first sampleSize lines from the input files as plain text strings.
When trying to retrieve samples from data, Spark has only 2 types of methods:
Methods that retrieve a given percentage of the data. This operation takes random samples from all partitions. It benefits from higher parallelism, but it must read all the input files.
Methods that retrieve a specific number of rows. This operation must collect the data on the driver, but it could read a single partition (if the required row count is low enough)
Since you mentioned you want to use a specific small number of rows and since you want to avoid touching all the data, I provided a solution based on option 2
PS: The DataFrameReader.textFile method accepts paths to files, folders and it also has a varargs variant, so you could pass in one or more files or folders.

How to convert from dataframe to RDD and back with a case class [duplicate]

I am trying to convert a dataframe of multiple case classes to an rdd of these multiple cases classes. I cant find any solution. This wrappedArray has drived me crazy :P
For example, assuming I am having the following:
case class randomClass(a:String,b: Double)
case class randomClass2(a:String,b: Seq[randomClass])
case class randomClass3(a:String,b:String)
val anRDD = sc.parallelize(Seq(
(randomClass2("a",Seq(randomClass("a1",1.1),randomClass("a2",1.1))),randomClass3("aa","aaa")),
(randomClass2("b",Seq(randomClass("b1",1.2),randomClass("b2",1.2))),randomClass3("bb","bbb")),
(randomClass2("c",Seq(randomClass("c1",3.2),randomClass("c2",1.2))),randomClass3("cc","Ccc"))))
val aDF = anRDD.toDF()
Assuming that I am having the aDF how can I get the anRDD???
I tried something like this just to get the second column but it was giving an error:
aDF.map { case r:Row => r.getAs[randomClass3]("_2")}

You can convert indirectly using Dataset[randomClass3]:
aDF.select($"_2.*").as[randomClass3].rdd
Spark DatataFrame / Dataset[Row] represents data as the Row objects using mapping described in Spark SQL, DataFrames and Datasets Guide Any call to getAs should use this mapping.
For the second column, which is struct<a: string, b: string>, it would be a Row as well:
aDF.rdd.map { _.getAs[Row]("_2") }
As commented by Tzach Zohar to get back a full RDD you'll need:
aDF.as[(randomClass2, randomClass3)].rdd

I don't know the scala API but have you considered the rdd value?
Maybe something like :
aDR.rdd.map { case r:Row => r.getAs[randomClass3]("_2")}

Split single DStream into multiple Hive tables

I am working on Kafka Spark streaming project. Spark streaming getting data from Kafka. Data is in json format. sample input
{
"table": "tableA",
"Product_ID": "AGSVGF.upf",
"file_timestamp": "2018-07-26T18:58:08.4485558Z000000000000000",
"hdfs_file_name": "null_1532631600050",
"Date_Time": "2018-07-26T13:45:01.0000000Z",
"User_Name": "UBAHTSD"
}
{
"table": "tableB",
"Test_ID": "FAGS.upf",
"timestamp": "2018-07-26T18:58:08.4485558Z000000000000000",
"name": "flink",
"time": "2018-07-26T13:45:01.0000000Z",
"Id": "UBAHTGADSGSCVDGHASD"
}
One JSON string is one message. There are 15 types of JSON string which distinguish using table column. Now I want to save this 15 different JSON in Apache Hive. So I have created a dstream and on the bases of table column I have filtered the rdd and saved into Hive. Code works fine. But some time lots it table much time then spark batch. I have controlled the input using spark.streaming.kafka.maxRatePerPartition=10. I have repartitioned the rdd into 9 partitioned but on Spark UI, it show unknown stage.
Here is my code.
val dStream = dataStream.transform(rdd => rdd.repartition(9)).map(_._2)
dStream.foreachRDD { rdd =>
if (!rdd.isEmpty()) {
val sparkContext = rdd.sparkContext
rdd.persist(StorageLevel.MEMORY_AND_DISK)
val hiveContext = getInstance(sparkContext)
val tableA = rdd.filter(_.contains("tableA"))
if (!tableA.isEmpty()) {
HiveUtil.tableA(hiveContext.read.json(tableA))
tableA.unpersist(true)
}
val tableB = rdd.filter(_.contains("tableB"))
if (!tableB.isEmpty()) {
HiveUtil.tableB(hiveContext.read.json(tableB))
tableB.unpersist(true)
}
.....
.... upto 15 tables
....
val tableK = rdd.filter(_.contains("tableK"))
if (!tableB.isEmpty()) {
HiveUtil.tableB(hiveContext.read.json(tableK))
tableB.unpersist(true)
}
}
}
How I can optimise the code ?
Thank you.

Purely from a management perspective, I would suggest you parameterize your job to accept the table name, then run 15 separate Spark applications. Also ensure that the kafka consumer group is different for each application
This way, you can more easily monitor which Spark job is not performing as well as others and a skew of data to one table won't cause issues with others.
It's not clear what the Kafka message keys are, but if produced with the table as the key, then Spark could scale along with the kafka partitions, and you're guaranteed all messages for each table will be in order.
Overall, I would actually use Kafka Connect or Streamsets for writing to HDFS/Hive, not having to write code or tune Spark settings

How to do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState?

How to do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState or Dstream API? looking for a more declarative way
Example:
select count(*) from some_view
I want the output to just count whatever records are available in each batch but not aggregate from the previous batch

To do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState or Dstream API, you can use following code-
import spark.implicits._
def countValues = (_: String, it: Iterator[(String, String)]) => it.length
val query =
dataStream
.select(lit("a").as("newKey"), col("value"))
.as[(String, String)]
.groupByKey { case(newKey, _) => newKey }
.mapGroups[Int](countValues)
.writeStream
.format("console")
.start()
Here what we are doing is-
We added one column to our datastream - newKey. We did this so that we can do a groupBy over it, using groupByKey. I have used a literal string "a", but you can use anything. Also, you need to select anyone column from the available columns in datastream. I have selected value column for this purpose, you can select anyone.
We created a mapping function - countValues, to count the values aggregated by groupByKey function by writing it.length.
So, in this way, we can count whatever records are available in each batch but not aggregating from the previous batch.
I hope it helps!

How to write DataFrame (built from RDD inside foreach) to Kafka?

I'm trying to write a DataFrame from Spark to Kafka and I couldn't find any solution out there. Can you please show me how to do that?
Here is my current code:
activityStream.foreachRDD { rdd =>
val activityDF = rdd
.toDF()
.selectExpr(
"timestamp_hour", "referrer", "action",
"prevPage", "page", "visitor", "product", "inputProps.topic as topic")
val producerRecord = new ProducerRecord(topicc, activityDF)
kafkaProducer.send(producerRecord) // <--- this shows an error
}
type mismatch; found : org.apache.kafka.clients.producer.ProducerRecord[Nothing,org‌.apache.spark.sql.Da‌taFrame] (which expands to) org.apache.kafka.clients.producer.ProducerRecord[Nothing,org‌.apache.spark.sql.Da‌taset[org.apache.spa‌rk.sql.Row]] required: org.apache.kafka.clients.producer.ProducerRecord[Nothing,Str‌ing] Error occurred in an application involving default arguments.

Do collect on the activityDF to get the records (not Dataset[Row]) and save them to Kafka.
Note that you'll end up with a collection of records after collect so you probably have to iterate over it, e.g.
val activities = activityDF.collect()
// the following is pure Scala and has nothing to do with Spark
activities.foreach { a: Row =>
val pr: ProducerRecord = // map a to pr
kafkaProducer.send(pr)
}
Use pattern matching on Row to destructure it to fields/columns, e.g.
activities.foreach { case Row(timestamp_hour, referrer, action, prevPage, page, visitor, product, topic) =>
// ...transform a to ProducerRecord
kafkaProducer.send(pr)
}
PROTIP: I'd strongly suggest using a case class and transform DataFrame (= Dataset[Row]) to Dataset[YourCaseClass].
See Spark SQL's Row and Kafka's ProducerRecord docs.
As Joe Nate pointed out in the comments:
If you do "collect" before writing to any endpoint, it's going to make all the data aggregate at the driver and then make the driver write it out. 1) Can crash the driver if too much data (2) no parallelism in write.
That's 100% correct. I wished I had said it :)
You may want to use the approach as described in Writing Stream Output to Kafka instead.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark: How to group and count RDD that meet certain condition? - apache-spark

Related

Spark infer schema with limit during a read.csv

How to convert from dataframe to RDD and back with a case class [duplicate]

Split single DStream into multiple Hive tables

How to do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState?

How to write DataFrame (built from RDD inside foreach) to Kafka?

Categories

Resources