Spark Avro write RDD to multiple directories by key - apache-spark

I need to split an RDD by first letters (A-Z) and write the files into directories respectively.
The simple solution is to filter the RDD for each letter, but this requires 26 passes.
There is a response to a similar question for writing to text files here, but I cannot figure out how to do this for Avro files.
Has anyone been able to do this?

You can use multipleoutputformat to do this
It is a two step task :-
First you need the multiple output format for avro. Below is the code for that:
package avro
import org.apache.hadoop.mapred.lib.MultipleOutputFormat
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.util.Progressable
import org.apache.avro.mapred.AvroOutputFormat
import org.apache.avro.mapred.AvroWrapper
import org.apache.hadoop.io.NullWritable
import org.apache.spark.rdd.RDD
import org.apache.hadoop.mapred.RecordWriter
class MultipleAvroFileOutputFormat[K] extends MultipleOutputFormat[AvroWrapper[K], NullWritable] {
val outputFormat = new AvroOutputFormat[K]
override def generateFileNameForKeyValue(key: AvroWrapper[K], value: NullWritable, name: String) = {
val name = key.datum().asInstanceOf[String].substring(0, 1)
name + "/" + name
}
override def getBaseRecordWriter(fs: FileSystem,
job: JobConf,
name: String,
arg3: Progressable) = {
outputFormat.getRecordWriter(fs, job, name, arg3).asInstanceOf[RecordWriter[AvroWrapper[K], NullWritable]]
}
}
In your driver code you have to mention that you want to use the Above given output format. You also need to mention the output schema for avro data. Below is sample driver code which stores a RDD of string in avro format with schema {"type":"string"}
package avro
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.hadoop.io.NullWritable
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.hadoop.mapred.JobConf
import org.apache.avro.mapred.AvroJob
import org.apache.avro.mapred.AvroWrapper
object AvroDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf
conf.setAppName(args(0));
conf.setMaster("local[2]");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(Array(classOf[AvroWrapper[String]]))
val sc = new SparkContext(conf);
val input = sc.parallelize(Seq("one", "two", "three", "four"), 1);
val pairRDD = input.map(x => (new AvroWrapper(x), null));
val job = new JobConf(sc.hadoopConfiguration)
val schema = "{\"type\":\"string\"}"
job.set(AvroJob.OUTPUT_SCHEMA, schema) //set schema for avro output
pairRDD.partitionBy(new HashPartitioner(26)).saveAsHadoopFile(args(1), classOf[AvroWrapper[String]], classOf[NullWritable], classOf[MultipleAvroFileOutputFormat[String]], job, None);
sc.stop()
}
}

I hope you get a better answer than mine...
I've been in a similar situation myself, except with "ORC" instead of Avro. I basically threw up my hands and ended up calling the ORC file classes directly to write the files myself.
In your case, my approach would entail partitioning the data via "partitionBy" into 26 partitions, one for each first letter A-Z. Then call "mapPartitionsWithIndex", passing a function that outputs the i-th partition to an Avro file at the appropriate path. Finally, to convince Spark to actually do something, have mapPartitionsWithIndex return, say, a List containing the single boolean value "true"; and then call "count" on the RDD returned by mapPartitionsWithIndex to get Spark to start the show.
I found an example of writing an Avro file here: http://www.myhadoopexamples.com/2015/06/19/merging-small-files-into-avro-file-2/

Related

Case Class within foreachRDD causes Serialization Error

I can can create a DF inside foreachRDD if I do not try and use a Case Class and simply let default names for columns be made with toDF() or if I assign them via toDF("c1, "c2").
As soon as I try and use a Case Class, and having looked at the examples, I get:
Task not serializable
If I shift the Case Class statement around I then get:
toDF() not part of RDD[CaseClass]
It's legacy, but I am curious as to the nth Serialization error that Spark can produce and if it carries over into Structured Streaming.
I have an RDD that need not be split, may be that is the issue? NO. Running in DataBricks?
Coding is as follows:
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
case class Person(name: String, age: Int) //extends Serializable // Some say inherently serializable so not required
val spark = SparkSession.builder
.master("local[4]")
.config("spark.driver.cores", 2)
.appName("forEachRDD")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val rddQueue = new mutable.Queue[RDD[List[(String, Int)]]]()
val QS = ssc.queueStream(rddQueue)
QS.foreachRDD(q => {
if(!q.isEmpty) {
import spark.implicits._
val q_flatMap = q.flatMap{x=>x}
val q_withPerson = q_flatMap.map(field => Person(field._1, field._2))
val df = q_withPerson.toDF()
df.show(false)
}
}
)
ssc.start()
for (c <- List(List(("Fred",53), ("John",22), ("Mary",76)), List(("Bob",54), ("Johnny",92), ("Margaret",15)), List(("Alfred",21), ("Patsy",34), ("Sylvester",7)) )) {
rddQueue += ssc.sparkContext.parallelize(List(c))
}
ssc.awaitTermination()
Having not grown up with Java, but having looked around I found out what to do, but am not expert enough to explain.
I was running in a DataBricks notebook where I prototype.
The clue is that the
case class Person(name: String, age: Int)
was inside the same DB Notebook. One needs to define the case class external to the current notebook - in a separate notebook - and thus separate to the class running the Streaming.

spark a single dataframe convert to two tables with id and a relationship of one to many

The data source is a csv:
name,companyName
shop1,com1
shop2,com1
shop3,com1
shop4,com2
shop5,com2
shop6,com3
I use spark to read it into a dataframe,and want to convert it to two dataframes in one to many relationship,the expected output are two dataframes.
One is companyDF:
companyId,companyName
1,com1
2,com2
3,com3
another is shopDF:
shopId, shopName, shopCompanyId
1,shop1,1
2,shop2,1
3,shop3,1
4,shop4,2
5,shop5,2
6,shop6,3
these two dataframes can join on shopDF.shopCompanyId=companyDF.companyId and getData.I used monotonically_increasing_id() to generate id like 1 2 3 4.. or there is a better way
I have write some code to do it,and works
package delme
import com.qydata.fuyun.util.Utils;
import scala.reflect.api.materializeTypeTag
import java.io.BufferedWriter
import java.io.InputStream
import java.io.InputStreamReader
import java.io.FileInputStream
import java.io.BufferedReader
import scala.util.control.Exception.Finally
import java.io.IOException
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
object OneToMany {
def main(args: Array[String]){
var sparkConf = new SparkConf().setMaster("local[*]")
val builder = SparkSession.builder().config(sparkConf)//.enableHiveSupport()
val ss = builder.getOrCreate()
val sc = ss.sparkContext
import ss.implicits._
var shopdf = ss.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("data/shops.csv")
val companydf=shopdf.select("companyName").distinct().withColumn("companyId", monotonically_increasing_id())
companydf.show()
shopdf=shopdf.join(companydf,shopdf.col("companyName")===companydf.col("companyName"),"left_outer").select("name", "companyName","companyId")
shopdf.show()
}
}
but I feel it is stupid,I want handle it just one time,not a "distinct" and a "join",first the operator on String maybe low efficiency,second if I have incremental data,I don`t how to handle it.for example,another batch of data is:
name,companyName
shop1a,com1
shop2a,com1
shop3a,com1
shop4a,com2
shop5a,com2
shop6a,com3
shop7,com4
And I want to append these to the old table,(I save the data to hive tables before in fact),I have no idea then.
Here the new company com4 with an id 4 should be append to company table,and (13,shop7,4) will be append to shop table
so how to convert the source to two dataframes?
val df1=df.withColumn(“companyId”,dense_rank.over(Window.orderBy(“companyName”))).withColumn(“shopId”,row_number.over(Window.orderBy(“name”)))
val companydf = df1.select(“companyName”,”companyId”).dropDuplicates
val shopdf= df1.select(col(“name”).alias(“shopName”,col(“shopId”),col(“companyId”).alias(“shopCompanyId”))

Reading Avro file in Spark and extracting column values

I want to read an avro file using Spark (I am using Spark 1.3.0 so I don't have data frames)
I read the avro file using this piece of code
import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.hadoop.io.NullWritable
import org.apache.spark.SparkContext
private def readAvro(sparkContext: SparkContext, path: String) = {
sparkContext.newAPIHadoopFile[
AvroKey[GenericRecord],
NullWritable,
AvroKeyInputFormat[GenericRecord]
](path)
}
I execute this and get an RDD. Now from RDD, how do I extract value of specific columns? like loop through all records and give value of column name?
[edit]As suggested by Justin below I tried
val rdd = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](input)
rdd.map(record=> record._1.get("accountId")).toArray().foreach(println)
but I get an error
<console>:34: error: value get is not a member of org.apache.avro.mapred.AvroKey[org.apache.avro.generic.GenericRecord]
rdd.map(record=> record._1.get("accountId")).toArray().foreach(println)
AvroKey has a datum method to extract the wrapped value. And GenericRecord has a get method that accepts the column name as a string. So you can just extract the columns using a map
rdd.map(record=>record._1.datum.get("COLNAME"))

Xml processing in Spark

Scenario:
My Input will be multiple small XMLs and am Supposed to read these XMLs as RDDs. Perform join with another dataset and form an RDD and send the output as an XML.
Is it possible to read XML using spark, load the data as RDD? If it is possible how will the XML be read.
Sample XML:
<root>
<users>
<user>
<account>1234<\account>
<name>name_1<\name>
<number>34233<\number>
<\user>
<user>
<account>58789<\account>
<name>name_2<\name>
<number>54697<\number>
<\user>
<\users>
<\root>
How will this be loaded into the RDD?
Yes it possible but details will differ depending on an approach you take.
If files are small, as you've mentioned, the simplest solution is to load your data using SparkContext.wholeTextFiles. It loads data as RDD[(String, String)] where the the first element is path and the second file content. Then you parse each file individually like in a local mode.
For larger files you can use Hadoop input formats.
If structure is simple you can split records using textinputformat.record.delimiter. You can find a simple example here. Input is not a XML but it you should give you and idea how to proceed
Otherwise Mahout provides XmlInputFormat
Finally it is possible to read file using SparkContext.textFile and adjust later for record spanning between partitions. Conceptually it means something similar to creating sliding window or partitioning records into groups of fixed size:
use mapPartitionsWithIndex partitions to identify records broken between partitions, collect broken records
use second mapPartitionsWithIndex to repair broken records
Edit:
There is also relatively new spark-xml package which allows you to extract specific records by tag:
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "foo")
.load("bar.xml")
Here's the way to perform it using HadoopInputFormats to read XML data in spark as explained by #zero323.
Input data:
<root>
<users>
<user>
<account>1234<\account>
<name>name_1<\name>
<number>34233<\number>
<\user>
<user>
<account>58789<\account>
<name>name_2<\name>
<number>54697<\number>
<\user>
<\users>
<\root>
Code for reading XML Input:
You will get some jars at this link
Imports:
//---------------spark_import
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
//----------------xml_loader_import
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{ LongWritable, Text }
import com.cloudera.datascience.common.XmlInputFormat
Code:
object Tester_loader {
case class User(account: String, name: String, number: String)
def main(args: Array[String]): Unit = {
val sparkHome = "/usr/big_data_tools/spark-1.5.0-bin-hadoop2.6/"
val sparkMasterUrl = "spark://SYSTEMX:7077"
var jars = new Array[String](3)
jars(0) = "/home/hduser/Offload_Data_Warehouse_Spark.jar"
jars(1) = "/usr/big_data_tools/JARS/Spark_jar/avro/spark-avro_2.10-2.0.1.jar"
val conf = new SparkConf().setAppName("XML Reading")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.setMaster("local")
.set("spark.cassandra.connection.host", "127.0.0.1")
.setSparkHome(sparkHome)
.set("spark.executor.memory", "512m")
.set("spark.default.deployCores", "12")
.set("spark.cores.max", "12")
.setJars(jars)
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// ---- loading user from XML
// calling function 1.1
val pages = readFile("src/input_data", "<user>", "<\\user>", sc)
val xmlUserDF = pages.map { tuple =>
{
val account = extractField(tuple, "account")
val name = extractField(tuple, "name")
val number = extractField(tuple, "number")
User(account, name, number)
}
}.toDF()
println(xmlUserDF.count())
xmlUserDF.show()
}
Functions:
def readFile(path: String, start_tag: String, end_tag: String,
sc: SparkContext) = {
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, start_tag)
conf.set(XmlInputFormat.END_TAG_KEY, end_tag)
val rawXmls = sc.newAPIHadoopFile(
path, classOf[XmlInputFormat], classOf[LongWritable],
classOf[Text], conf)
rawXmls.map(p => p._2.toString)
}
def extractField(tuple: String, tag: String) = {
var value = tuple.replaceAll("\n", " ").replace("<\\", "</")
if (value.contains("<" + tag + ">") &&
value.contains("</" + tag + ">")) {
value = value.split("<" + tag + ">")(1).split("</" + tag + ">")(0)
}
value
}
}
Output:
+-------+------+------+
|account| name|number|
+-------+------+------+
| 1234|name_1| 34233|
| 58789|name_2| 54697|
+-------+------+------+
The result obtained is in dataframes you can convert them to RDD as per your requirement like this->
val xmlUserRDD = xmlUserDF.toJavaRDD.rdd.map { x =>
(x.get(0).toString(),x.get(1).toString(),x.get(2).toString()) }
Please evaluate it, if it could help you some how.
This will help you.
package packagename;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import com.databricks.spark.xml.XmlReader;
public class XmlreaderSpark {
public static void main(String arr[]){
String localxml="file path";
String booksFileTag = "user";
String warehouseLocation = "file:" + System.getProperty("user.dir") + "spark-warehouse";
System.out.println("warehouseLocation" + warehouseLocation);
SparkSession spark = SparkSession
.builder()
.master("local")
.appName("Java Spark SQL Example")
.config("spark.some.config.option", "some-value").config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport().config("set spark.sql.crossJoin.enabled", "true")
.getOrCreate();
SQLContext sqlContext = new SQLContext(spark);
Dataset<Row> df = (new XmlReader()).withRowTag(booksFileTag).xmlFile(sqlContext, localxml);
df.show();
}
}
You need to add this dependency in your POM.xml:
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.10</artifactId>
<version>0.4.0</version>
</dependency>
and your input file is not in proper format.
Thanks.
There are two good options for simple cases:
wholeTextFiles. Use map method with your XML parser which could be Scala XML pull parser (quicker to code) or the SAX Pull Parser (better performance).
Hadoop streaming XMLInputFormat which you must define the start and end tag <user> </user> to process it, however, it creates one partition per user tag
spark-xml package is a good option too.
With all options you are limited to only process simple XMLs which can be interpreted as dataset with rows and columns.
However, if we make it a little complex, those options won’t be useful.
For example, if you have one more entity there:
<root>
<users>
<user>...</users>
<companies>
<company>...</companies>
</root>
Now you need to generate 2 RDDs and change your parser to recognise the <company> tag.
This is just a simple case, but the XML could be much more complex and you would need to include more and more changes.
To solve this complexity we’ve built Flexter on top of Apache Spark to take the pain out of processing XML files on Spark. I also recommend to read about converting XML on Spark to Parquet. The latter post also includes some code samples that show how the output can be queried with SparkSQL.
Disclaimer: I work for Sonra

Spark: Reading files using different delimiter than new line

I'm using Apache Spark 1.0.1. I have many files delimited with UTF8 \u0001 and not with the usual new line \n. How can I read such files in Spark? Meaning, the default delimiter of sc.textfile("hdfs:///myproject/*") is \n, and I want to change it to \u0001.
You can use textinputformat.record.delimiter to set the delimiter for TextInputFormat, E.g.,
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration(sc.hadoopConfiguration)
conf.set("textinputformat.record.delimiter", "X")
val input = sc.newAPIHadoopFile("file_path", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val lines = input.map { case (_, text) => text.toString}
println(lines.collect)
For example, my input is a file containing one line aXbXcXd. The above code will output
Array(a, b, c, d)
In Spark shell, I extracted data according to Setting textinputformat.record.delimiter in spark:
$ spark-shell
...
scala> import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.LongWritable
scala> import org.apache.hadoop.io.Text
import org.apache.hadoop.io.Text
scala> import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.conf.Configuration
scala> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
scala> val conf = new Configuration
conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml
scala> conf.set("textinputformat.record.delimiter", "\u0001")
scala> val data = sc.newAPIHadoopFile("mydata.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf).map(_._2.toString)
data: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text)] = NewHadoopRDD[0] at newAPIHadoopFile at <console>:19
sc.newAPIHadoopFile("mydata.txt", ...) is a RDD[(LongWritable, Text)], where the first part of the elements is the starting character index, and the second part is the actual text delimited by "\u0001".
In python this could be achieved using:
rdd = sc.newAPIHadoopFile(YOUR_FILE, "org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
"org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text",
conf={"textinputformat.record.delimiter": YOUR_DELIMITER}).map(lambda l:l[1])
Here is a ready-to-use version of Chad's and #zsxwing's answers for Scala users, which can be used this way:
sc.textFile("some/path.txt", "\u0001")
The following snippet creates an additional textFile method implicitly attached to the SparkContext using an implicit class (in order to replicate SparkContext's default textFile method):
package com.whatever
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
object Spark {
implicit class ContextExtensions(val sc: SparkContext) extends AnyVal {
def textFile(
path: String,
delimiter: String,
maxRecordLength: String = "1000000"
): RDD[String] = {
val conf = new Configuration(sc.hadoopConfiguration)
// This configuration sets the record delimiter:
conf.set("textinputformat.record.delimiter", delimiter)
// and this one limits the size of one record:
conf.set("mapreduce.input.linerecordreader.line.maxlength", maxRecordLength)
sc.newAPIHadoopFile(
path,
classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
conf
)
.map { case (_, text) => text.toString }
}
}
}
which can be used this way:
import com.whatever.Spark.ContextExtensions
sc.textFile("some/path.txt", "\u0001")
Note the additional setting mapreduce.input.linerecordreader.line.maxlength which limits the maximum size of a record. This comes in handy when reading from a corrupted file for which a record could be too long to fit in memory (more chances of it happening when playing with the record delimiter).
With this setting, when reading a corrupted file, an exception (java.io.IOException - thus catchable) will be thrown rather than getting a messy out of memory which will stop the SparkContext.
If you are using spark-context, the below code helped me
sc.hadoopConfiguration.set("textinputformat.record.delimiter","delimeter")

Resources