Skipping first few lines in Spark - apache-spark

I have spark 2.0 code which would read .gz(text) files and writes them to the HIVE table.
Can i know How do i ignore the first two lines from all of my files. Just want to skip the first two lines.
SparkSession spark = SparkSession
.builder()
.master("local")
.appName("SparkSessionFiles")
.config("spark.some.config.option", "some-value")
.enableHiveSupport()
.getOrCreate();
JavaRDD<mySchema> peopleRDD = spark.read()
.textFile("file:///app/home/emm/zipfiles/myzips/")
.javaRDD()
.map(new Function<String, mySchema>()
{
#Override
public mySchema call(String line) throws Exception
{
String[] parts = line.split(";");
mySchema mySchema = new mySchema();
mySchema.setCFIELD1 (parts[0]);
mySchema.setCFIELD2 (parts[1]);
mySchema.setCFIELD3 (parts[2]);
mySchema.setCFIELD4 (parts[3]);
mySchema.setCFIELD5 (parts[4]);
return mySchema;
}
});
// Apply a schema to an RDD of JavaBeans to get a DataFrame
Dataset<Row> myDF = spark.createDataFrame(peopleRDD, mySchema.class);
myDF.createOrReplaceTempView("myView");
spark.sql("INSERT INTO myHIVEtable SELECT * from myView");
UPDATE: Modified code
Lambdas are not working on my eclipse. So used regular java syntax. I am getting an exceception now.
.....
Function2 removeHeader= new Function2<Integer, Iterator<String>, Iterator<String>>(){
public Iterator<String> call(Integer ind, Iterator<String> iterator) throws Exception {
System.out.println("ind="+ind);
if((ind==0) && iterator.hasNext()){
iterator.next();
iterator.next();
return iterator;
}else
return iterator;
}
};
JavaRDD<mySchema> peopleRDD = spark.read()
.textFile(path) //file:///app/home/emm/zipfiles/myzips/
.javaRDD()
.mapPartitionsWithIndex(removeHeader,false)
.map(new Function<String, mySchema>()
{
........
Java.util.NoSuchElementException
at java.util.LinkedList.removeFirst(LinkedList.java:268)
at java.util.LinkedList.remove(LinkedList.java:683)
at org.apache.spark.sql.execution.BufferedRowIterator.next(BufferedRowIterator.java:49)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.next(WholeStageCodegenExec.scala:374)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.next(WholeStageCodegenExec.scala:368)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
at com.comcast.emm.vodip.SparkSessionFiles.SparkSessionFiles$1.call(SparkSessionFiles.java:2480)
at com.comcast.emm.vodip.SparkSessionFiles.SparkSessionFiles$1.call(SparkSessionFiles.java:2476)

You could do something like that :
JavaRDD<mySchema> peopleRDD = spark.read()
.textFile("file:///app/home/emm/zipfiles/myzips/")
.javaRDD()
.mapPartitionsWithIndex((index, iter) -> {
if (index == 0 && iter.hasNext()) {
iter.next();
if (iter.hasNext()) {
iter.next();
}
}
return iter;
}, true);
...
In Scala, it the syntax is simpler. For example :
rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(2) else iter }
EDIT :
I modified the code to avoid the Exception.
This code will only delete the first 2 lines of the RDD, not of every files.
If you want to remove the first 2 lines of every file, I suggest you do a RDD for each file, apply the .mapPartitionWithIndex(...) for each RDD, then do a union of each RDD.

Related

SparkSession null point exception in Dataset foreach

I am new to Spark.
I want to keep getting message from kafka, and then save to S3 once the message size over 100000.
I implemented it by Dataset.collectAsList(), but it throw error with Total size of serialized results of 3 tasks (1389.3 MiB) is bigger than spark.driver.maxResultSize
So I turned to use foreach, and it said null point exception when used SparkSession to createDataFrame.
Any idea about it? Thanks.
---Code---
SparkSession spark = generateSparkSession();
registerUdf4AddPartition(spark);
Dataset<Row> dataset = spark.readStream().format("kafka")
.option("kafka.bootstrap.servers", args[0])
.option("kafka.group.id", args[1])
.option("subscribe", args[2])
.option("kafka.security.protocol", SecurityProtocol.SASL_PLAINTEXT.name)
.load();
DataStreamWriter<Row> console = dataset.toDF().writeStream().foreachBatch((rawDataset, time) -> {
Dataset<Row> rowDataset = rawDataset.selectExpr("CAST(value AS STRING)");
//using foreach
rowDataset.foreach(row -> {
List<Span> rawDataList = new CsvToBeanBuilder(new StringReader(row.getString(0))).withType(Span.class).build().parse();
spans.addAll(rawDataList);
batchSave(spark);
});
// using collectAsList
List<Row> rows = rowDataset.collectAsList();
for (Row row : rows) {
List<Span> rawDataList = new CsvToBeanBuilder(new StringReader(row.getString(0))).withType(Span.class).build().parse();
spans.addAll(rawDataList);
batchSave(spark);
}
});
StreamingQuery start = console.start();
start.awaitTermination();
public static void batchSave(SparkSession spark){
synchronized (spans){
if(spans.size() == 100000){
System.out.println(spans.isEmpty());
Dataset<Row> spanDataSet = spark.createDataFrame(spans, Span.class);
Dataset<Row> finalResult = addCustomizedTimeByUdf(spanDataSet);
StringBuilder pathBuilder = new StringBuilder("s3a://fwk-dataplatform-np/datalake/log/FWK/ART2/test/leftAndRight");
finalResult.repartition(1).write().partitionBy("year","month","day","hour").format("csv").mode("append").save(pathBuilder.toString());
spans.clear();
}
}
}
Since the main SparkSession is running in driver, and tasks in foreach... is running distributed in executors, so the spark is not defined to all other executors.
BTW, there is no meaning to use synchronized inside foreach task since everything is distributed.

Cogrouping not supported in streaming DataSet/DataFrames

Executing action:Metrics:P{"input":"tripMetrics","isEnabled":"true","class":"com.mobileum.wcmodel.execution.actions.SaveStreamAction","properties":{"path":"output/tripMetrics","triggerWindow":"5 minutes","checkpointLocation":"output/checkpoints/tripMetrics","format":"console","queryName":"GtpDetailModel"}}}
Exception in thread "main" org.apache.spark.sql.AnalysisException: CoGrouping with a streaming DataFrame/Dataset is not supported;;
I have a use case where I have to cogroup two datasets in streaming . However when doing so I am getting an exception that Cogrouping of Dataset/DataFrames in streaming is not supported
#Override
public List<Dataset<Row>> transform(SparkSession sparkSession, Map<String, Dataset<Row>> inputDatasets, Properties properties) {
Encoder<Row> encoder = RowEncoder.apply((StructType)new CatalystSqlParser(sparkSession.sqlContext().conf()).parseDataType("struct<hostnetworkid:string,partnercountryid:string>"));
try {
Iterator<Map.Entry<String,Dataset<Row>>> itr= inputDatasets.entrySet().iterator();
Dataset<Row> trip = null;
Dataset<Row> registration= null;
while(itr.hasNext()){
trip=itr.next().getValue();
registration=itr.next().getValue();
}
KeyValueGroupedDataset<Long, TripModel> tripKeyValueGroupedDataset =
trip.map((MapFunction<Row, TripModel>) TripModel :: new , Encoders.bean(TripModel.class))
.groupByKey((MapFunction<TripModel, Long>) TripModel::getKey, Encoders.LONG());
KeyValueGroupedDataset<Long, RegistrationModel> regKeyValueGroupedDataset =
registration.map((MapFunction<Row, RegistrationModel>) RegistrationModel :: new , Encoders.bean(RegistrationModel.class))
.groupByKey((MapFunction<RegistrationModel, Long>) RegistrationModel::getKey, Encoders.LONG());
Dataset<Row> cogrouped = tripKeyValueGroupedDataset.cogroup(regKeyValueGroupedDataset, (CoGroupFunction<Long,TripModel, RegistrationModel, Row>) ( key, it1, it2) ->
{
Iterable<TripModel> iterable = () -> it1;
List<TripModel> tripModelList = StreamSupport
.stream(iterable.spliterator(), false)
.collect(Collectors.toList());
List<Row> a1 = new ArrayList<Row>();
a1.add(RowFactory.create(tripModelList.get(0).getCosid(),"asdf"));
return a1.iterator();

Spark : cleaner way to build Dataset out of Spark streaming

I want to create an API which looks like this
public Dataset<Row> getDataFromKafka(SparkContext sc, String topic, StructType schema);
here
topic - is Kafka topic name from which the data is going to be consumed.
schema - is schema information for Dataset
so my function contains following code :
JavaStreamingContext jsc = new JavaStreamingContext(javaSparkContext, Durations.milliseconds(2000L));
JavaPairInputDStream<String, String> directStream = KafkaUtils.createDirectStream(
jsc, String.class, String.class,
StringDecoder.class, StringDecoder.class,
kafkaConsumerConfig(), topics
);
Dataset<Row> dataSet = sqlContext.createDataFrame(javaSparkContext.emptyRDD(), schema);
DataSetHolder holder = new DataSetHolder(dataSet);
LongAccumulator stopStreaming = sc.longAccumulator("stop");
directStream.foreachRDD(rdd -> {
RDD<Row> rows = rdd.values().map(value -> {
//get type of message from value
Row row = null;
if (END == msg) {
stopStreaming.add(1);
row = null;
} else {
row = new GenericRow(/*row data created from values*/);
}
return row;
}).filter(row -> row != null).rdd();
holder.union(sqlContext.createDataFrame(rows, schema));
holder.get().count();
});
jsc.start();
//stop stream if stopStreaming value is greater than 0 its spawned as new thread.
return holder.get();
Here DatasetHolder is a wrapper class around Dataset to combine the result of all the rdds.
class DataSetHolder {
private Dataset<Row> df = null;
public DataSetHolder(Dataset<Row> df) {
this.df = df;
}
public void union(Dataset<Row> frame) {
this.df = df.union(frame);
}
public Dataset<Row> get() {
return df;
}
}
This doesn't looks good at all but I had to do it. I am wondering what is the good way to do it. Or is there any provision for this by Spark?
Update
So after consuming all the data from stream i.e. from kafka topic, we create a dataframe out of it so that the data analyst can register it as a temp table and can fire any query to get the meaningful result.

Best Approach for duplicate filtering in a Java Spark JavaPairRDD

I am learning spark and would like to seek best approaches for solving the below problem.
I have 2 datasets users and transactions as below and would like to join them to find unique locations per item sold.
The headers for the files are as below
id,email,language,location ----------- USER HEADERS
txid,productid,userid,price,desc -------------------- TRANSACTION HEADERS
Below is my approach
/*
* Load user data set into userDataFrame
* Load transaction data set into transactionDataFrame
* join both on user id - userTransactionFrame
* select productid and location columns from the joined dataset into a new dataframe - productIdLocationDataFrame
* convert the new dataframe into a javardd - productIdLocationJavaRDD
* make the javardd a pair rdd - productIdLocationJavaPairRDD
* group the pair rdd by key - productLocationList
* apply mapvalues on the grouped key to convert the list of values to a set of valued for duplicate filtering - productUniqLocations
*
* */
I am not very sure that I have done this the right way and still feel "can be done better, differently".
I am doubtful of the part where I have done duplicate filtering from the JavaPairRDD.
Please evaluate the approach and code and let me know better solutions.
Code
SparkConf conf = new SparkConf();
conf.setAppName("Sample App - Uniq Location per item");
JavaSparkContext jsc = new JavaSparkContext("local[*]","A 1");
//JavaSparkContext jsc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(jsc);
//id email language location ----------- USER HEADERS
DataFrame userDataFrame = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.option("delimiter", "\t")
.load("user");
//txid pid uid price desc -------------------- TRANSACTION HEADERS
DataFrame transactionDataFrame = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.option("delimiter", "\t")
.load("transactions");
Column joinColumn = userDataFrame.col("id").equalTo(transactionDataFrame.col("uid"));
DataFrame userTransactionFrame = userDataFrame.join(transactionDataFrame,joinColumn,"rightouter");
DataFrame productIdLocationDataFrame = userTransactionFrame.select(userTransactionFrame.col("pid"),userTransactionFrame.col("location"));
JavaRDD<Row> productIdLocationJavaRDD = productIdLocationDataFrame.toJavaRDD();
JavaPairRDD<String, String> productIdLocationJavaPairRDD = productIdLocationJavaRDD.mapToPair(new PairFunction<Row, String, String>() {
public Tuple2<String, String> call(Row inRow) throws Exception {
return new Tuple2(inRow.get(0),inRow.get(1));
}
});
JavaPairRDD<String, Iterable<String>> productLocationList = productIdLocationJavaPairRDD.groupByKey();
JavaPairRDD<String, Iterable<String>> productUniqLocations = productLocationList.mapValues(new Function<Iterable<String>, Iterable<String>>() {
public Iterable<String> call(Iterable<String> inputValues) throws Exception {
return new HashSet<String>((Collection<? extends String>) inputValues);
}
});
productUniqLocations.saveAsTextFile("uniq");
The good part is that the code runs and generates the output that I expect.
The lowest hanging fruit is getting rid of groupByKey.
Using aggregateByKey should do the job since the output type of the value is different (we want a set per key).
Code in Scala :
pairRDD.aggregateByKey(new java.util.HashSet[String])
((locationSet, location) => {locationSet.add(location); locationSet},
(locSet1, locSet2) => {locSet1.addAll(locSet2); locSet1}
)
Java Equivalent:
Function2<HashSet<String>, String, HashSet<String>> sequenceFunction = new Function2<HashSet<String>, String, HashSet<String>>() {
public HashSet<String> call(HashSet<String> aSet, String arg1) throws Exception {
aSet.add(arg1);
return aSet;
}
};
Function2<HashSet<String>, HashSet<String>, HashSet<String>> combineFunc = new Function2<HashSet<String>, HashSet<String>, HashSet<String>>() {
public HashSet<String> call(HashSet<String> arg0, HashSet<String> arg1) throws Exception {
arg0.addAll(arg1);
return arg0;
}
};
JavaPairRDD<String, HashSet<String>> byKey = productIdLocationJavaPairRDD.aggregateByKey(new HashSet<String>(), sequenceFunction, combineFunc );
Secondly, joins work the best when the datasets are co-partitioned.
Since you are dealing with dataframes, partitioning out of box is not possible if you are using Spark < 1.6. Thus, you may want to read data into RDDs, partition them and then create dataframes. For your use case, it might be better to not involve dataframes at all.

How can use spark SqlContext object in spark sql registeredFunction?

I am new to Spark SQL. Concat function not available in Spark Sql Query for this we have registered one sql function, with in this function i need access another table. for that we have written spark sql query on SQLContext object.
when i invoke this query i am getting NullpointerException.please can you help on this.
Thanks in advance
//This I My code
class SalesHistory_2(sqlContext:SQLContext,sparkContext:SparkContext) extends Serializable {
import sqlContext._
import sqlContext.createSchemaRDD
try{
sqlContext.registerFunction("MaterialTransformation", Material_Transformation _)
def Material_Transformation(Material_ID: String): String =
{
var material:String =null;
var dd = sqlContext.sql("select * from product_master")
material
}
/* Product master*/
val productRDD = this.sparkContext.textFile("D:\\Realease 8.0\\files\\BHI\\BHI_SOP_PRODUCT_MASTER.txt")
val product_schemaString = productRDD.first
val product_withoutHeaders = dropHeader(productRDD)
val product_schema = StructType(product_schemaString.split("\\|").map(fieldName => StructField(fieldName, StringType, true)))
val productdata = product_withoutHeaders.map{_.replace("|", "| ")}.map(x=> x.split("\\|"))
var product_rowRDD = productdata.map(line=>{
Row.fromSeq(line.map {_.trim() })
})
val product_srctableRDD = sqlContext.applySchema(product_rowRDD, product_schema)
product_srctableRDD.registerTempTable("product_master")
cacheTable("product_master")
/* Customer master*/
/* Sales History*/
val srcRDD = this.sparkContext.textFile("D:\\Realease 8.0\\files\\BHI\\BHI_SOP_TRADE_SALES_HISTORY_DS_4_20150119.txt")
val schemaString= srcRDD.first
val withoutHeaders = dropHeader(srcRDD)
val schema = StructType(schemaString.split("\\|").map(fieldName => StructField(fieldName, StringType, true)))
val lines = withoutHeaders.map {_.replace("|", "| ")}.map(x=> x.split("\\|"))
var rowRDD = lines.map(line=>{
Row.fromSeq(line.map {_.trim() })
})
val srctableRDD = sqlContext.applySchema(rowRDD, schema)
srctableRDD.registerTempTable("SALES_HISTORY")
val srcResults = sqlContext.sql("SELECT Delivery_Number,Delivery_Line_Item,MaterialTransformation(Material_ID),Customer_Group_Node,Ops_ID,DC_ID,Mfg_ID,PGI_Date,Delivery_Qty,Customer_Group_Node,Line_Total_COGS,Line_Net_Rev,Material_Description,Sold_To_Partner_Name,Plant_Description,Originating_Doc,Orig_Doc_Line_item,Revenue_Type,Material_Doc_Ref,Mater_Doc_Ref_Item,Req_Delivery_Date FROM SALES_HISTORY")
val path: Path = Path ("D:/Realease 8.0/files/output/")
try {
path.deleteRecursively(continueOnFailure = false)
} catch {
case e: IOException => // some file could not be deleted
}
val successRDDToFile = srcResults.map { x => x.mkString("|")}
successRDDToFile.coalesce(1).saveAsTextFile("D:/Realease 8.0/files/output/")
}
catch {
case ex: Exception => println(ex) // TODO: handle error
}
this.sparkContext.stop()
def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
The answer here is rather short and probably disappointing - you simply cannot do something like this.
General rule in Spark is you cannot trigger action or transformation from another action and transformation or, to be a little bit more precise, outside the driver Spark Context is no longer accessible / defined.
Calling Spark SQL for each row in the Sales History RDD looks like a very bad idea:
val srcResults = sqlContext.sql("SELECT Delivery_Number,Delivery_Line_Item,MaterialTransformation(Material_ID),Customer_Group_Node,Ops_ID,DC_ID,Mfg_ID,PGI_Date,Delivery_Qty,Customer_Group_Node,Line_Total_COGS,Line_Net_Rev,Material_Description,Sold_To_Partner_Name,Plant_Description,Originating_Doc,Orig_Doc_Line_item,Revenue_Type,Material_Doc_Ref,Mater_Doc_Ref_Item,Req_Delivery_Date FROM SALES_HISTORY")
You'd better user a join between your RDDs and forget you custom function:
val srcResults = sqlContext.sql("SELECT s.*, p.* FROM SALES_HISTORY s join product_master p on s.Material_ID=p.ID")

Resources