NPE reading a spark Spark Dataset - apache-spark

I have an RDD that I grouped based on the size of the rows (each row has a different size) in a foreachPartition. then for each of these groups (a list of Strings), I am creating a new Dataset, so I can write them to storage. basically I am trying to split the RDD by size and update the target (controlled by size). Spark's new maxRecordsPerFile won't work for me because each row has a blob that will make the row sizes different. So here is my code, I am getting null pointer exception reading the newly created dataset:
Caller:
JavaRDD<String> stringRDD = xmlDataSet.toJavaRDD().map(xmlRow -> extractXMLBlobToJsonString(xmlRow)).rdd().toJavaRDD();
GroupBySizeManager mgr = GroupBySizeManager.getInstance();
jsonRDD.foreachPartition(mgr::splitPartition);
Function splitPartition in GroupBySizeManager.java
public void splitPartition(Iterator<String> rowsInPartition) {
System.out.println("********Number of groups " + this.getGroups().size());
SparkSession sparkSession = SparkDataflowConfig.getInstance().getSparkSession();
while (rowsInPartition.hasNext()) {
addToGroup(rowsInPartition.next());
}
for(List<String> rows: getRows()) {
System.out.println("******Size of List: " + rows.size());
Dataset<String> dataset = sparkSession.createDataset(rows, Encoders.STRING());
Dataset<Row> xmldf = sparkSession.read().json(dataset);// NPE here
System.out.println("Count of xmlfiltered: " + xmldf.count());
xmldf.write().format("csv").option("header", "true").save("/home/myfilepath/");
}
}
Receiving the following NPE on the above sparkSession.read().json(dataset). This same code works if I just do it on my initial RDD before I did this grouping and recreating Dataset
java.lang.NullPointerException
at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:67)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$.inferFromDataset(JsonDataSource.scala:104)
at org.apache.spark.sql.DataFrameReader.$anonfun$json$1(DataFrameReader.scala:567)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:567)
at com.bm.xchange.spark.executor.GroupBySizeManager.splitPartition(GroupBySizeManager.java:57)
at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreachPartition$1(JavaRDDLike.scala:219)
at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreachPartition$1$adapted(JavaRDDLike.scala:219)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020)
I have been reading more and I think I know why I am seeing the exception. It seems like we cannot create a dataset in the executor, but only in the driver. So this logic of mine wont work. Is there any better way to do this (split partition based on file size) without causing shuffle or repartition?
Appreciate your help
thank you

Related

Use RDD.foreach to Create a Dataframe and execute actions on the Dataframe in Spark scala

I'm trying to read a config file in spark read.textfile which basically contains my tables list. my task is to iterate through the table list and convert Avro to ORC format. please find my below code snippet which will do the logic.
val tableList = spark.read.textFile('tables.txt')
tableList.collect().foreach(tblName => {
val df = spark.read.format("avro").load(inputPath+ "/" + tblName)
df.write.format("orc").mode("overwrite").save(outputPath+"/"+tblName)})
Please find my configurations below
DriverMemory: 4GB
ExecutorMemory: 10GB
NoOfExecutors: 5
Input DataSize: 45GB
My question here is this will execute in Executor or Driver? This will throw Out of Memory Error ? Please comment your suggestions.
val tableList = spark.read.textFile('tables.txt')
tableList.collect().foreach(tblName => {
val df = spark.read.format("avro").load(inputPath+ "/" + tblName)
df.write.format("orc").mode("overwrite").save(outputPath+"/"+tblName)}
)
Re:
will this execute in Executor or Driver?
Once you call tableList.collect(), the contents of 'tables.txt' will be brought to the Driver application. If it is well within the Driver Memory it should be alright.
However the save operation on Dataframe would be executed on executor.
Re:
This will throw Out of Memory Error ?
Have you faced one ? IMO, unless your tables.txt is too huge you should be alright.I am assuming Input data size as 45 GB is the data in the tables mentioned in tables.txt.
Hope this helps.
I would suggest to eliminate the collect since it is an action therefore all the data from your 45gb file is loaded in memory. You can try something like this
val tableList = spark.read.textFile('tables.txt')
tableList.foreach(tblName => {
val df = spark.read.format("avro").load(inputPath+ "/" + tblName)
df.write.format("orc").mode("overwrite").save(outputPath+"/"+tblName)})

Convert dataset into list row is taking much time

i am calculating TFIDF, for that i need to convert my data set into list row.
My dataset has 40,00,000 records, when i call collectAsList function for my dataset it is taking more than 20mins to complete.
My RAM configured of 16gb.
Basically i need to work on individual row to calculate TFIDF for that particular record.
Please suggest me is there any other type of function to convert data set into list row in spark.
Even i tried for and foreach loop also, but still it is taking time.
Below is my sample code.
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("connection example").getOrCreate();
Dataset<Row> tokenlist= sqlContext.read().format("com.databricks.spark.csv").option("header", "true").option("nullValue", "").load("D:\\AI_MATCHING\\exampleTFIDF.csv");
tokenlist= tokenlist.select("features");
tokenlist.show(false);
List<Row> tokenizedWordsList1 = tokenlist.collectAsList();
/*tokenlist.foreach((ForeachFunction<Row>) individaulRow -> {
newtest.ItemIDSourceIndex=individaulRow.fieldIndex("ItemIDSource");
newtest.upcSourceIndex=individaulRow.fieldIndex("upcSource");
newtest.ManufacturerSourceIndex=individaulRow.fieldIndex("ManufacturerSource");
newtest.ManufacturerPartNumberSourceIndex=individaulRow.fieldIndex("Manufacturer part NumberSource");
newtest.PART_NUMBER_SOURCEIndex=individaulRow.fieldIndex("PART_NUMBER_SOURCE");
newtest.productDescriptionSourceIndex=individaulRow.fieldIndex("productDescriptionSource");
newtest.HASH_CODE_dummyIndex=individaulRow.fieldIndex("HASH_CODE_dummy");
newtest.rowIdSourceIndex=individaulRow.fieldIndex("rowIdSource");
newtest.rawFeaturesIndex=individaulRow.fieldIndex("rawfeatures ");
newtest.featuresIndex=individaulRow.fieldIndex("features ");
});*/
A) Spark ML library already doing TFIDF calculations by itself, try to use those methods.
B) If you have large rows (toList() will take time), try to use SQL methods.
Such as convert dataset into a table and query on it with certain conditions

Converting Dataframe to RDD reduces partitions

In our code, Dataframe was created as :
DataFrame DF = hiveContext.sql("select * from table_instance");
When I convert my dataframe to rdd and try to get its number of partitions as
RDD<Row> newRDD = Df.rdd();
System.out.println(newRDD.getNumPartitions());
It reduces the number of partitions to 1(1 is printed in the console). Originally my dataframe has 102 partitions .
UPDATE:
While reading , I repartitoned the dataframe :
DataFrame DF = hiveContext.sql("select * from table_instance").repartition(200);
and then converted to rdd , so it gave me 200 partitions only.
Does
JavaSparkContext
has a role to play in this? When we convert a dataframe to rdd , is default minimum partitions flag also considered at the spark context level?
UPDATE:
I made a seperate sample program in which I read the exact same table into dataframe and converted to rdd. No extra stage was created for RDD conversion and the partition count was also correct. I am now wondering what different am I doing in my main program.
Please let me know if my understanding is wrong here.
It basically depends on the implementation of hiveContext.sql(). Since I am new to Hive, my guess is hiveContext.sql doesn't know OR is not able to split the data present in the table.
For example, when you read a text file from HDFS, spark context considers the number of blocks used by that file to determine the partitions.
What you did with repartition is the obvious solution for these kinds of problems.(Note: repartition may cause a shuffle operation if proper partitioner is not used, hash Partitioner is used by default)
Coming to your doubt, hiveContext may consider the default minimum partition property. But, relying on default property is not going to
solve all your problems. For instance, if your hive table's size increases, your program still uses the default number of partitions.
Update: Avoid shuffle during repartition
Define your custom partitioner:
public class MyPartitioner extends HashPartitioner {
private final int partitions;
public MyPartitioner(int partitions) {
super();
this.partitions = partitions;
}
#Override
public int numPartitions() {
return this.partitions;
}
#Override
public int getPartition(Object key) {
if (key instanceof String) {
return super.getPartition(key);
} else if (key instanceof Integer) {
return (Integer.valueOf(key.toString()) % this.partitions);
} else if (key instanceof Long) {
return (int)(Long.valueOf(key.toString()) % this.partitions);
}
//TOD ... add more types
}
}
Use your custom partitioner:
JavaPairRDD<Long, SparkDatoinDoc> pairRdd = hiveContext.sql("select * from table_instance")
.mapToPair( //TODO ... expose the column as key)
rdd = rdd.partitionBy(new MyPartitioner(200));
//... rest of processing

PHOENIX SPARK - Load Table as DataFrame

I have created a DataFrame from a HBase Table (PHOENIX) which has 500 million rows. From the DataFrame I created an RDD of JavaBean and use it for joining with data from a file.
Map<String, String> phoenixInfoMap = new HashMap<String, String>();
phoenixInfoMap.put("table", tableName);
phoenixInfoMap.put("zkUrl", zkURL);
DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap).load();
JavaRDD<Row> tableRows = df.toJavaRDD();
JavaPairRDD<String, AccountModel> dbData = tableRows.mapToPair(
new PairFunction<Row, String, String>()
{
#Override
public Tuple2<String, String> call(Row row) throws Exception
{
return new Tuple2<String, String>(row.getAs("ID"), row.getAs("NAME"));
}
});
Now my question - Lets say the file has 2 unique million entries matching with the table. Is the entire table loaded into memory as RDD or only the matching 2 million records from the table will be loaded into memory as RDD ?
Your statement
DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap)
.load();
will load the entire table into memory. You have not provided any filter for phoenix to push down into hbase - and thus reduce the number of rows read.
If you do a join to a non-HBase datasource - e.g a flat file - then all of the records from the hbase table would first need to be read in. The records not matching the secondary data source would not be saved in the new DataFrame - but the initial reading would have still happened.
Update A potential approach would be to pre-process the file - i.e. extracting the id's you want. Store the results into a new HBase table. Then perform the join directly in HBase via Phoenix not Spark .
The rationale for that approach is to move the computation to the data. The bulk of the data resides in HBase - so then move the small data (id's in the files) to there.
I am not familiar directly with Phoenix except that it provides a sql layer on top of hbase. Presumably then it would be capable of doing such a join and storing the result in a separate HBase table ..? That separate table could then be loaded into Spark to be used in your subsequent computations.

Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

I have Spark job which does some processing on ORC data and stores back ORC data using DataFrameWriter save() API introduced in Spark 1.4.0. I have the following piece of code which is using heavy shuffle memory. How do I optimize below code? Is there anything wrong with it? It is working fine as expected only causing slowness because of GC pause and shuffles lots of data so hitting memory issues. I am new to Spark.
JavaRDD<Row> updatedDsqlRDD = orderedFrame.toJavaRDD().coalesce(1, false).map(new Function<Row, Row>() {
#Override
public Row call(Row row) throws Exception {
List<Object> rowAsList;
Row row1 = null;
if (row != null) {
rowAsList = iterate(JavaConversions.seqAsJavaList(row.toSeq()));
row1 = RowFactory.create(rowAsList.toArray());
}
return row1;
}
}).union(modifiedRDD);
DataFrame updatedDataFrame = hiveContext.createDataFrame(updatedDsqlRDD,renamedSourceFrame.schema());
updatedDataFrame.write().mode(SaveMode.Append).format("orc").partitionBy("entity", "date").save("baseTable");
Edit
As per suggestion I tried to convert above code into the following using mapPartitionsWithIndex() but I still see data shuffling it is better than above code but still it fails by hitting GC limit and throws OOM or goes into GC pause for long and timeout and YARN will kill executor.
I am using spark.storage.memoryFraction as 0.5 and spark.shuffle.memoryFraction as 0.4; I tried to use default and changed many combinations, but nothing helped.
JavaRDD<Row> indexedRdd = sourceRdd.cache().mapPartitionsWithIndex(new Function2<Integer, Iterator<Row>, Iterator<Row>>() {
#Override
public Iterator<Row> call(Integer ind, Iterator<Row> rowIterator) throws Exception {
List<Row> rowList = new ArrayList<>();
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
List<Object> rowAsList = iterate(JavaConversions.seqAsJavaList(row.toSeq()));
Row updatedRow = RowFactory.create(rowAsList.toArray());
rowList.add(updatedRow);
}
return rowList.iterator();
}
}, true).coalesce(200,true);
Coalescing an RDD or Dataframe to a single partition means that all your processing is happening on a single machine. This is not a good thing for a variety of reasons: all of the data has to be shuffled across the network, there is no more parallelism, etc. Instead you should look at other operators like reduceByKey, mapPartitions, or really pretty much anything besides coalescing the data to a single machine.
Note: looking are your code I don't see why you are bringing it down to a single machine, you can probably just remove that part.

Resources