Trying to use POI 4.0.1 to extract text from powerpoints. Using all POI 4.0.1 jars and getting a method not found exception
Exception in thread "Thread-2" java.lang.NoSuchMethodError:
org.apache.poi.sl.usermodel.Sheet.getPlaceholderDetails(Lorg/apache/poi/sl/usermodel/Placeholder;)Lorg/apache/poi/sl/usermodel/PlaceholderDetails;
at org.apache.poi.sl.extractor.SlideShowExtractor.addSheetPlaceholderDatails(SlideShowExtractor.java:224)
at org.apache.poi.sl.extractor.SlideShowExtractor.printHeaderReturnFooter(SlideShowExtractor.java:183)
at org.apache.poi.sl.extractor.SlideShowExtractor.printShapeText(SlideShowExtractor.java:236)
at org.apache.poi.sl.extractor.SlideShowExtractor.getText(SlideShowExtractor.java:130)
at org.apache.poi.sl.extractor.SlideShowExtractor.getText(SlideShowExtractor.java:120)
Looked at my classpath and didn't find mismatched or duplicate poi jars. Poked around in the POI 4 distribution jars and could not find the missing method
FileInputStream fis = new FileInputStream( file.getPath() );
XMLSlideShow xmlA = new XMLSlideShow( fis );
SlideShowExtractor<XSLFShape, XSLFTextParagraph> ex = new SlideShowExtractor<>( xmlA );
String text = ex.getText();
Related
I am using Java 8 with Spark v2.4.1.
I am trying to add Map using Spark function called typedLit. But getting compilation errors. How can I do it in Java API?
Below is the scenario:
Map<Integer,Integer> lookup_map= new HashMap<>();
lookup_map.put(1,11);
lookup_map.put(2,21);
lookup_map.put(3,31);
lookup_map.put(4,41);
lookup_map.put(5,51);
JavaSparkContext javaSparkContext = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
Column typedMapCol = functions.typedLit(lookUpScoreHm, Map<Encoders.INT(), Encoders.INT()> );
// this is not correct and giving error in at typedLit.
Dataset<Row> resultDs= dataDs
.withColumn("map_col", typedMapCol)
How to define functions.typedLit in Java 8?
Using Spark 2.3.0 new feature, ImageSchema, I've read some images as dataset, and now after applying changes on them, I want to save them as image formats (png, jpeg).
I got each dataset row's data (byte[]) and tried to save it as png file, but the exported file is not valid!
Dataset<Row> images = ImageSchema.readImages("images/");
images.foreach(data_row -> {
Row row = data_row.getAs(0);
File file = new File(Paths.get(ImageSchema.getOrigin(row)).getFileName().toString() + ".png");
FileOutputStream fos = new FileOutputStream(file);
fos.write(ImageSchema.getData(row));
fos.flush();
fos.close();
}
Recently, my Phoenix integration with Spark has been acting up. It was working very well until last week, as a matter of fact, there are still pieces of code running on the cluster using similar code. Only new code that I write does not seem to work properly and is giving weird errors while writing a data-frame to Phoenix -- reading works fine:
def main(args: Array[String]): Unit =
{
if (args.length < 1) {
println("Needs a month range as: startingmonth,endingmonth")
System.exit(1)
}
val month_range = args(0).split(",")
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
val start = System.currentTimeMillis();
val sparkConf = new SparkConf().setAppName("userProfile")//.set("spark.testing.memory", "536870912")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
/*val itemProfileDF = sqlContext.read.format("jdbc")
.options(ImmutableMap.of("driver", "org.apache.phoenix.jdbc.PhoenixDriver", "url",
"jdbc:phoenix:<ZK URL>:5181", "dbtable", "ITEMPROFILES")).load()
itemProfileDF.show() //[[ [ TAKE NOTE OF THIS COMMENTED PART] ]]*/
val callUsageSummary = sqlContext.read.parquet("/edw_data_vol/hp_tab/CALL_USAGE_SUMMARY21_FCT")
callUsageSummary.write.format("org.apache.phoenix.spark").mode(SaveMode.Overwrite).options(ImmutableMap.of("driver", "org.apache.phoenix.jdbc.PhoenixDriver","zkUrl","jdbc:phoenix:<ZK URL>:5181","table","AGGREGATIONFINAL")).save()
print("Done")
var stop = System.currentTimeMillis();
System.out.println("Time taken to process the files" + (stop - start) / 1000 + "s")
}
this code throws this error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 70 in stage 2.0 failed 4 times, most recent failure: Lost task 70.3 in stage 2.0 (TID 13): java.sql.SQLException: No suitable driver found for jdbc:phoenix:<ZK URL>:5181:/hbase;
at java.sql.DriverManager.getConnection(DriverManager.java:596)
at java.sql.DriverManager.getConnection(DriverManager.java:187)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(ConnectionUtil.java:98)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getOutputConnection(ConnectionUtil.java:82)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getOutputConnection(ConnectionUtil.java:70)
at org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getUpsertColumnMetadataList(PhoenixConfigurationUtil.java:230)
at org.apache.phoenix.spark.DataFrameFunctions$$anonfun$2.apply(DataFrameFunctions.scala:45)
at org.apache.phoenix.spark.DataFrameFunctions$$anonfun$2.apply(DataFrameFunctions.scala:41)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
whereas If I uncomment the part which marked so you can notice, the code works just fine, but only in this particular classes, in other classes it's pretty much unreliable when it may or may not work. Mostly I can't figure out changing what where is going to make it work. It most probably isn't a problem with Phoenix as I have only noticed the result change after I repackage it with maven. Also, my older project runs just fine.
There are also classes in which the code was working just fine but now it does not after I repackaged them (I think).
in the command line I pass:
--conf "spark.driver.extraClassPath=/opt/mapr/spark/spark-1.6.1/lib/phoenix-spark-4.8.1-HBase-1.1.jar,/opt/mapr/spark/spark-1.6.1/lib/hbase-protocol-1.1.1-mapr-1602.jar"
I know neither of these is the client jar mentioned in the official Phoenix documentation. BUT THE CODE USED TO WORK WITH THESE, SOMETIMES IT STILL DOES. I have tried the client jars and every possible combination of these 3 jars for executors and driver. I have finally given up and writing here in case someone knows what might have happened.
I have a simple spark job that splits the words from a file and loads into a table in hive.
public static void wordCountJava7() {
// Define a configuration to use to interact with Spark
SparkConf conf = new SparkConf().setMaster("local[4]").setAppName("Work Count App");
SparkContext sc = new SparkContext(conf);
// Create a Java version of the Spark Context from the configuration
JavaSparkContext jsc = new JavaSparkContext(sc);
// Load the input data, which is a text file read from the command line
JavaRDD<String> input = jsc.textFile("file:///home/priyanka/workspace/ZTA/spark/src/main/java/sample.txt");
// Java 7 and earlier
JavaRDD<String> words = input.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
});
// Java 7 and earlier: transform the collection of words into pairs (word and 1)
JavaPairRDD<String, Integer> counts = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2(s, 1);
}
});
// Java 7 and earlier: count the words
JavaPairRDD<String, Integer> reducedCounts = counts.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer x, Integer y) {
return x + y;
}
});
HiveContext hiveContext = new HiveContext(sc);
DataFrame dataFrame = hiveContext.createDataFrame(words, SampleBean.class);
dataFrame.write().saveAsTable("Sample");
words.saveAsTextFile("output");
jsc.close();
}
The spark jobs fails with the following trace:
16/04/29 15:41:21 WARN HiveContext$$anon$2: Could not persist `sample` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format.
org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: at least one column must be specified for the table
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:720)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:677)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply$mcV$sp(ClientWrapper.scala:424)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:422)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:422)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:290)
at org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:237)
at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:236)
at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:279)
at org.apache.spark.sql.hive.client.ClientWrapper.createTable(ClientWrapper.scala:422)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:358)
at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:280)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
Checked the table sample in Hive. It did have one column
hive> desc sample;
OK
word string None
Time taken: 0.218 seconds, Fetched: 1 row(s)
When i try to save it as a table, this error is thrown.
Any help is appreciated.
it means columns data type is not correct
i got same error while using sing Avro schema
in dataframe data type was decimal(20,2)
and in Avro schema I had mentioned type as Decimal(20,2)
it gave same error
schema with issue
later changed data type in Avro schema to string and it worked fine for me
as Avro convert internal decimal to string
changed schema
This issue occured for me because I was doing
DataFrameWriter<Row> dfw =
sparkSession.createDataFrame(jsc.parallelize(uuids), CustomDataClass.class).write();
And I fixed it by doing
DataFrameWriter<Row> dfw =
sparkSession.createDataFrame(uuids, CustomDataClass.class).write();
(No need to parallelize). Or, in general, make sure the Bean Class you pass matches the type in the List you pass to createDataFrame.
I am trying to use Spark SQL using parquet file formats. When I try the basic example :
object parquet {
case class Person(name: String, age: Int)
def main(args: Array[String]) {
val sparkConf = new SparkConf().setMaster("local").setAppName("HdfsWordCount")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD
val people = sc.textFile("C:/Users/pravesh.jain/Desktop/people/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.saveAsParquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet")
val parquetFile = sqlContext.parquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet")
}
}
I get a null pointer exception :
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.parquet$.main(parquet.scala:16)
which is the line saveAsParquetFile. What's the issue here?
This error occurs when I was using Spark in eclipse in Windows. I tried the same on spark-shell and it works fine. I guess spark might not be 100% compatible with windows.
Spark is compatible with Windows. You can run your program in a spark-shell session in Windows or you can run it using spark-submit with necessary argument such as "-master" (again, in Windows or other OS).
You cannot just run your Spark program as an ordinary Java program in Eclispe without properly setting up the Spark environment and so on. You problem has nothing to do with Windows.