Spark file stream syntax - apache-spark

JavaPairInputDStream<Text, BytesWritable> dStream = jsc.fileStream("/home/suv/junk/sparkInput");
when am running this code i am getting
java.lang.ClassCastException: java.lang.Object cannot be cast to org.apache.hadoop.mapreduce.InputFormat
I am unable to mention the input format in the file stream. How to give this input format.
This is the method signature i got
public <K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>> JavaPairInputDStream<K,V> fileStream(String directory).
In this how to specify the input format.

After wasting all my day...wrote a utility in scala..
class ZipFileStream {
def fileStream(path: String, ssc: StreamingContext): JavaPairInputDStream[Text, BytesWritable] = {
return ssc.fileStream[Text, BytesWritable, ZipFileInputFormat](path)
}
}
and refereed this from java.
Any better solution appreciated.

I have faced the same issue.
It seems to be a bug, which was fixed Spark 1.3.0
https://issues.apache.org/jira/browse/SPARK-5297

Related

How to read cassandra FQL logs in java?

I have a bunch of cassandra FQL logs with the "cq4" extension. I would like to read them in Java, is there a Java class that those log entries can be mapped into?
These are the logs I see.
I want to read this with this code:
import net.openhft.chronicle.Chronicle;
import net.openhft.chronicle.ChronicleQueueBuilder;
import net.openhft.chronicle.ExcerptTailer;
import java.io.IOException;
public class Main{
public static void main(String[] args) throws IOException {
Chronicle chronicle = ChronicleQueueBuilder.indexed("/Users/pavelorekhov/Desktop/fql_logs").build();
ExcerptTailer tailer = chronicle.createTailer();
while (tailer.nextIndex()) {
tailer.readInstance(/*class goes here*/)
}
}
}
I think from the code and screenshot you can understand what kind of class I need in order to read log entries into objects. Does that class exist in some cassandra maven dependency?
You are using Chronicle 3.x, which is very old.
I suggest using Chronicle 5.20.123, which is the version Cassandra uses.
I would assume Cassandra has it's own tool for reading the contents of these file however, you can dump the raw messages with net.openhft.chronicle.queue.main.DumpMain
I ended up cloning cassandra's github repo from here: https://github.com/apache/cassandra
In their code they have the FQLQueryIterator class which you can use to read logs, like so:
SingleChronicleQueue scq = SingleChronicleQueueBuilder.builder().path("/Users/pavelorekhov/Desktop/fql_logs").build();
ExcerptTailer excerptTailer = scq.createTailer();
FQLQueryIterator iterator = new FQLQueryIterator(excerptTailer, 1);
while (iterator.hasNext()) {
FQLQuery fqlQuery = iterator.next(); // object that holds the log entry
// do whatever you need to do with that log entry...
}

Micronaut CompileStatic JSON object -Static type checking- No such property: bookid for class: java.lang.Object

In my Micronaut Controller I have below code to parse the JSON object. when I use #CompileStatic annotation it throwing this below error.
#Post("/save")
def save(#Body Object JSON) {
String bookid=JSON?.bookid
String name=JSON?.name
def b =bookService.save(bookid,name)
return HttpResponse.created(b)
}
Error
BookController.groovy: 58: [Static type checking] - No such property: bookid for class: java.lang.Object
Is there way to fix this error message with compilestatic annotation?
Thanks
SR
With Help of Jeff Brown I have changed. my save method like this.
#Post('/')
Book save(Book b) {
bookService.save b
}
Micronaut JSON post strip the Qutoes
You can also work with your method instead of changing it for parsing.I encountered the same problem and the method that worked for me is using String instead of object. Just use JSON String along with #BODY and then parse it using ObjectMapper().
here is the answer i posted at some other question, hope it will help you out.
https://stackoverflow.com/a/54905403/7803105

Spark - ignoring corrupted files

In the ETL process that we are managing, we are receiving sometimes corrupted files.
We tried this Spark configuration and it seems it works (the Spark job is not failing because the corrupted files are discarded):
spark.sqlContext.setConf("spark.sql.files.ignoreCorruptFiles", "true")
But I don't know if there is anyway to know which files were ignored. Is there anyway to get those filenames?
Thanks in advance
One way is look through your executor logs. If you have setup following configuratios to true in your spark configuration.
RDD: spark.files.ignoreCorruptFiles
DataFrame: spark.sql.files.ignoreCorruptFiles
Then spark will log corrupted file as a WARN message in your executor logs.
Here is code snippet from Spark that does that:
if (ignoreCorruptFiles) {
currentIterator = new NextIterator[Object] {
// The readFunction may read some bytes before consuming the iterator, e.g.,
// vectorized Parquet reader. Here we use lazy val to delay the creation of
// iterator so that we will throw exception in `getNext`.
private lazy val internalIter = readCurrentFile()
override def getNext(): AnyRef = {
try {
if (internalIter.hasNext) {
internalIter.next()
} else {
finished = true
null
}
} catch {
// Throw FileNotFoundException even `ignoreCorruptFiles` is true
case e: FileNotFoundException => throw e
case e # (_: RuntimeException | _: IOException) =>
logWarning(
s"Skipped the rest of the content in the corrupted file: $currentFile", e)
finished = true
null
}
}
Did you solve it?
If not, may be you can try the below approach:
Read everything from the location with that ignoreCorruptFiles setting
You can get the file names each record belongs to using the input_file_name UDF. Get distinct names out.
Separately get list of all the objects in the respective directory
Find the difference.
Did you use a different approach?

Extracting age related information from using nlp

I am new to NLP and I have been trying to extract age related information from raw text. I googled and didn't get any reliable library in any language for this requirement. It would be great if I can get any help in this. I am open to any language and it is not a constraint. It can be in Java, Python or any other language too. Any help would be much appreciated. Thanks in advance. Cheers!
Update:
I tried adding the annotators, mentioned by Stanford help, to my java parser and I am facing below exception :
ERROR: cannot create CorefAnnotator!
java.lang.RuntimeException: Error creating coreference system
at
edu.stanford.nlp.scoref.StatisticalCorefSystem.fromProps(StatisticalCorefSystem.java:58)
at edu.stanford.nlp.pipeline.CorefAnnotator.<init>(CorefAnnotator.java:66)
at edu.stanford.nlp.pipeline.AnnotatorImplementations.coref(AnnotatorImplementations.java:220)
at edu.stanford.nlp.pipeline.AnnotatorFactories$13.create(AnnotatorFactories.java:515)
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:85)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:375)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:139)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:135)
at com.dateparser.SUtime.SUAgeParser.makeNumericPipeline(SUAgeParser.java:85)
at com.dateparser.SUtime.SUAgeParser.<clinit>(SUAgeParser.java:60)
Caused by: java.lang.RuntimeException: Error initializing coref system
at edu.stanford.nlp.scoref.StatisticalCorefSystem.<init>(StatisticalCorefSystem.java:36)
at edu.stanford.nlp.scoref.ClusteringCorefSystem.<init>(ClusteringCorefSystem.java:24)
at edu.stanford.nlp.scoref.StatisticalCorefSystem.fromProps(StatisticalCorefSystem.java:48)
... 9 more
Caused by: java.io.IOException: Unable to open "edu/stanford/nlp/models/hcoref/md-model.ser" as class path, filename or URL
at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:485)
at edu.stanford.nlp.io.IOUtils.readObjectFromURLOrClasspathOrFileSystem(IOUtils.java:323)
at edu.stanford.nlp.hcoref.md.DependencyCorefMentionFinder.<init>(DependencyCorefMentionFinder.java:38)
at edu.stanford.nlp.hcoref.CorefDocMaker.getMentionFinder(CorefDocMaker.java:149)
at edu.stanford.nlp.hcoref.CorefDocMaker.<init>(CorefDocMaker.java:61)
at edu.stanford.nlp.scoref.StatisticalCorefSystem.<init>(StatisticalCorefSystem.java:34)
... 11 more
I upgraded to version 1.6.0 and also added stanford-corenlp-models-current.jar to the classpath. Please let me know if I am missing something
Update 1:
The exception was fixed after upgrading to 3.9.1. But I am getting the ouput as per:duration relation instead of per:age
private static AnnotationPipeline makePipeline() {
Properties props = new Properties();
props.setProperty("annotators",
"tokenize,ssplit,pos,lemma,ner,depparse,coref,kbp");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
return pipeline;
}
public static void parse(String str) {
try {
Annotation doc = new Annotation(str);
pipeline.annotate(doc);
ArrayList<CoreMap> resultRelations = new ArrayList<CoreMap>();
List<CoreMap> mentionsAnnotations = doc.get(MentionsAnnotation.class);
for (CoreMap currentCoreMap : mentionsAnnotations) {
System.out.println(currentCoreMap.get(TextAnnotation.class));
System.out.println(currentCoreMap.get(CharacterOffsetBeginAnnotation.class));
System.out.println(currentCoreMap.get(CharacterOffsetEndAnnotation.class));
System.out.println(currentCoreMap.get(NamedEntityTagAnnotation.class));
}
} catch (Exception e) {
}
}
Is this normal behaviour or am I doing something wrong?
You may find the KBP relation extractor useful.
Example text:
Joe Smith is 58 years old.
Command:
java -Xmx12g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,depparse,coref,kbp -file example.txt -outputFormat text
This should attach Joe Smith to 58 years old with the per:age relation.

Reading/writing with Avro schemas AND Parquet format in SparkSQL

I'm trying to write and read Parquet files from SparkSQL. For reasons of schema evolution, I would like to use Avro schemas with my writes and reads.
My understanding is that this is possible outside of Spark (or manually within Spark) using e.g. AvroParquetWriter and Avro's Generic API. However, I would like to use SparkSQL's write() and read() methods (which work with DataFrameWriter and DataFrameReader), and which integrate well with SparkSQL (I will be writing and reading Dataset's).
I can't for the life of me figure out how to do this, and am wondering if this is possible at all. The only options the SparkSQL parquet format seems to support are "compression" and "mergeSchema" -- i.e. no options for specifying an alternate schema format or alternate schema. In other words, it appears that there is no way to read/write Parquet files using Avro schemas using the SparkSQL API. But perhaps I'm just missing something?
To clarify, I also understand that this will basically just add the Avro schema to the Parquet metadata on write, and will add one more translation layer on read (Parquet format -> Avro schema -> SparkSQL internal format) but will specifically allow me to add default values for missing columns (which Avro schema supports but Parquet schema does not).
Also, I am not looking for a way to convert Avro to Parquet, or Parquet to Avro (rather a way to use them together), and I am not looking for a way to read/write plain Avro within SparkSQL (you can do this using databricks/spark-avro).
I am doing something similar. I use avro schema to write into parquet file however, dont read it as avro. But the same technique should work on read as well. I am not sure if this is the best way to do it, but here it is anyways:
I have AvroData.avsc which has the avro schema.
KafkaUtils.createDirectStream[String,Array[Byte],StringDecoder,DefaultDecoder,Tuple2[String, Array[Byte]]](ssc, kafkaProps, fromOffsets, messageHandler)
kafkaArr.foreachRDD { (rdd,time)
=> { val schema = SchemaConverters.toSqlType(AvroData.getClassSchema).dataType.asInstanceOf[StructType] val ardd = rdd.mapPartitions{itr =>
itr.map { r =>
try {
val cr = avroToListWithAudit(r._2, offsetSaved, loadDate, timeNow.toString)
Row.fromSeq(cr.toArray)
} catch{
case e:Exception => LogHandler.log.error("Exception while converting to Avro" + e.printStackTrace())
System.exit(-1)
Row(0) //This is just to allow compiler to accept. On exception, the application will exit before this point
}
}
}
public static List avroToListWithAudit(byte[] kfkBytes, String kfkOffset, String loaddate, String loadtime ) throws IOException {
AvroData av = getAvroData(kfkBytes);
av.setLoaddate(loaddate);
av.setLoadtime(loadtime);
av.setKafkaOffset(kfkOffset);
return avroToList(av);
}
public static List avroToList(AvroData a) throws UnsupportedEncodingException{
List<Object> l = new ArrayList<>();
for (Schema.Field f : a.getSchema().getFields()) {
String field = f.name().toString();
Object value = a.get(f.name());
if (value == null) {
//System.out.println("Adding null");
l.add("");
}
else {
switch (f.schema().getType().getName()){
case "union"://System.out.println("Adding union");
l.add(value.toString());
break;
default:l.add(value);
break;
}
}
}
return l;
}
The getAvroData method needs to have code to construct the avro object from raw bytes. I am also trying to figure out a way to do that without having to specifying each attribute setter explicitly, but seems like there isnt one.
public static AvroData getAvroData (bytes)
{
AvroData av = AvroData.newBuilder().build();
try {
av.setAttr(String.valueOf("xyz"));
.....
}
}
Hope it helps

Resources