How to identify or reroute bad xml's when reading xmls with spark - apache-spark

Using spark, I am trying to read a bunch of xmls from a path, one of the files is a dummy file which is not an xml.
I would like the spark to tell me that one particular file is not valid, in any way
Adding "badRecordsPath" otiton writes the bad data into specified location for JSON files, but the same is not working for xml, is there some other way?
df = (spark.read.format('json')
.option('badRecordsPath','/tmp/data/failed')
.load('/tmp/data/dummy.json')

As far as I know.... Unfortunately it wasnt available in xml package of spark till today in a declarative way... in the way you are expecting...
Json it was working since FailureSafeParser was implemented like below... in DataFrameReader
/**
* Loads a `Dataset[String]` storing JSON objects (<a href="http://jsonlines.org/">JSON Lines
* text format or newline-delimited JSON</a>) and returns the result as a `DataFrame`.
*
* Unless the schema is specified using `schema` function, this function goes through the
* input once to determine the input schema.
*
* #param jsonDataset input Dataset with one JSON object per record
* #since 2.2.0
*/
def json(jsonDataset: Dataset[String]): DataFrame = {
val parsedOptions = new JSONOptions(
extraOptions.toMap,
sparkSession.sessionState.conf.sessionLocalTimeZone,
sparkSession.sessionState.conf.columnNameOfCorruptRecord)
val schema = userSpecifiedSchema.getOrElse {
TextInputJsonDataSource.inferFromDataset(jsonDataset, parsedOptions)
}
ExprUtils.verifyColumnNameOfCorruptRecord(schema, parsedOptions.columnNameOfCorruptRecord)
val actualSchema =
StructType(schema.filterNot(_.name == parsedOptions.columnNameOfCorruptRecord))
val createParser = CreateJacksonParser.string _
val parsed = jsonDataset.rdd.mapPartitions { iter =>
val rawParser = new JacksonParser(actualSchema, parsedOptions, allowArrayAsStructs = true)
val parser = new FailureSafeParser[String](
input => rawParser.parse(input, createParser, UTF8String.fromString),
parsedOptions.parseMode,
schema,
parsedOptions.columnNameOfCorruptRecord)
iter.flatMap(parser.parse)
}
sparkSession.internalCreateDataFrame(parsed, schema, isStreaming = jsonDataset.isStreaming)
}
you can implement the feature programatic way.
read all the files in the folder using sc.textFile .
foreach file using xml parser parse the entries.
If its valid redirect to another path .
If its invalid, then write in to bad record path.

Related

Parsing data-forge DataFrame with missing data

I have a CSV file formatted like this and containing missing data:
time,col1,col2
0,12.3,99.2
1,,101.0
2,10.5,101.4
3,,102.5
4,11.9,
I'm using data-forge-js to read the data from CSV and convert it into floats:
const stringDF = dataForge.fromCSV(csvData);
const parsedDF = stringDF.parseFloats(stringDF.getColumnNames());
But this results an empty dataframe because (I assume) there are missing values in the data. It works when there are no missing values. How can I insert NaNs or some stand-in value for missing data?
I was getting confused by the lazy evaluation of DataFrames in data-forge. I needed to call parsedDF.bake() to force evaluation so that when it's printed in the console I could see the parsed data.
Working solution here: https://runkit.com/shrinkinguniverse/parse-dataframe-with-missing-data-using-data-forge
const dataForge = require('data-forge');
csvString = `time,col1,col2
0,12.3,99.2
1,,101.0
2,10.5,101.4
3,,102.5
4,11.9,`
const stringDF = dataForge.fromCSV(csvString);
let parsedDF = stringDF.parseFloats(stringDF.getColumnNames());
parsedDF = parsedDF.bake(); //Needed to force lazy evaluation to evaluate.
console.log(parsedDF);

How do you transform a `FixedSqlAction` into a `StreamingDBIO` in Slick?

I'm creating an akka-stream using Alpakka and the Slick module but I'm stuck in a type mismatch problem.
One branch is about getting the total number of invoices in their table:
def getTotal(implicit session: SlickSession) = {
import session.profile.api._
val query = TableQuery[Tables.Invoice].length.result
Slick.source(query)
}
But the end line doesn't compile because Alpakka is expecting a StreamingDBIO but I'm providing a FixedSqlAction[Int,slick.dbio.NoStream,slick.dbio.Effect.Read].
How can I move from the non-streaming result to the streaming one?
Taking the length of a table results in a single value, not a stream. So the simplest way to get a Source to feed a stream is
def getTotal(implicit session: SlickSession): Source[Int, NotUsed] =
Source.lazyFuture { () =>
// Don't actually run the query until the stream has materialized and
// demand has reached the source
val query = TableQuery[Tables.Invoice].length.result
session.db.run(query)
}
Alpakka's Slick connector is more oriented towards streaming (including managing pagination etc.) results of queries that have a lot of results. For a single result, converting the Future of the result that vanilla Slick gives you into a stream is sufficient.
If you want to start executing the query as soon as you call getTotal (note that this whether or not the downstream ever runs or demands data from the source), you can have
def getTotal(implicit session: SlickSession): Source[Int, NotUsed] = {
val query = TableQuery[Tables.Invoice].length.result
Source.future(session.db.run(query))
}
Would sth like this work for you?
def getTotal() = {
// Doc Expressions (Scalar values)
// https://scala-slick.org/doc/3.2.0/queries.html
val query = TableQuery[Tables.Invoice].length.result
val res = Await.result(session.db.run(query), 60.seconds)
println(s"Result: $res")
res
}

Apache Beam - Write BigQuery TableRow to Cassandra

I'm trying to read data from BigQuery (using TableRow) and write the output to Cassandra. How to do that?
Here's what I've tried. This works:
/* Read BQ */
PCollection<CxCpmMapProfile> data = p.apply(BigQueryIO.read(new SerializableFunction<SchemaAndRecord, CxCpmMapProfile>() {
public CxCpmMapProfile apply(SchemaAndRecord record) {
GenericRecord r = record.getRecord();
return new CxCpmMapProfile((String) r.get("channel_no").toString(), (String) r.get("channel_name").toString());
}
}).fromQuery("SELECT channel_no, channel_name FROM `dataset_name.table_name`").usingStandardSql().withoutValidation());
/* Write to Cassandra */
data.apply(CassandraIO.<CxCpmMapProfile>write()
.withHosts(Arrays.asList("<IP addr1>", "<IP addr2>"))
.withPort(9042)
.withUsername("cassandra_user").withPassword("cassandra_password").withKeyspace("cassandra_keyspace")
.withEntity(CxCpmMapProfile.class));
But when I changed Read BQ part using TableRow like this:
/* Read from BQ using readTableRow */
PCollection<TableRow> data = p.apply(BigQueryIO.readTableRows()
.fromQuery("SELECT channel_no, channel_name FROM `dataset_name.table_name`")
.usingStandardSql().withoutValidation());
In Write to Cassandra I got the following error
The method apply(PTransform<? super PCollection<TableRow>,OutputT>) in the type PCollection<TableRow> is not applicable for the arguments (CassandraIO.Write<CxCpmMacProfile>)
The error is due to the input PCollection containing TableRow elements, while the CassandraIO read is expecting CxCpmMacProfile elements. You need to read the elements from BigQuery as CxCpmMacProfile elements. The BigQueryIO documentation has an example of reading rows from a table and parsing them into a custom type, done through the read(SerializableFunction) method.

Spark dataset : Casting Columns of dataset

This is my dataset :
Dataset<Row> myResult = pot.select(col("number")
, col("document")
, explode(col("mask")).as("mask"));
I need to now create a new dataset from the existing myResult . something like below:
Dataset<Row> myResultNew = myResult.select(col("number")
, col("name")
, col("age")
, col("class")
, col("mask");
name , age and class are created from column document from Dataset myResult .
I guess I can call functions on the column document and then perform any operation on that.
myResult.select(extract(col("document")));
private String extract(final Column document) {
//TODO ADD A NEW COLUMN nam, age, class TO THE NEW DATASET.
// PARSE DOCUMENT AND GET THEM.
XMLParser doc= (XMLParser) document // this doesnt work???????
}
My question is: document is of type column and I need to convert it into a different Object Type and parse it for extracting name , age ,class. How can I do that. document is an xml and i need to do parsing for getting the other 3 columns so cant avoid converting it to XML .
Converting the extract method into an UDF would be a solution that is as close as possible to what you are asking. An UDF can take the value of one or more columns and execute any logic with this input.
import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.udf;
[...]
UserDefinedFunction extract = udf(
(String document) -> {
List<String> result = new ArrayList<>();
XMLParser doc = XMLParser.parse(document);
String name = ... //read name from xml document
String age = ... //read age from xml document
String clazz = ... //read class from xml document
result.add(name);
result.add(age);
result.add(clazz);
return result;
}, DataTypes.createArrayType(DataTypes.StringType)
);
A restriction of UDFs is that they can only return one column. Therefore the function returns a String array that has to be unpacked afterwards.
Dataset<Row> myResultNew = myResult
.withColumn("extract", extract.apply(col("document"))) //1
.withColumn("name", col("extract").getItem(0)) //2
.withColumn("age", col("extract").getItem(1)) //2
.withColumn("class", col("extract").getItem(2)) //2
.drop("document", "extract"); //3
call the UDF and use the column that contains the xml document as parameter of the apply function
create the result columns out of the returned array from step 1
drop the intermediate columns
Note: the udf is executed once per row in the dataset. If the creation of the xml parser is expensive this might slow down the execution of the Spark job as one parser is instantiated per row. Due to the parallel nature of Spark it is not possible to reuse the parser for the next row. If this is an issue, another (at least in the Java world slightly more complex) option would be to use mapPartitions. Here one would not need one parser per row but only one parser per partition of the dataset.
A completely different approach would be to use spark-xml.

How to read multiple text files into a single RDD?

I want to read a bunch of text files from a hdfs location and perform mapping on it in an iteration using spark.
JavaRDD<String> records = ctx.textFile(args[1], 1); is capable of reading only one file at a time.
I want to read more than one file and process them as a single RDD. How?
You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.:
sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
As Nick Chammas points out this is an exposure of Hadoop's FileInputFormat and therefore this also works with Hadoop (and Scalding).
Use union as follows:
val sc = new SparkContext(...)
val r1 = sc.textFile("xxx1")
val r2 = sc.textFile("xxx2")
...
val rdds = Seq(r1, r2, ...)
val bigRdd = sc.union(rdds)
Then the bigRdd is the RDD with all files.
You can use a single textFile call to read multiple files. Scala:
sc.textFile(','.join(files))
You can use this
First You can get a Buffer/List of S3 Paths :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D
In PySpark, I have found an additional useful way to parse files. Perhaps there is an equivalent in Scala, but I am not comfortable enough coming up with a working translation. It is, in effect, a textFile call with the addition of labels (in the below example the key = filename, value = 1 line from file).
"Labeled" textFile
input:
import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)
for filename in glob.glob(Data_File + "/*"):
Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)
output: array with each entry containing a tuple using filename-as-key and with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie.
[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
...]
You can also recombine either as a list of lines:
Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()
[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]
Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.):
Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()
you can use
JavaRDD<String , String> records = sc.wholeTextFiles("path of your directory")
here you will get the path of your file and content of that file. so you can perform any action of a whole file at a time that saves the overhead
All answers are correct with sc.textFile
I was just wondering why not wholeTextFiles For example, in this case...
val minPartitions = 2
val path = "/pathtohdfs"
sc.wholeTextFiles(path,minPartitions)
.flatMap{case (path, text)
...
one limitation is that, we have to load small files otherwise performance will be bad and may lead to OOM.
Note :
The wholefile should fit in to memory
Good for file formats that are NOT splittable by line... such as XML files
Further reference to visit
There is a straight forward clean solution available. Use the wholeTextFiles() method. This will take a directory and forms a key value pair. The returned RDD will be a pair RDD.
Find below the description from Spark docs:
SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file
TRY THIS
Interface used to write a DataFrame to external storage systems (e.g. file systems, key-value stores, etc). Use DataFrame.write() to access this.
New in version 1.4.
csv(path, mode=None, compression=None, sep=None, quote=None, escape=None, header=None, nullValue=None, escapeQuotes=None, quoteAll=None, dateFormat=None, timestampFormat=None)
Saves the content of the DataFrame in CSV format at the specified path.
Parameters:
path – the path in any Hadoop supported file system
mode –
specifies the behavior of the save operation when data already exists.
append: Append contents of this DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
error (default case): Throw an exception if data already exists.
compression – compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).
sep – sets the single character as a separator for each field and value. If None is set, it uses the default value, ,.
quote – sets the single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.
escape – sets the single character used for escaping quotes inside an already quoted value. If None is set, it uses the default value, \
escapeQuotes – A flag indicating whether values containing quotes should always be enclosed in quotes. If None is set, it uses the default value true, escaping all values containing a quote character.
quoteAll – A flag indicating whether all values should always be enclosed in quotes. If None is set, it uses the default value false, only escaping values containing a quote character.
header – writes the names of columns as the first line. If None is set, it uses the default value, false.
nullValue – sets the string representation of a null value. If None is set, it uses the default value, empty string.
dateFormat – sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type. If None is set, it uses the default value value, yyyy-MM-dd.
timestampFormat – sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type. If None is set, it uses the default value value, yyyy-MM-dd'T'HH:mm:ss.SSSZZ.
rdd = textFile('/data/{1.txt,2.txt}')

Resources