DynamicFrame resolve choice between Array and Struct - apache-spark

I'm using AWS Glue to crawl XML files and add them to a Glue database table. The DynamicFrame I'm using identifies several choices in the XML schema. I can resolve most of them, but there's one case that I can't figure out.
The relevant part of the XML structure is:
<root>
<order>
<lineitems>
<lineitem>
...
</lineitem>
</lineitems>
</order>
</root>
The DynamicFrame shows lineitems as a struct and lineitems/lineitem as a choice between array or struct, I suspect because some orders have one lineitem, whereas other orders have multiple lineitems. I've tried calling resolveChoice with project:array, but that results in element:unknown so I can no longer see the structure of the lineitem. I'm not sure what else to try here, any ideas?

Related

Invalid date:Error while import CSV to Cassandra using pySpark

I'm using Jupyter NoteBook to run pySpark code to import CSV file to Cassandra v3.11.3. Getting below error.
... 1 more[![enter image description here][1]][1]
---------------------------------------------------------------------------
pySpark Code i have attached as picture:
[![pyspark_code][1]][1]
Any inputs...
Without the full trace it's hard to know exactly where this is failing. The method you pasted is just the p4yj wrapper method and we really would need to see the underlying Java Exception.
From what I can tell it looks like you are attempting to also use some options on the C* write that are unsupported. For example "MODE" - "DROPMALFORMED" is not a valid C* connector option. DataFrame Writer and Reader options are source specific so you are unfortunately unable to mix and match.
This makes me think that the data being written actually has a malformed date string or two and this code is dying when attempting to write the broken record. One way around this would be to attempt to do the date casting on CSV read which I believe does support DROPMALFORMED style parsing options.

Is dataframe.colums a Spark action?

If not, there is no action method in the following code, but "./demo.json" is read once.
val x = spark.read.json("./demo.json")
println(x.columns)
dataframe.columns is not an action per se, but it needs to get the schema of your dataframe. Depending on the file format, this needs a file-scan (json, csv). With other file formats like parquet, the columns can be extracted from the meta-data, so no actual file scan is needed
spark.read.json is an action that reads all your data to infer the schema (unless you specify it manually). Hence x.columns will not trigger any action.
According to the latest documentation (click on json):
This function goes through the input once to determine the input
schema. If you know the schema in advance, use the version that
specifies the schema to avoid the extra scan.

Spark: reading files with PERMISSIVE and provided schema - issues with corrupted records column

I am reading spark CSV. I am providing a schema for the file that I read and I read it permissive mode. I would like to keep all records in columnNameOfCorruptRecord (in my case corrupted_records).
I went trough hell to set this up and still get warnings that I cannot suppress i there something I miss.
So first in order to have the corrupted_records column I needed to add it to the schema as StringType. This is documented so okay.
But whenever I read a file a get a Warning that the schema doesn't match because the amount of columns is different. It's just a warning, but it's filling my logs.
Also when there is a field that is not nullable and there is a corrupted record, the corrupted record goes to the corrupt_records column and all it's fields are set to null thus I get an exception because I have non nullable field. The only to solve this is to set that the columns are not nullable to nullable. Which is quite strange.
Am I missing something?
Recap:
Is there a way to ignore the warning when I've added
corrupted_records column in the schema
Is there a way to use
PERMISSIVE mode and corrupted_records column with schema that has
non nullable fields.
Thanks!
The following documentation might help. It would be great if you atleast provide the code you've written.
https://docs.databricks.com/spark/latest/data-sources/read-csv.html
a demo of read json code snippet
df= self.spark.read.option("mode", "PERMISSIVE").option("primitivesAsString", True).json(self.src_path)
To answer your point 2, you should delve better point 1.
Point 1: you should do an analysis of your file and map your schema with all the fields in your file. After having imported your csv file into a DataFrame, I would select your fields of interest, and continue what you were doing.
Point 2: you will solve your problem defining your schema as follows (I would use scala):
import pyspark.sql.types as types
yourSchema = (types.StructType()
.add('field0', types.IntegerType(), True)
# all your .add(fieldsName, fieldsType, True which let your field be nullable)
.add('corrupted_records', types.StringType(), True) #your corrupted date will be here
)
With this having been defined, you can import your csv file into a DataFrame as follows:
df = ( spark.read.format("csv")
.schema(yourSchema)
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "corrupted_records")
load(your_csv_files)
)
There are also other ways to do the same operation, and different modalities to handle bad bad; have a look at this insightful article: https://python.plainenglish.io/how-to-handle-bad-data-in-spark-sql-5e0276d37ca1

Use recursive globbing to extract XML documents as strings in pyspark

The goal is to extract XML documents, given an XPath expression, from a group of text files as strings. The difficulty is the variance of forms the text files may be in. Might be:
single zip / tar file with 100 files, each 1 XML document
one file, with 100 XML documents (aggregate document)
one zip / tar file, with varying levels of directories, with single XML records as files and aggregate XML files
I thought I had found a solution with Databrick's Spark Spark-XML library, as it handles recursive globbing when reading files. It was amazing. Could do things like:
# read directory of loose files
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/qs/mods/*.xml')
# recursively discover and parse
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/qs/**/*.xml')
# even read archive files without additional work
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/mods_archive.tar')
The problem, this library is focused on parsing the XML records into DataFrame columns, where my goal is retrieve just the XML documents as strings for storage.
My scala is not strong enough to easily hack at the Spark-XML library to utilize the recursive globbing and XPath grabbing of documents, but skipping the parsing and instead save the entire XML record as a string.
The library comes with the ability to serialize DataFrames to XML, but the serialization is decidely different than the input (which is to be expected to some degree). For example, element text values become element attributes. Given the following original XML:
<mods:role>
<mods:roleTerm authority="marcrelator" type="text">creator</mods:roleTerm>
</mods:role>
reading then serializing witih Spark-XML returns:
<mods:role>
<mods:roleTerm VALUE="creator" authority="marcrelator" type="text"></mods:roleTerm>
</mods:role>
However, even if I could get the VALUE to be serialized as an actual element value, I'm still not acheiving my end goal of having these XML documents that were discovered and read via Spark-XML's excellent globbing and XPath selection, as just strings.
Any insight would be appreciated.
Found a solution from this Databricks Spark-XML issue:
xml_rdd = sc.newAPIHadoopFile('file:///tmp/mods/*.xml','com.databricks.spark.xml.XmlInputFormat','org.apache.hadoop.io.LongWritable','org.apache.hadoop.io.Text',conf={'xmlinput.start':'<mods:mods>','xmlinput.end':'</mods:mods>','xmlinput.encoding': 'utf-8'})
Expecting 250 records, and got 250 records. Simple RDD with entire XML record as a string:
In [8]: xml_rdd.first()
Out[8]:
(4994,
'<mods:mods xmlns:mets="http://www.loc.gov/METS/" xmlns:xl="http://www.w3.org/1999/xlink" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/" version="3.0">\n\n\n <mods:titleInfo>\n\n\n <mods:title>Jessie</mods:title>\n\n\n...
...
...
Credit to the Spark-XML maintainer(s) for a wonderful library, and attentiveness to issues.

Infer a dataframe schema in spark from json file

I am looking to infer a data schema from a .json doc then load data from a database that has a type="Customer"
The current code I am using at the moment does work but the issue i face is that not all documents within the couchbase database contain the exact same schema. As the schema scans from a selection of documents the inferred schema will not contain all the data fields that I require. When I then load the dataframe I get a wrapped array error.
Python Code
%pyspark
df = sqlContext.read.format("com.couchbase.spark.sql.DefaultSource").option("schemaFilter", "type=\"Customer\"").load()
if you have had this issue previously all suggestions welcome on how to solve.
Many thanks!!

Resources