Use recursive globbing to extract XML documents as strings in pyspark - apache-spark

The goal is to extract XML documents, given an XPath expression, from a group of text files as strings. The difficulty is the variance of forms the text files may be in. Might be:
single zip / tar file with 100 files, each 1 XML document
one file, with 100 XML documents (aggregate document)
one zip / tar file, with varying levels of directories, with single XML records as files and aggregate XML files
I thought I had found a solution with Databrick's Spark Spark-XML library, as it handles recursive globbing when reading files. It was amazing. Could do things like:
# read directory of loose files
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/qs/mods/*.xml')
# recursively discover and parse
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/qs/**/*.xml')
# even read archive files without additional work
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/mods_archive.tar')
The problem, this library is focused on parsing the XML records into DataFrame columns, where my goal is retrieve just the XML documents as strings for storage.
My scala is not strong enough to easily hack at the Spark-XML library to utilize the recursive globbing and XPath grabbing of documents, but skipping the parsing and instead save the entire XML record as a string.
The library comes with the ability to serialize DataFrames to XML, but the serialization is decidely different than the input (which is to be expected to some degree). For example, element text values become element attributes. Given the following original XML:
<mods:role>
<mods:roleTerm authority="marcrelator" type="text">creator</mods:roleTerm>
</mods:role>
reading then serializing witih Spark-XML returns:
<mods:role>
<mods:roleTerm VALUE="creator" authority="marcrelator" type="text"></mods:roleTerm>
</mods:role>
However, even if I could get the VALUE to be serialized as an actual element value, I'm still not acheiving my end goal of having these XML documents that were discovered and read via Spark-XML's excellent globbing and XPath selection, as just strings.
Any insight would be appreciated.

Found a solution from this Databricks Spark-XML issue:
xml_rdd = sc.newAPIHadoopFile('file:///tmp/mods/*.xml','com.databricks.spark.xml.XmlInputFormat','org.apache.hadoop.io.LongWritable','org.apache.hadoop.io.Text',conf={'xmlinput.start':'<mods:mods>','xmlinput.end':'</mods:mods>','xmlinput.encoding': 'utf-8'})
Expecting 250 records, and got 250 records. Simple RDD with entire XML record as a string:
In [8]: xml_rdd.first()
Out[8]:
(4994,
'<mods:mods xmlns:mets="http://www.loc.gov/METS/" xmlns:xl="http://www.w3.org/1999/xlink" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/" version="3.0">\n\n\n <mods:titleInfo>\n\n\n <mods:title>Jessie</mods:title>\n\n\n...
...
...
Credit to the Spark-XML maintainer(s) for a wonderful library, and attentiveness to issues.

Related

pyspark read multiple csv files at once

I'm using SPARK to read files in hdfs. There is a scenario, where we are getting files as chunks from legacy system in csv format.
ID1_FILENAMEA_1.csv
ID1_FILENAMEA_2.csv
ID1_FILENAMEA_3.csv
ID1_FILENAMEA_4.csv
ID2_FILENAMEA_1.csv
ID2_FILENAMEA_2.csv
ID2_FILENAMEA_3.csv
This files are loaded to FILENAMEA in HIVE using HiveWareHouse Connector, with few transformation like adding default values. Similarly we have around 70 tables. Hive tables are created in ORC format. Tables are partitioned on ID. Right now, I'm processing all these files one by one. It's taking much time.
I want to make this process much faster. Files will be in GBs.
Is there is any way to read all the FILENAMEA files at the same time and load it to HIVE tables.
You have two methods to read several CSV files in pyspark. If all CSV files are in the same directory and all have the same schema, you can read then at once by directly passing the path of directory as argument, as follow:
spark.read.csv('hdfs://path/to/directory')
If you have CSV files in different locations or CSV files in same directory but with other CSV/text files in it, you can pass them as string representing a list of path in .csv() method argument, as follow:
spark.read.csv('hdfs://path/to/filename1,hdfs://path/to/filename2')
You can have more information about how to read a CSV file with Spark here
If you need to build this list of paths from the list of files in HDFS directory, you can look at this answer, once you've created your list of paths, you can transform it to a string to pass to .csv() method with ','.join(your_file_list)
Using: spark.read.csv(["path1","path2","path3"...]) you can read multiple files from different paths. But that means you have first to make a list of the paths. A list not a string of comma-separated file paths

How to include SQLAlchemy Data types in Python dictionary

I've written an application using Python 3.6, pandas and sqlalchemy to automate the bulk loading of data into various back-end databases and the script works well.
As a brief summary, the script reads data from various excel and csv source files, one at a time, into a pandas dataframe and then uses the df.to_sql() method to write the data to a database. For maximum flexibility, I use a JSON file to provide all the configuration information including the names and types of source files, the database engine connection strings, the column titles for the source file and the column titles in the destination table.
When my script runs, it reads the JSON configuration, imports the specified source data into a dataframe, renames source columns to match the destination columns, drops any columns from the dataframe that are not required and then writes the dataframe contents to the database table using a call similar to:
df.to_sql(strTablename, con=engine, if_exists="append", index=False, chunksize=5000, schema="dbo")
The problem I have is that I would like to also specify the data types in the df.to_sql method for columns and provide them as inputs from the JSON configuration file however, this doesn't appear to be possible as all the strings in the JSON file need to be be enclosed in quotes and they don't then translate when read by my code. This is how the df.to_sql call should look:
df.to_sql(strTablename, con=engine, if_exists="append", dtype=dictDatatypes, index=False, chunksize=5000, schema="dbo")
The entries that form the dtype dictionary from my JSON file look like this:
"Data Types": {
"EmployeeNumber": "sqlalchemy.types.NVARCHAR(length=255)",
"Services": "sqlalchemy.types.INT()",
"UploadActivities": "sqlalchemy.types.INT()",
......
and there a many more, one for each column.
However, when the above is read in as a dictionary, which I pass to the df.to_sql method, it doesn't work as the alchemy datatypes shouldn't be enclosed in quotes but, I can't get around this in my JSON file. The dictionary values therefore aren't recognised by pandas. They look like this:
{'EmployeeNumber': 'sqlalchemy.types.INT()', ....}
And they really need to look like this:
{'EmployeeNumber': sqlalchemy.types.INT(), ....}
Does anyone have experience of this to suggest how I might be able to have the sqlalchemy datatypes in my configuration file?
You could use eval() to convert the string names to objects of that type:
import sqlalchemy as sa
dict_datatypes = {"EmployeeNumber": "sa.INT", "EmployeeName": "sa.String(50)"}
pprint(dict_datatypes)
"""console output:
{'EmployeeName': 'sa.String(50)', 'EmployeeNumber': 'sa.INT'}
"""
for key in dict_datatypes:
dict_datatypes[key] = eval(dict_datatypes[key])
pprint(dict_datatypes)
"""console output:
{'EmployeeName': String(length=50),
'EmployeeNumber': <class 'sqlalchemy.sql.sqltypes.INTEGER'>}
"""
Just be sure that you do not pass untrusted input values to functions like eval() and exec().

Converting 2TB of gziped multiline JSONs to NDJSONs

For my research I have a dataset of about 20,000 gziped multiline json files (~2TB, all have the same schema). I need to process and clean this data (I should say I'm very new to data analytics tools).
After spending a few days reading about Spark and Apache Beam I'm convinced that the first step would be to first convert this dataset to NDJSONs. In most books and tutorials they always assume you are working with some new line delimited file.
What is the best way to go about converting this data?
I've tried to just launch a large instance on gcloud and just use gunzip and jq to do this. Not surprisingly, it seems that this will take a long time.
Thanks in advance for any help!
Apache Beam supports unzipping file if you use TextIO.
But the delimiter remains to be New Line.
For multiline json you can read complete file using in parallel and then convert the json string to pojo and eventually reshuffle the data to utilize parallelism.
So the steps would be
Get the file list > Read individual files > Parse file content to json objects > Reshuffle > ...
You can get the file list by FileSystems.match("gcs://my_bucker").metadata().
Read individual files by Compression Compression.detect((fileResouceId).getFilename()).readDecompressed(FileSystems.open(fileResouceId))
Converting to NDJSON is not necessary if you use sc.wholeTextFiles. Point this method at a directory, and you'll get back an RDD[(String, String)] where ._1 is the filename and ._2 is the content of the file.

Python: Universal XML parser

I'm trying to make simple Python 3 program to read weather information from XML web source, convert it into Python-readable object (maybe dictionary) and process it (for example visualize multiple observations into graph).
Source of data is national weather service's (direct translation) xml file at link provided in code.
What's different from typical XML parsing related question in Stack Overflow is that there are repetitive tags without in-tag identificator (<station> tags in my example) and some with (1st line, <observations timestamp="14568.....">). Also I would like to try parse it straight from website, not local file. Of course, I could create local temporary file too.
What I have so far, is simply loading script, that gives string containing xml code for both forecast and latest weather observations.
from urllib.request import urlopen
#Read 4-day forecast
forecast= urlopen("http://www.ilmateenistus.ee/ilma_andmed/xml/forecast.php").read().decode("iso-8859-1")
#Get current weather
observ=urlopen("http://www.ilmateenistus.ee/ilma_andmed/xml/observations.php").read().decode("iso-8859-1")
Shortly, I'm looking for as universal as possible way to parse XML to Python-readable object (such as dictionary/JSON or list) while preserving all of the information in XML-file.
P.S I prefer standard Python 3 module such as xml, which I didn't understand.
Try xmltodict package for simple conversion of XML structure to Python dict: https://github.com/martinblech/xmltodict

Apache Spark Word Count on PDF file

I want to read the pdf files in hdfs and do word count. I know how to do this in Map Reduce.
I need to do the same in Apache Spark. Your help would be greatly appreciated.
Do this:
Modify the code in the blog post you referenced to write the PDF words to a HDFS file or event a plain text file. That post references another one of the author's posts https://amalgjose.wordpress.com/2014/04/13/simple-pdf-to-text-conversion/
Then, once you have the PDF to text conversion, you can read HDFS input from Spark.
Goto http://spark.apache.org/examples.html and look for Word Count example. There are examples in Scala, Python, Java. The examples even show how you can specify a HDFS location, but you can use a local filesystem as well.
Good luck
The SparkContext has a method called hadoopFile. You need to rewrite FileInputFormat, same as how to read image using spark .
And also read Pdf Input Format implementation for Hadoop Mapreduce

Resources