How to parse CSV files with double-quoted strings in Julia? - string

I want to read CSV files where the columns are separated by commas. The columns can be strings and if those strings contain a comma in their content, they are wrapped in double-quotes. Currently I'm loading my data using:
file = open("data.csv","r")
data = readcsv(file)
But this code code would split the follwing string into 4 pieces whereas it only should be 3:
1,"text, more text",3,4
Is there a way in Julia's Standard Library to parse CSV while respecting quoting or do I have to write my own custom solution?

The readcsv function in base is super-basic (just blindly splitting on commas).
You will probably be happier with readtable from the DataFrames.jl package: http://juliastats.github.io/DataFrames.jl/io.html
To use the package, you just need to Pkg.add("DataFrames"), and then import it with `using DataFrames"

The readcsv function in base (0.3 prerelease) can now read quoted columns.
julia> readcsv(IOBuffer("1,\"text, more text\",3,4"))
1x4 Array{Any,2}:
1.0 "text, more text" 3.0 4.0
It is much simpler than DataFrames. But may be quicker if you just need the data as an array.

Related

weird characters in Pandas dataframe - how to standardize to UTF-8?

I'm using Python + Camelot (OCR library) to read a PDF, clean up, and write to Excel or csv. There are some non-standard dashes that print out a weird character.
Using Camelot means I'm not calling "read_csv". It's coming from the PDF. A value that is supposed to be "1-4" prints out as 1–4.
I fixed this using a regular expression but a colleague mentioned I should standardize to UTF-8. I tried to do that for the header like this:
header = df.iloc[0, 1:].str.encode('utf-8')
but then that value becomes b'1\xe2\x80\x934'.
Any advice? The goal is to simply use standard text.

Loading Special Character via Polybase

I am trying to load single quote string delimited file and I am able to load data except for certain records for the string which contains below format. How to Load this below values using PolyBase in SQL Datawarehouse. Any input is highly appreciated.
Eg:
'Don''t Include'
'1'''
'Can''t'
'VM''s'
External File Format:
CREATE EXTERNAL FILE FORMAT SAMPLE_HEADER
with (format_type=delimitedtext,
format_options(
FIELD_TERMINATOR=',',
STRING_DELIMITER='''',
DATE_FORMAT='yyyy-MM-dd HH:mm:ss',
USE_TYPE_DEFAULT=False)
)
In this case your string delimiter needs to be something other than a single quote.
I assume you're using a comma-delimited file. You have a couple of options:
Make your column delimiter something other than comma.
Make your string delimiter a character that does not exist in your data
Use an output format other than CSV, such as Parquet or Orc
If you're going to use a custom delimiter, I suggest ASCII Decimal(31) or Hex(0x1F), which is specifically reserved for this purpose.
If you're going to use a string delimiter you might use double-quote (but I'm guessing this is in your data) or choose some other character.
That said, my next guess is that you're going to come across data with embedded carriage returns, and this is going to cause yet another layer of problem. For that reason, I suggest you move your extracts to something other than CSV, and look to Parquet or Orc.
Currently, Polybase in SQLDW does not support handling of the escape character in the delimited text format. So you cannot load your file directory in SQLDW.
In order to load your file, you may pre-process your input file. During pre-processing you may generate another data file either in binary format (PARQUET or ORC which are directory readable by poly-base) or another delimited file with some special field separator(any character which is not expected in your data file, e.g. | or ~). With such special character, there is no need of using escaping/delimiting the values)
Hope its helps.
From Azure docs:
<format_options> ::=
{
FIELD_TERMINATOR = field_terminator
| STRING_DELIMITER = string_delimiter
| First_Row = integer -- ONLY AVAILABLE SQL DW
| DATE_FORMAT = datetime_format
| USE_TYPE_DEFAULT = { TRUE | FALSE }
| Encoding = {'UTF8' | 'UTF16'}
}

Python 3: Read text file that is in list format

I have one large text file that contains data in the form of a list and its just in one line. See examples
Text file contents: [{"input": "data1"}, {"input": "data2"}, {"input": "data2"}]
I am reading this file using python 3 and when I use the read() method, I get one large string however I want to convert this string to a list while maintaining the same format that is in the text file. Is there anyway that this can be achieved? Most of the posts talk about using the split method to achieve this which does not work for this case.
In JavaScript I generally use the stringify and parse methods to do these kinds of conversions but I am not able to find this in python. Any help will be appreciated. Thank you.
You can load json from a a file using Python's built-in json package.
>>> import json
>>> with open('foo.json') as f:
... data = json.load(f)
...
>>> print(data)
[{'input': 'data1'}, {'input': 'data2'}, {'input': 'data2'}]

Use recursive globbing to extract XML documents as strings in pyspark

The goal is to extract XML documents, given an XPath expression, from a group of text files as strings. The difficulty is the variance of forms the text files may be in. Might be:
single zip / tar file with 100 files, each 1 XML document
one file, with 100 XML documents (aggregate document)
one zip / tar file, with varying levels of directories, with single XML records as files and aggregate XML files
I thought I had found a solution with Databrick's Spark Spark-XML library, as it handles recursive globbing when reading files. It was amazing. Could do things like:
# read directory of loose files
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/qs/mods/*.xml')
# recursively discover and parse
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/qs/**/*.xml')
# even read archive files without additional work
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/mods_archive.tar')
The problem, this library is focused on parsing the XML records into DataFrame columns, where my goal is retrieve just the XML documents as strings for storage.
My scala is not strong enough to easily hack at the Spark-XML library to utilize the recursive globbing and XPath grabbing of documents, but skipping the parsing and instead save the entire XML record as a string.
The library comes with the ability to serialize DataFrames to XML, but the serialization is decidely different than the input (which is to be expected to some degree). For example, element text values become element attributes. Given the following original XML:
<mods:role>
<mods:roleTerm authority="marcrelator" type="text">creator</mods:roleTerm>
</mods:role>
reading then serializing witih Spark-XML returns:
<mods:role>
<mods:roleTerm VALUE="creator" authority="marcrelator" type="text"></mods:roleTerm>
</mods:role>
However, even if I could get the VALUE to be serialized as an actual element value, I'm still not acheiving my end goal of having these XML documents that were discovered and read via Spark-XML's excellent globbing and XPath selection, as just strings.
Any insight would be appreciated.
Found a solution from this Databricks Spark-XML issue:
xml_rdd = sc.newAPIHadoopFile('file:///tmp/mods/*.xml','com.databricks.spark.xml.XmlInputFormat','org.apache.hadoop.io.LongWritable','org.apache.hadoop.io.Text',conf={'xmlinput.start':'<mods:mods>','xmlinput.end':'</mods:mods>','xmlinput.encoding': 'utf-8'})
Expecting 250 records, and got 250 records. Simple RDD with entire XML record as a string:
In [8]: xml_rdd.first()
Out[8]:
(4994,
'<mods:mods xmlns:mets="http://www.loc.gov/METS/" xmlns:xl="http://www.w3.org/1999/xlink" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/" version="3.0">\n\n\n <mods:titleInfo>\n\n\n <mods:title>Jessie</mods:title>\n\n\n...
...
...
Credit to the Spark-XML maintainer(s) for a wonderful library, and attentiveness to issues.

How does spark sc.textFile works?

JavaRDD<String> input = sc.textFile("data.txt");
For the above sample code in Spark, I know it returns distributed list of string. But individual string in that list is a line or word tokens of data.txt?
A string in your rdd equals a line in data.txt.
If the data in your data.txt file is some type of csv data, you can use the spark-csv package that will split the data into columns for you, so you don't have to parse the lines yourself.

Resources