Read CSV with linebreaks in pyspark
I want to read with pyspark a "legal" (it follows RFC4180) CSV that has breaklines (CRLF) in some of the rows. The next code sample shows how it does seem when opened it with Notepad++:
I try to read it with sqlCtx.read.load using format ='com.databricks.spark.csv. and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark 2.1.0.2 version.
Is there any command or alternative way of reading the csv that allows me to read these two lines only as one?
You can use "csv" instead of Databricks CSV - the last one redirects now to default Spark reader. But, it's only a hint :)
In Spark 2.2 there was added new option - wholeFile. If you write this:
spark.read.option("wholeFile", "true").csv("file.csv")
it will read all file and handle multiline CSV.
There is no such option in Spark 2.1. You can read file using sparkContext.wholeTextFile or just use newer verison
wholeFile does not exist (anymore?) in the spark api documentation:
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
This solution will work:
spark.read.option("multiLine", "true").csv("file.csv")
From the api documentation:
multiLine – parse records, which may span multiple lines. If None is set, it uses the default value, false
Related
I'm using Jupyter NoteBook to run pySpark code to import CSV file to Cassandra v3.11.3. Getting below error.
... 1 more[![enter image description here][1]][1]
---------------------------------------------------------------------------
pySpark Code i have attached as picture:
[![pyspark_code][1]][1]
Any inputs...
Without the full trace it's hard to know exactly where this is failing. The method you pasted is just the p4yj wrapper method and we really would need to see the underlying Java Exception.
From what I can tell it looks like you are attempting to also use some options on the C* write that are unsupported. For example "MODE" - "DROPMALFORMED" is not a valid C* connector option. DataFrame Writer and Reader options are source specific so you are unfortunately unable to mix and match.
This makes me think that the data being written actually has a malformed date string or two and this code is dying when attempting to write the broken record. One way around this would be to attempt to do the date casting on CSV read which I believe does support DROPMALFORMED style parsing options.
I am trying to read a text file delimited by |. I am trying this
spark.read.format("com.databricks.spark.csv").option("header","true").option("delimiter", "|").option("inferSchema","true").csv("/tmp/file.txt").show()
I am only reading/seeing only the header but no data.
When I try the same with textFile, I am getting data but all in one column
spark.read.format("com.databricks.spark.csv").option("header","true").option("delimiter", "|").option("inferSchema","true").textFile("/tmp/file.txt").show()
Is there a way to read data via csv? I am using spark 2.4.4
The reason for the issue was the file is in UTF16 so I had to convert it and do run dostounix on it. Thanks for your advice. Apologies I really did not know that
For my research I have a dataset of about 20,000 gziped multiline json files (~2TB, all have the same schema). I need to process and clean this data (I should say I'm very new to data analytics tools).
After spending a few days reading about Spark and Apache Beam I'm convinced that the first step would be to first convert this dataset to NDJSONs. In most books and tutorials they always assume you are working with some new line delimited file.
What is the best way to go about converting this data?
I've tried to just launch a large instance on gcloud and just use gunzip and jq to do this. Not surprisingly, it seems that this will take a long time.
Thanks in advance for any help!
Apache Beam supports unzipping file if you use TextIO.
But the delimiter remains to be New Line.
For multiline json you can read complete file using in parallel and then convert the json string to pojo and eventually reshuffle the data to utilize parallelism.
So the steps would be
Get the file list > Read individual files > Parse file content to json objects > Reshuffle > ...
You can get the file list by FileSystems.match("gcs://my_bucker").metadata().
Read individual files by Compression Compression.detect((fileResouceId).getFilename()).readDecompressed(FileSystems.open(fileResouceId))
Converting to NDJSON is not necessary if you use sc.wholeTextFiles. Point this method at a directory, and you'll get back an RDD[(String, String)] where ._1 is the filename and ._2 is the content of the file.
I am trying to read the dat file using pyspark csv reader and it contains newline character ("\n") as part of the data. Spark is unable to read this file as single column, rather treating it as new row.
I tried using the "multiLine" option while reading , but still its not working.
spark.read.csv(file_path, schema=schema, sep=delimiter,multiLine=True)
Data is something like this. Here $ is CRLF for newline shown in vim.
name,test,12345,$
$
,desc$
name2,test2,12345,$
$
,desc2$
So pyspark is treating desc as next record.
How to read such data in pyspark .
Tried this in both spark2.2 and spark2.3 versions.
I created my own hadoop Custom Record Reader and was able to read it by invoking the api .
spark.sparkContext.newAPIHadoopFile(file_path,'com.test.multi.reader.CustomFileFormat','org.apache.hadoop.io.LongWritable','org.apache.hadoop.io.Text',conf=conf)
And in the Custom Record Reader implemented the logic to handle the newline characters encountered .
The file names don't end with .gz and I cannot change them back as they are shared with other programs.
file1.log.gz.processed is simply a csv file. But how do I read it in pyspark, preferably in pyspark.sql?
I tried to specify the format and compression but couldn't find the correct key/value. E.g.,
sqlContext.load(fn, format='gz')
didn't work. Although Spark could deal with gz files it seems to determine the codec from file names. E.g.,
sc.textFile(fn)
would work if the file ends with .gz but not in my case.
How do I instruct Spark to use the correct codec? Thank you!
You should not use .load that way, as it's deprecated (since version 1.4.0). You should use read.format(source).schema(schema).options(options).load().
sql_context.read.format("com.databricks.spark.csv")
.options(
header=... # e.g., "true"
inferSchema=...)
.load(file_path + ".gz")