difference between spark read textFile and csv - apache-spark

I am trying to read a text file delimited by |. I am trying this
spark.read.format("com.databricks.spark.csv").option("header","true").option("delimiter", "|").option("inferSchema","true").csv("/tmp/file.txt").show()
I am only reading/seeing only the header but no data.
When I try the same with textFile, I am getting data but all in one column
spark.read.format("com.databricks.spark.csv").option("header","true").option("delimiter", "|").option("inferSchema","true").textFile("/tmp/file.txt").show()
Is there a way to read data via csv? I am using spark 2.4.4

The reason for the issue was the file is in UTF16 so I had to convert it and do run dostounix on it. Thanks for your advice. Apologies I really did not know that

Related

Converting 2TB of gziped multiline JSONs to NDJSONs

For my research I have a dataset of about 20,000 gziped multiline json files (~2TB, all have the same schema). I need to process and clean this data (I should say I'm very new to data analytics tools).
After spending a few days reading about Spark and Apache Beam I'm convinced that the first step would be to first convert this dataset to NDJSONs. In most books and tutorials they always assume you are working with some new line delimited file.
What is the best way to go about converting this data?
I've tried to just launch a large instance on gcloud and just use gunzip and jq to do this. Not surprisingly, it seems that this will take a long time.
Thanks in advance for any help!
Apache Beam supports unzipping file if you use TextIO.
But the delimiter remains to be New Line.
For multiline json you can read complete file using in parallel and then convert the json string to pojo and eventually reshuffle the data to utilize parallelism.
So the steps would be
Get the file list > Read individual files > Parse file content to json objects > Reshuffle > ...
You can get the file list by FileSystems.match("gcs://my_bucker").metadata().
Read individual files by Compression Compression.detect((fileResouceId).getFilename()).readDecompressed(FileSystems.open(fileResouceId))
Converting to NDJSON is not necessary if you use sc.wholeTextFiles. Point this method at a directory, and you'll get back an RDD[(String, String)] where ._1 is the filename and ._2 is the content of the file.

Which file formats can I save a pyspark dataframe as?

I would like to save a huge pyspark dataframe as a Hive table. How can I do this efficiently? I am looking to use saveAsTable(name, format=None, mode=None, partitionBy=None, **options) from pyspark.sql.DataFrameWriter.saveAsTable.
# Let's say I have my dataframe, my_df
# Am I able to do the following?
my_df.saveAsTable('my_table')
My question is which formats are available for me to use and where can I find this information for myself? Is OrcSerDe an option? I am still learning about this. Thank you.
Following file formats are supported.
text
csv
ldap
json
parquet
orc
Referece: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
So I was able to write the pyspark dataframe to a compressed Hive table by using a pyspark.sql.DataFrameWriter. To do this I had to do something like the following:
my_df.write.orc('my_file_path')
That did the trick.
https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.write
I am using pyspark 1.6.0 btw

pyspark read csv file multiLine option not working for records which has newline spark2.3 and spark2.2

I am trying to read the dat file using pyspark csv reader and it contains newline character ("\n") as part of the data. Spark is unable to read this file as single column, rather treating it as new row.
I tried using the "multiLine" option while reading , but still its not working.
spark.read.csv(file_path, schema=schema, sep=delimiter,multiLine=True)
Data is something like this. Here $ is CRLF for newline shown in vim.
name,test,12345,$
$
,desc$
name2,test2,12345,$
$
,desc2$
So pyspark is treating desc as next record.
How to read such data in pyspark .
Tried this in both spark2.2 and spark2.3 versions.
I created my own hadoop Custom Record Reader and was able to read it by invoking the api .
spark.sparkContext.newAPIHadoopFile(file_path,'com.test.multi.reader.CustomFileFormat','org.apache.hadoop.io.LongWritable','org.apache.hadoop.io.Text',conf=conf)
And in the Custom Record Reader implemented the logic to handle the newline characters encountered .

Difficulty with encoding while reading data in Spark

In connection with my earlier question, when I give the command,
filePath = sc.textFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.collect()
some part of the data has '\xa0' prefixed to every word, and other part of the data doesn't have that special character. I am attaching 2 pictures, one with '\xa0', and another without '\xa0'. The content shown in 2 pictures belong to same file. Only some part of the data from same file is read that way by Spark. I have checked the original data file present in HDFS, and it was problem free.
I feel that it has something to do with encoding. I tried all methods like using replaceoption in flatMap like flatMap(lambda line: line.replace(u'\xa0', ' ').split(" ")), flatMap(lambda line: line.replace(u'\xa0', u' ').split(" ")), but none worked for me. This question might sound dump, but I am newbie in using Apache Spark, and I require some assistance to overcome this problem.
Can anyone please help me? Thanks in advance.
Check the encoding of your file. When you use sc.textFile, spark expects an UTF-8 encoded file.
One of the solution is to acquire your file with sc.binaryFiles and then apply the expected encoding.
sc.binaryFile create a key/value rdd where key is the path to file and value is the content as a byte.
If you need to keep only the text and apply an decoding function, :
filePath = sc.binaryFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.map(lambda x :x[1].decode('utf-8')) #or another encoding depending on your file

Read CSV with linebreaks in pyspark

Read CSV with linebreaks in pyspark
I want to read with pyspark a "legal" (it follows RFC4180) CSV that has breaklines (CRLF) in some of the rows. The next code sample shows how it does seem when opened it with Notepad++:
I try to read it with sqlCtx.read.load using format ='com.databricks.spark.csv. and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark 2.1.0.2 version.
Is there any command or alternative way of reading the csv that allows me to read these two lines only as one?
You can use "csv" instead of Databricks CSV - the last one redirects now to default Spark reader. But, it's only a hint :)
In Spark 2.2 there was added new option - wholeFile. If you write this:
spark.read.option("wholeFile", "true").csv("file.csv")
it will read all file and handle multiline CSV.
There is no such option in Spark 2.1. You can read file using sparkContext.wholeTextFile or just use newer verison
wholeFile does not exist (anymore?) in the spark api documentation:
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
This solution will work:
spark.read.option("multiLine", "true").csv("file.csv")
From the api documentation:
multiLine – parse records, which may span multiple lines. If None is set, it uses the default value, false

Resources