load data from csv with encoding utf-16le - apache-spark

I am using spark version 3.1.2, and I need to load data from a csv with encoding utf-16le.
df = spark.read.format("csv")
.option("delimiter", ",")
.option("header", true)
.option("encoding", "utf-16le")
.load(file_path)
df.show(4)
It seems spark can only read the first line normally:
Starting from the second row, either garbled characters or null values
however, python can read the data correct with code:
with open(file_path, encoding='utf-16le', mode='r') as f:
text = f.read()
print(text)
print result like:
python read correct

Add these options while creating Spark dataframe from CSV file source -
.option('encoding', 'UTF-16')
.option('multiline', 'true')

the multiline option ignores the encoding option when using the DataFrameReader.
It is not possible to use both options at the same time.
Maybe you can process the multiline problems in your data and later specify an encoding to read good characters.

Related

Spark : Store data frame into CSV with unicode separator \u2592

Looking to store spark dataframe into csv, but columns need to be separated with unicode \u2592
considering my dataframe name is myDf
myDf.option("header",true)
.option("encoding", "......")
.option("delimiter", ".....")
.csv(s"$path")
data should look like
my_cd▒my_cd▒flag_cd
00000051▒R▒Y
00000051▒R▒Y
0000007a▒D▒Y
Finally I found the solution, I was trying delimiter. You can pass any unicode of desire separator as below option("sep","\u2592"). It worked for me.
myDf.option("header",true)
.option("sep", "\u2592")
.csv(s"$path")

spark data read with quoted string

i am having the csv data file as given below
each line is terminated by a Carriage Return('\r')
but certain value of text are multilined field having line delimiter as line feed ('\n'). how to use spark data source api option to handle these issue.
with enter image description here
Spark 2.2.0 has added support for parsing multi-line CSV files. You can use following to read a csv with multi-line:
val df = spark.read
.option("sep", ",")
.option("quote", "")
.option("multiLine", "true")
.option("inferSchema", "true")
.csv(file_name)

Pyspark: How to convert a spark dataframe to json and save it as json file?

I am trying to convert my pyspark sql dataframe to json and then save as a file.
df_final = df_final.union(join_df)
df_final contains the value as such:
I tried something like this. But it created a invalid json.
df_final.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)
{"Variable":"Col1","Min":"20","Max":"30"}
{"Variable":"Col2","Min":"25,"Max":"40"}
My expected file should have data as below:
[
{"Variable":"Col1",
"Min":"20",
"Max":"30"},
{"Variable":"Col2",
"Min":"25,
"Max":"40"}]
For pyspark you can directly store your dataframe into json file, there is no need to convert the datafram into json.
df_final.coalesce(1).write.format('json').save('/path/file_name.json')
and still you want to convert your datafram into json then you can used
df_final.toJSON().
A solution can be using collect and then using json.dump:
import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
json.dump(data, outfile)
Here is how you can do the equivalent of json.dump for a dataframe with PySpark 1.3+.
df_list_of_jsons = df.toJSON().collect()
df_list_of_dicts = [json.loads(x) for x in df_list_of_jsons]
df_json = json.dumps(df_list_of_dicts)
sc.parallelize([df_json]).repartition(1).cache().saveAsTextFile("<HDFS_PATH>")
Note this will result in the whole dataframe being loaded into the driver memory, so this is only recommended for small dataframe.
If you want to use spark to process result as json files, I think that your output schema is right in hdfs.
And I assumed you encountered the issue that you can not smoothly read data from normal python script by using :
with open('data.json') as f:
data = json.load(f)
You should try to read data line by line:
data = []
with open("data.json",'r') as datafile:
for line in datafile:
data.append(json.loads(line))
and you can use pandas to create dataframe :
df = pd.DataFrame(data)

How to save dataframe as text file GZ format in pyspark?( (but not in csv format) [duplicate]

I use Spark 1.6.0 and Scala.
I want to save a DataFrame as compressed CSV format.
Here is what I have so far (assume I already have df and sc as SparkContext):
//set the conf to the codec I want
sc.getConf.set("spark.hadoop.mapred.output.compress", "true")
sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "true")
sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
sc.getConf.set("spark.hadoop.mapred.output.compression.type", "BLOCK")
df.write
.format("com.databricks.spark.csv")
.save(my_directory)
The output is not in gz format.
This code works for Spark 2.1, where .codec is not available.
df.write
.format("com.databricks.spark.csv")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.save(my_directory)
For Spark 2.2, you can use the df.write.csv(...,codec="gzip") option described here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=codec
With Spark 2.0+, this has become a bit simpler:
df.write.csv("path", compression="gzip") # Python-only
df.write.option("compression", "gzip").csv("path") // Scala or Python
You don't need the external Databricks CSV package anymore.
The csv() writer supports a number of handy options. For example:
sep: To set the separator character.
quote: Whether and how to quote values.
header: Whether to include a header line.
There are also a number of other compression codecs you can use, in addition to gzip:
bzip2
lz4
snappy
deflate
The full Spark docs for the csv() writer are here: Python / Scala
Spark 2.2+
df.write.option("compression","gzip").csv("path")
Spark 2.0
df.write.csv("path", compression="gzip")
Spark 1.6
On the spark-csv github:
https://github.com/databricks/spark-csv
One can read:
codec: compression codec to use when saving to file. Should be the fully qualified name of a class implementing org.apache.hadoop.io.compress.CompressionCodec or one of case-insensitive shorten names (bzip2, gzip, lz4, and snappy). Defaults to no compression when a codec is not specified.
In this case, this works:
df.write.format("com.databricks.spark.csv").codec("gzip")\ .save('my_directory/my_file.gzip')
To write the CSV file with headers and rename the part-000 file to .csv.gzip
DF.coalesce(1).write.format("com.databricks.spark.csv").mode("overwrite")
.option("header","true")
.option("codec",org.apache.hadoop.io.compress.GzipCodec").save(tempLocationFileName)
copyRename(tempLocationFileName, finalLocationFileName)
def copyRename(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
// the "true" setting deletes the source files once they are merged into the new output
}
If you don't need the header then set it to false and you wouldn't need to do the coalesce either. It will be faster to write too.

How to read the csv and convert to RDD in sparkR

As i am a R programmer i want to use R as a interface to spark, with the sparkR package i installed sparkR in R.
I'm new to sparkR. I want to perform some operations on particular data in a CSV record. I'm trying to read a csv file and convert it to rdd.
This is the code i did:
sc <- sparkR.init(master="local") # created spark content
data <- read.csv(sc, "/home/data1.csv")
#It throws an error, to use read.table
Data i have to load and convert - http://i.stack.imgur.com/sj78x.png
if am wrong, how to read this data in csv and convert to RDD in sparkR
TIA
I believe that the problem is the header line, if you remove this line, it should work.
How do I convert csv file to rdd
--edited--
With this code you can test Sparkr with CSVs, but you need to remove the header line in your CSV file.
lines <- textFile(sc, "/home/data1.csv")
csvElements <- lapply(lines, function(line) {
#line represent each CSV line i. e. strsplit(line, ",") is useful
})
In the recent SparkR version (2.0+)
read.df(path, source = "csv")
In Spark 1.x
read.df(sc, path, source = "com.databricks.spark.csv")
with
spark.jars.packages com.databricks:spark-csv_2.10:1.4.0
This below code will let you read a csv with header . All the best
val csvrdd = spark.read.options(“header”,”true”).csv(filename)

Resources