How to read the csv and convert to RDD in sparkR - apache-spark

As i am a R programmer i want to use R as a interface to spark, with the sparkR package i installed sparkR in R.
I'm new to sparkR. I want to perform some operations on particular data in a CSV record. I'm trying to read a csv file and convert it to rdd.
This is the code i did:
sc <- sparkR.init(master="local") # created spark content
data <- read.csv(sc, "/home/data1.csv")
#It throws an error, to use read.table
Data i have to load and convert - http://i.stack.imgur.com/sj78x.png
if am wrong, how to read this data in csv and convert to RDD in sparkR
TIA

I believe that the problem is the header line, if you remove this line, it should work.
How do I convert csv file to rdd
--edited--
With this code you can test Sparkr with CSVs, but you need to remove the header line in your CSV file.
lines <- textFile(sc, "/home/data1.csv")
csvElements <- lapply(lines, function(line) {
#line represent each CSV line i. e. strsplit(line, ",") is useful
})

In the recent SparkR version (2.0+)
read.df(path, source = "csv")
In Spark 1.x
read.df(sc, path, source = "com.databricks.spark.csv")
with
spark.jars.packages com.databricks:spark-csv_2.10:1.4.0

This below code will let you read a csv with header . All the best
val csvrdd = spark.read.options(“header”,”true”).csv(filename)

Related

load data from csv with encoding utf-16le

I am using spark version 3.1.2, and I need to load data from a csv with encoding utf-16le.
df = spark.read.format("csv")
.option("delimiter", ",")
.option("header", true)
.option("encoding", "utf-16le")
.load(file_path)
df.show(4)
It seems spark can only read the first line normally:
Starting from the second row, either garbled characters or null values
however, python can read the data correct with code:
with open(file_path, encoding='utf-16le', mode='r') as f:
text = f.read()
print(text)
print result like:
python read correct
Add these options while creating Spark dataframe from CSV file source -
.option('encoding', 'UTF-16')
.option('multiline', 'true')
the multiline option ignores the encoding option when using the DataFrameReader.
It is not possible to use both options at the same time.
Maybe you can process the multiline problems in your data and later specify an encoding to read good characters.

Pyspark dataframe: load from csv and then remove the first line

I am able to load csv file from Azure datalake into pyspark dataframe.
How to remove the first line and make the second line as my header?
I have seen some RDD solution. But I am not able to load the file and I get error using the following code as "RDD is empty"
items = sc.textFile(f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/tmp/items.csv")
firstRow=data.first()
Hence I prefer to load using standard spark as below. I could display the dataframe contents. I have to drop or remove the first line and make 2nrd row as header. Thanks.
items= spark.read.format("csv").load(f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/tmp/items.csv", header=True)
Try this:
it's not an optimized solution but will solve the requirement.
df = spark.createDataFrame([(1,2,3),(4,5,6),(7,8,9)],['a','b','c'])
df.show()
df1 = df.rdd.zipWithIndex().toDF().where(F.col('_2') > 0).drop('_2')
for each_col in df.columns:
df1 = df1.withColumn(each_col, F.col('_1.'+each_col))
df1.drop('_1').show()

RDD String to Spark csv Reader

I want to read the RDD[String] using the spark CSV reader. The reason I am doing this is, I need to filter some records before using the CSV reader.
val fileRDD: RDD[String] = spark.sparkContext.textFile("file")
I need to read the fileRDD using the spark CSV reader. I wish not to commit the file as it increases the IO of the HDFS. I have looked into the options we have in the spark CSV, but didn't found any.
spark.read.csv(file)
Sample Data
PHM|MERC|PHARMA|BLUEDRUG|50
CLM|BSH|CLAIM|VISIT|HSA|EMPLOYER|PAID|250
PHM|GSK|PHARMA|PARAC|70
CLM|UHC|CLAIM|VISIT|HSA|PERSONAL|PAID|72
As you can see all the records starts with PHM has different number of columns and clm has different number of columns. That is the reason i am filtering and then applying schema. PHM and CLM records has different schema.
val fileRDD: RDD[String] = spark.sparkContext.textFile("file").filter(_.startWith("PHM"))
spark.read.option(schema,"phcschema").csv(fileRDD.toDS())
Since Spark 2.2, method ".csv" can read dataset of strings. Can be implemented in this way:
val rdd: RDD[String] = spark.sparkContext.textFile("csv.txt")
// ... do filtering
spark.read.csv(rdd.toDS())

Pyspark: How to convert a spark dataframe to json and save it as json file?

I am trying to convert my pyspark sql dataframe to json and then save as a file.
df_final = df_final.union(join_df)
df_final contains the value as such:
I tried something like this. But it created a invalid json.
df_final.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)
{"Variable":"Col1","Min":"20","Max":"30"}
{"Variable":"Col2","Min":"25,"Max":"40"}
My expected file should have data as below:
[
{"Variable":"Col1",
"Min":"20",
"Max":"30"},
{"Variable":"Col2",
"Min":"25,
"Max":"40"}]
For pyspark you can directly store your dataframe into json file, there is no need to convert the datafram into json.
df_final.coalesce(1).write.format('json').save('/path/file_name.json')
and still you want to convert your datafram into json then you can used
df_final.toJSON().
A solution can be using collect and then using json.dump:
import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
json.dump(data, outfile)
Here is how you can do the equivalent of json.dump for a dataframe with PySpark 1.3+.
df_list_of_jsons = df.toJSON().collect()
df_list_of_dicts = [json.loads(x) for x in df_list_of_jsons]
df_json = json.dumps(df_list_of_dicts)
sc.parallelize([df_json]).repartition(1).cache().saveAsTextFile("<HDFS_PATH>")
Note this will result in the whole dataframe being loaded into the driver memory, so this is only recommended for small dataframe.
If you want to use spark to process result as json files, I think that your output schema is right in hdfs.
And I assumed you encountered the issue that you can not smoothly read data from normal python script by using :
with open('data.json') as f:
data = json.load(f)
You should try to read data line by line:
data = []
with open("data.json",'r') as datafile:
for line in datafile:
data.append(json.loads(line))
and you can use pandas to create dataframe :
df = pd.DataFrame(data)

Generate single json file for pyspark RDD

I am building a Python script in which I need to generate a json file from json RDD .
Following is code snippet for saving json file.
jsonRDD.map(lambda x :json.loads(x))
.coalesce(1, shuffle=True).saveAsTextFile('examples/src/main/resources/demo.json')
But I need to write json data to a single file instead of data distributed across several partitions.
So please suggest me appropriate solution for it
Without the use of additional libraries like pandas, you could save your RDD of several jsons by reducing them to one big string of jsons, each separated by a new line:
# perform your operation
# note that you do not need a lambda expression for json.loads
jsonRDD = jsonRDD.map(json.loads).coalesce(1, shuffle=True)
# map jsons back to string
jsonRDD = jsonRDD.map(json.dumps)
# reduce to one big string with one json on each line
json_string = jsonRDD.reduce(lambda x, y: x + "\n" + y)
# write your string to a file
with open("path/to/your.json", "w") as f:
f.write(json_string.encode("utf-8"))
I have had issues with pyspark saving off JSON files once I have them in a RDD or dataframe, so what I do is convert them to a pandas dataframe and save them to a non distributed directory.
import pandas
df1 = sqlContext.createDataFrame(yourRDD)
df2 = df1.toPandas()
df2.to_json(yourpath)

Resources