Generate single json file for pyspark RDD - apache-spark

I am building a Python script in which I need to generate a json file from json RDD .
Following is code snippet for saving json file.
jsonRDD.map(lambda x :json.loads(x))
.coalesce(1, shuffle=True).saveAsTextFile('examples/src/main/resources/demo.json')
But I need to write json data to a single file instead of data distributed across several partitions.
So please suggest me appropriate solution for it

Without the use of additional libraries like pandas, you could save your RDD of several jsons by reducing them to one big string of jsons, each separated by a new line:
# perform your operation
# note that you do not need a lambda expression for json.loads
jsonRDD = jsonRDD.map(json.loads).coalesce(1, shuffle=True)
# map jsons back to string
jsonRDD = jsonRDD.map(json.dumps)
# reduce to one big string with one json on each line
json_string = jsonRDD.reduce(lambda x, y: x + "\n" + y)
# write your string to a file
with open("path/to/your.json", "w") as f:
f.write(json_string.encode("utf-8"))

I have had issues with pyspark saving off JSON files once I have them in a RDD or dataframe, so what I do is convert them to a pandas dataframe and save them to a non distributed directory.
import pandas
df1 = sqlContext.createDataFrame(yourRDD)
df2 = df1.toPandas()
df2.to_json(yourpath)

Related

Concatenate Excel Files using Dask

I have 20 Excel files and need to concatenate them using Dask (I have already done it using pandas, but it will grow in the future). I have used the following solution found here: Reading multiple Excel files with Dask
But throws me an error: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
What I am assuming is that it does not create a Dataframe, tried the following code:
df = pd.DataFrame()
files = glob.glob(r"D:\XX\XX\XX\XX\XXX\*.xlsx")
# note we are wrapping in delayed only the function, not the arguments
delayeds = [dask.delayed(pd.read_excel)(i, skiprows=0) for i in files]
# the line below launches actual computations
results = dask.compute(delayeds)
# after computation is over the results object will
# contain a list of pandas dataframes
df = pd.concat(results, ignore_index=True)
The original solution did not include df=pd.DataFrame(). Where is the mistake?
Thank you!
Using the following solution: Build a dask dataframe from a list of dask delayed objects
Realized that the last line was not using dask but pandas. Changed the data to a numpy array to pandas.
Here is the code:
files = glob.glob(r"D:\XX\XX\XX\XX\XXX\*.xlsx")
# note we are wrapping in delayed only the function, not the arguments
delayeds = [dask.delayed(pd.read_excel)(i, skiprows=0) for i in files]
# the line below launches actual computations
results = dask.compute(delayeds)
# after computation is over the results object will
# contain a list of pandas dataframes
dask_array = dd.from_delayed(delayeds) # here instead of pd.concat
dask_array.compute().to_csv(r"D:\XX\XX\XX\XX\XXX\*.csv") # Please be aware of the dtypes on your Excel.

Pyspark: How to convert a spark dataframe to json and save it as json file?

I am trying to convert my pyspark sql dataframe to json and then save as a file.
df_final = df_final.union(join_df)
df_final contains the value as such:
I tried something like this. But it created a invalid json.
df_final.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)
{"Variable":"Col1","Min":"20","Max":"30"}
{"Variable":"Col2","Min":"25,"Max":"40"}
My expected file should have data as below:
[
{"Variable":"Col1",
"Min":"20",
"Max":"30"},
{"Variable":"Col2",
"Min":"25,
"Max":"40"}]
For pyspark you can directly store your dataframe into json file, there is no need to convert the datafram into json.
df_final.coalesce(1).write.format('json').save('/path/file_name.json')
and still you want to convert your datafram into json then you can used
df_final.toJSON().
A solution can be using collect and then using json.dump:
import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
json.dump(data, outfile)
Here is how you can do the equivalent of json.dump for a dataframe with PySpark 1.3+.
df_list_of_jsons = df.toJSON().collect()
df_list_of_dicts = [json.loads(x) for x in df_list_of_jsons]
df_json = json.dumps(df_list_of_dicts)
sc.parallelize([df_json]).repartition(1).cache().saveAsTextFile("<HDFS_PATH>")
Note this will result in the whole dataframe being loaded into the driver memory, so this is only recommended for small dataframe.
If you want to use spark to process result as json files, I think that your output schema is right in hdfs.
And I assumed you encountered the issue that you can not smoothly read data from normal python script by using :
with open('data.json') as f:
data = json.load(f)
You should try to read data line by line:
data = []
with open("data.json",'r') as datafile:
for line in datafile:
data.append(json.loads(line))
and you can use pandas to create dataframe :
df = pd.DataFrame(data)

Using pyspark, how do I read multiple JSON documents on a single line in a file into a dataframe?

Using Spark 2.3, I know I can read a file of JSON documents like this:
{'key': 'val1'}
{'key': 'val2'}
With this:
spark.json.read('filename')
How can I read the following in to a dataframe when there aren't newlines between JSON documents?
The following would be an example input.
{'key': 'val1'}{'key': 'val2'}
To be clear, I expect a dataframe with two rows (frame.count() == 2).
Please try -
df = spark.read.json(["fileName1","fileName2"])
You can also do if you want to read all json files in the folder -
df = spark.read.json("data/*json")
As #cricket_007 suggested above, you'd be better off fixing the input file
If you're sure you have no inline close braces within json objects, you could do the following:
with open('myfilename', 'r') as f:
txt = f.read()
txt = txt.replace('}', '}\n')
with open('mynewfilename', 'w') as f:
f.write(txt)
If you do have '}' within keys or values, the task becomes harder but not impossible with regex. It seems unlikely though.
We solved this using the RDD-Api as we couldn't find any way to use the Dataframe-API in a memory efficient way (we were always hitting executor OoM-Errors).
Following function will incrementally try to parse the json and yielding subsequent jsons from your file (from this post):
from functools import partial
from json import JSONDecoder
from io import StringIO
def generate_from_buffer(buffer: str, chunk: str, decoder: JSONDecoder):
buffer += chunk
while buffer:
try:
result, index = decoder.raw_decode(buffer)
yield result
buffer = buffer[index:].lstrip()
except ValueError:
# Not enough data to decode, read more
break
return buffer
def parse_jsons_file(jsons_str: str, buffer_size: int = 1024):
decoder = JSONDecoder()
buffer = ''
file_obj = StringIO(jsons_str)
for chunk in iter(partial(file_obj.read, buffer_size), ''):
buffer = yield from generate_from_buffer(buffer, chunk, decoder)
if buffer:
raise ValueError("Invalid input: should be concatenation of json strings")
We first read the json with .format("text"):
df: DataFrame = (
spark
.read
.format("text")
.option("wholetext", True)
.load(data_source_path)
)
then convert it to RDD, flatMap using the function from above, and finally convert it back to spark dataframe. For this you have to define the json_schema for the single jsons in your file, which is good practice anyway.
rdd_df = (df_rdd.map(lambda row: row["value"])
.flatMap(lambda jsons_string: parse_jsons_file(jsons_string))
.toDF(json_schema))

Pyspark Pair RDD from Text File

I have a local text file kv_pair.log formatted such as that key value pairs are comma delimited and records begin and terminate with a new line:
"A"="foo","B"="bar","C"="baz"
"A"="oof","B"="rab","C"="zab"
"A"="aaa","B"="bbb","C"="zzz"
I am trying to read this to a Pair RDD using pySpark as follows:
from pyspark import SparkContext
sc=sparkContext()
# Read raw text to RDD
lines=sc.textFile('kv_pair.log')
# How to turn this into a Pair RDD?
pairs=lines.map(lambda x: (x.replace('"', '').split(",")))
print type(pairs)
print pairs.take(2)
I feel I am close! The output of above is:
[[u'A=foo', u'B=bar', u'C=baz'], [u'A=oof', u'B=rab', u'C=zab']]
So it looks like pairs is a list of records, which contains a list of the kv pairs as strings.
How can I use pySpark to transform this into a Pair RDD such as that the keys and values are properly separated?
Ultimate goal is to transform this Pair RDD into a DataFrame to perform SQL operations - but one step at a time, please help transforming this into a Pair RDD.
You can use flatMap with a custom function as lambda can't be used for multiple statements
def tranfrm(x):
lst = x.replace('"', '').split(",")
return [(x.split("=")[0], x.split("=")[1]) for x in lst]
pairs = lines.map(tranfrm)
This is really bad practice for a parser, but I believe your example could be done with something like this:
from pyspark import SparkContext
from pyspark.sql import Row
sc=sparkContext()
# Read raw text to RDD
lines=sc.textFile('kv_pair.log')
# How to turn this into a Pair RDD?
pairs=lines.map(lambda x: (x.replace('"', '').split(",")))\
.map(lambda r: Row(A=r[0].split('=')[1], B=r[1].split('=')[1], C=r[2].split('=')[1] )
print type(pairs)
print pairs.take(2)

How to read the csv and convert to RDD in sparkR

As i am a R programmer i want to use R as a interface to spark, with the sparkR package i installed sparkR in R.
I'm new to sparkR. I want to perform some operations on particular data in a CSV record. I'm trying to read a csv file and convert it to rdd.
This is the code i did:
sc <- sparkR.init(master="local") # created spark content
data <- read.csv(sc, "/home/data1.csv")
#It throws an error, to use read.table
Data i have to load and convert - http://i.stack.imgur.com/sj78x.png
if am wrong, how to read this data in csv and convert to RDD in sparkR
TIA
I believe that the problem is the header line, if you remove this line, it should work.
How do I convert csv file to rdd
--edited--
With this code you can test Sparkr with CSVs, but you need to remove the header line in your CSV file.
lines <- textFile(sc, "/home/data1.csv")
csvElements <- lapply(lines, function(line) {
#line represent each CSV line i. e. strsplit(line, ",") is useful
})
In the recent SparkR version (2.0+)
read.df(path, source = "csv")
In Spark 1.x
read.df(sc, path, source = "com.databricks.spark.csv")
with
spark.jars.packages com.databricks:spark-csv_2.10:1.4.0
This below code will let you read a csv with header . All the best
val csvrdd = spark.read.options(“header”,”true”).csv(filename)

Resources