How to convert .CSV file to .Json file using Pyspark? - python-3.x

I am having a problem in converting .csv file to multiline json file using pyspark.
I have a csv file read via spark rdd and I need to convert this to multiline json using pyspark.
Here is my code:
import json
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("jsonconversion").getOrCreate()
df = spark.read.format("csv").option("header","True").load(csv_file)
df.show()
df_json = df.toJSON()
for row in df_json.collect():
line = json.loads(row)
result =[]
for key,value in list(line.items()):
if key == 'FieldName':
FieldName =line['FieldName']
del line['FieldName']
result.append({FieldName:line})
res =result
with open("D:/tasklist/jsaonoutput.json",'a+')as f:
f.write(json.dumps(res, indent=4, separators=(',',':')))
I need the output in below format.
{
"Name":{
"DataType":"String",
"Length":4,
"Required":"Y",
"Output":"Y",
"Address": "N",
"Phone Number":"N",
"DoorNumber":"N/A"
"Street":"N",
"Locality":"N/A",
"State":"N/A"
}
}
My Input CSV file Looks like this:
I am new to Pyspark, Any leads to modify this code to a working code will be much appreciated.
Thank you in advance.

Try the following code. It first creates pandas dataframe from spark DF (unless you care doing some else with spark df, you can load csv file directly into pandas). From pandas df, it creates groups based on FieldName column and then writes to file where json.dumps takes care of formatting.
import json
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("jsonconversion").getOrCreate()
df = spark.read.format("csv").option("header","True").load(csv_file)
df.show()
df_pandas_grped = df.toPandas().groupby('FieldName')
final_dict = {}
for key, grp in df_pandas_grped:
final_dict[str(key)] = grp.to_dict('records')
with open("D:/tasklist/jsaonoutput.json",'w')as f:
f.write(json.dumps(final_dict,indent=4))

Related

how to copy data from variable list of parquet files using pyspark

I have saved the list of parquet files(to be read) in a variable list, say listOffilteredFiles()
Now I want to read all the files from this list and write all the data into a single parquet file in another path. How can I do this. I have written the below code and I'm stuck here. Any help would be appreciated
import time
import datetime
from datetime import datetime
import pandas as pd
import glob
import pyspark
from pyspark.sql import SQLContext
dirName = 'dbfs:/mnt/abc/def/efg'
now = datetime.utcnow()
# Get the list of all files in directory tree at given path
listOfFiles = list()
listOffilteredFiles = list()
for (dirpath, dirnames, filenames) in os.walk(dirName):
listOfFiles += [os.path.join(dirpath, file) for file in filenames]
listOffilteredFiles = filter(lambda x: datetime.utcfromtimestamp(os.path.getmtime(x)) < now, listOfFiles)
Let's assume that the files you're trying to read are parquet files.
You can read all the parquet files from a directory using * syntax.
Suppose you have a directory like this:
/abc/def/[file1.parquet, file2.parquet, file3.parquet]
/abc/ghi/[file1.parquet, file2.parquet, file3.parquet]
and you wanna read all the parquet files under /abc directory. The spark read statement would be:
df = spark.read.parquet('/abc/*/*')
In another scenario, when you need to read some files after filtering them, you can do:
listOffilteredFiles = list(filter(lambda x: datetime.utcfromtimestamp(os.path.getmtime(x)) < now, listOfFiles))
df = spark.read.parquet(*listOffilteredFiles)

Pass schema from hdfs file while creating Spark DataFrame

I am trying to read the schema stored in text file in hdfs and use it while creating a DataFrame.
schema=StructType([
StructField("col1",StringType(),True),
StructField("col2",StringType(),True),
StructField("col3",TimestampType(),True),
StructField("col4",
StructType([
StructField("col5",StringType(),True),
StructField("col6",
.... and so on
jsonDF = spark.read.schema(schema).json('/path/test.json')
Since the schema is too big I want to defined inside the code. Can anyone please suggest which is the best way to do.
I tried below ways but doesn't work.
schema = sc.wholeTextFiles("hdfs://path/sample.schema"))
schema = spark.read.text('/path/sample.schema')
I figured out how to do this.
1. Define the schema of json file
json.schema=StructType([
StructField("col1",StringType(),True),
StructField("col2",StringType(),True),
StructField("col3",TimestampType(),True),
StructField("col4",
StructType([
StructField("col5",StringType(),True),
StructField("col6",
2. Print the json output
print(sampletmp.json())
3. Copy paste the above output to file sample.schema
4. In the code, recreate the schema as below
schema_file = 'path/sample.schema'
schema_json = spark.read.text(schema_file).first()[0]
schema = StructType.fromJson(json.loads(schema_json))
5. Create a DF using above schema
spark.read.schema(schema).json('/path/test.json')
6. Insert the data from DF into Hive table
jsonDF.write.mode("append").insertInto("hivetable")
Referred to the article - https://szczeles.github.io/Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark/
I haven't tested it with hdfs but I assume it is similar to reading from a local file. The idea is to store the file as a dict and then parse it to create the desidered schema. I have taken inspiration from here. Currently it lacks support for nullable and I have not tested with deeper levels of nested structs.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from fractions import Fraction
from pyspark.sql.functions import udf
import json
spark = SparkSession.builder.appName('myPython').getOrCreate()
f = open("/path/schema_file", "r")
dictString = f.read()
derived_schema = StructType([])
jdata = json.loads(dictString)
def get_type(v):
if v == "StringType":
return StringType()
if v == "TimestampType":
return TimestampType()
if v == "IntegerType":
return IntegerType()
def generate_schema(jdata, derived_schema):
for k, v in sorted(jdata.items()):
if (isinstance(v, str)):
derived_schema.add(StructField(k, get_type(v), True))
else:
added_schema = StructType([])
added_schema = generate_schema(v, added_schema)
derived_schema.add(StructField(k, added_schema, True))
return derived_schema
generate_schema(jdata, derived_schema)
from datetime import datetime
data = [("first", "the", datetime.utcnow(), ["as", 1])]
input_df = spark.createDataFrame(data, derived_schema)
input_df.printSchema()
With the file being:
{
"col1" : "StringType",
"col2" : "StringType",
"col3" : "TimestampType",
"col4" : {
"col5" : "StringType",
"col6" : "IntegerType"
}
}

Pyspark: How to convert a spark dataframe to json and save it as json file?

I am trying to convert my pyspark sql dataframe to json and then save as a file.
df_final = df_final.union(join_df)
df_final contains the value as such:
I tried something like this. But it created a invalid json.
df_final.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)
{"Variable":"Col1","Min":"20","Max":"30"}
{"Variable":"Col2","Min":"25,"Max":"40"}
My expected file should have data as below:
[
{"Variable":"Col1",
"Min":"20",
"Max":"30"},
{"Variable":"Col2",
"Min":"25,
"Max":"40"}]
For pyspark you can directly store your dataframe into json file, there is no need to convert the datafram into json.
df_final.coalesce(1).write.format('json').save('/path/file_name.json')
and still you want to convert your datafram into json then you can used
df_final.toJSON().
A solution can be using collect and then using json.dump:
import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
json.dump(data, outfile)
Here is how you can do the equivalent of json.dump for a dataframe with PySpark 1.3+.
df_list_of_jsons = df.toJSON().collect()
df_list_of_dicts = [json.loads(x) for x in df_list_of_jsons]
df_json = json.dumps(df_list_of_dicts)
sc.parallelize([df_json]).repartition(1).cache().saveAsTextFile("<HDFS_PATH>")
Note this will result in the whole dataframe being loaded into the driver memory, so this is only recommended for small dataframe.
If you want to use spark to process result as json files, I think that your output schema is right in hdfs.
And I assumed you encountered the issue that you can not smoothly read data from normal python script by using :
with open('data.json') as f:
data = json.load(f)
You should try to read data line by line:
data = []
with open("data.json",'r') as datafile:
for line in datafile:
data.append(json.loads(line))
and you can use pandas to create dataframe :
df = pd.DataFrame(data)

Apply a function to a single column of a csv in Spark

Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this?
My code
SparkContext().addPyFile("myfile.py")
spark = SparkSession\
.builder\
.appName("myApp")\
.getOrCreate()
from myfile import myFunction
df = spark.read.csv(sys.argv[1], header=True,
mode="DROPMALFORMED",)
a = df.rdd.map(lambda line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF()
I would like to be able to just call the function on the column name instead of mapping each row to line and then calling the function on line[index].
I'm using Spark version 2.0.1
You can simply use User Defined Functions (udf) combined with a withColumn :
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
udf_myFunction = udf(myFunction, IntegerType()) # if the function returns an int
df = df.withColumn("message", udf_myFunction("_3")) #"_3" being the column name of the column you want to consider
This will add a new column to the dataframe df containing the result of myFunction(line[3]).

Reading Avro file in Spark and extracting column values

I want to read an avro file using Spark (I am using Spark 1.3.0 so I don't have data frames)
I read the avro file using this piece of code
import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.hadoop.io.NullWritable
import org.apache.spark.SparkContext
private def readAvro(sparkContext: SparkContext, path: String) = {
sparkContext.newAPIHadoopFile[
AvroKey[GenericRecord],
NullWritable,
AvroKeyInputFormat[GenericRecord]
](path)
}
I execute this and get an RDD. Now from RDD, how do I extract value of specific columns? like loop through all records and give value of column name?
[edit]As suggested by Justin below I tried
val rdd = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](input)
rdd.map(record=> record._1.get("accountId")).toArray().foreach(println)
but I get an error
<console>:34: error: value get is not a member of org.apache.avro.mapred.AvroKey[org.apache.avro.generic.GenericRecord]
rdd.map(record=> record._1.get("accountId")).toArray().foreach(println)
AvroKey has a datum method to extract the wrapped value. And GenericRecord has a get method that accepts the column name as a string. So you can just extract the columns using a map
rdd.map(record=>record._1.datum.get("COLNAME"))

Resources