In Python 3.7, I want to encode an Avro object to String.
I found examples converting to byte array but not to string.
Code to convert to byte array:
def serialize(mapper, schema):
bytes_writer = io.BytesIO()
encoder = avro.io.BinaryEncoder(bytes_writer)
writer1 = avro.io.DatumWriter(schema)
writer1.write(mapper, encoder)
return bytes_writer.getvalue()
mapper is a dictionary which will populate the avro object.
io provides with StringIO which I assume will need to be used instead of BytesIO but then what encoder to use with that? How do we serialize this?
if, for example, a is your Avro object, you can use a.to_json() method of Avro and then json.dumps(a)
Related
I am trying to convert my pyspark sql dataframe to json and then save as a file.
df_final = df_final.union(join_df)
df_final contains the value as such:
I tried something like this. But it created a invalid json.
df_final.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)
{"Variable":"Col1","Min":"20","Max":"30"}
{"Variable":"Col2","Min":"25,"Max":"40"}
My expected file should have data as below:
[
{"Variable":"Col1",
"Min":"20",
"Max":"30"},
{"Variable":"Col2",
"Min":"25,
"Max":"40"}]
For pyspark you can directly store your dataframe into json file, there is no need to convert the datafram into json.
df_final.coalesce(1).write.format('json').save('/path/file_name.json')
and still you want to convert your datafram into json then you can used
df_final.toJSON().
A solution can be using collect and then using json.dump:
import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
json.dump(data, outfile)
Here is how you can do the equivalent of json.dump for a dataframe with PySpark 1.3+.
df_list_of_jsons = df.toJSON().collect()
df_list_of_dicts = [json.loads(x) for x in df_list_of_jsons]
df_json = json.dumps(df_list_of_dicts)
sc.parallelize([df_json]).repartition(1).cache().saveAsTextFile("<HDFS_PATH>")
Note this will result in the whole dataframe being loaded into the driver memory, so this is only recommended for small dataframe.
If you want to use spark to process result as json files, I think that your output schema is right in hdfs.
And I assumed you encountered the issue that you can not smoothly read data from normal python script by using :
with open('data.json') as f:
data = json.load(f)
You should try to read data line by line:
data = []
with open("data.json",'r') as datafile:
for line in datafile:
data.append(json.loads(line))
and you can use pandas to create dataframe :
df = pd.DataFrame(data)
Using Spark 2.3, I know I can read a file of JSON documents like this:
{'key': 'val1'}
{'key': 'val2'}
With this:
spark.json.read('filename')
How can I read the following in to a dataframe when there aren't newlines between JSON documents?
The following would be an example input.
{'key': 'val1'}{'key': 'val2'}
To be clear, I expect a dataframe with two rows (frame.count() == 2).
Please try -
df = spark.read.json(["fileName1","fileName2"])
You can also do if you want to read all json files in the folder -
df = spark.read.json("data/*json")
As #cricket_007 suggested above, you'd be better off fixing the input file
If you're sure you have no inline close braces within json objects, you could do the following:
with open('myfilename', 'r') as f:
txt = f.read()
txt = txt.replace('}', '}\n')
with open('mynewfilename', 'w') as f:
f.write(txt)
If you do have '}' within keys or values, the task becomes harder but not impossible with regex. It seems unlikely though.
We solved this using the RDD-Api as we couldn't find any way to use the Dataframe-API in a memory efficient way (we were always hitting executor OoM-Errors).
Following function will incrementally try to parse the json and yielding subsequent jsons from your file (from this post):
from functools import partial
from json import JSONDecoder
from io import StringIO
def generate_from_buffer(buffer: str, chunk: str, decoder: JSONDecoder):
buffer += chunk
while buffer:
try:
result, index = decoder.raw_decode(buffer)
yield result
buffer = buffer[index:].lstrip()
except ValueError:
# Not enough data to decode, read more
break
return buffer
def parse_jsons_file(jsons_str: str, buffer_size: int = 1024):
decoder = JSONDecoder()
buffer = ''
file_obj = StringIO(jsons_str)
for chunk in iter(partial(file_obj.read, buffer_size), ''):
buffer = yield from generate_from_buffer(buffer, chunk, decoder)
if buffer:
raise ValueError("Invalid input: should be concatenation of json strings")
We first read the json with .format("text"):
df: DataFrame = (
spark
.read
.format("text")
.option("wholetext", True)
.load(data_source_path)
)
then convert it to RDD, flatMap using the function from above, and finally convert it back to spark dataframe. For this you have to define the json_schema for the single jsons in your file, which is good practice anyway.
rdd_df = (df_rdd.map(lambda row: row["value"])
.flatMap(lambda jsons_string: parse_jsons_file(jsons_string))
.toDF(json_schema))
My Spark program reads a file that contains gzip compressed string that encoded64. I have to decode and decompress.
I used spark unbase64 to decode and generated byte array
bytedf=df.withColumn("unbase",unbase64(col("value")) )
Is there any spark method available in spark that decompresses bytecode?
I wrote a udf
def decompress(ip):
bytecode = base64.b64decode(x)
d = zlib.decompressobj(32 + zlib.MAX_WBITS)
decompressed_data = d.decompress(bytecode )
return(decompressed_data.decode('utf-8'))
decompress = udf(decompress)
decompressedDF = df.withColumn("decompressed_XML",decompress("value"))
I have a similar case, in my case, I do this:
from pyspark.sql.functions import col,unbase64,udf
from gzip import decompress
bytedf=df1.withColumn("unbase",unbase64(col("payload")))
decompress_func = lambda x: decompress(x).decode('utf-8')
udf_decompress = udf(decompress_func)
df2 = bytedf.withColumn('unbase_decompress', udf_decompress('unbase'))
Spark example using base64-
import base64
.
.
#decode base 64 string using map operation or you may create udf.
df.map(lambda base64string: base64.b64decode(base64string), <string encoder>)
Read here for detailed python example.
I want to read an avro file using Spark (I am using Spark 1.3.0 so I don't have data frames)
I read the avro file using this piece of code
import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.hadoop.io.NullWritable
import org.apache.spark.SparkContext
private def readAvro(sparkContext: SparkContext, path: String) = {
sparkContext.newAPIHadoopFile[
AvroKey[GenericRecord],
NullWritable,
AvroKeyInputFormat[GenericRecord]
](path)
}
I execute this and get an RDD. Now from RDD, how do I extract value of specific columns? like loop through all records and give value of column name?
[edit]As suggested by Justin below I tried
val rdd = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](input)
rdd.map(record=> record._1.get("accountId")).toArray().foreach(println)
but I get an error
<console>:34: error: value get is not a member of org.apache.avro.mapred.AvroKey[org.apache.avro.generic.GenericRecord]
rdd.map(record=> record._1.get("accountId")).toArray().foreach(println)
AvroKey has a datum method to extract the wrapped value. And GenericRecord has a get method that accepts the column name as a string. So you can just extract the columns using a map
rdd.map(record=>record._1.datum.get("COLNAME"))
I am building a Python script in which I need to generate a json file from json RDD .
Following is code snippet for saving json file.
jsonRDD.map(lambda x :json.loads(x))
.coalesce(1, shuffle=True).saveAsTextFile('examples/src/main/resources/demo.json')
But I need to write json data to a single file instead of data distributed across several partitions.
So please suggest me appropriate solution for it
Without the use of additional libraries like pandas, you could save your RDD of several jsons by reducing them to one big string of jsons, each separated by a new line:
# perform your operation
# note that you do not need a lambda expression for json.loads
jsonRDD = jsonRDD.map(json.loads).coalesce(1, shuffle=True)
# map jsons back to string
jsonRDD = jsonRDD.map(json.dumps)
# reduce to one big string with one json on each line
json_string = jsonRDD.reduce(lambda x, y: x + "\n" + y)
# write your string to a file
with open("path/to/your.json", "w") as f:
f.write(json_string.encode("utf-8"))
I have had issues with pyspark saving off JSON files once I have them in a RDD or dataframe, so what I do is convert them to a pandas dataframe and save them to a non distributed directory.
import pandas
df1 = sqlContext.createDataFrame(yourRDD)
df2 = df1.toPandas()
df2.to_json(yourpath)