My Spark program reads a file that contains gzip compressed string that encoded64. I have to decode and decompress.
I used spark unbase64 to decode and generated byte array
bytedf=df.withColumn("unbase",unbase64(col("value")) )
Is there any spark method available in spark that decompresses bytecode?
I wrote a udf
def decompress(ip):
bytecode = base64.b64decode(x)
d = zlib.decompressobj(32 + zlib.MAX_WBITS)
decompressed_data = d.decompress(bytecode )
return(decompressed_data.decode('utf-8'))
decompress = udf(decompress)
decompressedDF = df.withColumn("decompressed_XML",decompress("value"))
I have a similar case, in my case, I do this:
from pyspark.sql.functions import col,unbase64,udf
from gzip import decompress
bytedf=df1.withColumn("unbase",unbase64(col("payload")))
decompress_func = lambda x: decompress(x).decode('utf-8')
udf_decompress = udf(decompress_func)
df2 = bytedf.withColumn('unbase_decompress', udf_decompress('unbase'))
Spark example using base64-
import base64
.
.
#decode base 64 string using map operation or you may create udf.
df.map(lambda base64string: base64.b64decode(base64string), <string encoder>)
Read here for detailed python example.
Related
In Python 3.7, I want to encode an Avro object to String.
I found examples converting to byte array but not to string.
Code to convert to byte array:
def serialize(mapper, schema):
bytes_writer = io.BytesIO()
encoder = avro.io.BinaryEncoder(bytes_writer)
writer1 = avro.io.DatumWriter(schema)
writer1.write(mapper, encoder)
return bytes_writer.getvalue()
mapper is a dictionary which will populate the avro object.
io provides with StringIO which I assume will need to be used instead of BytesIO but then what encoder to use with that? How do we serialize this?
if, for example, a is your Avro object, you can use a.to_json() method of Avro and then json.dumps(a)
Using Spark 2.3, I know I can read a file of JSON documents like this:
{'key': 'val1'}
{'key': 'val2'}
With this:
spark.json.read('filename')
How can I read the following in to a dataframe when there aren't newlines between JSON documents?
The following would be an example input.
{'key': 'val1'}{'key': 'val2'}
To be clear, I expect a dataframe with two rows (frame.count() == 2).
Please try -
df = spark.read.json(["fileName1","fileName2"])
You can also do if you want to read all json files in the folder -
df = spark.read.json("data/*json")
As #cricket_007 suggested above, you'd be better off fixing the input file
If you're sure you have no inline close braces within json objects, you could do the following:
with open('myfilename', 'r') as f:
txt = f.read()
txt = txt.replace('}', '}\n')
with open('mynewfilename', 'w') as f:
f.write(txt)
If you do have '}' within keys or values, the task becomes harder but not impossible with regex. It seems unlikely though.
We solved this using the RDD-Api as we couldn't find any way to use the Dataframe-API in a memory efficient way (we were always hitting executor OoM-Errors).
Following function will incrementally try to parse the json and yielding subsequent jsons from your file (from this post):
from functools import partial
from json import JSONDecoder
from io import StringIO
def generate_from_buffer(buffer: str, chunk: str, decoder: JSONDecoder):
buffer += chunk
while buffer:
try:
result, index = decoder.raw_decode(buffer)
yield result
buffer = buffer[index:].lstrip()
except ValueError:
# Not enough data to decode, read more
break
return buffer
def parse_jsons_file(jsons_str: str, buffer_size: int = 1024):
decoder = JSONDecoder()
buffer = ''
file_obj = StringIO(jsons_str)
for chunk in iter(partial(file_obj.read, buffer_size), ''):
buffer = yield from generate_from_buffer(buffer, chunk, decoder)
if buffer:
raise ValueError("Invalid input: should be concatenation of json strings")
We first read the json with .format("text"):
df: DataFrame = (
spark
.read
.format("text")
.option("wholetext", True)
.load(data_source_path)
)
then convert it to RDD, flatMap using the function from above, and finally convert it back to spark dataframe. For this you have to define the json_schema for the single jsons in your file, which is good practice anyway.
rdd_df = (df_rdd.map(lambda row: row["value"])
.flatMap(lambda jsons_string: parse_jsons_file(jsons_string))
.toDF(json_schema))
How to open a file which is stored in HDFS - Here the input file is from HDFS - If I give the file as bellow , I wont be able to open , It will show as file not found
from pyspark import SparkConf,SparkContext
conf = SparkConf ()
sc = SparkContext(conf = conf)
def getMovieName():
movieNames = {}
with open ("/user/sachinkerala6174/inData/movieStat") as f:
for line in f:
fields = line.split("|")
mID = fields[0]
mName = fields[1]
movieNames[int(fields[0])] = fields[1]
return movieNames
nameDict = sc.broadcast(getMovieName())
My assumption was to use like
with open (sc.textFile("/user/sachinkerala6174/inData/movieStat")) as f:
But that also didnt work
To read the textfile into rdd:
rdd_name = sc.textFile("/user/sachinkerala6174/inData/movieStat")
You can use collect() in order to use it in pure python (not recommended - use only on very small data), or use spark rdd methods in order to manipulate it using pyspark methods (the recommended way)
More info pyspark API:
textFile(name, minPartitions=None, use_unicode=True)
Read a text file from HDFS, a local file system (available on all
nodes), or any Hadoop-supported file system URI, and return it as an
RDD of Strings.
If use_unicode is False, the strings will be kept as str (encoding as
utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)
>>> path = os.path.join(tempdir, "sample-text.txt")
>>> with open(path, "w") as testFile:
... _ = testFile.write("Hello world!")
>>> textFile = sc.textFile(path)
>>> textFile.collect()
[u'Hello world!']
I'm working with spark in python, trying to map PDF files to some custom parsing. Currently I'm loading the PDFS with pdfs = sparkContext.binaryFiles("some_path/*.pdf").
I set the RDD to be cachable on disk with pdfs.persist( pyspark.StorageLevel.MEMORY_AND_DISK ).
I then try to map the parsing operation. And then to save a pickle But it fails with a out-of-memory error in the heap. Could you help me please?
Here is the simplified code of what I do:
from pyspark import SparkConf, SparkContext
import pyspark
#There is some code here that set a args object with argparse.
#But it's not very interesting and a bit long, so I skip it.
def extractArticles( tupleData ):
url, bytesData = tupleData
#Convert the bytesData into `content`, a list of dict
return content
sc = SparkContext("local[*]","Legilux PDF Analyser")
inMemoryPDFs = sc.binaryFiles( args.filePattern )
inMemoryPDFs.persist( pyspark.StorageLevel.MEMORY_AND_DISK )
pdfData = inMemoryPDFs.flatMap( extractArticles )
pdfData.persist( pyspark.StorageLevel.MEMORY_AND_DISK )
pdfData.saveAsPickleFile( args.output )
I am building a Python script in which I need to generate a json file from json RDD .
Following is code snippet for saving json file.
jsonRDD.map(lambda x :json.loads(x))
.coalesce(1, shuffle=True).saveAsTextFile('examples/src/main/resources/demo.json')
But I need to write json data to a single file instead of data distributed across several partitions.
So please suggest me appropriate solution for it
Without the use of additional libraries like pandas, you could save your RDD of several jsons by reducing them to one big string of jsons, each separated by a new line:
# perform your operation
# note that you do not need a lambda expression for json.loads
jsonRDD = jsonRDD.map(json.loads).coalesce(1, shuffle=True)
# map jsons back to string
jsonRDD = jsonRDD.map(json.dumps)
# reduce to one big string with one json on each line
json_string = jsonRDD.reduce(lambda x, y: x + "\n" + y)
# write your string to a file
with open("path/to/your.json", "w") as f:
f.write(json_string.encode("utf-8"))
I have had issues with pyspark saving off JSON files once I have them in a RDD or dataframe, so what I do is convert them to a pandas dataframe and save them to a non distributed directory.
import pandas
df1 = sqlContext.createDataFrame(yourRDD)
df2 = df1.toPandas()
df2.to_json(yourpath)