Parse a huge JSON file - groovy

I have a very large JSON file (about a gigabyte) which I want to parse.
I tried the JsonSlurper, but it looks like it tries to load the whole file into memory which causes out of memory exception.
Here is a piece of code I have:
def parser = new JsonSlurper().setType(JsonParserType.CHARACTER_SOURCE);
def result = parser.parse(new File("equity_listing_full_201604160411.json"))
result.each{
println it.Listing.ID
}
And Json is something like this but much longer with more columns and rows
[
{"Listing": {"ID":"2013056","VERSION":"20160229:053120:000","NAME":"XXXXXX","C_ID":["1927445"],}},
{"Listing": {"ID":"2013057","VERSION":"20160229:053120:000","NAME":"XXXXXX","C_ID":["1927446"],}},
{"Listing": {"ID":"2013058","VERSION":"20160229:053120:000","NAME":"XXXXXX","C_ID":["1927447"],}}
]
I want to be able to read it row by row. I can probably just parse each row separately, but was thinking that there might be something for parsing as you read.

Suggest using GSON by Google.
There is a streaming Parsing Option here: https://sites.google.com/site/gson/streaming

Related

compress dataframe to one json string apache spark

This bounty has ended. Answers to this question are eligible for a +50 reputation bounty. Bounty grace period ends in 12 hours.
Mike3355 is looking for a canonical answer.
I have a dataframe that when I write it to json it has several hundred lines of json but that are exactly the same. I am trying to compress it to one json line. Is there an out of the box way to accomplish this?
def collect_metrics(df) -> pyspark.sql.DataFrame:
neg_value = df.where(df.count < 0).count()
return df.withColumn(loader_neg_values, F.lit(neg_value))
main(args):
df_metrics = collect_metrics(df)
df_metrics.write.json(args.metrics)
In the end the goal is the write one json line and the file has to be a json file, not compressed.
It seems like you have hundreds of (duplicated) lines but you only want to keep one. You can use limit(1) in that case:
df_metrics.limit(1).write.json(args.metrics)
You want something like this:
df_metrics.limit(1).repartition(1).write.json(args.metrics)
.repartition(1) guarantees 1 output file, and .limit(1) guarantees one output row.

Skip over element in large XML file (python 3)

I'm new to xml parsing, and I've been trying to figure out a way to skip over a parent element's contents because there is a nested element that contains a large amount of data in its text attribute (I cannot change how this file is generated). Here's an example of what the xml looks like:
<root>
<Parent>
<thing_1>
<a>I need this</a>
</thing_1>
<thing_2>
<a>I need this</a>
</thing_2>
<thing_3>
<subgroup>
<huge_thing>enormous string here</huge_thing>
</subgroup>
</thing_3>
</Parent>
<Parent>
<thing_1>
<a>I need this</a>
</thing_1>
<thing_2>
<a>I need this</a>
</thing_2>
<thing_3>
<subgroup>
<huge_thing>enormous string here</huge_thing>
</subgroup>
</thing_3>
</Parent>
</root>
I've tried lxml.iterparse and xml.sax implementations to try and work this out, but no dice. These are the majority of the answers I've found in my searches:
Use the tag keyword in iterparse.
This does not work, because, although lxml cleans up the elements in the background, the large text in the element is still parsed into memory, so I'm getting large memory spikes.
Create a flag where you set it to True if the start event for that element is found and then ignore the element in parsing.
This does not work, as the element is still parsed into memory at the end event.
Break before you reach the end event of the specific element.
I cannot just break when I reach the element, because there are multiples of these elements that I need specific children data from.
This is not possible as stream parsers still have an end event and generate the full element.
... ok.
I'm currently trying to directly edit the stream data that the GzipFile sends to iterparse in hopes that it would be able to not even know that the element exists, but I'm running into issues with that. Any direction would be greatly appreciated.
I don't think you can get a parser to selectively ignore some part of the XML it's parsing. Here are my findings using the SAX parser...
I took your sample XML, blew it up to just under 400MB, created a SAX parser, and ran it against my big.xml file two different ways.
For the straightforward approach, sax.parse('big.xml', MyHandler()), memory peaked at 12M.
For a buffered file reader approach, using 4K chunks, parser.feed(chunk), memory peaked at 10M.
I then doubled the size, for an 800M file, re-ran both ways and the peak memory usage didn't change, ~10M. The SAX parser seems very effecient.
I ran this script against your sample XML to create some really big text nodes, 400M each.
with open('input.xml') as f:
data = f.read()
with open('big.xml', 'w') as f:
f.write(data.replace('enormous string here', 'a'*400_000_000))
Here's big.xml's size in MB:
du -ms big.xml
763 big.xml
Heres's my SAX ContentHandler which only handles the character data if the path to the data's parent ends in thing_*/a (which according to your sample disqualifies huge_thing)...
BTW, much appreciation to l4mpi for this answer, showing how to buffer the character data you do want:
from xml import sax
class MyHandler(sax.handler.ContentHandler):
def __init__(self):
self._charBuffer = []
self._path = []
def _getCharacterData(self):
data = ''.join(self._charBuffer).strip()
self._charBuffer = []
return data.strip() # remove strip() if whitespace is important
def characters(self, data):
if len(self._path) < 2:
return
if self._path[-1] == 'a' and self._path[-2].startswith('thing_'):
self._charBuffer.append(data)
def startElement(self, name, attrs):
self._path.append(name)
def endElement(self, name):
self._path.pop()
if len(self._path) == 0:
return
if self._path[-1].startswith('thing_'):
print(self._path[-1])
print(self._getCharacterData())
For both the whole-file parse method, and the chunked reader, I get:
thing_1
I need this
thing_2
I need this
thing_3
thing_1
I need this
thing_2
I need this
thing_3
It's printing thing_3 because of my simple logic, but the data in subgroup/huge_thing is ignored.
Here's how I call the handler with the straight-forward parse() method:
handler = MyHandler()
sax.parse('big.xml', handler)
When I run that with Unix/BSD time, I get:
/usr/bin/time -l ./main.py
...
1.45 real 0.64 user 0.11 sys
...
11027456 peak memory footprint
Here's how I call the handler with the more complex chunked reader, using a 4K chunk size:
handler = MyHandler()
parser = sax.make_parser()
parser.setContentHandler(handler)
Chunk_Sz = 4096
with open('big.xml') as f:
chunk = f.read(Chunk_Sz)
while chunk != '':
parser.feed(chunk)
chunk = f.read(Chunk_Sz)
/usr/bin/time -l ./main.py
...
1.85 real 1.65 user 0.19 sys
...
10453952 peak memory footprint
Even with a 512B chunk size, it doesn't get below 10M, but the runtime doubled.
I'm curious to see what kind of performance you're getting.
You cannot use a DOM parser as that would per definition stuff the whole document into RAM. But basically a DOM parser is just a SAX parser that creates a DOM as it goes through the SAX events.
When creating your custom SAX parser you can actually not just create the DOM (or whichever other memory represenation you prefer) but start ignoring events should they relate to some specific location in the document.
Be aware the parsing needs to continue so you know when to stop ignoring the events. But the output of the parser would not contain this unneeded large chunk of data.

How can I convert a Pyspark dataframe to a CSV without sending it to a file?

I have a dataframe which I need to convert to a CSV file, and then I need to send this CSV to an API. As I'm sending it to an API, I do not want to save it to the local filesystem and need to keep it in memory. How can I do this?
Easy way: convert your dataframe to Pandas dataframe with toPandas(), then save to a string. To save to a string, not a file, you'll have to call to_csv with path_or_buf=None. Then send the string in an API call.
From to_csv() documentation:
Parameters
path_or_bufstr or file handle, default None
File path or object, if None is provided the result is returned as a string.
So your code would likely look like this:
csv_string = df.toPandas().to_csv(path_or_bufstr=None)
Alternatives: use tempfile.SpooledTemporaryFile with a large buffer to create an in-memory file. Or you can even use a regular file, just make your buffer large enough and don't flush or close the file. Take a look at Corey Goldberg's explanation of why this works.

Transform a string to some code in python 3

I store some data in a excel that I extract in a JSON format. I also call some data with GET requests from some API I created. With all these data, I do some test (does the data in the excel = the data returned by the API?)
In my case, I may need to store in the excel the way to select the data from the API json returned by the GET.
for example, the API returns :
{"countries":
[{"code":"AF","name":"Afghanistan"},
{"code":"AX","name":"Ă…land Islands"} ...
And in my excel, I store :
excelData['countries'][0]['name']
I can retrieve the excelData['countries'][0]['name'] in my code just fine, as a string.
Is there a way to convert excelData['countries'][0]['name'] from a string to some code that actually points and get the data I need from the API json?
here's how I want to use it :
self.assertEqual(str(valueExcel), path)
#path is the string from the excel that tells where to fetch the data from the
# JSON api
I thought strings would be interpreted but no :
AssertionError: 'AF' != "excelData['countries'][0]['code']"
- AF
+ excelData['countries'][0]['code']
You are looking for the eval method. Try with this:
self.assertEqual(str(valueExcel), eval(path))
Important: Keep in mind that eval can be dangerous, since malicious code could be executed. More warnings here: What does Python's eval() do?

read login data from text file into dictionary error

Using the answer on Stack Overflow shown on this link: https://stackoverflow.com/a/4804039, I have attempted to read in the file contents into a dictionary. There is an error that I cannot seem to fix.
Code
def login():
print("====Login====")
userinfo={}
with open("userinfo.txt","r") as f:
for line in f:
(key,val)=line.split()
userinfo[key]=val
print(userinfo)
File Contents
{'user1': 'pass'}
{'user2': 'foo'}
{'user3': 'boo'}
Error:
(key,val)=line.split()
ValueError: not enough values to unpack (expected 2, got 0)
I have a question to which I would very much appreciate a two fold answer
What is the best and most efficient way to read in file contents, as shown, into a dictionary, noting that it has already been stored in dictionary format.
Is there a way to WRITE to a dictionary to make this "reading" easier? My code for writing to the userinfo.txt file in the first place is shown below
Write code
with open("userinfo.txt","a",newline="")as fo:
writer=csv.writer(fo)
writer.writerow([{username:password}])
Could any answers please attempt the following
Provide a solution to the error using the original code
Suggest the best method to do the same thing (simplest for teaching purposes) Note, that I do not wish to use pickle, json or anything other than very basic file handling (so only reading from a text file or csv reader/writer tools). For instance, would it be best to read the file contents into a list and then convert the list into a dictionary? Or is there any other way?
Is there a method of writing a dictionary to a text file using csv reader or other basic txt file handling, so that the reading of the file contents into a dictionary could be done more effectively on the other end.
Update:
Blank line removed, and the code works but produces the erroneous output:
{"{"Vjr':": "'open123'}", "{'mvj':": "'mvv123'}"}
I think I need to understand the split and strip commands and how to use them in this context to produce the desired result (reading the contents into a dictionary userinfo)
Well let's start with the basics first. The error message:
ValueError: not enough values to unpack (expected 2, got 0)
means a line was empty, so do you have a blank line in the file?
Yes, there are other options on saving your dictionary out and bringing it back, but first you should understand this, and may work just fine for you. :-) The split() is acting on the string you read from the file, and by default will split on the space, so that is what you are seeing. You could format your text file like 'username:pass' instead and then use split(':").
File Contents
user1:pass
user2:foo
user3:boo
Code
def login():
print("====Login====")
userinfo={}
with open("userinfo.txt","r") as f:
for line in f:
(key,val)=line.split(':')
userinfo[key]=val.strip()
print(userinfo)
if __name__ == '__main__':
login()
This simple format may be best if you want to be able to edit the text file by hand, and I like to keep it simple as possible. ;-)

Resources