Split 30 Gb json file into smaller files

Split 30 Gb json file into smaller files - python-3.x

I am facing memory issue in reading a json file which is 30 GB in size. Is there any direct way in Python3.x like we have in unix where we can split the json file into smaller files based on the lines.
e.g. first 100000 records go in first slit file and then rest go to subsequent child json file?

Depending on your input data and if its structure is known and consistent, it will be more hard or easy.
In my example here the idea is to read the file line by line with a lazy generator and write new files whenever a valid object can be constructed from the input. Its a bit like manual parsing.
In the real world case this logic when to write to a new file would highly depend on your input and what you are trying to achieve.
Some sample data
[
{
"color": "red",
"value": "#f00"
},
{
"color": "green",
"value": "#0f0"
},
{
"color": "blue",
"value": "#00f"
},
{
"color": "cyan",
"value": "#0ff"
},
{
"color": "magenta",
"value": "#f0f"
},
{
"color": "yellow",
"value": "#ff0"
},
{
"color": "black",
"value": "#000"
}
]
# create a generator that yields each individual line
lines = (l for l in open('data.json'))
# o is used to accumulate some lines before
# writing to the files
o=''
# itemCount is used to count the number of valid json objects
itemCount=0
# read the file line by line to avoid memory issues
i=-1
while True:
try:
line = next(lines)
except StopIteration:
break
i=i+1
# ignore first square brackets
if i == 0:
continue
# in this data I know every 5th lines a new object will begin
# this logic depends on your input data
if i%4==0:
itemCount+=1
# at this point I am able to create avalid json object
# based on my knowledge of the input file structure
validObject=o+line.replace("},\n", '}\n')
o=''
# now write each object to its own file
with open(f'item-{itemCount}.json', 'w') as outfile:
outfile.write(validObject)
else:
o+=line
Here is a repl with the working example: https://replit.com/#bluebrown/linebyline

Related

Pandas Dataframe to JSON add JSON Object Name

I have a dataframe that I'm converting to JSON but I'm having a hard time naming the object. The code I have:
j = (df_import.groupby(['Item', 'Subinventory', 'TransactionUnitOfMeasure', 'TransactionType', 'TransactionDate', 'TransactionSourceId', 'OrganizationName'])
.apply(lambda x: x[['LotNumber', 'TransactionQuantity']].to_dict('records'))
.reset_index()
.rename(columns={0: 'lotItemLots'})
.to_json(orient='records'))
The result I'm getting:
[
{
"Item": "000400MCI00099",
"OrganizationName": "OR",
"Subinventory": "LAB R",
"TransactionDate": "2021-08-19 00:00:00",
"TransactionSourceId": 3000001595xxxxx,
"TransactionType": "Account Alias Issue",
"TransactionUnitOfMeasure": "EA",
"lotItemLots": [
{
"LotNumber": "00040I",
"TransactionQuantity": -5
}
]
}
]
The result I need (the transactionLines part), but can't figure out:
{
"transactionLines":[
{
"Item":"000400MCI00099",
"Subinventory":"LAB R",
"TransactionQuantity":-5,
"TransactionUnitOfMeasure":"EA",
"TransactionType":"Account Alias Issue",
"TransactionDate":"2021-08-20 00:00:00",
"OrganizationName":"OR",
"TransactionSourceId": 3000001595xxxxx,
"lotItemLots":[{"LotNumber":"00040I", "TransactionQuantity":-5}]
}
]
}
Index,Item Number,5 Digit,Description,Subinventory,Lot Number,Quantity,EOM,[Qty],Transaction Type,Today's Date,Expiration Date,Source Header ID,Lot Interface Number,Transaction Source ID,TransactionType,Organization Name
1,000400MCI00099,40,ACANTHUS OAK LEAF,LAB R,00040I,-5,EA,5,Account Alias Issue,2021/08/25,2002/01/01,160200,160200,3000001595xxxxx,Account Alias Issue,OR
Would appreciate any guidance on how to get the transactionLines name in there. Thank you in advance.

It would seem to me you could simply parse the json output, and then re-form it the way you want:
import pandas as pd
import json
data = [{'itemID': 0, 'itemprice': 100}, {'itemID': 1, 'itemprice': 200}]
data = pd.DataFrame(data)
pd_json = data.to_json(orient='records')
new_collection = [] # store our reformed records
# loop over parsed json, and reshape it the way we want
for record in json.loads(pd_json):
nested = {'transactionLines': [record]} # matching specs of question
new_collection.append(nested)
new_json = json.dumps(new_collection) # convert back to json str
print(new_json)
Which results in:
[
{"transactionLines": [{"itemID": 0, "itemprice": 100}]},
{"transactionLines": [{"itemID": 1, "itemprice": 200}]}
]
Note that of course you could probably do this in a more concise manner, without the intermediate json conversion.

Include base64 code of image in csv file using Nifi

I have json array response from InvokeHTTP. I am using the below Flow to convert some json info to csv. One of the json info is id which is used to get image and then convert it to base64. I need to add this base64 code to my csv. I don't understand how to save it in an attribute so that it can be put in AttributeToCsv.
Also, I was reading here https://community.cloudera.com/t5/Support-Questions/Nifi-attribute-containing-large-text-value/td-p/190513
that it is not recommended to store large values in attributes due to memory concern. What would be an optimal approach in this scenario.
Json response during first call:
[ {
"fileNumber" : "1",
"uuid" : "abc",
"attachedFiles" : [ {
"id" : "bkjdbkjdsf",
"name" : "image1.png",
}, {
"id" : "xzcv",
"name" : "image2.png",
} ],
"date":null
},
{ "fileNumber" : "2",
"uuid" : "def",
"attachedFiles" : [],
"date":null
}]
Final Csv (after merge or expected output):
Id,File Name, File Data(base64 code)
bkjdbkjdsf,image1.png, iVBORw0KGgo...ji
xzcv,image1.png,ZEStWRGau..74
My approach (will change as per suggestions):
After splitting Json response, I use EvaluateJsonPath to get "attachedFiles".
I find length of array "attachedFiles" and then decide if need to split further if 2 or more files are there. If 0 then do nothing. In second EvaluateJsonPath I add properties Id,File Name and set the values from json using $.id etc.. I use the Id to invoke other URL which I encode to Base64.
Current output - csv file which needs to be updated with third column File Data(base64 code) and it's value:
Id,File Name
bkjdbkjdsf,image1.png
xzcv,image1.png

as a variant use ExecuteGroovyScript:
def ff=session.get()
if(!ff)return
ff.write{sin, sout->
sout.withWriter('UTF-8'){w->
//write attribute values for names 'Id' and 'filename' delimited with coma
w << ff.attributes.with{a->[a.'Id', a.'filaname']}.join(',')
w << ',' //wtite coma
//sin.withReader('UTF-8'){r-> w << r} //write current content of the file after last coma
w << sin.bytes.encodeBase64()
w << '\n'
}
}
REL_SUCCESS << ff
UPD: i put sin.bytes.encodeBase64() instead of copying flowfile content. this one creates one-line base64 string for input file. if you are using this option - you should remove Base64EncodeContent to prevent double base64 encoding.

How to append multiple JSON object in a custom list using python?

I have two dictionary (business and business 1). I convert this dictionary into JSON file as (a and b). Then i append this two JSON object in a custom list called "all".
Here, list creation is static, i have to make it dynamic because the number of dictionary could be random. But output should be in same structure.
Here is my code section
Python Code
import something as b
business = {
"id": "04",
"target": b.YesterdayTarget,
'Sales': b.YSales,
'Achievements': b.Achievement
}
business1 = {
"id": "05",
"target": b.YesterdayTarget,
'Sales': b.YSales,
'Achievements': b.Achievement
}
# Convert Dictionary to json data
a= str(json.dumps(business, indent=5))
b= str(json.dumps(business1, indent=5))
all = '[' + a + ',\n' + b + ']'
print(all)
Output Sample
[{
"id": "04",
"target": 55500000,
"Sales": 23366927,
"Achievements": 42.1
},
{
"id": "05",
"target": 55500000,
"Sales": 23366927,
"Achievements": 42.1
}]
Thanks for your suggestions and efforts.

Try this one.
import ast, re
lines = open(path_to_your_file).read().splitlines()
result = [ast.literal_eval(re.search('({.+})', line).group(0)) for line in lines]
print(len(result))
print(result)

How do I use this list as a parameter for this function?

I'm new to Python and I'm using it to write a Spotify app with Spotipy. Basically, I have a dictionary of tracks called topTracks. I can access a track and its name/ID and stuff with
topSongs['items'][0]
topSongs['items'][3]['id']
topSongs['items'][5]['name']
So there's a function I'm trying to use:
recommendations(seed_artists=None, seed_genres=None, seed_tracks=None, limit=20, country=None, **kwargs)
With this function I'm trying to use seed_tracks, which requires a list of track IDs. So ideally I want to input topSongs['items'][0]['id'], topSongs['items'][1]['id'], topSongs['items'][2]['id'], etc. How would I do this? I've read about the * operator but I'm not sure how I can use that or if it applies here.

You can try something like shown below.
ids = [item["id"] for item in topSongs["items"]]
Here, I have just formed a simple example.
>>> topSongs = {
... "items": [
... {
... "id": 1,
... "name": "Alejandro"
... },
... {
... "id": 22,
... "name": "Waiting for the rights"
... }
... ]
... }
>>>
>>> seed_tracks = [item["id"] for item in topSongs["items"]]
>>>
>>> seed_tracks
[1, 22]
>>>
Imp note about using * operator »
* operator is used in this case but for that, you will need to form a list/tuple containing the list of data the function takes. Something like
You have to form all the variables like seed_tracks above.
data = [seed_artists, seed_genres, seed_tracks, limit, country]
And finally,
recommendations(*data)
Imp note about using ** operator »
And if you are willing to use ** operator, the data will look like
data = {"seed_artists": seed_artists, "seed_genres": seed_genres, "seed_tracks": seed_tracks, "limit": limit, "country": country}
Finally,
recommendations(**data)

Build a millions of items set from large file python fast

I'm trying to build a few sets of int-pairs from a huge file. Each set in a typical file contains about a few million lines to parse and build one set from. I have created the following code but it takes >36hours for just one set made from 2 million lines!!
Input file (a few million lines like this): starts with
*|NET 2 0.000295965PF
... //unwanted sections
R2_42 2:1 2:2 3.43756e-05 $a=2.909040 $lvl=99 $llx=15.449 $lly=9.679 $urx=17.309 $ury=11.243
R2_43 2:2 2:3 0.805627 $l=0.180 $w=1.564 $lvl=71 $llx=16.199 $lly=9.679 $urx=16.379 $ury=11.243 $dir=0
R2_44 2:2 2:4 4.16241 $l=0.930 $w=1.564 $lvl=71 $llx=16.379 $lly=9.679 $urx=17.309 $ury=11.243 $dir=0
R2_45 2:3 2:5 0.568889 $a=0.360000 $lvl=96 $llx=15.899 $lly=10.185 $urx=16.499 $ury=10.785
R2_46 2:3 2:6 3.35678 $l=0.750 $w=1.564 $lvl=71 $llx=15.449 $lly=9.679 $urx=16.199 $ury=11.243 $dir=0
R2_47 2:5 2:7 0.0381267 $l=0.301 $w=0.600 $lvl=8 $llx=16.199 $lly=10.200 $urx=16.500 $ury=10.800 $dir=0
R2_48 2:5 2:8 0.0378733 $l=0.299 $w=0.600 $lvl=8 $llx=15.900 $lly=10.200 $urx=16.199 $ury=10.800 $dir=0
*|NET OUT 0.000895965PF
...etc
Finally I need to build a set of integer pairs from the above where the integers are indexes of a list made from column 2 and column 3 of the file.
[(2:1,2:2), (2:2,2:3), (2:2,2:4), (2:3,2:5), (2:3,2:6), (2:5,2:7), (2:5,2:8)] becomes
[(0,1),(1,2),(1,3),(2,4),(2,5),(4,6),(4,7)]
I coded this:
if __name__ == '__main__':
with open('myspf') as infile, open('tmp','w') as outfile:
copy = False
allspf = []
for line in infile:
if line.startswith("*|NET 2"):
copy = True
elif line.strip() == "":
copy = False
elif copy:
#capture col2 and col3
if line.startswith("R"):
allspf.extend(re.findall(r'^R.*?\s(.*?)\s(.*?)\s', line))
final = f6(list(itertools.chain(*allspf))) //to get unique list
#build the finalpairs again by index: I've found this was the bottleneck
for x in allspf:
left,right = x
outfile.write("({},{}),".format(final.index(left),final.index(right)))
pair = []
f = open('tmp')
pair = list(ast.literal_eval(f.read()))
f.close()
fopen = open('hopespringseternal.txt','w')
fopen.write((json.dumps(construct_trees_by_TingYu(pair), indent=1)))
fopen.close()
def f6(seq):
# Not order preserving
myset = set(seq)
return list(myset)
The bottleneck is in the 'for x in allspf' loop, and the procedure construct_trees_by_TingYu itself also ran out of memory after I gave it the millions items set. The procedure from this guy requires the entire set all at once: http://xahlee.info/python/python_construct_tree_from_edge.html
The final output is a tree from parent to child:
{
"3": {
"1": {
"0": {}
}
},
"5": {
"2": {
"1": {
"0": {}
}
}
},
"6": {
"4": {
"2": {
"1": {
"0": {}
}
}
}
},
"7": {
"4": {
"2": {
"1": {
"0": {}
}
}
}
}
}

Building a set is always O(n). You need to traverse the entire list to add each item to your set.
However, it does not look like you're even using the set operation in the code excerpt above.
If you are running out of memory, you probably want to iterate over the huge set, rather than wait for the entire set to be created and then pass it to construct_trees_by_TingYu (I have no idea what this is by the way). Also, you can create a generator to yield each item from the set, which will decrease your memory footprint. Whether 'construct_trees_by_TingYu' will handle a generator passed to it, I do not know.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Split 30 Gb json file into smaller files - python-3.x

I am facing memory issue in reading a json file which is 30 GB in size. Is there any direct way in Python3.x like we have in unix where we can split the json file into smaller files based on the lines. e.g. first 100000 records go in first slit file and then rest go to subsequent child json file?

Related

Pandas Dataframe to JSON add JSON Object Name

Include base64 code of image in csv file using Nifi

How to append multiple JSON object in a custom list using python?

How do I use this list as a parameter for this function?

Build a millions of items set from large file python fast

Categories

Resources