Build a millions of items set from large file python fast - python-3.x

I'm trying to build a few sets of int-pairs from a huge file. Each set in a typical file contains about a few million lines to parse and build one set from. I have created the following code but it takes >36hours for just one set made from 2 million lines!!
Input file (a few million lines like this): starts with
*|NET 2 0.000295965PF
... //unwanted sections
R2_42 2:1 2:2 3.43756e-05 $a=2.909040 $lvl=99 $llx=15.449 $lly=9.679 $urx=17.309 $ury=11.243
R2_43 2:2 2:3 0.805627 $l=0.180 $w=1.564 $lvl=71 $llx=16.199 $lly=9.679 $urx=16.379 $ury=11.243 $dir=0
R2_44 2:2 2:4 4.16241 $l=0.930 $w=1.564 $lvl=71 $llx=16.379 $lly=9.679 $urx=17.309 $ury=11.243 $dir=0
R2_45 2:3 2:5 0.568889 $a=0.360000 $lvl=96 $llx=15.899 $lly=10.185 $urx=16.499 $ury=10.785
R2_46 2:3 2:6 3.35678 $l=0.750 $w=1.564 $lvl=71 $llx=15.449 $lly=9.679 $urx=16.199 $ury=11.243 $dir=0
R2_47 2:5 2:7 0.0381267 $l=0.301 $w=0.600 $lvl=8 $llx=16.199 $lly=10.200 $urx=16.500 $ury=10.800 $dir=0
R2_48 2:5 2:8 0.0378733 $l=0.299 $w=0.600 $lvl=8 $llx=15.900 $lly=10.200 $urx=16.199 $ury=10.800 $dir=0
*|NET OUT 0.000895965PF
...etc
Finally I need to build a set of integer pairs from the above where the integers are indexes of a list made from column 2 and column 3 of the file.
[(2:1,2:2), (2:2,2:3), (2:2,2:4), (2:3,2:5), (2:3,2:6), (2:5,2:7), (2:5,2:8)] becomes
[(0,1),(1,2),(1,3),(2,4),(2,5),(4,6),(4,7)]
I coded this:
if __name__ == '__main__':
with open('myspf') as infile, open('tmp','w') as outfile:
copy = False
allspf = []
for line in infile:
if line.startswith("*|NET 2"):
copy = True
elif line.strip() == "":
copy = False
elif copy:
#capture col2 and col3
if line.startswith("R"):
allspf.extend(re.findall(r'^R.*?\s(.*?)\s(.*?)\s', line))
final = f6(list(itertools.chain(*allspf))) //to get unique list
#build the finalpairs again by index: I've found this was the bottleneck
for x in allspf:
left,right = x
outfile.write("({},{}),".format(final.index(left),final.index(right)))
pair = []
f = open('tmp')
pair = list(ast.literal_eval(f.read()))
f.close()
fopen = open('hopespringseternal.txt','w')
fopen.write((json.dumps(construct_trees_by_TingYu(pair), indent=1)))
fopen.close()
def f6(seq):
# Not order preserving
myset = set(seq)
return list(myset)
The bottleneck is in the 'for x in allspf' loop, and the procedure construct_trees_by_TingYu itself also ran out of memory after I gave it the millions items set. The procedure from this guy requires the entire set all at once: http://xahlee.info/python/python_construct_tree_from_edge.html
The final output is a tree from parent to child:
{
"3": {
"1": {
"0": {}
}
},
"5": {
"2": {
"1": {
"0": {}
}
}
},
"6": {
"4": {
"2": {
"1": {
"0": {}
}
}
}
},
"7": {
"4": {
"2": {
"1": {
"0": {}
}
}
}
}
}

Building a set is always O(n). You need to traverse the entire list to add each item to your set.
However, it does not look like you're even using the set operation in the code excerpt above.
If you are running out of memory, you probably want to iterate over the huge set, rather than wait for the entire set to be created and then pass it to construct_trees_by_TingYu (I have no idea what this is by the way). Also, you can create a generator to yield each item from the set, which will decrease your memory footprint. Whether 'construct_trees_by_TingYu' will handle a generator passed to it, I do not know.

Related

Split 30 Gb json file into smaller files

I am facing memory issue in reading a json file which is 30 GB in size. Is there any direct way in Python3.x like we have in unix where we can split the json file into smaller files based on the lines.
e.g. first 100000 records go in first slit file and then rest go to subsequent child json file?
Depending on your input data and if its structure is known and consistent, it will be more hard or easy.
In my example here the idea is to read the file line by line with a lazy generator and write new files whenever a valid object can be constructed from the input. Its a bit like manual parsing.
In the real world case this logic when to write to a new file would highly depend on your input and what you are trying to achieve.
Some sample data
[
{
"color": "red",
"value": "#f00"
},
{
"color": "green",
"value": "#0f0"
},
{
"color": "blue",
"value": "#00f"
},
{
"color": "cyan",
"value": "#0ff"
},
{
"color": "magenta",
"value": "#f0f"
},
{
"color": "yellow",
"value": "#ff0"
},
{
"color": "black",
"value": "#000"
}
]
# create a generator that yields each individual line
lines = (l for l in open('data.json'))
# o is used to accumulate some lines before
# writing to the files
o=''
# itemCount is used to count the number of valid json objects
itemCount=0
# read the file line by line to avoid memory issues
i=-1
while True:
try:
line = next(lines)
except StopIteration:
break
i=i+1
# ignore first square brackets
if i == 0:
continue
# in this data I know every 5th lines a new object will begin
# this logic depends on your input data
if i%4==0:
itemCount+=1
# at this point I am able to create avalid json object
# based on my knowledge of the input file structure
validObject=o+line.replace("},\n", '}\n')
o=''
# now write each object to its own file
with open(f'item-{itemCount}.json', 'w') as outfile:
outfile.write(validObject)
else:
o+=line
Here is a repl with the working example: https://replit.com/#bluebrown/linebyline

is there a method to collect data intelligently from website?

i want to get data from this link https://meshb.nlm.nih.gov/treeView
the problem is to get all the tree, we should click on + each time and for each line to get the children node of the tree,
but I want to display all the tree just on one click then i want to copy all the content.
Any ideas, please?
Well, it all depends what you mean by "intelligently". Not sure if that meets the criteria, but you might want to try this.
import json
import string
import requests
abc = string.ascii_uppercase
base_url = "https://meshb.nlm.nih.gov/api/tree/children/"
follow_url = "https://meshb.nlm.nih.gov/record/ui?ui="
tree = {}
for letter in abc[:1]:
res = requests.get(f"{base_url}{letter}").json()
tree[letter] = {
"Records": [i["RecordName"] for i in res],
"FollowURLS": [f"{follow_url}{i['RecordUI']}" for i in res],
}
print(json.dumps(tree, indent=2))
This prints:
{
"A": {
"Records": [
"Body Regions",
"Musculoskeletal System",
"Digestive System",
"Respiratory System",
"Urogenital System",
"Endocrine System",
"Cardiovascular System",
"Nervous System",
"Sense Organs",
"Tissues",
"Cells",
"Fluids and Secretions",
"Animal Structures",
"Stomatognathic System",
"Hemic and Immune Systems",
"Embryonic Structures",
"Integumentary System",
"Plant Structures",
"Fungal Structures",
"Bacterial Structures",
"Viral Structures"
],
"FollowURLS": [
"https://meshb.nlm.nih.gov/record/ui?ui=D001829",
"https://meshb.nlm.nih.gov/record/ui?ui=D009141",
"https://meshb.nlm.nih.gov/record/ui?ui=D004064",
"https://meshb.nlm.nih.gov/record/ui?ui=D012137",
"https://meshb.nlm.nih.gov/record/ui?ui=D014566",
"https://meshb.nlm.nih.gov/record/ui?ui=D004703",
"https://meshb.nlm.nih.gov/record/ui?ui=D002319",
"https://meshb.nlm.nih.gov/record/ui?ui=D009420",
"https://meshb.nlm.nih.gov/record/ui?ui=D012679",
"https://meshb.nlm.nih.gov/record/ui?ui=D014024",
"https://meshb.nlm.nih.gov/record/ui?ui=D002477",
"https://meshb.nlm.nih.gov/record/ui?ui=D005441",
"https://meshb.nlm.nih.gov/record/ui?ui=D000825",
"https://meshb.nlm.nih.gov/record/ui?ui=D013284",
"https://meshb.nlm.nih.gov/record/ui?ui=D006424",
"https://meshb.nlm.nih.gov/record/ui?ui=D004628",
"https://meshb.nlm.nih.gov/record/ui?ui=D034582",
"https://meshb.nlm.nih.gov/record/ui?ui=D018514",
"https://meshb.nlm.nih.gov/record/ui?ui=D056229",
"https://meshb.nlm.nih.gov/record/ui?ui=D056226",
"https://meshb.nlm.nih.gov/record/ui?ui=D056224"
]
}
}
If you want all of it, just remove [:1] from the loop. If there's no entry for a given letter on the page you'll get, well, an empty entry in the dictionary.
Obviously, you can dump the entire response, but that's just a proof of concept.
Try this, some parts are a bit tricky but it manages to give you the tree:
import requests as r
import operator
import string
link = 'https://meshb.nlm.nih.gov/api/tree/children/{}'
all_data = []
for i in string.ascii_uppercase:
all_data.append({'RecordName': i, 'RecordUI': '', 'TreeNumber': i, 'HasChildren': True})
res = r.get(link.format(i))
data_json = res.json()
all_data += data_json
# This request will get all the rest of the data at once, other than A-Z or A..-Z..
# This request takes time to load, depending on your network, it got like 3 million+ characters
res = r.get(link.format('.*'))
data_json = res.json()
all_data += data_json
# Sorting the data depending on the TreeNumber
all_data.sort(key=operator.itemgetter('TreeNumber'))
# Printing the tree using tabulations
for row in all_data:
l = len(row['TreeNumber'])
if l == 3:
print('\t', end='')
elif l > 3:
print('\t'*(len(row['TreeNumber'].split('.'))+1), end='')
print(row['RecordName'])

List all system fonts as dictionary. | python

I want to get all system fonts (inside c://Windows//Fonts) as dictionary since I need to differentiate between bold and italic etc. Though when listing the content of the directory via os.listdir or in terminal it's not possible to tell which font is what. (or at least in most cases) Further, even if you wanted to iterate through all fonts you could barely tell whether it's the 'regular' font or a variant.
So windows list the folder as follows:
Each of these 'font-folders' looks like (depending on their different styles) :
Lastly, this is what I get via the list command (unreadable and unusable for most cases):
So this is the output I wish I could achieve (or similar):
path = "C://Windows//Fonts"
# do some magic
dictionary = {
'Arial':'Regular': 'Arial-Regular.ttf','Bold':'Arial-Bold.ttf',
'Carlito:'Regular':' 8514fix.fon','Bold':'someweirdotherfile.fon'
}
The only things I got so far are the bare installed font names not their filenames.
So if there is any way to either get the content as dictionary or to get the filename of the fonts please be so kind and give me a tip :)
I know that you made that post 2 years ago, but so happens that I made piece of code that makes similar thing for which you're asking for. I'm into programming only 1,5 months (and it's my 1st answer on stackoverflow ever), so probably it can be improved in many ways, but maybe there will be some guy, which I'll help with that code or give idea how they'll write it.
from fontTools import ttLib
import os
from os import walk
import json
path = "C://Windows//Fonts"
fonts_path = []
for (dirpath, dirnames, filenames) in walk(fr'{path}\Windows\Fonts'):
for i in filenames:
if any(i.endswith(ext) for ext in ['.ttf', '.otf', '.ttc', '.ttz', '.woff', '.woff2']):
fonts_path.append(dirpath.replace('\\\\', '\\') + '\\' + i)
def getFont(font, font_path):
x = lambda x: font['name'].getDebugName(x)
if x(16) is None:
return x(1), x(2), font_path
if x(16) is not None:
return x(16), x(17), font_path
else:
pass
fonts = []
for i in range(len(fonts_path)):
j = fonts_path[i]
if not j.endswith('.ttc'):
fonts.append(getFont(ttLib.TTFont(j), j))
if j.endswith('.ttc'):
try:
for k in range(100):
fonts.append(getFont(ttLib.TTFont(j, fontNumber=k), j))
except:
pass
fonts_dict = {}
no_dups = []
for i in fonts:
index_0 = i[0]
if index_0 not in no_dups:
no_dups.append(index_0)
for i in fonts:
for k in no_dups:
if i[0] == k:
fonts_dict[k] = json.loads('{\"' + str(i[1]) + '\" : \"' + str(i[2]).split('\\')[-1] + '\"}')
for j in fonts:
if i[0] == j[0]:
fonts_dict[k][j[1]] = j[2].split('\\')[-1]
print(json.dumps(fonts_dict, indent=2))
Some sample output (I made it shorter, because otherwise it'd be too big):
{
"CAC Moose PL": {
"Regular": "CAC Moose PL.otf"
},
"Calibri": {
"Bold Italic": "calibriz.ttf",
"Regular": "calibri.ttf",
"Bold": "calibrib.ttf",
"Italic": "calibrii.ttf",
"Light": "calibril.ttf",
"Light Italic": "calibrili.ttf"
},
"Cambria": {
"Bold Italic": "cambriaz.ttf",
"Regular": "cambria.ttc",
"Bold": "cambriab.ttf",
"Italic": "cambriai.ttf"
},
"Cambria Math": {
"Regular": "cambria.ttc"
},
"Candara": {
"Bold Italic": "Candaraz.ttf",
"Regular": "Candara.ttf",
"Bold": "Candarab.ttf",
"Italic": "Candarai.ttf",
"Light": "Candaral.ttf",
"Light Italic": "Candarali.ttf"
},
"Capibara Mono B PL": {
"Regular": "Capibara Mono B PL.otf"
}
}
If someone needs full path to the font, then the only thing you need to change is removing .split('\')[-1] in last for loop, then output will be like that:
"Arial": {
"Black": "C:\\Windows\\Fonts\\ariblk.ttf",
"Regular": "C:\\Windows\\Fonts\\arial.ttf",
"Bold": "C:\\Windows\\Fonts\\arialbd.ttf",
"Bold Italic": "C:\\Windows\\Fonts\\arialbi.ttf",
"Italic": "C:\\Windows\\Fonts\\ariali.ttf"
}
Some postscriptium. Fonts in Windows are stored in two folders. One for global fonts (installed for all users) and user's specific. Global fonts are stored in 'C:\Windows\Fonts' but user local are stored in 'C:\Users\username\AppData\Local\Microsoft\Windows\Fonts', so keep that in mind. It's possible to simply take username to variable by os.getlogin().
I decided to ignore .fon fonts because they're very problematic (for me).
Some code explaination:
fonts_path = []
for (dirpath, dirnames, filenames) in walk(fr'{path}\Windows\Fonts'):
for i in filenames:
if any(i.endswith(ext) for ext in ['.ttf', '.otf', '.ttc', '.ttz', '.woff', '.woff2']):
fonts_path.append(dirpath.replace('\\\\', '\\') + '\\' + i)
Taking only .ttf, otf, ttc, ttz, woff and woff2 fonts from Windows's global fonts folder and making list (fonts_path) with all fonts path.
def getFont(font, font_path):
x = lambda x: font['name'].getDebugName(x)
if x(16) is None:
return x(1), x(2), font_path
if x(16) is not None:
return x(16), x(17), font_path
else:
pass
So this funtion based on ttLib.TTFont file obtained in ttLib.TTFont(font_path) function from fontTools library and font path, checks debug names of fonts. Debug names are pre-defined metadata of font and contains informations such as font name, font family etc. You can read about that here: https://learn.microsoft.com/en-us/typography/opentype/spec/name#name-ids. So example use for full font name'll be:
NameID = 4
font = ttLib.TTFont(font_path)
font_full_name = font['name'].getDebugName(NameID)
print(font_full_name)
Example output: Candara Bold Italic
Basicly we need only family font name and font name. Only problem is that some fonts are having None value on NameID 1 and 2 so based on font, value is taken from NameID 1, 2 or 16, 17. After that every font is being packed to tuple in this way: (font_family, font_name, font_path)
fonts = []
for i in range(len(fonts_path)):
j = fonts_path[i]
if not j.endswith('.ttc'):
fonts.append(getFont(ttLib.TTFont(j), j))
if j.endswith('.ttc'):
try:
for k in range(100):
fonts.append(getFont(ttLib.TTFont(j, fontNumber=k), j))
except:
pass
.ttc fonts need special treatment, because .ttc font format contains more than one font, so we must specify which font we're going to use, so syntax of ttLib.TTFont(font_path) needs one more argument: fontNumber, so then it become: ttLib.TTFont(font_path, fontNumber=font_index).
After that we have list full of tuples in order: (font_family, font_name, font_path)
no_dups = []
for i in fonts:
index_0 = i[0]
if index_0 not in no_dups:
no_dups.append(index_0)
We make list (named: no_dups) of all font families (stored in index 0 of tuple), without duplications.
fonts_dict = {}
for i in fonts:
for k in no_dups:
if i[0] == k:
fonts_dict[k] = json.loads('{\"' + str(i[1]) + '\" : \"' + str(i[2]).split('\\')[-1] + '\"}')
for j in fonts:
if i[0] == j[0]:
fonts_dict[k][j[1]] = j[2].split('\\')[-1]
This code creates all font families as dictionary and then changes value in that way that value of font family dict is another dict.
So it's manipulated in this way:
It's in list:
[..., 'Cambria', ...]
Making dict with one value at time:
{...}, "Cambria": {"Bold Italic": "calibriz.ttf"}, {...}
Adding more key's with values to subdicts:
{...}, "Cambria": {"Bold Italic": "calibriz.ttf", "Regular": "cambria.ttc"}, {...}
And it's doing it till everything is assigned to dictionary.
And lastly:
print(json.dumps(fonts_dict, indent=2))
Printing result in nicely formated way.
Hope that I'll help somebody.

how to create list of dictionary in this code?

I have some names and scores as follows
input = {
'Maths': dict(Mohsen=19, Sadegh=18, Hafez=15),
'Physics': dict(Sadegh=16, Hafez=17, Mohsen=17),
'Chemistry': dict(Hafez=13),
'Literature': dict(Sadegh=14),
'Biology': dict(Mohsen=16, Sadegh=10),
}
if a person don't have any lesson its score consider zero also get avrege of scores's person and sort final list by averge and i want to get an output like this.
answer = [
dict(Name='Sadegh', Literature=14, Chemistry=0, Maths=18, Physics=16, Biology=10, Average=11.6),
dict(Name='Mohsen', Maths=19, Physics=17, Chemistry=0, Biology=16, Literature=0, Average=10.4),
dict(Name='Hafez', Chemistry=13, Biology=0, Physics=17, Literature=0, Maths=15, Average=9),
]
how to do it?
Essentially, you have a dictionary, where the information is arranged based on subjects, where for each subject, you have student marks. You want to collection all information related to each student in separate dictionaries.
One of the approaches which can try, is as below:
Try converting the data which you have into student specific data and then you can calculate the Average of the Marks of all subjects for that student. There is a sample code below.
Please do note that, this is just a sample and you should be trying
out a solution by yourself. There are many alternate ways of doing it and you should explore them by yourself.
The below code works with Python 2.7
from __future__ import division
def convert_subject_data_to_student_data(subject_dict):
student_dict = {}
for k, v in subject_dict.items():
for k1, v1 in v.items():
if k1 not in student_dict:
student_dict[k1] = {k:v1}
else:
student_dict[k1][k] = v1
student_list = []
for k,v in student_dict.items():
st_dict = {}
st_dict['Name'] = k
st_dict['Average'] = sum(v.itervalues()) / len(v.keys())
st_dict.update(v)
student_list.append(st_dict)
print student_list
if __name__ == "__main__":
subject_dict = {
'Maths': dict(Mohsen=19, Sadegh=18, Hafez=15),
'Physics': dict(Sadegh=16, Hafez=17, Mohsen=17),
'Chemistry': dict(Hafez=13),
'Literature': dict(Sadegh=14),
'Biology': dict(Mohsen=16, Sadegh=10),
}
convert_subject_data_to_student_data(subject_dict)
sample_input = {
'Maths': dict(Mohsen=19, Sadegh=18, Hafez=15),
'Physics': dict(Sadegh=16, Hafez=17, Mohsen=17),
'Chemistry': dict(Hafez=13),
'Literature': dict(Sadegh=14),
'Biology': dict(Mohsen=16, Sadegh=10),
}
def foo(lessons):
result = {}
for lesson in lessons:
for user in lessons[lesson]:#dictionary
if result.get(user):
#print(result.get(user))
result.get(user).setdefault(lesson, lessons[lesson].get(user,0))
else:
result.setdefault(user, dict(name=user))
result.get(user).setdefault(lesson,lessons[lesson].get(user,0))
#return list(result.values())
return result.values()
#if name == '__main__':
print(foo(sample_input))

PyTest: fix repeating code and remove dependencies

I am writing tests for an API with pytest.
The tests are structured like that:
KEEP_BOX_IDS = ["123abc"]
#pytest.fixture(scope="module")
def s():
UID = os.environ.get("MYAPI_UID")
if UID is None:
raise KeyError("UID not set in environment variable")
PWD = os.environ.get("MYAPI_PWD")
if PWD is None:
raise KeyError("PWD not set in environment variable")
return myapi.Session(UID, PWD)
#pytest.mark.parametrize("name,description,count", [
("Normal Box", "Normal Box Description", 1),
("ÄäÖöÜüß!§", "ÄäÖöÜüß!§", 2),
("___--_?'*#", "\n\n1738\n\n", 3),
])
def test_create_boxes(s, name, description, count):
box_info_create = s.create_box(name, description)
assert box_info_create["name"] == name
assert box_info_create["desc"] == description
box_info = s.get_box_info(box_info_create["id"])
assert box_info["name"] == name
assert box_info["desc"] == description
assert len(s.get_box_list()) == count + len(KEEP_BOX_IDS)
def test_update_boxes(s):
bl = s.get_box_list()
for b in bl:
b_id = b['id']
if b_id not in KEEP_BOX_IDS:
new_name = b["name"] + "_updated"
new_desc = b["desc"] + "_updated"
s.update_box(b_id, new_name, new_desc)
box_info = s.get_box_info(b_id)
assert box_info["name"] == new_name
assert get_box_info["desc"] == new_desc
I use a fixture to set up the session (this will keep me connected to the API).
As you can see I am creating 3 boxes at the beginning.
All test that are following do some sort of operations on this 3 boxes. (Boxes are just spaces for folders and files)
For example: update_boxes, create_folders, rename_folders, upload_files, change_file names, etc..
I know it's not good, since all the tests are dependent from each other, but if I execute them in the right order the test is valid and thats enough.
The second issue, which borders me the most, is that all the following tests start with the same lines:
bl = s.get_box_list()
for b in bl:
b_id = b['id']
if b_id not in KEEP_BOX_IDS:
box_info = s.get_box_info(b_id)
I always need to call this for loop to get each boxs id and info.
I've tried to put it in a second fixture, but the problem is that then there will be two fixtures.
Is there a better way of doing this?
Thanks

Resources