how to multiprocess large text files in python? - text

I tried to digest lines of a DictReader object after I read in a 60 MB csv file. I asked the question here: how to chunk a csv (dict)reader object in python 3.2?. (Code repated below.)
However, now I realize that chunking up the original text file might as well do the trick (and do the DictRead and the line-by-line digest later on). However, I found no io tool that multiprocessing.Pool could use.
Thanks for any thoughts!
source = open('/scratch/data.txt','r')
def csv2nodes(r):
strptime = time.strptime
mktime = time.mktime
l = []
ppl = set()
for row in r:
cell = int(row['cell'])
id = int(row['seq_ei'])
st = mktime(strptime(row['dat_deb_occupation'],'%d/%m/%Y'))
ed = mktime(strptime(row['dat_fin_occupation'],'%d/%m/%Y'))
# collect list
l.append([(id,cell,{1:st,2: ed})])
# collect separate sets
ppl.add(id)
return (l,ppl)
def csv2graph(source):
r = csv.DictReader(source,delimiter=',')
MG=nx.MultiGraph()
l = []
ppl = set()
# Remember that I use integers for edge attributes, to save space! Dic above.
# start: 1
# end: 2
p = Pool(processes=4)
node_divisor = len(p._pool)*4
node_chunks = list(chunks(r,int(len(r)/int(node_divisor))))
num_chunks = len(node_chunks)
pedgelists = p.map(csv2nodes,
zip(node_chunks))
ll = []
for l in pedgelists:
ll.append(l[0])
ppl.update(l[1])
MG.add_edges_from(ll)
return (MG,ppl)

Related

appending to a list inside a loop using a variable as the list name

my goal is to create several lists out of the contents of several files. In the past, I have used '{}'.format(x) inside of loops as a way to change the paths inside the loop to match whichever item in the list the loop is working on. Now I want to extend that to appending to lists outside the loop. Here is the code I am using currently.
import csv
import os
c3List = []
c4List = []
camList = []
plantList = ('c3', 'c4', 'cam')
for p in plantList:
plantFolder = folder path
plantCsv = '{}List.csv'.format(p)
plantPath = os.path.join(plantFolder, plantCsv)
with open(plantPath) as plantParse:
reader = csv.reader(plantParse)
data = list(reader)
'{}List'.format(p).append(data)
But this is giving me AttributeError: 'str' object has no attribute 'append'
if I try to make a variable like this
pList = '{}List'.format(p)
pList.append(data)
I get the same error. Any advice would be appreciated. I am using Python 3.
Because list object are mutable, you could create a dict referencing all of your lists.
For example with this:
myList = []
myDict = {"a": myList}
myDict["a"].append("appended_by_reference")
myList.append("appended_directly")
print(myList)
you will get ['appended_by_reference', 'appended_directly'] printed.
If you want to learn more about mutability and immutability in python see link.
So my own implementation to achieve your goal would be:
import csv
from pathlib import Path
c3List = []
c4List = []
camList = []
plantList = {'c3': c3List, 'c4': c4List, 'cam': camList}
plantFolder = `folder path`
for p in plantList:
plantCsv = f'{p}List.csv'
plantPath = Path(plantFolder, plantCsv)
with open(plantPath) as plantParse:
reader = csv.reader(plantParse)
data = list(reader)
plantList[p].append(data)
Note: I used an fstring to format the string and pathlib to define filepaths

Working with '"variable"' structure in Python

So while reading a CSV-file into python, some of the variables have the following structure:
'"variable"'
I stored them in listed tuples.
Now, some of these variables have to be compared to each other as they are numeric.
But I can't seem to find a way to compare them to each other. For example:
counter = 0
if '"120000"' < '"130000"':
counter += 1
However, the counter remains at 0.
Any advice on how to work with these types of datastructures?
I tryed converting them to integers but this gives my a ValueError.
The original file has the following layout:
Date,"string","string","string","string","integer"
I read the file as follows:
with open(dataset, mode="r") as flight_information:
flight_information_header = flight_information.readline()
flight_information = flight_information.read()
flight_information = flight_information.splitlines()
flight_information_list = []
for lines in flight_information:
lines = lines.split(",")
flight_information_tuple = tuple(lines)
flight_information_list.append(flight_information_tuple)
For people in the future, the following solved my problem:
Since the tuples are immutable I now removed the "" around my numerical values while loading the csv file:
Example:
with open(dataset, mode="r") as flight_information:
flight_information_header = flight_information.readline()
flight_information = flight_information.read()
flight_information = flight_information.splitlines()
flight_information_list = []
for lines in flight_information:
lines = lines.replace('"', '').split(",")
flight_information_tuple = tuple(lines)
flight_information_list.append(flight_information_tuple)
Note this line in particular:
lines = lines.replace('"', '').split(",")

Read out .csv and hand results to a dictionary

I am learning some coding, and I am stuck with an error I can't explain. Basically I want to read out a .csv file with birth statistics from the US to figure out the most popular name in the time recorded.
My code looks like this:
# 0:Id, 1: Name, 2: Year, 3: Gender, 4: State, 5: Count
names = {} # initialise dict names
maximum = 0 # store for maximum
l = []
with open("Filepath", "r") as file:
for line in file:
l = line.strip().split(",")
try:
name = l[1]
if name in names:
names[name] = int(names[name]) + int(l(5))
else:
names[name] = int(l(5))
except:
continue
print(names)
max(names)
def max(values):
for i in values:
if names[i] > maximum:
names[i] = maximum
else:
continue
return(maximum)
print(maximum)
It seems like the dictionary does not take any values at all since the print command does not return anything. Where did I go wrong (incidentally, the filepath is correct, it takes a while to get the result since the .csv is quite big. So my assumption is that I somehow made a mistake writing into the dictionary, but I was staring at the code for a while now and I don't see it!)
A few suggestions to improve your code:
names = {} # initialise dict names
maximum = 0 # store for maximum
with open("Filepath", "r") as file:
for line in file:
l = line.strip().split(",")
names[name] = names.get(name, 0) + l[5]
maximum = [(v,k) for k,v in names]
maximum.sort(reversed=True)
print(maximum[0])
You will want to look into Python dictionaries and learn about get. It helps you accomplish the objective of making your names dictionary in less lines of codes (more Pythonic).
Also, you used def to generate a function but you never called that function. That is why it's not printing.
I propose the shorted code above. Ask if you have questions!
Figured it out.
I think there were a few flow issues: I called a function before defining it... is that an issue or is python okay with that?
Also I think I used max as a name for a variable, but there is a built-in function with the same name, that might cause an issue I guess?! Same with value
This is my final code:
names = {} # initialise dict names
l = []
def maxval(val):
maxname = max(val.items(), key=lambda x : x[1])
return maxname
with open("filepath", "r") as file:
for line in file:
l = line.strip().split(",")
name = l[1]
try:
names[name] = names.get(name, 0) + int(l[5])
except:
continue
#print(str(l))
#print(names)
print(maxval(names))

How can i get IfcOpenShell for python to write with the same unicode as the file it reads?

I'm using IfcOpenshell to read an .ifc file. make some changes, then write it to a new .ifc file. But IfcOpenshell is not writing the unicode the same way as it reads it.
I'm creating a script taht adds a pset with properties to each ifcelement. the value of these properties are copied from existing properties. So basically i'm creating a pset that gathers chosen information to a single place.
This has worked great until the existing values contained unicode utf-8.
It is read and decoded to show the correct value when printed, but it does not write the unicode the same way as it reads it.
I tried changing the unicode used in PyCharm, no luck. I found simular posts elsewhere without finding a fix.
From what i've read elsewhere it has something to do with the unicode encoder/decoder IfcOpenshell use, but i cant be sure.
def mk_pset():
global param_name
global param_type
global max_row
global param_map
wb = load_workbook(b)
sheet = wb.active
max_row = sheet.max_row
max_column = sheet.max_column
param_name = []
param_type = []
param_map=[]
global pset_name
pset_name = sheet.cell(row=2, column=1).value
for pm in range(2, max_row+1):
param_name.append((sheet.cell(pm, 2)).value)
param_type.append((sheet.cell(pm, 3)).value)
param_map.append((sheet.cell(pm,4)).value)
print(param_type,' - ',len(param_type))
print(param_name,' - ',len(param_name))
create_pset()
def create_pset():
ifcfile = ifcopenshell.open(ifc_loc)
create_guid = lambda: ifcopenshell.guid.compress(uuid.uuid1().hex)
owner_history = ifcfile.by_type("IfcOwnerHistory")[0]
element = ifcfile.by_type("IfcElement")
sets = ifcfile.by_type("IfcPropertySet")
list = []
for sett in sets:
list.append(sett.Name)
myset = set(list)
global antall_parametere
global index
index = 0
antall_parametere = len(param_name)
if pset_name not in myset:
property_values = []
tot_elem = (len(element))
cur_elem = 1
for e in element:
start_time_e=time.time()
if not e.is_a() == 'IfcOpeningElement':
type_element.append(e.is_a())
for rel_e in e.IsDefinedBy:
if rel_e.is_a('IfcRelDefinesByProperties'):
if not rel_e[5][4] == None:
index = 0
while index < antall_parametere:
try:
ind1 = 0
antall_ind1 = len(rel_e[5][4])
while ind1 < antall_ind1:
if rel_e[5][4][ind1][0] == param_map[index]:
try:
if not rel_e[5][4][ind1][2]==None:
p_type = rel_e[5][4][ind1][2].is_a()
p_verdi =rel_e[5][4][ind1][2][0]
p_t=param_type[index]
property_values.append(ifcfile.createIfcPropertySingleValue(param_name[index], param_name[index],ifcfile.create_entity(p_type,p_verdi),None),)
ind1 += 1
else:
ind1 +=1
except TypeError:
pass
break
else:
ind1 += 1
except AttributeError and IndexError:
pass
index += 1
index = 0
property_set = ifcfile.createIfcPropertySet(create_guid(), owner_history, pset_name, pset_name,property_values)
ifcfile.createIfcRelDefinesByProperties(create_guid(), owner_history, None, None, [e], property_set)
ifc_loc_edit = str(ifc_loc.replace(".ifc", "_Edited.ifc"))
property_values = []
print(cur_elem, ' av ', tot_elem, ' elementer ferdig. ',int(tot_elem-cur_elem),'elementer gjenstår. Det tok ',format(time.time()-start_time_e),' sekunder')
cur_elem += 1
ifcfile.write(ifc_loc_edit)
else:
###print("Pset finnes")
sg.PopupError("Pset er allerede oprettet i modell.")
I expect p_verdi written to be equal to the p_verdi read.
Original read (D\X2\00F8\X0\r):
#2921= IFCBUILDINGELEMENTPROXYTYPE('3QPADpsq71CHeCe7e3GDm5',#32,'D\X2\00F8\X0\r',$,$,$,$,'DA64A373-DB41-C131-1A0C-A07A0340DC05',$,.NOTDEFINED.);
Written (D\X4\000000F8\X0\r):
#2921=IFCBUILDINGELEMENTPROXYTYPE('3QPADpsq71CHeCe7e3GDm5',#32,'D\X4\000000F8\X0\r',$,$,$,$,'DA64A373-DB41-C131-1A0C-A07A0340DC05',$,.NOTDEFINED.);
Decoded to "Dør"
this happens to hard spaceing also:
('2\X2\00A0\X0\090')
prints correctly as:('2 090')
gets written:
('2\X4\000000A0\X0\090')
written form is unreadable by my ifc using software.
Not so much an answere as a workaround.
After more research i found out that most IFC reading software seems to not support X4 coding, so i made a workaround with regex. Basically finding everything and replacing \X4\0000 with \X2. This has worked with all the spec chars i've encountered so far. But as stated, is just a workaround that probably wont work for everyone.
def X4trans_2(target_file,temp_fil):
from re import findall
from os import remove,rename
dec_file = target_file.replace('.ifc', '_dec.ifc')
tempname = target_file
dec_list = []
with open(temp_fil, 'r+') as r,open(dec_file, 'w', encoding='cp1252') as f:
for line in r:
findX4 = findall(r'\\X4\\0000+[\w]+\\X0\\', str(line))
if findX4:
for fx in findX4:
X4 = str(fx)
newX = str(fx).replace('\\X4\\0000', '\X2\\')
line = line.replace(str(X4), newX) # print ('Fant X4')
f.writelines(line)
remove(temp_fil)
try:
remove(target_file)
except FileNotFoundError:
pass
rename(dec_file,tempname)
It basically opens the ifc as text, find and replace X4 with X2 and writes it again.

Referenced variable isn't recognized by python

I am developing a program which works with a ; separated csv.
When I try to execute the following code
def accomodate(fil, targets):
l = fil
io = []
ret = []
for e in range(len(l)):
io.append(l[e].split(";"))
for e in io:
ter = []
for theta in range(len(e)):
if targets.count(theta) > 0:
ter.append(e[theta])
ret.append(ter)
return ret
, being 'fil' the read rows of the csv file and 'targets' a list which contains the columns to be chosen. While applying the split to the csv file it raises the folowing error: "'l' name is not defined" while as far as I can see the 'l' variable has already been defined.
Does anyone know why this happens? Thanks beforehand
edit
As many of you have requested, I shall provide with an example.
I shall post an example of csv, not a shard of the original one. It comes already listed
k = ["Cookies;Brioche;Pudding;Pie","Dog;Cat;Bird;Fish","Boat;Car;Plane;Skate"]
accomodate(k, [1,2]) = [[Brioche, Pudding], [Cat, Bird], [Car, Plane]]
You should copy the content of fil list:
l = fil.copy()

Resources