Problems Reading Zip of Shapefiles without loading memory - python-3.x

I've been trying to adapt Andrew Gaidus shapefile reading routine for my needs. The Jupyter Notebook I'm using acts like it partitioned the disk of my MacBook Pro so I can't read or write to disk. Gaidus has a good procedure for avoiding using disk, but is written for prior version of Python.
Here is the code:
dls = "https://github.com/ItsMeLarry/Coursera_Capstone/raw/master/tl_2010_25009_tract00%202.zip"
lynntracts = ZipFile(io.BytesIO(urllib.request.urlopen(dls).read()))
print("Done")
filenames = [y for y in sorted(lynntracts.namelist()) for ending in ['dbf', 'prj', 'shp', 'shx'] if y.endswith(ending)]
#For some reason, I get 8, instead of 4, filenames. The first 4 start with __MACOSX. I get rid of those. The problem I
#have with the 'TypeError' occurs no matter which set of 4 files I use.
print(filenames[0], 'Example of the 4 files that I remove in the for loop')
for i in range(0,4):
del filenames[0]
print(filenames)
dbf, prj, shp, shx = [io.StringIO(ZipFile.read(filename)) for filename in filenames]
r = shapefile.Reader(shp=shp, shx=shx, dbf=dbf)
print(r.numRecords)
Opening with io.BytesIO cured the prior problem of byte/str collision. Now see the TypeError for the ZipFile.read. I get the same error if I use io.BytesIO when calling it. Here is error output followed by error info:
Done
__MACOSX/tl_2010_25009_tract00/._tl_2010_25009_tract00.dbf Example of the 4 files that I remove in the for loop
['tl_2010_25009_tract00/tl_2010_25009_tract00.dbf', 'tl_2010_25009_tract00/tl_2010_25009_tract00.prj', 'tl_2010_25009_tract00/tl_2010_25009_tract00.shp', 'tl_2010_25009_tract00/tl_2010_25009_tract00.shx']
TypeError Traceback (most recent call last)
in ()
12 del filenames[0]
13 print(filenames)
---> 14 dbf, prj, shp, shx = [io.StringIO(ZipFile.read(filename)) for filename in filenames]
15 r = shapefile.Reader(shp=shp, shx=shx, dbf=dbf)
16 print(r.numRecords)
in (.0)
12 del filenames[0]
13 print(filenames)
---> 14 dbf, prj, shp, shx = [io.StringIO(ZipFile.read(filename)) for filename in filenames]
15 r = shapefile.Reader(shp=shp, shx=shx, dbf=dbf)
16 print(r.numRecords)
TypeError: read() missing 1 required positional argument: 'name'
Clearly, I am a beginner. I've come up empty handed trying to research this. Where do I go? What do I need to understand here? Thanks

Related

Issues using tkdraw.basic on jupyter VCS

So i tried to execute a code using the the tkdraw.basic library in jupyter notebook, and this is what i got :
Its just a simple program that is supposed to draw a line along the window :
import tkdraw.basic as graph
def ligne_horiz(y,larg):
for x in range(larg):
graph.plot(y,x)
graph.open_win(120,160)
ligne_horiz(50,160)
graph.wait()
hers the error :
AssertionError Traceback (most recent call last)
Cell In [12], line 7
4 for x in range(larg):
5 graph.plot(y,x)
----> 7 graph.open_win(120,160)
8 ligne_horiz(50,160)
9 graph.wait()
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tkdraw/basic.py:63, in open_win(height, width, zoom)
59 # pylint: disable=global-statement
60 # I really want to use a global in this module, to make those functions
61 # easier to use.
62 global _WINDOW
---> 63 assert not _WINDOW, "ERROR: function open() was called twice!"
64 _WINDOW = tkd.Screen((height, width), zoom, grid=False)
AssertionError: ERROR: function open() was called twice!
The code work perfectly on the terminal or in an another .py folder so i think the issue come from the Extension,
I tried to find any similar issue or an update in the library but nothing wrong apparently.(maybe a parameter in the extension it self ?)

Mimicking bash wc functionalities using python

I have written a very simple python programme, called wc.py, which mimics "bash wc" behaviour to count the number of words, lines and bytes in a file. My programme is as follow:
import sys
path = sys.argv[1]
w = 0
l = 0
b = 0
for currentLine in file:
wordsInLine = currentLine.strip().split(' ')
wordsInLine = [word for word in wordsInLine if word != '']
w += len(wordsInLine)
b += len(currentLine.encode('utf-8'))
l += 1
#output
print(str(l) + ' ' + str(w) + ' ' + str(b))
In order to execute my programme you should execute the following command:
python3 wc.py [a file to read the data from]
As the result it shows
[The number of lines in the file] [The number of words in the file] [The number of bytes in the file] [the file directory path]
The files I used to test my code is as follow:
file.txt which contains the following data:
1
2
3
4
Executing "wc file.txt" returns
4 4 8
Executing "python3 wc.py file.txt" returns 4 4 8
Download "Annual enterprise survey: 2020 financial year (provisional) – CSV" from CSV file download
Executing "wc [fileName].csv" returns
37081 500273 5881081
Executing "python3 wc.py [fileName].csv" returns
37081 500273 5844000
and a [something].pdf file
Executing "wc [something].pdf" works.
Executing "python3 code.py" throws the following errors:
Traceback (most recent call last):
File "code.py", line 10, in <module>
for currentLine in file:
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 10: invalid start byte
As you can see, the output of python3 code.py [something].pdf and python3 code.py [something].csv is not the same as what wc returns. Could you help me to find the reason of this erroneous behaviour in my code?
Regarding the CSV file, if you look at the difference between your result and that of wc:
5881081 - 5844000 = 37081 which is exactly the number of lines.
That is, every line has one additional character in the original file. That character is the carriage return \r which got lost in Python because you iterate over lines and don't specify the linebreaks. If you want a byte-correct result, you have to first identify the type of linebreaks used in the file (and watch out for inconsistencies throughout the document).

for loop over list KeyError: 664

I am trying to iterate this list with words as
CTCCTC TCCTCT CCTCTC CTCTCC TCTCCC CTCCCA TCCCAA CCCAAA CCAAAC CAAACT
CTGGGC TGGGCC GGGCCA GGCCAA GCCAAT CCAATG CAATGC AATGCC ATGCCT TGCCTG GCCTGC
TGCCAG GCCAGG CCAGGA CAGGAG AGGAGG GGAGGG GAGGGG AGGGGC GGGGCT GGGCTG GGCTGG GCTGGT CTGGTC
TGGTCT GGTCTG GTCTGG TCTGGA CTGGAC TGGACA GGACAC GACACT ACACTA CACTAT
ATTCAG TTCAGC TCAGCC CAGCCA AGCCAG GCCAGT CCAGTC CAGTCA AGTCAA GTCAAC TCAACA CAACAC AACACA
ACACAA CACAAG ACAAGG AGGTGG GGTGGC GTGGCC TGGCCT GGCCTG GCCTGC CCTGCA CTGCAC
TGCACT GCACTC CACTCG ACTCGA CTCGAG TCGAGG CGAGGT GAGGTT AGGTTC GGTTCC
TATATA ATATAC TATACC ATACCT TACCTG ACCTGG CCTGGT CTGGTA TGGTAA GGTAAT GTAATG TAATGG AATGGA
I am trying for loop to read each item in the list and parse it through mk_model.vector
the code used is as follows
for x in all_seq_sentences[:]:
mk_model.vector(x)
print(x)
Usually, mk_model.vector("AGT") will give an array corresponding to defines dna2vec model, But here rather than actually performing the model run it throws error as
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-144-77c47b13e98a> in <module>
1 for x in all_seq_sentences[:]:
----> 2 mk_model.vector(x)
3 print(x)
4
~/Desktop/DNA2vec/dna2vec/dna2vec/multi_k_model.py in vector(self, vocab)
35
36 def vector(self, vocab):
---> 37 return self.data[len(vocab)].model[vocab]
38
39 def unitvec(self, vec):
KeyError: 664
Looking forward to some help here
The above problem was having issues because the for loop took all items in first line as one item, which is why .split() was best solution of it. To read follow https://python-reference.readthedocs.io/en/latest/docs/str/split.html
working code:
for i in all_seq_sentences:
word = i.split()
print(word[0])
and then later implement another loop to access the model.vector function
vec_of_all_seq = []
for sentence in all_seq_sentences:
sentence = sentence.split()
for word in sentence:
vec_of_all_seq.append(mk_model.vector(word))
vector representation derived from model.vector will be saved in numpy array named vec_of_all_seq.

Is it possible to delete the file if UnicodeEncodeError occur? [duplicate]

This question already has an answer here:
How to catch all exceptions in Try/Catch Block Python?
(1 answer)
Closed 3 years ago.
My code below goes through each .m4v file in the list and converts them to a .wav file using FFmpeg, and it works. I use python 3 jupyter environment.
for fpath in list:
if (fpath.endswith(".m4v")):
cdir=os.path.dirname(fpath)
os.chdir(cdir)
filename=os.path.basename(fpath)
os.system("ffmpeg -i {0} temp_name.wav".format(filename))
ofnamepath=os.path.splitext(fpath)[0]
temp_name=os.path.join(cdir, "temp_name.wav")
new_name = os.path.join(ofnamepath+'.wav')
os.rename(temp_name,new_name)
old_name=os.path.join(ofnamepath+'.m4v')
os.remove(old_name)
However, for this particular dataset I get the following error;
> UnicodeEncodeError Traceback (most recent call
> last) <ipython-input-10-bd3b17e409fa> in <module>()
>
>
> > 7 os.chdir(cdir)
> > 8 filename=os.path.basename(fpath)
> > ----> 9 os.system("ffmpeg -i {0} temp_name.wav".format(filename))
> > 10 ofnamepath=os.path.splitext(fpath)[0]
> > 11 temp_name=os.path.join(cdir, "temp_name.wav")
>
> UnicodeEncodeError: 'ascii' codec can't encode characters in position
> 10-16: ordinal not in range(128)
Is it possible to do add an if comment line in the code something like;
if 'UnicodeEncodeError: 'ascii' codec can't encode'
delete that file and continue to the next file?
You can use a try and except block.
If an exception occurs while inside a try block, it will jump to the exception block. What's better is that you can even specify the exception.
Adding this to your code would look something like:
for fpath in list:
if (fpath.endswith(".m4v")):
cdir=os.path.dirname(fpath)
os.chdir(cdir)
filename=os.path.basename(fpath)
try:
os.system("ffmpeg -i {0} temp_name.wav".format(filename))
except UnicodeEncodeError:
print("Some failure message.. Continuing to next..")
# os.remove(filename)
continue # This skips the rest of the current iteration and jumps to the top of the loop.
ofnamepath=os.path.splitext(fpath)[0]
temp_name=os.path.join(cdir, "temp_name.wav")
new_name = os.path.join(ofnamepath+'.wav')
os.rename(temp_name,new_name)
old_name=os.path.join(ofnamepath+'.m4v')
os.remove(old_name)
Uncomment the # os.remove(filename) to have your files deleted. Are you sure you want to permanently delete them?

Can`t solve TypeError: '>' not supported between instances of 'NoneType' and 'int'

I have a long list of audio files, and some of them are longer than an hour. I am using Jupyter notebook, Python 3.6 and TinyTag library to get a duration of audio. My code below goes over the files and if a file is longer than an hour, it splits the file into one-hour long pieces, and a leftover piece less than an hour, and copies the pieces as fname_1,fname_2, etc. The code was working for the previous datasets I tried, but this time after running for a while, I get the error below. I don`t know where this is coming from and how to fix it, I have already read the similar titled questions but their contents were different. Thanks in advance.
# fpaths is the list of filepaths
for i in range(0,len(fpaths)):
fpath=fpaths[i]
fname=os.path.basename(fpath)
fname0=os.path.splitext(fname)[0] #name without extension
tag = TinyTag.get(fname)
if tag.duration > 3600:
cmd2 = "ffmpeg -i %s -f segment -segment_time 3600 -c copy %s" %(fpath, fname0) + "_%d.wav"
os.system(cmd2)
os.remove(fpath)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-79d0ceebf75d> in <module>()
7 fname0=os.path.splitext(fname)[0]
8 tag = TinyTag.get(fname)
----> 9 if tag.duration > 3600:
10 cmd2 = "ffmpeg -i %s -f segment -segment_time 3600 -c copy %s" %(fpath, fname0) + "_%d.wav"
11 os.system(cmd2)
TypeError: '>' not supported between instances of 'NoneType' and 'int'
Seems like some of those results do not have a duration
Perhaps change it to:
if tag.duration and tag.duration > 3600:
.....

Resources