os.path.getsize returns 0 although folder has files in it (Python 3.5) - python-3.x

I am trying to create a program which auto-backups some folders under certain circumstances.
I try to compare the size of two folders (source and dest), source has files in it, a flac file and a subfolder with a text file whereas dest is empty.
This is the code I've written so far:
import os.path
sls = os.path.getsize('D:/autobu/source/')
dls = os.path.getsize('D:/autobu/dest/')
print(sls)
print(dls)
if sls > dls:
print('success')
else:
print('fail')
And the output is this:
0
0
fail
What have I done wrong? Have I misunderstood how getsize functions?

os.path.getsize('D:/autobu/source/') is used for getting size of a file
you can folder size you can use os.stat
src_stat = os.stat('D:/autobu/source/')
sls = src_stat.st_size

Related

Python Glob - Get Full Filenames, but no directory-only names

This code works, but it's returning directory names and filenames. I haven't found a parameter that tells it to return only files or only directories.
Can glob.glob do this, or do I have to call os.something to test if I have a directory or file. In my case, my files all end with .csv, but I would like to know for more general knowledge as well.
In the loop, I'm reading each file, so currently bombing when it tries to open a directory name as a filename.
files = sorted(glob.glob(input_watch_directory + "/**", recursive=True))
for loop_full_filename in files:
print(loop_full_filename)
Results:
c:\Demo\WatchDir\
c:\Demo\WatchDir\2202
c:\Demo\WatchDir\2202\07
c:\Demo\WatchDir\2202\07\01
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_51.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_52.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_53.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_54.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_55.csv
c:\Demo\WatchDir\2202\07\05
c:\Demo\WatchDir\2202\07\05\polygonData_2022_07_05__12_00.csv
c:\Demo\WatchDir\2202\07\05\polygonData_2022_07_05__12_01.csv
Results needed:
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_51.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_52.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_53.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_54.csv
c:\Demo\WatchDir\2202\07\01\polygonData_2022_07_01__15_55.csv
c:\Demo\WatchDir\2202\07\05\polygonData_2022_07_05__12_00.csv
c:\Demo\WatchDir\2202\07\05\polygonData_2022_07_05__12_01.csv
For this specific program, I can just check if the file name contains.csv, but I would like to know in general for future reference.
Line:
files = sorted(glob.glob(input_watch_directory + "/**", recursive=True))
replace with the line:
files = sorted(glob.glob(input_watch_directory + "/**/*.*", recursive=True))

Vertex AI scheduled notebooks doesn't recognize existence of folders

I have a managed jupyter notebook in Vertex AI that I want to schedule. The notebook works just fine as long as I start it manually, but as soon as it is scheduled, it fails. There are in fact many things that go wrong when scheduled, some of them are fixable. Before explaining what my trouble is, let me first give some details of the context.
The notebook gathers information from an API for several stores and saves the data in different folders before processing it, saving csv-files to store-specific folders and to bigquery. So, in the location of the notebook, I have:
The notebook
Functions needed for the handling of data (as *.py files)
A series of folders, some of which have subfolders which also have subfolders
When I execute this manually, no problem. Everything works well and all files end up exactly where they should, as well as in different bigQuery tables.
However, when scheduling the execution of the notebook, everything goes wrong. First, the files *.py cannot be read (as import). No problem, I added the functions in the notebook.
Now, the following error is where I am at a loss, because I have no idea why it does work or how to fix it. The code that leads to the error is the following:
internal = "https://api.************************"
df_descriptions = []
storess = internal
response_stores = requests.get(storess,auth = HTTPBasicAuth(userInternal, keyInternal))
pathlib.Path("stores/request_1.json").write_bytes(response_stores.content)
filepath = "stores"
files = os.listdir(filepath)
for file in files:
with open(filepath + "/"+file) as json_string:
jsonstr = json.load(json_string)
information = pd.json_normalize(jsonstr)
df_descriptions.append(information)
StoreINFO = pd.concat(df_descriptions)
StoreINFO = StoreINFO.dropna()
StoreINFO = StoreINFO[StoreINFO['storeIdMappings'].map(lambda d: len(d)) > 0]
cloud_store_ids = list(set(StoreINFO.cloudStoreId))
LastWeek = datetime.date.today()- timedelta(days=2)
LastWeek =np.datetime64(LastWeek)
and the error reported is:
FileNotFoundError Traceback (most recent call last)
/tmp/ipykernel_165/2970402631.py in <module>
5 storess = internal
6 response_stores = requests.get(storess,auth = HTTPBasicAuth(userInternal, keyInternal))
----> 7 pathlib.Path("stores/request_1.json").write_bytes(response_stores.content)
8
9 filepath = "stores"
/opt/conda/lib/python3.7/pathlib.py in write_bytes(self, data)
1228 # type-check for the buffer interface before truncating the file
1229 view = memoryview(data)
-> 1230 with self.open(mode='wb') as f:
1231 return f.write(view)
1232
/opt/conda/lib/python3.7/pathlib.py in open(self, mode, buffering, encoding, errors, newline)
1206 self._raise_closed()
1207 return io.open(self, mode, buffering, encoding, errors, newline,
-> 1208 opener=self._opener)
1209
1210 def read_bytes(self):
/opt/conda/lib/python3.7/pathlib.py in _opener(self, name, flags, mode)
1061 def _opener(self, name, flags, mode=0o666):
1062 # A stub for the opener argument to built-in open()
-> 1063 return self._accessor.open(self, flags, mode)
1064
1065 def _raw_open(self, flags, mode=0o777):
FileNotFoundError: [Errno 2] No such file or directory: 'stores/request_1.json'
I would gladly find another way to do this, for instance by using GCS buckets, but my issue is the existence of sub-folders. There are many stores and I do not wish to do this operation manually because some retailers for which I am doing this have over 1000 stores. My python code generates all these folders and as I understand it, this is not feasible in GCS.
How can I solve this issue?
GCS uses a flat namespace, so folders don't actually exist, but can be simulated as given in this documentation.For your requirement, you can either use absolute path (starting with "/" -- not relative) or create the "stores" directory (with "mkdir"). For more information you can check this blog.

Watson LT SDK create_model() fails if forced_glossary is in /tmp/ folder

The Watson LT create_model() fails if the glossary file is in a folder outside the local dir. Kinda crazy... why would location of TMX file matter?
It works if I just the basename (CustomModel_xxxx.tmx) w/o a folder.
If fails with error below if I use /tmp/CustomModel_xxxx.tmx
I don't want tmx files created in my code base...
Running on Py 3.5. in a jupyter notebook
WatsonApiException: Error: Error while uploading file(s). Please try again!, Code: 500 , X-dp-watson-tran-id: gateway02-898567107 , X-global-transaction-id: ffea405d5bfc5adf358f0bc3
CODE:
from watson_developer_cloud import LanguageTranslatorV3
lt = LanguageTranslatorV3(....)
DIR = kwargs.get('folder','/tmp')
bn = 'CustomModel_%d.tmx' % os.getpid()
# Fails
tmx_name = os.path.join(DIR, bn)
# Is ok
#tmx_name = bn
with open(tmx_FN,'r', encoding='U8') as fio:
x = fio.read()
print("Read ok",)
r = lt.create_model(
base_model_id=model_id,
name = 'xxx',
**{'forced_glossary': fio}
)
I tried your example with Python 2.7 and it works fine on my system. My best guess is that you have some kind of permissions problem with /tmp on your system. Or maybe jupyter is remapping /tmp somehow. What happens if you run this as a standalone python app?

Run code on specific files in a directory separately (by the name of file)

I have N files in the same folder with different index numbers like
Fe_1Sec_1_.txt
Fe_1Sec_2_.txt
Fe_1Sec_3_.txt
Fe_2Sec_1_.txt
Fe_2Sec_2_.txt
Fe_2Sec_3_.txt
.
.
.
and so on
Ex: If I need to run my code with only the files with time = 1 Sec, I can make it manually as follow:
path = "input/*_1Sec_*.txt"
files = glob.glob(path)
print(files)
which gave me:
Out[103]: ['input\\Fe_1Sec_1_.txt', 'input\\Fe_1Sec_2_.txt', 'input\\Fe_1Sec_3_.txt']
In case of I need to run my code for all files separately (depending on the measurement time in seconds, i.e. the name of file)
I tried this code to get the path for each time of measurement:
time = 0
while time < 4:
time += 1
t = str(time)
path = ('"input/*_'+t+'Sec_*.txt"')
which gives me:
"input/*_1Sec_*.txt"
"input/*_2Sec_*.txt"
"input/*_3Sec_*.txt"
"input/*_4Sec_*.txt"
After that I tried to use this path as follow:
files = glob.glob(path)
print(files)
But it doesn't import the wanted files and give me :
"input/*_1Sec_*.txt"
[]
"input/*_2Sec_*.txt"
[]
"input/*_3Sec_*.txt"
[]
"input/*_4Sec_*.txt"
[]
Any suggestions, please??
I think the best way would be to simply do
for time in range(1, 5): # 1,2,3,4
glob_path = 'input/*_{}Sec_*.txt'.format(time)
for file_path in glob.glob(glob_path):
do_something(file_path, measurement) # or whatever

MafftCommandline and io.StringIO

I've been trying to use the Mafft alignment tool from Bio.Align.Applications. Currently, I've had success writing my sequence information out to temporary text files that are then read by MafftCommandline(). However, I'd like to avoid redundant steps as much as possible, so I've been trying to write to a memory file instead using io.StringIO(). This is where I've been having problems. I can't get MafftCommandline() to read internal files made by io.StringIO(). I've confirmed that the internal files are compatible with functions such as AlignIO.read(). The following is my test code:
from Bio.Align.Applications import MafftCommandline
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import io
from Bio import AlignIO
sequences1 = ["AGGGGC",
"AGGGC",
"AGGGGGC",
"AGGAGC",
"AGGGGG"]
longest_length = max(len(s) for s in sequences1)
padded_sequences = [s.ljust(longest_length, '-') for s in sequences1] #padded sequences used to test compatibilty with AlignIO
ioSeq = ''
for items in padded_sequences:
ioSeq += '>unknown\n'
ioSeq += items + '\n'
newC = io.StringIO(ioSeq)
cLoc = str(newC).strip()
cLocEdit = cLoc[:len(cLoc)] #create string to remove < and >
test1Handle = AlignIO.read(newC, "fasta")
#test1HandleString = AlignIO.read(cLocEdit, "fasta") #fails to interpret cLocEdit string
records = (SeqRecord(Seq(s)) for s in padded_sequences)
SeqIO.write(records, "msa_example.fasta", "fasta")
test1Handle1 = AlignIO.read("msa_example.fasta", "fasta") #alignIO same for both #demonstrates working AlignIO
in_file = '.../msa_example.fasta'
mafft_exe = '/usr/local/bin/mafft'
mafft_cline = MafftCommandline(mafft_exe, input=in_file) #have to change file path
mafft_cline1 = MafftCommandline(mafft_exe, input=cLocEdit) #fails to read string (same as AlignIO)
mafft_cline2 = MafftCommandline(mafft_exe, input=newC)
stdout, stderr = mafft_cline()
print(stdout) #corresponds to MafftCommandline with input file
stdout1, stderr1 = mafft_cline1()
print(stdout1) #corresponds to MafftCommandline with internal file
I get the following error messages:
ApplicationError: Non-zero return code 2 from '/usr/local/bin/mafft <_io.StringIO object at 0x10f439798>', message "/bin/sh: -c: line 0: syntax error near unexpected token `newline'"
I believe this results due to the arrows ('<' and '>') present in the file path.
ApplicationError: Non-zero return code 1 from '/usr/local/bin/mafft "_io.StringIO object at 0x10f439af8"', message '/usr/local/bin/mafft: Cannot open _io.StringIO object at 0x10f439af8.'
Attempting to remove the arrows by converting the file path to a string and indexing resulted in the above error.
Ultimately my goal is to reduce computation time. I hope to accomplish this by calling internal memory instead of writing out to a separate text file. Any advice or feedback regarding my goal is much appreciated. Thanks in advance.
I can't get MafftCommandline() to read internal files made by
io.StringIO().
This is not surprising for a couple of reasons:
As you're aware, Biopython doesn't implement Mafft, it simply
provides a convenient interface to setup a call to mafft in
/usr/local/bin. The mafft executable runs as a separate process
that does not have access to your Python program's internal memory,
including your StringIO file.
The mafft program only works with an input file, it doesn't even
allow stdin as a data source. (Though it does allow stdout as a
data sink.) So ultimately, there must be a file in the file system
for mafft to open. Thus the need for your temporary file.
Perhaps tempfile.NamedTemporaryFile() or tempfile.mkstemp() might be a reasonable compromise.

Resources