Python 3 img2pdf wrong order of images in pdf - python-3.x

I am working on a small program that takes images from a website and puts them into a pdf for easy access and simpler viewing.
I have a small problem as the img2pdf module seems to put the images into the pdf in the wrong order and I don't really get why.
It seems to put the files in order of 1,10,11.
import urllib.request
import os
import img2pdf
n = 50
all = 0
for counter in range(1,n+1):
all = all + 1
urllib.request.urlretrieve("https://website/images/"+str(all)+".jpg", "img"+str(all)+".jpg")
cwd = os.getcwd()
if all == 50:
with open("output2.pdf", "wb") as f:
f.write(img2pdf.convert([i for i in os.listdir(cwd) if i.endswith(".jpg")]))

Without seeing the filenames you're trying to read in, a guess is that your filenames include numbers that are not zero-padded. Lexicographic ordering (sorting in alphabetical order) of a sequence of files called 0.jpg, 1.jpg, ... 11.jpg will lead to this ordering: 0.jpg, 1.jpg, 10.jpg, 11.jpg, 2.jpg, 3.jpg, 4.jpg, 5.jpg, 6.jpg, 7.jpg, 8.jpg, 9.jpg, because "1" < "2".
To combine your files such that 2 comes before 10, you can zero-pad the filenames (but also beware that some software will interpret leading zeros as indicators of an octal representation of a number, as opposed to just a leading zero.)
If you can't manipulate the filenames, then you could change your file-getting code as follows: use a regular expression to extract the numbers, as int type, from the filenames of your entire list of files, then sort the list of filenames by those extracted numbers (which will be sorted as int, for which 2 < 10).

Related

Python3: Recursively compare two directories based on file contents

I have two directories containing a bunch of files and subfolders.
I would like to check if the file contents are the same in both directories (ignoring the file name). The subfolder structure should be the same too.
I looked at filecmp.dircmp but this is not helping because it does not consider the file content; there is no shallow=False option with filecmp.dircmp(), see here.
The workaround in this SO answer does not work either, because it considers the file names.
What's the best way to do my comparison?
Got around to this. After minor testing this seems to work, though more is needed. Again, this can be extremely long, depending both on the amount of files and their size:
import filecmp
import os
from collections import defaultdict
from sys import argv
def compareDirs(d1,d2):
files1 = defaultdict(set)
files2 = defaultdict(set)
subd1 = set()
subd2 = set()
for entry in os.scandir(d1):
if entry.is_dir(): subd1.add(entry)
else: files1[os.path.getsize(entry)].add(entry)
#Collecting first to compare length since we are guessing no
#match is more likely. Can compare files directly if this is
# not true.
for entry in os.scandir(d2):
if entry.is_dir(): subd2.add(entry)
else: files2[os.path.getsize(entry)].add(entry)
#Structure not the same. Checking prior to content.
if len(subd1) != len(subd2) or len(files1) != len(files2): return False
for size in files2:
for entry in files2[size]:
for fname in files1[size]: #If size does not exist will go to else
if filecmp.cmp(fname,entry,shallow=False): break
else: return False
files1[size].remove(fname)
if not files1[size]: del files1[size]
#Missed a file
if files1: return False
#This is enough since we checked lengths - if all sd2 are matched, sd1
#will be accounted for.
for sd1 in subd1:
for sd2 in subd2:
if compareDirs(sd1,sd2): break
else: return False #Did not find a sub-directory
subd2.remove(sd2)
return True
print(compareDirs(argv[1],argv[2]))
Recursively enter both directories. Compare files on the first level - fail if no match. Then try and match any sub-dir in the first directory to any sub-dir in the next recursively, until all are matched.
This is the most naive solution. Possibly traversing the tree and only matching sizes and structure would be beneficial in the average case. In that case the function would look similar, except we compare getsize instead of using filecmp, and save the matching tree structures, so the second run would be faster.
Of course, in case of a few sub-directories with the exact same structures and sizes we would still need to compare all possibilities of matching.

Python 3 Opening Binary into a list

I have a binary file consisting only of hex numbers.
I want to open the file and create a list with each element of the list consisting
of 1 hexanumber of the file (e.g. 1 byte => AB for example would be 1 element).
I tried it with the "with open" and "readlines" commands and then split the lines into the element size i wanted but failed.
Also it somehow didn´t include a specific Hex number (in my case 0A).
my code is
with open(r"C:\Users\James\Desktop\Test1.bin","rb") as file:
fileread = file.read
linesread = file.readlines()
splitted = linesread.split('\\')
print(splitted)
How do i go about?
Thanks for help

Iterate over images with pattern

I have thousands of images which are labeled IMG_####_0 where the first image is IMG_0001_0.png the 22nd is IMG_0022_0.png, the 100th is IMG_0100_0.png etc. I want to perform some tasks by iterating over them.
I used this fnames = ['IMG_{}_0.png'.format(i) for i in range(150)] to iterate over the first 150 images but I get this error FileNotFoundError: [Errno 2] No such file or directory: '/Users/me/images/IMG_0_0.png' which suggests that it is not the correct way to do it. Any ideas about how to capture this pattern while being able to iterate over the specified number of images i.e in my case from IMG_0001_0.png to IMG_0150_0.png
fnames = ['IMG_{0:04d}_0.png'.format(i) for i in range(1,151)]
print(fnames)
for fn in fnames:
try:
with open(fn, "r") as reader:
# do smth here
pass
except ( FileNotFoundError,OSError) as err:
print(err)
Output:
['IMG_0000_0.png', 'IMG_0001_0.png', ..., 'IMG_0148_0.png', 'IMG_0149_0.png']
Dokumentation: string-format()
and format mini specification.
'{:04d}' # format the given parameter with 0 filled to 4 digits as decimal integer
The other way to do it would be to create a normal string and fill it with 0:
print(str(22).zfill(10))
Output:
0000000022
But for your case, format language makes more sense.
You need to use a format pattern to get the format you're looking for. You don't just want the integer converted to a string, you specifically want it to always be a string with four digits, using leading 0's to fill in any empty space. The best way to do this is:
'IMG_{:04d}_0.png'.format(i)
instead of your current format string. The result looks like this:
In [2]: 'IMG_{:04d}_0.png'.format(3)
Out[2]: 'IMG_0003_0.png'
generate list of possible names and try if exist is slow and horrible way to iterate over files.
try look to https://docs.python.org/3/library/glob.html
so something like:
from glob import iglob
filenames = iglob("/path/to/folder/IMG_*_0.png")

How to decode a text file by extracting alphabet characters and listing them into a message?

So we were given an assignment to create a code that would sort through a long message filled with special characters (ie. [,{,%,$,*) with only a few alphabet characters throughout the entire thing to make a special message.
I've been searching on this site for a while and haven't found anything specific enough that would work.
I put the text file into a pastebin if you want to see it
https://pastebin.com/48BTWB3B
Anywho, this is what I've come up with for code so far
code = open('code.txt', 'r')
lettersList = code.readlines()
lettersList.sort()
for letters in lettersList:
print(letters)
It prints the code.txt out but into short lists, essentially cutting it into smaller pieces. I want it to find and sort out the alphabet characters into a list and print the decoded message.
This is something you can do pretty easily with regex.
import re
with open('code.txt', 'r') as filehandle:
contents = filehandle.read()
letters = re.findall("[a-zA-Z]+", contents)
if you want to condense the list into a single string, you can use a join:
single_str = ''.join(letters)

I want to extract sentences that containing a drug and gene name from 10,000 articles

I want to extract sentences that containing a drug and gene name from 10,000 articles.
and my code is
import re
import glob
import fnmatch
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
flist= glob.glob ("C:/Users/Emma Belladona/Desktop/drug working/*.txt")
print (flist)
for txt in flist:
#print (txt)
fr = open (txt, "r")
tmp = fr.read().strip()
a = (sent_tokenize(tmp))
b = (word_tokenize(tmp))
for c, value in enumerate(a, 1):
if value.find("SLC22A1") != -1 and value.find("Metformin"):
print ("Result", value)
re.findall("\w+\s?[gene]+", a)
else:
if value.find("Metformin") != -1 and value.find("SLC22A1"):
print ("Results", value)
if value.find("SLC29B2") != -1 and value.find("Metformin"):
print ("Result", value)
I want to extract sentences that have gene and drug name from the whole body of article. For example "Metformin decreased logarithmically converted SLC22A1 excretion (from 1.5860.47 to 1.0060.52, p¼0.001)." "In conclusion, we could not demonstrate striking associations of the studied polymorphisms of SLC22A1, ACE, AGTR1, and ADD1 with antidiabetic responses to metformin in this well-controlled study."
This code return a lot of sentences i.e if one word of above came into the sentence that get printed out...!
Help me making the code for this
You don't show your real code, but the code you have now has at least one mistake that would lead to lots of spurious output. It's on this line:
re.findall("\w+\s?[gene]+", a)
This regexp does not match strings containing gene, as you clearly intended. It matches (almost) any string contains one of the letters g, e or n.
This cannot be your real code, since a is a list and you would get an error on this line-- plus you ignore the results of the findall()! Sort out your question so it reflects reality. If your problem is still not solved, edit your question and include at least one sentence that is part of the output but you do NOT want to be seeing.
When you do this:
if value.find("SLC22A1") != -1 and value.find("Metformin"):
You're testing for "SLC22A1 in the string and "Metformin" not at the start of the string (the second part is probably not what you want)
You probably wanted this:
if value.find("SLC22A1") != -1 and value.find("Metformin") != -1:
This find method is error-prone due to its return value and you don't care for the position, so you'd be better off with in.
To test for 2 words in a sentence (possibly case-insensitive for the 2nd occurrence) do like this:
if "SLC22A1" in vlow and "metformin" in value.lower():
I'd take a different approach:
Read in the text file
Split the text file into sentences. Check out https://stackoverflow.com/a/28093215/223543 for a hand-rolled approach to do this. Or you could use the ntlk.tokenizer.punkt module. (Edited after Alexis pointed me in the right direction in the comments below).
Check if I find your key terms in each sentence and print if I do.
As long as your text files are well formatted, this should work.

Resources