I have been using mechanize gem to scrape data from craigslist, I have a piece of code that uploads multiple image to craigslist, all the file paths are correct, but only single image gets uploaded multiple times what's the reason.
unless pic_url_arry.blank?
unless page.links_with(:text => 'Use classic image uploader').first.blank?
page = page.links_with(:text => 'Use classic image uploader').first.click
end
puts "After classic image uploader"
form = page.form_with(class: "add")
# build full file path before setting like this => file = File.join( APP_ROOT, 'tmp', 'image.jpg')
i = 0
pic_url_arry = pic_url_arry.shuffle
pic_url_arry.each do |p|
form.file_uploads.first.file_name = p
i+= 1
page = form.submit
puts "******#{p.inspect}*******"
puts "******#{page.inspect}*******"
end unless pic_url_arry.blank?
# check if the file uploaded sucessfully with no. of files with no. of imgbox on page.
check_image_uploaded = page.at('figure.imgbox').count
if check_image_uploaded.to_i == i.to_i
# upload failure craiglist or net error.
end
end
AND the pic array has value as ["/home/codebajra/www/office/autocraig/public/uploads/posting_pic/pic/1/images__4_.jpg", "/home/codebajra/www/office/autocraig/public/uploads/posting_pic/pic/2/mona200.jpg", "/home/codebajra/www/office/autocraig/public/uploads/posting_pic/pic/3/images__1_.jpg"].
The form holding filefield is being set only once, which is taking only one image that hits first. So, the updated code will be,
unless pic_url_arry.blank?
unless page.links_with(:text => 'Use classic image uploader').first.blank?
page = page.links_with(:text => 'Use classic image uploader').first.click
end
puts "After classic image uploader"
form = page.form_with(class: "add")
# build full file path before setting like this => file = File.join( APP_ROOT, 'tmp', 'image.jpg')
i = 0
pic_url_arry = pic_url_arry.shuffle
pic_url_arry.each do |p|
form.file_uploads.first.file_name = p
i+= 1
page = form.submit
form = page.form_with(class: "add")
puts "******#{p.inspect}*******"
puts "******#{page.inspect}*******"
end unless pic_url_arry.blank?
# check if the file uploaded sucessfully with no. of files with no. of imgbox on page.
check_image_uploaded = page.at('figure.imgbox').count
if check_image_uploaded.to_i == i.to_i
# upload failure craiglist or net error.
end
end
hoping this will solve the problem.
Related
I'm writing a program to take large PDF's and convert each page to a .jpg, then add the .jpg's of each pdf file to their own directory (which the program needs to create).
I have completed the conversion part of the program, but I am stuck on creating a directory and adding the files to the directory.
Here's my code so far.
import glob, sys, fitz, os, shutil
zoom_x = 2.0
zoom_y = 2.0
mat = fitz.Matrix(zoom_x, zoom_y) # to get better resolution
all_files = glob.glob('/Users/homefolder/Downloads/*.pdf') # image path
print(all_files)
for filename in all_files:
doc = fitz.open(filename)
head, tail = os.path.split(doc.name)
save_file_name = tail.split('.')[0]
for page in doc: # iterate through the pages
# print(page)
pix = page.get_pixmap(matrix=mat)
# render the image
filepath_save = '/Users/homefolder/Downloads/files' + save_file_name + str(page.number) + '.jpg'
pix.save(filepath_save) # save image
sample = glob.glob('/Users/homefolder/Downloads/*.jpg')
How would I write the code to create a directory for each pdf file and add those .jpg's to the directory?
You can create directory and save to it your processed files, I also refactored your code a bit:
import glob, fitz, os
zoom_x = 2.0
zoom_y = 2.0
mat = fitz.Matrix(zoom_x, zoom_y)
pdf_files = glob.glob('/Users/homefolder/Downloads/*.pdf')
save_to = '/Users/homefolder/Downloads/pdf_as_img/'
for path in pdf_files:
doc = fitz.open(path)
base_name, _ = os.path.splitext(os.path.basename(doc.name))
directory_to_save = os.path.join(save_to, base_name)
if not os.path.exists(directory_to_save):
os.makedirs(directory_to_save)
for page in doc:
pix = page.get_pixmap(matrix=mat)
filepath_save = os.path.join(directory_to_save, str(page.number) + '.jpg')
pix.save(filepath_save)
This script creates a directory for every pdf file and saves pages as jpg to it.
and I'm trying to upload files in the app that I'm making.
I have an input field (ImageField) that accepts multiple images to be uploaded once submitted.
I do this via ajax, and so I want to return some sort of json object that
has total count of images being tried to upload and number of images being actually uploaded.
def post_update(request):
save_status = {'update_save': False, 'image_count': 0, 'image_save': 0}
if request.method == 'POST':
update = UpdateForm(request.POST)
if update.is_valid():
event = Event.objects.get(pk=request.POST['event_id'])
update_form = update.save(False)
update_form.update_by = request.user
update_form.event = event
update_form.save()
save_status['update_save'] = True
images = ImageForm(request.POST, request.FILES)
files = request.FILES.getlist('image_path')
save_status['image_count'] = request.FILES.count
if images.is_valid():
for f in files:
photo = Image( image_path=f, update_ref=update_form, image_title=images.cleaned_data.get('image_title'))
photo.save()
save_status['image_save'] += 1
I tried request.FILES.count and request.FILES.length in trying to get the count, but to no avail and keeps having an error. My question is basically how can i get the number of files in the request.FILES?
You can obtain the length of the number of images, so:
num_files = len(request.FILES.getlist('image_path'))
That being said, I would advice to make use of the form instead, so use images.cleaned_data['image_path'], since these are the items that the form has cleaned.
I am currently working with several XML files that require the text of the element mods:namePart changed. I have created a script that should loop through all the XML files I have specified in a particular directory and make the intended changes. However, when I run the script the changes are not reflected in the new files. It executes as expected, and I even get the "namepart changed" output in my console, but the text I want to replace remains the same. PLEASE HELP!! I am extremely new to coding so any tips/comments are welcome. Here is the code I'm using:
list_of_files = glob.glob('/Users/#####/Documents/test_xml_files/*.xml')
for file in list_of_files: xmlObject = ET.parse(file)
root = xmlObject.getroot()
namespaces = {'mods':'http://www.loc.gov/mods/v3'}
for namePart in root.iterfind('mods:name/mods:namePart', namespaces):
if namePart.text == 'Tsukioka, Kōgyo, 1869-1927':
new_namePart = namePart.text.replace('Tsukioka, Kōgyo, 1869-1927', 'Tsukioka Kōgyo, 1869-1927', 1)
namePart.text == new_namePart
print('namepart changed')
else:
continue
nf = open(os.path.join('/Users/####/Documents/updated_test_directory', os.path.basename(file)), 'wb')
xmlString = ET.tostring(root, encoding="utf-8", method="xml", xml_declaration=None)
nf.write(xmlString)
nf.close()
While trying to download sentinel image for a specific location, the tif file is generated by default in drive but its not readable by openCV or PIL.Image().Below is the code for the same. If I use the file format as tfrecord. There are no Images downloaded in the drive.
starting_time = '2018-12-15'
delta = 15
L = -96.98
B = 28.78
R = -97.02
T = 28.74
cordinates = [L,B,R,T]
my_scale = 30
fname = 'sinton_texas_30'
llx = cordinates[0]
lly = cordinates[1]
urx = cordinates[2]
ury = cordinates[3]
geometry = [[llx,lly], [llx,ury], [urx,ury], [urx,lly]]
tstart = datetime.datetime.strptime(starting_time, '%Y-%m-%d') tend =
tstart+datetime.timedelta(days=delta)
collSent = ee.ImageCollection('COPERNICUS/S2').filterDate(str(tstart).split('')[0], str(tend).split(' ')[0]).filter(ee.Filter.lt('CLOUDY_PIXEL_PERCENTAGE', 20)).map(mask2clouds)
medianSent = ee.Image(collSent.reduce(ee.Reducer.median())) cropLand = ee.ImageCollection('USDA/NASS/CDL').filterDate('2017-01-01','2017-12-31').first()
task_config = {
'scale': my_scale,
'region': geometry,
'fileFormat':'TFRecord'
}
f1 = medianSent.select(['B1_median','B2_median','B3_median'])
taskSent = ee.batch.Export.image(f1,fname+"_Sent",task_config)
taskSent.start()
I expect the output to be readable in python so I can covert into numpy. In case of file format 'tfrecord', I expect the file to be downloaded in my drive.
I think you should think about the following things:
File format
If you want to open your file with PIL or OpenCV, and not with TensorFlow, you would rather use GeoTIFF. Try with this format and see if things are improved.
Saving to drive
Normally saving to your Drive is the default behavior. However, you can try to force writing to your drive:
ee.batch.Export.image.toDrive(image=f1, ...)
You can further try to setup a folder, where the images should be sent to:
ee.batch.Export.image.toDrive(image=f1, folder='foo', ...)
In addition, the Export data help page and this tutorial are good starting points for further research.
I am just exploring scrapy with splash and I am trying to scrape all the product (pants) data with productid,name and price from one of the e-commerce site
gap but I didn't see all the dynamic product data loaded when I see from splash web UI splash web UI (only 16 items are loading though for every request - no clue why)
I tried with the following options but no luck
Increasing wait time upto 20 sec
By starting the docker with "--disable-private-mode"
By using lua_script for page scrolling
With view report full option splash:set_viewport_full()
lua_script2 = """ function main(splash)
local num_scrolls = 10
local scroll_delay = 2.0
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
splash:wait(splash.args.wait)
for _ = 1, num_scrolls do
scroll_to(0, get_body_height())
splash:wait(scroll_delay)
end
return splash:html()
end"""
yield SplashRequest(
url,
self.parse_product_contents,
endpoint='execute',
args={
'lua_source': lua_script2,
'wait': 5,
}
)
Can anyone please shed some light on this behavior?
p.s : I am using scrapy framework and I am able to parse the product information (itemid,name and price) from the render.html (but render.html has only 16 items information)
I updated the script to below
function main(splash)
local num_scrolls = 10
local scroll_delay = 2.0
splash:set_viewport_size(1980, 8020)
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
-- splash:set_viewport_full()
splash:wait(10)
splash:runjs("jQuery('span.icon-x').click();")
splash:wait(1)
for _ = 1, num_scrolls do
scroll_to(0, get_body_height())
splash:wait(scroll_delay)
end
splash:wait(30)
return {
png = splash:png(),
html = splash:html(),
har = splash:har()
}
end
And ran it in my local splash, the png doesn't work fine but the HTML has the last product
The only issue was when the email subscribe popup is there it won't scroll, so I added code to close it