How to get the amount of subfield nodes in a xml file - python-3.x

I am trying to extract data from an xml file. I get the xml file by accessing a previously generated url to the api of the xml provider. Normally the datafields I need are only present once, but sometimes, the datafield node is present multiple times.
This is the code I use: (it's only a part of the code, so indenting might be a bit off)
from urllib.request import urlopen
import pandas as pd
import xml.etree.ElementTree as ET
with urlopen(str(row)) as response:
doc = ET.parse(response)
root = doc.getroot()
namespaces = {
"zs": "http://www.loc.gov/zing/srw/",
"": "http://www.loc.gov/MARC21/slim",
}
datafield_nodes_path = "./zs:records/zs:record/zs:recordData/record/datafield" # XPath
datafield_attribute_filters = [ #which fields to extract
{
"tag": "100", #author
"ind1": "1",
"ind2": " ",
}]
no_aut = True
for datafield_node in root.iterfind(datafield_nodes_path, namespaces=namespaces):
if any(datafield_node.get(k) != v for attr_dict in datafield_attribute_filters for k,v in attr_dict.items()):
continue
for subfield_node in datafield_node.iterfind("./subfield[#code='a']", namespaces=namespaces):
clean_aut.append(subfield_node.text) #this gets the author name
no_aut = False
if no_aut: clean_aut.append(None)
This works fine for 80% of the URLs I access, but the remaining 20% are either broken or have multiple subfield_nodes for the datafield_attribute_filter I'm searching.
Here's an example URL of multiple occurrences: example link
When this URL gets loaded into urlopen I get the Author nine times instead of once.
Is there a way to count the number of occurences and if the datafield_node is present more than once, to only take the first occuring datafield_node?
I have tried using findall from ET but got no usable results.
Any help is appreciated

Although it is not how I wanted to solve it, this did the trick:
append_author=0
no_aut = True
for datafield_node in root.iterfind(datafield_nodes_path, namespaces=namespaces):
if any(datafield_node.get(k) != v for attr_dict in datafield_attribute_filters for k,v in attr_dict.items()):
continue
if append_author ==0
for subfield_node in datafield_node.iterfind("./subfield[#code='a']", namespaces=namespaces):
clean_aut.append(subfield_node.text) #this gets the author name
no_aut = False
append_author+=1
as soon as the first field gets appended, the others get skipped

Related

Python, Selenium, Pandas DataFrame and Excel

I am having trouble piecing together the last part of a puzzle. The entire code is shown below, which includes a non-essential username and password to a site where I am scraping data.
After looping through part numbers from an Excel file using
pd.read_excel()
Selenium is used to scrape various items of the website in question; the code then writes these values to the output window successfully.
As opposed to writing the data to an output window, I aim to write to the same Excel file I am pulling data from, writing it to the appropriate columns.
In the final for loop of the code, I initially tried to write the variables (which were printing to the screen) to Excel by appending
.to_excel('filePathHere')
to the variable in question. As an example, I attempted
description.to_excel('pathToFile/output.xlsx')
Which yield an error of EOL while scanning string literal (<string>, line 1)
I then thought, maybe this variable needs to be converted to a DataFrame, so I then tried
description_DataFrame = pd.DataFrame(description)
description_DataFrame.to_excel('pathToFile/output.xlsx')
which resulted in the same error message.
I am not even sure if this is the correct logic to write each item to the existing (or new) file. If it is, I found an explanation on how to deal with long strings here: StackOverFlow EOL Error but none of my data constitutes as long strings, so I can't see how that applies.
I then start to think I might need to create a dictionary, and then append to it.
So I then removed any attempts from above and tried:
description = []
description.append(mfg_part)
mfg_part.to_excel('pathToFile/output.xlsx')
Which still give me the same EOL error.
I am not to sure what is wrong, and why I can't write the variables mfg_part, mfg_OEM, description to their respective columns in the loaded Excel file.
Any hints / tips would be greatly appreciated.
complete working code, printing to the screen is as follows:
import time
#Need Selenium for interacting with web elements
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
#Need numpy/pandas to interact with large datasets
import numpy as np
import pandas as pd
import itertools
# load in manufacture part number from a collection of components, via an Excel file
mfg_id_list = pd.read_excel("C:/Users/James/Documents/Python Scripts/jupyterNoteBooks/ScrapingData/MasterQuoteTemplate.xls")['Model']
# Create a dictionary to store product and price
# While the below works just fine, we want to create en empty pandas dataframe, so we can output to Excel later
productInfo = {}
chrome_path = r"C:\Users\James\Documents\Python Scripts\jupyterNoteBooks\ScrapingData\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.maximize_window()
driver.get("https://www.tessco.com/login")
userName = "FirstName.SurName321123#gmail.com"
password = "PasswordForThis123"
#Set a wait, for elements to load into the DOM
wait10 = WebDriverWait(driver, 10)
wait20 = WebDriverWait(driver, 20)
wait30 = WebDriverWait(driver, 30)
elem = wait10.until(EC.element_to_be_clickable((By.ID, "userID")))
elem.send_keys(userName)
elem = wait10.until(EC.element_to_be_clickable((By.ID, "password")))
elem.send_keys(password)
#Press the login button
driver.find_element_by_xpath("/html/body/account-login/div/div[1]/form/div[6]/div/button").click()
for i in mfg_id_list:
#Expand the search bar
searchBar = wait10.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#searchBar input")))
#Enter information into the search bar
#If cell is not blank
if len(str(i)) != 0:
searchBar.send_keys(Keys.CONTROL, 'a')
searchBar.send_keys(i)
driver.find_element_by_css_selector('a.inputButton').click()
time.sleep(5)
try:
# wait for the products information to be loaded
products = wait10.until(EC.presence_of_all_elements_located((By.XPATH,"//div[#class='CoveoResult']")))
#isProductsThere = driver.find_element_by_xpath("//div[#class='CoveoResult']")
if products:
# iterate through all products in the search result and add details to dictionary
for product in products:
# get product info such as OEM, Description and Part Number
productDescr = product.find_element_by_xpath(".//a[#class='productName CoveoResultLink hidden-xs']").text
mfgPart = product.find_element_by_xpath(".//ul[#class='unlisted info']").text.split('\n')[3]
mfgName = product.find_element_by_tag_name("img").get_attribute("alt")
# There are multiple classes, some are "class sale" or else.
#We will locate by CSS
price = product.find_element_by_css_selector("div.price").text.split('\n')[1]
# add details to dictionary
productInfo[mfgPart, mfgName, productDescr] = price
# prints the searched products information
for (mfg_part, mfg_OEM, description), price in productInfo.items():
mfg_id = mfg_part.split(': ')[1]
if mfg_id == i:
#Here is where I would write to an Excel file
#And where I made attempts as described above
print('________________________________________________')
print('Part #:', mfg_id)
print('Company:', mfg_OEM)
print('Description:', description)
print('Price:', price)
print('________________________________________________')
#time.sleep(5)
#driver.close()
else:
mfg_id = "Not on Tessco"
mfg_OEM = "Not on Tessco"
description = "Not on Tessco"
price = "Not on Tessco"
#driver.close()
print("Item was not found on Tessco.com")
except Exception as e:
print('________________________________________________')
print(e)
mfg_id = "Not on Tessco"
mfg_OEM = "Not on Tessco"
description = "Not on Tessco"
price = "Not on Tessco"
#driver.close()
print("Item was not found on Tessco.com")
print('________________________________________________')
driver.close()

Python scraping's trouble in extract value

I'm trying to extract values from the table in this site: https://www.geonames.org/search.html?q=&country=IT
In my example I want to extract the name 'Rome' and I used this code:
import requests
import lxml.html
html = requests.get('https://www.geonames.org/search.html?q=&country=IT')
doc = lxml.html.fromstring(html.content)
table_body = doc.xpath('//*[#id="search"]/table')[0]
cities = table_body.xpath('//*[#id="search"]/table/tbody/tr[3]/td[2]/a[1]/text()')
Everything seams ok for me but wehen I print it the result is:
>>> print(cities)
[]
I really have no idea of what could be the problem, do someone have some suggestion?
If you're looking to get "Rome", you can omit tbody. This element was inserted by the browser and isn't present in the original document returned by the request.
Additionally, the extra line table_body = doc.xpath('//*[#id="search"]/table')[0] is redundant--you can search directly from the root.
import requests
import lxml.html
html = requests.get('https://www.geonames.org/search.html?q=&country=IT')
doc = lxml.html.fromstring(html.content)
print(doc.xpath('//*[#id="search"]/table/tr[3]/td[2]/a[1]/text()')[0]) # => Rome
Here is the simple script to extract all cities in that page
import requests
import lxml.html
html = requests.get('https://www.geonames.org/search.html?q=&country=IT')
doc = lxml.html.fromstring(html.content)
# corrected the xpath in the below line.
cities = doc.xpath("//table[#class='restable']//td[a][2]/a[1]/text()")
for city in cities:
print(city)

Problem exporting Web Url results into CSV using beautifulsoup3

Problem: I tried to export results (Name, Address, Phone) into CSV but the CSV code not returning expected results.
#Import the installed modules
import requests
from bs4 import BeautifulSoup
import json
import re
import csv
#To get the data from the web page we will use requests get() method
url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
page = requests.get(url)
# To check the http response status code
print(page.status_code)
#Now I have collected the data from the web page, let's see what we got
print(page.text)
#The above data can be view in a pretty format by using beautifulsoup's prettify() method. For this we will create a bs4 object and use the prettify method
soup = BeautifulSoup(page.text, 'lxml')
print(soup.prettify())
#Find all DIVs that contain Companies information
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
#Find all Companies Name under h2tag
company_name_list_heading = soup.findAll("h2")
#Find all Address on page Name under a tag
company_name_list_items = soup.findAll("a",{"class":"address"})
#Find all Phone numbers on page Name under ul
company_name_list_numbers = soup.findAll("ul",{"class":"submenu"})
Created for loop to print out all company Data
for company_address in company_name_list_items:
print(company_address.prettify())
# Create for loop to print out all company Names
for company_name in company_name_list_heading:
print(company_name.prettify())
# Create for loop to print out all company Numbers
for company_numbers in company_name_list_numbers:
print(company_numbers.prettify())
Below is the code to export the results (name, address & phonenumber) into CSV
outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["name", "Address", "Phone"])
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
company_name_list_heading = soup.findAll("h2")
company_name_list_items = soup.findAll("a",{"class":"address"})
company_name_list_numbers = soup.findAll("ul",{"class":"submenu"})
Here is the for loop to loop over data.
for company_name in company_name_list_heading:
names = company_name.contents[0]
for company_numbers in company_name_list_numbers:
names = company_numbers.contents[1]
for company_address in company_name_list_items:
address = company_address.contents[1]
writer.writerow([name, Address, Phone])
outfile.close()
You need to work on understanding how for loops work, and also the difference between strings, and variables and other datatypes. You also need to work on using what you have seen from other stackoverflow questions and learn to apply that. This is essentially the same as youre other 2 questions you already posted, but just a different site you're scraping from (but I didn't flag it as a duplicate, as you're new to stackoverflow and web scrpaing and I remember what it was like to try to learn). I'll still answer your questions, but eventually you need to be able to find the answers on your own and learn how to adapt it and apply (coding isn't a paint by colors. Which I do see you are adapting some of it. Good job in finding the "div",{"class":"CompanyInfo"} tag to get the company info)
That data you are pulling (name, address, phone) needs to be within a nested loop of the div class=CompanyInfo element/tag. You could theoretically have it the way you have it now, by putting those into a list, and then writing to the csv file from your lists, but theres a risk of data missing and then your data/info could be off or not with the correct corresponding company.
Here's what the full code looks like. notice that the variables are stored with in the loop, and then written. It then goes to the next block of CompanyInfo and continues.
#Import the installed modules
import requests
from bs4 import BeautifulSoup
import csv
#To get the data from the web page we will use requests get() method
url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
page = requests.get(url)
# To check the http response status code
print(page.status_code)
#Now I have collected the data from the web page, let's see what we got
print(page.text)
#The above data can be view in a pretty format by using beautifulsoup's prettify() method. For this we will create a bs4 object and use the prettify method
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Name", "Address", "Phone"])
#Find all DIVs that contain Companies information
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
# Now loop through those elements
for element in product_name_list:
# Takes 1 block of the "div",{"class":"CompanyInfo"} tag and finds/stores name, address, phone
name = element.find('h2').text
address = element.find('address').text.strip()
phone = element.find("ul",{"class":"submenu"}).text.strip()
# writes the name, address, phone to csv
writer.writerow([name, address, phone])
# now will go to the next "div",{"class":"CompanyInfo"} tag and repeats
outfile.close()

pdf form filled with PyPDF2 does not show in print

I need to fill pdf form in batch, so tried to write a python code to do it for me from a csv file. I used second answer in this question and it fills the forms fine, however when I open the filled forms the answers does not show unless the corresponding field is selected. Also the answers does not show when the form is printed. I looked into PyPDF2 documents to see if I can flatten the generated forms but this features has not been implemented yet even though has been asked for about a year ago. My preference is not to use pdftk so I can compile the script without the need for more dependency. When using the original code in the mentioned question, some fields show in the print and some doesn't which makes me confused on how they're working. Any help is appreciated.
Here's the code.
# -*- coding: utf-8 -*-
from collections import OrderedDict
from PyPDF2 import PdfFileWriter, PdfFileReader
def _getFields(obj, tree=None, retval=None, fileobj=None):
"""
Extracts field data if this PDF contains interactive form fields.
The *tree* and *retval* parameters are for recursive use.
:param fileobj: A file object (usually a text file) to write
a report to on all interactive form fields found.
:return: A dictionary where each key is a field name, and each
value is a :class:`Field<PyPDF2.generic.Field>` object. By
default, the mapping name is used for keys.
:rtype: dict, or ``None`` if form data could not be located.
"""
fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name',
'/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'}
if retval is None:
retval = {} #OrderedDict()
catalog = obj.trailer["/Root"]
# get the AcroForm tree
if "/AcroForm" in catalog:
tree = catalog["/AcroForm"]
else:
return None
if tree is None:
return retval
obj._checkKids(tree, retval, fileobj)
for attr in fieldAttributes:
if attr in tree:
# Tree is a field
obj._buildField(tree, retval, fileobj, fieldAttributes)
break
if "/Fields" in tree:
fields = tree["/Fields"]
for f in fields:
field = f.getObject()
obj._buildField(field, retval, fileobj, fieldAttributes)
return retval
def get_form_fields(infile):
infile = PdfFileReader(open(infile, 'rb'))
fields = _getFields(infile)
return {k: v.get('/V', '') for k, v in fields.items()}
def update_form_values(infile, outfile, newvals=None):
pdf = PdfFileReader(open(infile, 'rb'))
writer = PdfFileWriter()
for i in range(pdf.getNumPages()):
page = pdf.getPage(i)
try:
if newvals:
writer.updatePageFormFieldValues(page, newvals)
else:
writer.updatePageFormFieldValues(page,
{k: f'#{i} {k}={v}'
for i, (k, v) in
enumerate(get_form_fields(infile).items())
})
writer.addPage(page)
except Exception as e:
print(repr(e))
writer.addPage(page)
with open(outfile, 'wb') as out:
writer.write(out)
if __name__ == '__main__':
import csv
import os
from glob import glob
cwd=os.getcwd()
outdir=os.path.join(cwd,'output')
csv_file_name=os.path.join(cwd,'formData.csv')
pdf_file_name=glob(os.path.join(cwd,'*.pdf'))[0]
if not pdf_file_name:
print('No pdf file found')
if not os.path.isdir(outdir):
os.mkdir(outdir)
if not os.path.isfile(csv_file_name):
fields=get_form_fields(pdf_file_name)
with open(csv_file_name,'w',newline='') as csv_file:
csvwriter=csv.writer(csv_file,delimiter=',')
csvwriter.writerow(['user label'])
csvwriter.writerow(['fields']+list(fields.keys()))
csvwriter.writerow(['Mr. X']+list(fields.values()))
else:
with open(csv_file_name,'r',newline='') as csv_file:
csvreader=csv.reader(csv_file,delimiter=',')
csvdata=list(csvreader)
fields=csvdata[1][1:]
for frmi in csvdata[2:]:
frmdict=dict(zip(fields,frmi[1:]))
outfile=os.path.join(outdir,frmi[0]+'.pdf')
update_form_values(pdf_file_name, outfile,frmdict)
I had the same issue and apparently adding the "/NeedsAppearance" attribute to the PdfWriter object of the AcroForm fixed the problem (see https://github.com/mstamy2/PyPDF2/issues/355). With much help from ademidun (https://github.com/ademidun), I was able to populate a pdf form and have the values of the fields show properly. The following is an example:
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import BooleanObject, NameObject, IndirectObject
def set_need_appearances_writer(writer):
# See 12.7.2 and 7.7.2 for more information:
# http://www.adobe.com/content/dam/acom/en/devnet/acrobat/
# pdfs/PDF32000_2008.pdf
try:
catalog = writer._root_object
# get the AcroForm tree and add "/NeedAppearances attribute
if "/AcroForm" not in catalog:
writer._root_object.update(
{
NameObject("/AcroForm"): IndirectObject(
len(writer._objects), 0, writer
)
}
)
need_appearances = NameObject("/NeedAppearances")
writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
return writer
except Exception as e:
print("set_need_appearances_writer() catch : ", repr(e))
return writer
reader = PdfFileReader("myInputPdf.pdf", strict=False)
if "/AcroForm" in reader.trailer["/Root"]:
reader.trailer["/Root"]["/AcroForm"].update(
{NameObject("/NeedAppearances"): BooleanObject(True)}
)
writer = PdfFileWriter()
set_need_appearances_writer(writer)
if "/AcroForm" in writer._root_object:
writer._root_object["/AcroForm"].update(
{NameObject("/NeedAppearances"): BooleanObject(True)}
)
field_dictionary = {"Field1": "Value1", "Field2": "Value2"}
writer.addPage(reader.getPage(0))
writer.updatePageFormFieldValues(writer.getPage(0), field_dictionary)
with open("myOutputPdf.pdf", "wb") as fp:
writer.write(fp)
The underlying reason form fields are not showing up after being filled in, is that the values are not being added to the stream. Adding "NeedAppearances" tells the PDF reader that it needs to update the appearance, in this case it needs to create a stream for each field value, but not all PDF readers will honor that, and the fields may still look blank or have the default values.
The best solution to make sure the fields are updated for any reader is to create a stream for each field and add it to the field's XObject.
Here is an example solution for single line text fields. It also encodes the stream, updates the default value, and sets the fields to read only, which are all optional.
# Example data.
data = {
"field_name": "some value"
}
# Get template.
template = PdfReader("template-form.pdf", strict=False)
# Initialize writer.
writer = PdfWriter()
# Add the template page.
writer.add_page(template.pages[0])
# Get page annotations.
page_annotations = writer.pages[0][PageAttributes.ANNOTS]
# Loop through page annotations (fields).
for index in range(len(page_annotations)): # type: ignore
# Get annotation object.
annotation = page_annotations[index].get_object() # type: ignore
# Get existing values needed to create the new stream and update the field.
field = annotation.get(NameObject("/T"))
new_value = data.get(field, 'N/A')
ap = annotation.get(AnnotationDictionaryAttributes.AP)
x_object = ap.get(NameObject("/N")).get_object()
font = annotation.get(InteractiveFormDictEntries.DA)
rect = annotation.get(AnnotationDictionaryAttributes.Rect)
# Calculate the text position.
font_size = float(font.split(" ")[1])
w = round(float(rect[2] - rect[0] - 2), 2)
h = round(float(rect[3] - rect[1] - 2), 2)
text_position_h = h / 2 - font_size / 3 # approximation
# Create a new XObject stream.
new_stream = f'''
/Tx BMC
q
1 1 {w} {h} re W n
BT
{font}
2 {text_position_h} Td
({new_value}) Tj
ET
Q
EMC
'''
# Add Filter type to XObject.
x_object.update(
{
NameObject(StreamAttributes.FILTER): NameObject(FilterTypes.FLATE_DECODE)
}
)
# Update and encode XObject stream.
x_object._data = FlateDecode.encode(encode_pdfdocencoding(new_stream))
# Update annotation dictionary.
annotation.update(
{
# Update Value.
NameObject(FieldDictionaryAttributes.V): TextStringObject(
new_value
),
# Update Default Value.
NameObject(FieldDictionaryAttributes.DV): TextStringObject(
new_value
),
# Set Read Only flag.
NameObject(FieldDictionaryAttributes.Ff): NumberObject(
FieldFlag(1)
)
}
)
# Clone document root & metadata from template.
# This is required so that the document doesn't try to save before closing.
writer.clone_reader_document_root(template)
# write "output".
with open(f"output.pdf", "wb") as output_stream:
writer.write(output_stream) # type: ignore
Thanks to fidoriel and others from the discussion here: https://github.com/py-pdf/PyPDF2/issues/355.
This is what works for me on Python 3.8 and PyPDF4 (but I think it will work as well with PyPDF2):
#!/usr/bin/env python3
from PyPDF4.generic import NameObject
from PyPDF4.generic import TextStringObject
from PyPDF4.pdf import PdfFileReader
from PyPDF4.pdf import PdfFileWriter
import random
import sys
reader = PdfFileReader(sys.argv[1])
writer = PdfFileWriter()
# Try to "clone" the original one (note the library has cloneDocumentFromReader)
# but the render pdf is blank.
writer.appendPagesFromReader(reader)
writer._info = reader.trailer["/Info"]
reader_trailer = reader.trailer["/Root"]
writer._root_object.update(
{
key: reader_trailer[key]
for key in reader_trailer
if key in ("/AcroForm", "/Lang", "/MarkInfo")
}
)
page = writer.getPage(0)
params = {"Foo": "Bar"}
# Inspired by updatePageFormFieldValues but also handles checkboxes.
for annot in page["/Annots"]:
writer_annot = annot.getObject()
field = writer_annot["/T"]
if writer_annot["/FT"] == "/Btn":
value = params.get(field, random.getrandbits(1))
if value:
writer_annot.update(
{
NameObject("/AS"): NameObject("/On"),
NameObject("/V"): NameObject("/On"),
}
)
elif writer_annot["/FT"] == "/Tx":
value = params.get(field, field)
writer_annot.update(
{
NameObject("/V"): TextStringObject(value),
}
)
with open(sys.argv[2], "wb") as f:
writer.write(f)
This updates text fields and checkboxes.
I believe the key part is copying some parts from the original file:
reader_trailer = reader.trailer["/Root"]
writer._root_object.update(
{
key: reader_trailer[key]
for key in reader_trailer
if key in ("/AcroForm", "/Lang", "/MarkInfo")
}
)
Note: Please feel free to share this solution in other places. I consulted a lot of SO questions related to this topic.
What worked for me was to reopen with pdfrw
The following has worked for me for Adobe Reader, Acrobat, Skim, and Mac OS Preview:
pip install pdfrw
import pdfrw
pdf = pdfrw.PdfReader("<input_name>")
for page in pdf.pages:
annotations = page.get("/Annots")
if annotations:
for annotation in annotations:
annotation.update(pdfrw.PdfDict(AP=""))
pdf.Root.AcroForm.update(pdfrw.PdfDict(NeedAppearances=pdfrw.PdfObject('true')))
pdfrw.PdfWriter().write("<output_name>", pdf)
alepisa's answer was the closest to working for me (thank you, alepisa), but I just had to change one small section
elif writer_annot["/FT"] == "/Tx":
value = params.get(field, field)
writer_annot.update(
This was producing an output where my PDF had the desired fields updated based off the dictionary with field names and values I passed it, but every fillable field, whether I wanted them filled or not, was populated with the name of that fillable field. I changed the elif statement to the one below and everything worked like a charm!
elif writer_annot["/FT"] == "/Tx":
field_value = field_values.get(field_name, "")
writer_annot.update({NameObject("/V"): TextStringObject(field_value),
#This line below is just for formatting
NameObject("/DA"): TextStringObject("/Helv 0 Tf 0 g")})
This nested back into the rest of alepisa's script should work for anybody having issues with getting the output in Acrobat to show the values without clicking on the cell!

Trouble getting around list index error

I've written some script to scrape Name and Price from craigslist. It works smoothly until it finds that either of the vale is None. As soon as It gets any None value it breaks displaying: "list index out of range". How to deal with that?
import requests
from lxml import html
page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text
tree = html.fromstring(page)
rows = tree.xpath('//li[#class="result-row"]')
for row in rows:
link = row.xpath('.//a[contains(#class,"hdrlnk")]/text()')[0]
price = row.xpath('.//span[#class="result-price"]/text()')[0]
print (link,price)
The most efficient technique by far I've come across to avoid errors.
import requests
from lxml import html
page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text
tree = html.fromstring(page)
def if_exist(row,xpath):
docs=row.xpath(xpath)
if docs:
return docs[0]
return ""
for row in tree.xpath('//li[#class="result-row"]'):
link = if_exist(row,'.//a[contains(#class,"hdrlnk")]/text()')
price = if_exist(row,'.//span[#class="result-price"]/text()')
print (link,price)

Resources