Postgres - unable to clear all rows & then rewrite/replace scrapy pipeline data

Postgres - unable to clear all rows & then rewrite/replace scrapy pipeline data - python-3.x

Using scrapy to collect data & then saving it to postgres. I have one table named auto_records I wish to completely replace each time data is scraped. Seems like it should not be too difficult but I'm having some weird behavior.
class AutoRecordsPipeline(object):
def open_spider(self, spider):
hostname = 'localhost'
username = 'postgres'
password = 'xxxxxxx'
database = 'autos'
self.connection = psycopg2.connect(host=hostname, user=username, password=password, dbname=database)
self.cur = self.connection.cursor()
self.cur.execute("DELETE FROM auto_records") #MOVED HERE AS PER COMMENTS
def close_spider(self, spider):
self.cur.close()
self.connection.close()
def process_item(self, item, spider):
try:
#self.cur.execute("VACUUM FULL auto_records")
#self.cur.execute("DELETE FROM auto_records")
self.cur.execute("INSERT INTO auto_records(make,year,color,miles) VALUES (%s,%s,%s,%s)",(item['make'],item['year'],item['color'],item['miles']))
except psycopg2.IntegrityError:
self.conn.rollback()
else:
self.connection.commit()
return item
Initially I tried VACUUM FULL (commented out above) but I got the error psycopg2.errors.ActiveSqlTransaction: VACUUM cannot run inside a transaction block so then tried the current delete statement. What I see now when I print the table to the console like so
autorecs = pd.read_sql_query('SELECT * FROM "auto_records"', con=engine)
print('from autos db auto_records',autorecs)
is just the last record that gets scraped
from autos db auto_records id make year color miles
0 30 Chevrolet 2019 blue 30157
I don't understand where the 0 30 comes from, seems like it should be 0 1 or 29 30 since there's a total of 30 records. If I comment out the delete statement I get too many records b/c of the INSERT INTO statement. Don't know if it has to do with the scrapy pipeline or something else but I'm hoping someone has an idea as to what the real issue is... thanks

Related

How can I resolve this selenium stale reference error when scraping page 2 of a website?

I'm using Selenium to scrape Linkedin for jobs but I'm getting the stale reference error.
I've tried refresh, wait, webdriverwait, a try catch block.
It always fails on page 2.
I'm aware it could be a DOM issue and have run through a few of the answers to that but none of them seem to work for me.
def scroll_to(self, job_list_item):
"""Just a function that will scroll to the list item in the column
"""
self.driver.execute_script("arguments[0].scrollIntoView();", job_list_item)
job_list_item.click()
time.sleep(self.delay)
def get_position_data(self, job):
"""Gets the position data for a posting.
Parameters
----------
job : Selenium webelement
Returns
-------
list of strings : [position, company, location, details]
"""
# This is where the error is!
[position, company, location] = job.text.split('\n')[:3]
details = self.driver.find_element_by_id("job-details").text
return [position, company, location, details]
def wait_for_element_ready(self, by, text):
try:
WebDriverWait(self.driver, self.delay).until(EC.presence_of_element_located((by, text)))
except TimeoutException:
logging.debug("wait_for_element_ready TimeoutException")
pass
logging.info("Begin linkedin keyword search")
self.search_linkedin(keywords, location)
self.wait()
# scrape pages,only do first 8 pages since after that the data isn't
# well suited for me anyways:
for page in range(2, 3):
jobs = self.driver.find_elements_by_class_name("occludable-update")
#jobs = self.driver.find_elements_by_css_selector(".occludable-update.ember-view")
#WebDriverWait(self.driver, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'occludable-update')))
for job in jobs:
self.scroll_to(job)
#job.click
[position, company, location, details] = self.get_position_data(job)
# do something with the data...
data = (position, company, location, details)
#logging.info(f"Added to DB: {position}, {company}, {location}")
writer.writerow(data)
# go to next page:
bot.driver.find_element_by_xpath(f"//button[#aria-label='Page {page}']").click()
bot.wait()
logging.info("Done scraping.")
logging.info("Closing DB connection.")
f.close()
bot.close_session()

I expect that when job_list_item.click() is performed the page is loaded, in this case since you are looping jobs which is a list of WebDriverElement will become stale. You are returning back to the page but your jobs is already stale.
Usually to prevent a stale element, I always prevent the use of the element in a loop or store an element to a variable, especially if the element may change.

BigQuery update how to get number of updated rows

I am using Google Cloud Functions to connect to a Google Bigquery database and update some rows. The cloud function is written using Python 3.
I need help figuring out how to get the result message or the number of updated/changed rows whenever I run an update dml through the function. Any ideas?
from google.cloud import bigquery
def my_update_function(context,data):
BQ = bigquery.Client()
query_job = BQ.query("Update table set etc...")
rows = query_job.result()
return (rows)
I understand that rows always come back as _emptyrowiterator object. Any way i can get result or result message? Documentation says I have to get it from a bigquery job method. But can't seem to figure it out.

I think that you are searching for QueryJob.num_dml_affected_rows. It contain number of rows affected by update or any other DML statement. If you just paste it to your code instead of rows in return statement you will get number as int or you can create some massage like :
return("Number of updated rows: " + str(job_query.num_dml_affected_rows))
I hope it will help :)

Seems like there is no mention in the bigquery Python DB-API documentation on rows returned. https://googleapis.dev/python/bigquery/latest/reference.html
I decided to use a roundabout method on dealing with this issue by generating a SELECT statement first to check if there are any matches to the WHERE clause in the UPDATE statement.
Example:
from google.cloud.bigquery import dbapi as bq
def my_update_function(context,data):
try:
bq_connection = bq.connect()
bq_cursor = bq_connection.cursor()
bq_cursor.execute("select * from table where ID = 1")
results = bq_cursor.fetchone()
if results is None:
print("Row not found.")
else:
bq_cursor.execute("UPDATE table set name = 'etc' where ID = 1")
bq_connection.commit()
bq_connection.close()
except Exception as e:
db_error = str(e)

Cx_Oracle fetch crash

So I've queried data from oracle database using cursor.execute(). A relatively simple select query. It works.
But when I try to fetch data from it, python crashes.
The same occurs for fetchall(), fetchmany() and fetchone().
When the query first broke in fetchmany() I decided to loop through fetchone() and it worked for the first two rows then broke at the third.
I'm guessing it is because there's too much data in third row.
So, is there any way to bypass this issue and pull the data?
(Please ignore the wrong indents could not copy properly in my phone)
EDIT:
I removed four columns with type "ROWID". There was no issue after that. I was easily able to fetch 100 rows in one go.
So to confirm my suspicion I went ahead and created another copy with only those rowed columns and it crashes as expected.
So is there any issue with ROWID type?
Test table for the same.
Insert into TEST_FOR_CX_ORACLE (Z$OEX0_LINES,Z$OEX0_ORDER_INVOICES,Z$OEX0_ORDERS,Z$ITEM_ROWID) values ('ABoeqvAEyAAB0HOAAM','AAAL0DAEzAAClz7AAN','AAAVeuABHAAA4vdAAH','ABoeo+AIVAAE6dKAAQ');
Insert into TEST_FOR_CX_ORACLE (Z$OEX0_LINES,Z$OEX0_ORDER_INVOICES,Z$OEX0_ORDERS,Z$ITEM_ROWID) values ('ABoeqvABQAABKo6AAI','AAAL0DAEzAAClz7AAO','AAAVeuABHAAA4vdAAH','ABoeo+AIVAAE6dKAAQ');
Insert into TEST_FOR_CX_ORACLE (Z$OEX0_LINES,Z$OEX0_ORDER_INVOICES,Z$OEX0_ORDERS,Z$ITEM_ROWID) values ('ABoeqvABQAABKo6AAG','AAAL0DAEzAAClz7AAP','AAAVeuABHAAA4vdAAH','ABoeo+AHIAAN+OIAAM');
Insert into TEST_FOR_CX_ORACLE (Z$OEX0_LINES,Z$OEX0_ORDER_INVOICES,Z$OEX0_ORDERS,Z$ITEM_ROWID) values ('ABoeqvAEyAAB0HOAAK','AAAL0DAEzAACl0EAAC','AAAVeuABHAAA4vdAAH','ABoeo+AHIAAN+OIAAM');
Script:
from cx_Oracle import makedsn,connect,Cursor
from pandas import read_sql_table, DataFrame, Series
from time import time
def create_conn( host_link , port , service_name , user_name , password ):
dsn=makedsn(host_link,port,service_name=service_name)
return connect(user=user_name, password=password, dsn=dsn)
def initiate_connection(conn):
try:
dbconnection = create_conn(*conn)
print('Connected to '+conn[2]+' !')
except Exception as e:
print(e)
dbconnection = None
return dbconnection
def execute_query(query,conn):
dbconnection=initiate_connection(conn)
try:
cursor = dbconnection.cursor()
print ('Cursor Created!')
return cursor.execute(query)
except Exception as e:
print(e)
return None
start_time = time()
query='''SELECT * FROM test_for_cx_oracle'''
try:
cx_read_query = execute_query(query,ecspat_c)
time_after_execute_query = time()
print('Query Executed')
columns = [i[0] for i in cx_read_query.description]
time_after_getting_columns = time()
except Exception as e:
print(e)
print(time_after_execute_query-start_time,time_after_getting_columns-time_after_execute_query)

Unfortunately, this is a bug in the Oracle Client libraries. You will see it if you attempt to fetch the same rowid value multiple times in consecutive rows. If you avoid that situation all is well. You can also set the environment variable ORA_OCI_NO_OPTIMIZED_FETCH to the value 1 before you run the query to avoid the problem.
This has been reported earlier here: https://github.com/oracle/python-cx_Oracle/issues/120

asyncpg fetch feedback (python)

I have been using psycopg2 to manage items in my PostgreSQL database. Recently someone suggested that I could improve my database transactions by using asyncio and asyncpg in my code. I have looked around Stack Overflow and read though the documentation for examples. I have been able to create tables and insert records, but I haven't been able to get the execution feedback that I desire.
For example in my psycopg2 code, I can verify that a table exists or doesn't exist prior to inserting records.
def table_exists(self, verify_table_existence, name):
'''Verifies the existence of a table within the PostgreSQL database'''
try:
self.cursor.execute(verify_table_existence, name)
answer = self.cursor.fetchone()[0]
if answer == True:
print('The table - {} - exists'.format(name))
return True
else:
print ('The table - {} - does NOT exist'.format(name))
return False
except Exception as error:
logger.info('An error has occurred while trying to verify the existence of the table {}'.format(name))
logger.info('Error message: {}').format(error)
sys.exit(1)
I haven't been able to get the same feedback using asyncpg. How do I accomplish this?
import asyncpg
import asyncio
async def main():
conn = await asyncpg.connect('postgresql://postgres:mypassword#localhost:5432/mydatabase')
answer = await conn.fetch('''
SELECT EXISTS (
SELECT 1
FROM pg_tables
WHERE schemaname = 'public'
AND tablename = 'test01'
); ''')
await conn.close()
#####################
# the fetch returns
# [<Record exists=True>]
# but prints 'The table does NOT exist'
#####################
if answer == True:
print('The table exists')
else:
print('The table does NOT exist')
asyncio.get_event_loop().run_until_complete(main())

You used fetchone()[0] with psycopg2, but just fetch(...) with asyncpg. The former will retrieve the first column of the first row, while the latter will retrieve a whole list of rows. Being a list, it doesn't compare as equal to True.
To fetch a single value from a single row, use something like answer = await conn.fetchval(...).

Multi-threading SQLAlchemy and return results to ObjectListView

I have run into another issue with a program I am working on. Basically what my program does is it takes up to 4 input files, processes them and stores the information I collect from them in a SQLite3 database on my computer. This has allowed me to view the data any time I want without having to run the input files again. The program uses a main script that is essentially just an AUI Notebook that imports an input script, and output scripts to use as panels.
To add the data to the database I am able to use threading since I am not returning the results directly to my output screen(s). However, when I need to view the entire contents from my main table I end up with 25,000 records that are being loaded. While these are loading my GUI is locked and almost always displays: "Program not responding".
I would like to use threading/multiprocessing to grab the 25k records from the database and load them into my ObjectListView widget(s) so that my GUI is still usable during this process. When I attempted to use a similar threading class that is used to add the data to the database I get nothing returned. When I say I get nothing I am not exaggerating.
So here is my big question, is there a way to thread the query and return the results without using global variables? I have not been able to find a solution with an example that I could understand, but I may be using the wrong search terms.
Here are the snippets of code pertaining to the issue at hand:
This is what I use to make sure the data is ready for my ObjectListView widget.
class OlvMainDisplay(object):
def __init__(self, id, name, col01, col02, col03, col04, col05,
col06, col07, col08, col09, col10, col11,
col12, col13, col14, col15):
self.id = id
self.name = name
self.col01 = col01
self.col02 = col02
self.col03 = col03
self.col04 = col04
self.col05 = col05
self.col06 = col06
self.col07 = col07
self.col08 = col08
self.col09 = col09
self.col10 = col10
self.col11 = col11
self.col12 = col12
self.col13 = col13
self.col14 = col14
self.col15 = col15
The 2 tables I am pulling data from:
class TableMeta(base):
__tablename__ = 'meta_extra'
id = Column(String(20), ForeignKey('main_data.id'), primary_key=True)
col06 = Column(String)
col08 = Column(String)
col02 = Column(String)
col03 = Column(String)
col04 = Column(String)
col09 = Column(String)
col10 = Column(String)
col11 = Column(String)
col12 = Column(String)
col13 = Column(String)
col14 = Column(String)
col15 = Column(String)
class TableMain(base):
__tablename__ = 'main_data'
id = Column(String(20), primary_key=True)
name = Column(String)
col01 = Column(String)
col05 = Column(String)
col07 = Column(String)
extra_data = relation(
TableMeta, uselist=False, backref=backref('main_data', order_by=id))
I use 2 queries to collect from these 2 tables, one grabs all records while the other one is part of a function definition that takes multiple dictionaries and applies filters based on the dictionary contents. Both queries are part of my main "worker" script that is imported by each of my notebook panels.
Here is the function that applies the filter(s):
def multiFilter(theFilters, table, anOutput, qType):
session = Session()
anOutput = session.query(table)
try:
for x in theFilters:
for attr, value in x.items():
anOutput = anOutput.filter(getattr(table, attr).in_(value))
except AttributeError:
for attr, value in theFilters.items():
anOutput = anOutput.filter(getattr(table, attr).in_(value))
anOutput = convertResults(anOutput.all())
return anOutput
session.close()
The theFilters can either be a single dictionary or a list of dictionaries, hence the "Try:" statement. Once the function has applied the filters it then runs the returned results through another function that puts each result returned through the OlvMainDisplay class and adds them to a list to be passed on to the OLV Widget.
Again the big question, is there a way to thread the query (or queries) and return the results without using global variables? Or possibly grab around 200 records at a time and add the data "in chunks" to the OLV widget?
Thank you in advance.
-MikeS
--UPDATE--
I have reviewed "how to get the return value from a thread in python" and the accepted answer does not return anything or still locked the GUI (not sure what is causing the variance). I would like to limit the number of threads created to about 5 at the most.
--New Update--
I made some corrections to the filter function.

You probably don't want to load the entire database into memory at once. That is usually a bad idea. Because ObjectListView is a wrapper of the ListCtrl, I would recommend using the Virtual version of the the underlying widget. The flag is wx.LC_VIRTUAL. Take a look at the wxPython demo for an example, but basically you load data on demand via virtual methods OnGetItemText(), OnGetItemImage(), and OnGetItemAttr(). Note that that refers to the ListCtrl methods...that may be different in OLV land. Anyway, I know that the OLV version is called VirtualObjectListView and works in much the same way. I'm pretty sure there's an example in the source download.

Ok, I finally managed to get the query to run in a thread and be able to display the results in a standard ObjectListView. I used the answer HERE with some modifications.
I added the code to my main worker script which is imported into my output panel as EW.
Since I am not passing arguments to my query these lines were changed:
def start(self, params):
self.thread = threading.Thread(target=self.func, args=params)
to
def start(self):
self.thread = threading.Thread(target=self.func)
In my output panel I changed how I call upon my default query, the one that returns 25,000+ records. In my output panel's init I added self.worker = () as a placeholder and in my function that runs the default query:
def defaultView(self, evt):
self.worker = EW.ThreadWorker(EW.defaultQuery)
self.worker.start()
pub.sendMessage('update.statusbar', msg='Full query started.')
I also added:
def threadUpdateOLV(self):
time.sleep(10)
anOutput = self.worker.get_results()
self.dataOLV.SetObjects(anOutput)
pub.subscribe(self.threadUpdateOLV, 'thread.completed')
the time.sleep(10) was added after trial an error to get the full 25,000+ results, and I found a 10 seconds delay worked fine.
And finally, at the end of my default query I added the PubSub send right before my output return:
wx.CallAfter(pub.sendMessage, 'thread.completed')
return anOutput
session.close()
To be honest I am sure there is a better way to accomplish this, but as of right now it is serving the purpose needed. I will work on finding a better solution though.
Thanks
-Mike S

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Postgres - unable to clear all rows & then rewrite/replace scrapy pipeline data - python-3.x

Related

How can I resolve this selenium stale reference error when scraping page 2 of a website?

BigQuery update how to get number of updated rows

Cx_Oracle fetch crash

asyncpg fetch feedback (python)

Multi-threading SQLAlchemy and return results to ObjectListView

Categories

Resources