I'm stuck in this part. I'm extracting data from reddit using PRAW, and I need to push all the data I extract into a dictionary and then, store the dict data into a PostgreSQL database, the for-loop works and extracts all the values I need but at the end only the last one is inserted in the dict. I tried using a dict of lists, but the same values are repeated several times. How can I insert all the data in my dict?. Also tested other solutions I found here, but just got an error.
Here's my code:
class RedditExtract:
def __init__(self, query, token):
self.query = query
self.token = token
self.consulta = self.query.get("query")
def searchQuery(self):
reddit = praw.Reddit(
client_id=REDDIT_CLIENT_ID,
client_secret=REDDIT_CLIENT_SECRET,
user_agent="extracting for reddit",
)
subreddit = reddit.subreddit("all").search(self.consulta)
submission = reddit.submission
top_subreddit = subreddit
itemB = {}
con = Conexion()
for submission in top_subreddit:
try:
user = submission.author
reditor = reddit.redditor(user)
itemB["id"] = reditor.id
print("id: " + itemB["id"])
itemB["name"] = submission.fullname
#print("name: " + itemB["name"])
itemB["username"] = submission.author.name
#print("username: " + itemB["username"])
itemB["red"] = 13
#print("red: " + str(itemB["red"]))
itemB["type"] = "b"
#print("type: " + str(itemB["type"]))
itemB["karma"] = submission.author.total_karma
#print("karma: " + str(itemB["karma"]))
itemB["avatar"] = reditor.icon_img
#print("url icon username: " + itemB["avatar"])
itemB["extract_date"] = datetime.today().strftime("%Y-%m-%d %H:%M:%S")
#print("extract date: " + itemB["extract_date"])
itemB["created_at"] = datetime.fromtimestamp(int(submission.created_utc))
#print("created at: " + str(itemB["created_at"]))
except:
print("No se hallo ID del usuario, se omite el post")
The prints are just to evaluate that PRAW extracts the data correctly.
PS: I use PRAW 7.5.0 and Pyhton 3.8 with PyCharm.
I tried using lists to store each key's value and then using the lists to create the dictionary, but just got the same values repeating several times.
Also, tried to create another for to store keys and store values, but many values were missing.
I want so have something like this:
{'id':'kshdh''jajsjs''kasjs''asmjs'...,'name':'asrat''omes',...}
And then, from that dictionary, insert in each column (key) the values (value) in a PostgreSQL database.
TABLE:
I actually got a dict like this:
{'id': 'ajsgs,jhfhd,ajddg,ahsgys,...','name':'maaa,nnn,...',...} but the BIG problem with that is all values are string and I need 'red' and 'karma' to be integers, and can't cast them once in the dict.
My table in PostgreSQL is something like this:
CREATE TABLE IF NOT EXISTS public.salert_basic
(
id character varying(255) COLLATE pg_catalog."default" NOT NULL,
name character varying(255) COLLATE pg_catalog."default",
username character varying(255) COLLATE pg_catalog."default",
red integer,
extract_date timestamp without time zone,
created_at timestamp without time zone,
karma integer,
icon character varying COLLATE pg_catalog."default",
type character varying COLLATE pg_catalog."default",
CONSTRAINT salert_basic_pk PRIMARY KEY (id)
)
And the code to insert data from Pyhton is this:
Conexion.con.autocommit = True
curser = Conexion.cursor
columns = itemB.keys()
for i in itemB.values():
sql = '''insert into salert_basic(id,name,username,red,type,karma,icon,extraction_date,created_at) values{};'''.format(i)
curser.execute(sql)
Conexion.con.commit()
Conexion.con.close()
This is how I created my dict:
itemB = defaultdict(list)
Then. I fill it with this for each key:
itemB["name"].append(submission.fullname)
And finally, to concatenate the values of the list in the dict, I use this for:
for key in itemB:
itemB[key] = ", ".join(itemB[key])
But as I said, to do this, I cast my integers to strings, which can't put into my databse.
What do you say?
PS: How avoid duplicate primary key error? 'Cause there are some repeated ids.
UPDATE:
I checked the use of %s, I forgot about it.
Well... no, I need all the ids in "id" key but each one separated from the others, not like ahsgdshjgjsdgs....., also, id is a PK, so duplicates are not allowed but I think with a IN CONFLICT DO NOTHING in the sql I can avoid its insertion and continue with the others.
Yeah, I try to insert each submission as a row in the database table, but it's giving me headaches.
Still not exactly what you are trying to achieve. Here is a attempt at something that I think does what you want:
class RedditExtract:
def __init__(self, query, token):
self.query = query
self.token = token
self.consulta = self.query.get("query")
def searchQuery(self):
reddit = praw.Reddit(
client_id=REDDIT_CLIENT_ID,
client_secret=REDDIT_CLIENT_SECRET,
user_agent="extracting for reddit",
)
subreddit = reddit.subreddit("all").search(self.consulta)
submission = reddit.submission
top_subreddit = subreddit
data_list = []
con = Conexion()
for submission in top_subreddit:
item_dict = {}
try:
user = submission.author
reditor = reddit.redditor(user)
item_dict["id"] = reditor.id
item_dict["name"] = submission.fullname
item_dict["username"] = submission.author.name
item_dict["red"] = 13
item_dict["type"] = "b"
item_dict["karma"] = submission.author.total_karma
item_dict["avatar"] = reditor.icon_img
item_dict["extract_date"] = datetime.today().strftime("%Y-%m-%d %H:%M:%S")
item_dict["created_at"] = datetime.fromtimestamp(int(submission.created_utc))
data_list.append(item_dict)
except:
print("No se hallo ID del usuario, se omite el post")
sql = """insert into salert_basic
(id, name, username, red, type, karma, icon,
extraction_date, created_at)
values
(%(id)s, %(name)s, %(username)s, %(red)s, %(type)s, %(karma)s,
%(icon)s, %(extraction_date)s, %(created_at)s)"""
curser = Conexion.cursor
curser.executemany(sql, data_list)
--If this is a large data set then it will perform better with
from psycopg2.extras import execute_batch
execute_batch(curser, sql, data_list)
The above:
Creates a list of dicts
Modifies sql to use named placeholders so the values in the dict can be mapped to a placeholder.
Runs the sql in either executemany() or execute_batch()`. They will iterate over the list and apply the values in each dict to the placeholders in the query string.
Related
Is there a way to aggregate values in a column in sqlite in one-to-many relationship into array?
For example, I have 2 tables like this:
Artists:
ArtistId name
1 AC/DC
2 Accept
Albums:
AlbumId ArtistId Title
1 1 For Those About To Rock We Salute You
2 1 Let There Be Rock
3 2 Balls to the Wall
4 2 Restless and Wild
When I just do a query with a join:
SELECT
Name,
Title
FROM
artists
JOIN albums USING(ArtistId)
WHERE artists.ArtistId = 1;
I get:
I found out that I can do group_concat:
SELECT
Name,
GROUP_CONCAT(Title)
FROM
artists
JOIN albums USING(ArtistId)
WHERE artists.ArtistId = 1;
To concatenate all values together:
But I still have to parse the coma-separated string with titles: For Those About To Rock We Salute You,Let There Be Rock in the code to get the array of titles for each artist.
I use Python and I'd prefer to get something like a tuple for each row:
(name, titlesArray)
A much easier way in this case for me would be to use json.loads and json.dumps functions to save all the "many" array members into the same row in the same table, instead of using the recommended way for databases to save values in different tables and then use joins to retrieve them: the "many" values is an array on the object, and it's just much easier to save and get them using just 2 functions: json.loads and json.dumps, compared to manually saving the "many" values into a separate table, create binding to the "one" value, then use group_concat to concat them into a string, and then parse it more to actually get my array back.
Is it possible to get an array of values, or do I have to do group_concat and parse the string?
You might not be able to receive an array from sqlite straight away, but you can achieve the result with a very little edit on your query and a single line in python.
group_concat supports a custom delimiter that you can use later to split the entries.
Let's assume you have something like this:
from typing import Typle
import sqlite3
def connect(file: str = None) -> sqlite3.Connection:
connection = None
try:
connection = sqlite3.connect(file)
except sqlite3.Error:
raise
return connection
def select(connection: sqlite3.Connection) -> Tuple(str, str)):
entry = None
try:
sql = """
SELECT
Name,
GROUP_CONCAT(Title)
FROM artists
JOIN albums USING(ArtistId)
WHERE artists.ArtistId = 1;
"""
cursor.execute(sql, parameters)
reply = cursor.fetchone()
if reply is not None:
entry = reply
except sqlite3.Error:
raise
finally:
cursor.close()
return entry
that you can use to connect to the database and select from it like so:
connection = connect(r"/path/to/file.sqlite3")
if connection is not None:
entry = select(connection)
connection.close()
It is not important if your query is inside a function or not, the important concept is that you are using python to do this query, and you can add some code to manipulate the values.
As you can see here group_concat accepts a separator that you can use to arbitrarily separate values.
Your new select function could be something like:
def select(connection: sqlite3.Connection) -> Tuple(str, Tuple(str, ...)):
entry = None
separator = r"|"
try:
sql = f"""
SELECT
Name,
GROUP_CONCAT(Title, {separator})
FROM artists
JOIN albums USING(ArtistId)
WHERE artists.ArtistId = 1;
"""
cursor.execute(sql, parameters)
reply = cursor.fetchone()
if reply is not None:
reply[1] = reply[1].split(separator)
entry = reply
except sqlite3.Error:
raise
finally:
cursor.close()
return entry
Without changing how you use this function, you would now have a tuple with all the titles.
Another idea you'd like to consider is to do a more specific select query, like:
select albums.Title
from albums
where albums.ArtistId = 1;
In this case, you can have a list of titles using: cursor.fetchall().
Of course the band name should be asked separately in this case.
I've scraped some websites and stored the html info in a sqlite database. Now, I want to extract and store the email addresses. I'm able to successfully extract and print the id and emails. But, I keep getting TypeError: "'NoneType' object is not subscriptable" and "sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type" when I try to update the database with these new email addresses.
I've verified that the data types I'm using in the update statement are the same as my database (id is class int and email is str). I've googled a bunch of different examples and mucked around with the syntax alot.
I also tried removing the Where Clause in the update statement but got the same errors.
import sqlite3
import re
conn = sqlite3.connect('spider.sqlite')
cur = conn.cursor()
x = cur.execute('SELECT id, html FROM Pages WHERE html is NOT NULL and email is NULL ORDER BY RANDOM()').fetchone()
#print(x)#for testing purposes
for row in x:
row = cur.fetchone()
id = row[0]
html = row[1]
email = re.findall(b'[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+', html)
#print(email)#testing purposes
if not email:
email = 'no email found'
print(id, email)
cur.execute('''UPDATE pages SET email = ? WHERE id = ? ''', (email, id))
conn.commit
I want the update statement to update the database with the extracted email addresses for the appropriate row.
There are a few things going on here.
First off, you don't want to do this:
for row in x:
row = cur.fetchone()
If you want to iterate over the results returned by the query, you should consider something like this:
for row in cur.fetchall():
id = row[0]
html = row[1]
# ...
To understand the rest of the errors you are seeing, let's take a look at them step by step.
TypeError: "'NoneType' object is not subscriptable":
This is likely generated here:
row = cur.fetchone()
id = row[0]
Cursor.fetchone returns None if the executed query doesn't match any rows or if there are no rows left in the result set. The next line, then, is trying to do None[0] which would raise the error in question.
sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type:
re.findall returns a list of non-overlapping matches, not an individual match. There's no support for binding a Python list to a sqlite3 text column type. To fix this, you'll need to get the first element from the matched list (if it exists) and then pass that as your email parameter in the UPDATE.
.findall() returns a list.
You want to iterate over that list:
for email in re.findall(..., str(html)):
print(id, email)
cur.execute(...)
Not sure what's going on with that b'[a-z...' expression.
Recommend you use a raw string instead: r'[a-z...'.
It handles regex \ backwhacks nicely.
When making a "SELECT" query to my SQLite3 db I'm finding it quite cumbersome to do something as simple as reach into the db and grab a value and have it look exactly as it does in my program as it does in the db. I'm not familiar enough to know if this is just how it is or if there's a better way I should be doing these kinds of things.
The first element of this is in regards to the fact that something like this:
player_name = c.execute("""SELECT name FROM players WHERE pid = ?""", (pid,))
print(player_name)
will yield something like this:
<sqlite3.Cursor object at 0x10c185110>
The only way I've found to extract the actual content is by iterating over the resulting cursor and storing the content inside a list:
player_name = []
sql = c.execute("""SELECT name FROM players WHERE pid = ?""", (pid,))
for result in sql:
player_name.append(result)
player_name = str(player_name).replace("[", "").replace("]", "")
The other element is less of issue but I figured I'd throw it out there. It's the fact that each result from my above example comes out looking like this:
>>> print(player_name)
('Ben Johnson',)
In the end, to get the exact string out of the db and into my program I'm doing this for each query:
>>> player_name = []
>>> sql = c.execute("""SELECT name FROM players WHERE pid = ?""", (pid,))
>>> for result in sql:
>>> player_name.append(result)
>>> player_name = str(player_name).replace("[", "").replace("]", "")
>>> player_name = str(player_name).replace("('", "").replace("',)", "")
>>>
>>> print(player_name)
Ben Johnson
The cursor object returned by execute contains a list of rows (actually, they are computed on demand); you are indeed supposed to iterate over the cursor with for or something like that.
Each row is a tuple that contains all column values. To get the first column, just do row[0]:
for row in c.execute("SELECT SomeColumn FROM ..."):
print(row[0])
If you want to get a single column value from a single row, you can write a helper function:
def query_single_value(db, query, *parameters):
cursor = db.execute(query, parameters)
for row in cursor:
return row[0]
else:
return None # not found
print(query_single_value(c, "SELECT name FROM players WHERE pid = ?", pid))
I have a database where I have imported texts as a primary keys.
I then have columns with keywords that can pertain to the texts, for example column "arson". Each of these columns has a default value of 0.
I am trying to get the SQLite3 database to read the texts, check for the presence of specific keywords, and then assign a 1 value to the keywords column, for the row where the text contained the keyword.
The below example is of me trying to change the values in the arson column only for rows where the text contains the words "Arson".
The program is reading the texts and printing yes 3 times, indicating that three of the texts have the words "Arson" in them. However, I cannot get the individual rows to update with 1's. I have tried a few variations of the code below but seem to be stuck on this one.
!# Python3
#import sqlite3
sqlite_file = 'C:\\Users\\xxxx\\AppData\\Local\\Programs\\Python\\Python35-32\\database.sqlite'
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()
texts = c.execute("SELECT texts FROM database")
for articles in texts:
for words in articles:
try:
if "Arson" in words:
print('yes')
x = articles
c.execute("UPDATE database SET arson = 1 WHERE ID = ?" (x))
except TypeError:
pass
conn.commit()
conn.close()
This expression:
c.execute("UPDATE database SET arson = 1 WHERE ID = ?" (x))
always will raise a TypeError, because you are trying to treat the string as a function. You are basically doing "..."(argument), as if "..." were callable.
You'd need to add some commas for it to be an attempt to pass in x as a SQL parameter:
c.execute("UPDATE database SET arson = 1 WHERE ID = ?", (x,))
The first comma separates the two arguments passed to c.execute(), so now you pass a query string, and a separate sequence of parameters.
The second comma makes (..,) a tuple with one element in it. It is the comma that matters there, although the (...) parentheses are still needed to disambiguate what the comma represents.
You can drop the try...except TypeError altogether. If the code is still raising TypeError exceptions, you still have a bug.
Four hours later I have finally been able to fix this. I added the commas as recommended above; however, this led to other issues, as the code did not execute the entire loop correctly. To do this, I had to add another cursor object and use the second cursor inside my loop. The revised code may be seen below:
!# Python3
import sqlite3
sqlite_file = 'C:\\Users\\xxxx\\AppData\\Local\\Programs\\Python\\Python35-32\\database.sqlite'
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()
c2 = conn.cursor()
atexts = c.execute("SELECT texts FROM database")
for articles in atexts:
for words in articles:
if "arson" in words:
print('yes')
c2.execute("UPDATE database SET arson = 1 WHERE texts = ?", (words,))
conn.commit()
conn.close()
I am retrieving information from a sqlite DB that gives me back around 20 million rows that I need to process. This information is then transformed into a dict of lists which I need to use. I am trying to use generators wherever possible.
Can someone please take a look at this code and suggest optimization please? I am either getting a “Killed” message or it takes a really long time to run. The SQL result set part is working fine. I tested the generator code in the Python interpreter and it doesn’t have any problems. I am guessing the problem is with the dict generation.
EDIT/UPDATE FOR CLARITY:
I have 20 million rows in my result set from my sqlite DB. Each row is of the form:
(2786972, 486255.0, 4125992.0, 'AACAGA', '2005’)
I now need to create a dict that is keyed with the fourth element ‘AACAGA’ of the row. The value that the dict will hold is the third element, but it has to hold the values for all the occurences in the result set. So, in our case here, ‘AACAGA’ will hold a list containing multiple values from the sql result set. The problem here is to find tandem repeats in a genome sequence. A tandem repeat is a genome read (‘AACAGA’) that is repeated atleast three times in succession. For me to calculate this, I need all the values in the third index as a list keyed by the genome read, in our case ‘AACAGA’. Once I have the list, I can subtract successive values in the list to see if there are three consecutive matches to the length of the read. This is what I aim to accomplish with the dictionary and lists as values.
#!/usr/bin/python3.3
import sqlite3 as sql
sequence_dict = {}
tandem_repeat = {}
def dict_generator(large_dict):
dkeys = large_dict.keys()
for k in dkeys:
yield(k, large_dict[k])
def create_result_generator():
conn = sql.connect('sequences_mt_test.sqlite', timeout=20)
c = conn.cursor()
try:
conn.row_factory = sql.Row
sql_string = "select * from sequence_info where kmer_length > 2"
c.execute(sql_string)
except sql.Error as error:
print("Error retrieving information from the database : ", error.args[0])
result_set = c.fetchall()
if result_set:
conn.close()
return(row for row in result_set)
def find_longest_tandem_repeat():
sortList = []
for entry in create_result_generator():
sequence_dict.setdefault(entry[3], []).append(entry[2])
for key,value in dict_generator(sequence_dict):
sortList = sorted(value)
for i in range (0, (len(sortList)-1)):
if((sortList[i+1]-sortList[i]) == (sortList[i+2]-sortList[i+1])
== (sortList[i+3]-sortList[i+2]) == (len(key))):
tandem_repeat[key] = True
break
print(max(k for k, v in tandem_repeat.items() if v))
if __name__ == "__main__":
find_longest_tandem_repeat()
I got some help with this on codereview as #hivert suggested. Thanks. This is much better solved in SQL rather than just code. I was new to SQL and hence could not write complex queries. Someone helped me out with that.
SELECT *
FROM sequence_info AS middle
JOIN sequence_info AS preceding
ON preceding.sequence_info = middle.sequence_info
AND preceding.sequence_offset = middle.sequence_offset -
length(middle.sequence_info)
JOIN sequence_info AS following
ON following.sequence_info = middle.sequence_info
AND following.sequence_offset = middle.sequence_offset +
length(middle.sequence_info)
WHERE middle.kmer_length > 2
ORDER BY length(middle.sequence_info) DESC, middle.sequence_info,
middle.sequence_offset;
Hope this helps someone with around the same idea. Here is a link to the thread on codereview.stackexchange.com