Get BigQuery table schema using google.cloud - python-3.x

I can for example get BigQuery data into local python with:
import os
from google.cloud import bigquery
project_id = "example-project"
dataset_id = "exapmle_dataset"
table_id = "table_id"
os.environ["GOOGLE_CLOUD_PROJECT"] = project_id
bq = bigquery.Client()
query = "SELECT * FROM {}.{} LIMIT 5".format(dataset_id, table_id)
resp = bq.run_sync_query(query)
resp.run()
data_list = resp.rows
The result:
print(data_list)
>>> [('BEDD', '1',), ('A75', '1',), ('CE3F', '1',), ('0D8C', '1',), ('3E9C', '1',)]
How do I then go and get the schema for this table? Such that, for example
headings = ('heading1', 'heading2')
# or
schema_dict = {'fields': [{'name': 'heading1', 'type': 'STRING'}, {'name': 'heading2', 'type': 'STRING'}]}

You can use the schema method from your resp variable.
After running the query you can retrieve it:
schema = resp.schema
schema will be a list containing the definition for each column in your query.
As an example, lets say this is your query:
query = "select '1' as fv, STRUCT<i INT64, j INT64> (1, 2) t from `dataset.table` limit 1"
The schema will be a list containing 2 entries:
[<google.cloud.bigquery.schema.SchemaField at 0x7ffa64fe6e50>,
<google.cloud.bigquery.schema.SchemaField at 0x7ffa64fe6b10>]
For each object in schema, you have the methods field_type, fields, mode and name so if you run:
schema[0].field_type, schema[0].mode, schema[0].name
The result is "STRING", "NULLABLE", "fv".
As the second column is a record, then if you run:
schema[1].field_type, schema[1].mode, schema[1].name, schema[1].fields
The result is:
"RECORD", "NULLABLE", "t", [google schema 1, google schema 2]
Where google schema 1 contains the definition for the inner fields within the record.
As far as I know, there's no way of getting a dictionary as you showed in your question, which means you'll have to loop over the entries in schema and build it yourself. It should be simple though. Not sure if this is working as I haven't fully tested it but it might give you an idea on how to do it:
def extract_schema(schema_resp):
l = []
for schema_obj in schema_resp:
r = {}
r['name'] = schema_obj.name
r['type'] = schema_obj.field_type
r['mode'] = schema_obj.mode
if schema_obj.fields:
r['fields'] = extract_schema(schema_obj.fields)
l.append(r)
return l
So you'd just have to run schema = extract_schema(resp.schema) and (hopefully) you'll be good to go.

Related

Get sqlalchemy table Model and Field objects from strings?

Very simple trying to run a query in Python 3 sqlalchemy to delete some records given string names of table and field to query against.
How do you get the table object from a string?
Given 1. how do you run a query via ORM with just a string of the field name?
I would assume all ORM's have an internal array or method like get with the name.
json_config = [
{"table": "tableA",
"field": "modified_on"
"expires": 30},
{"table": "tableB",
"field": "event_on"
"expires": 30}
]
for table_conf_item in self.json_config:
table_name = table_conf_item["table"]
field_name = table_conf_item["field"]
expire_after = table_conf_item["expires"]
table_obj = self.orm_session.TABLES[table_name]
field_obj = self.orm_session.TABLES[table_name].FIELDS[field_name]
result = self.orm_session.delete(table_obj).where(field_obj < expire_after)
self.orm_session.commit()
print(f"{table_name}: removed {result.row_count} objects")
Given the table's name, you can use reflection to get a Table object. Using SQLAlchemy's core layer, this is reasonably straightforward:
import sqlalchemy as sa
engine = sa.create_engine(...)
tbl = sa.Table(name_of_table, metadata, autoload_with=engine)
If you want to work with multiple tables, it may be more efficient to store them a Metadata instance for later access:
metadata = sa.MetaData()
metadata.reflect(engine, only=list_of_table_names)
tbl = metadata.tables[name_of_table]
Once you have a Table object you can reference columns by name like this: tbl.c[name_of_field].
Full example:
import sqlalchemy as sa
# Setup
engine = sa.create_engine('sqlite://', echo=True, future=True)
tbl = sa.Table(
't',
sa.MetaData(),
sa.Column('id', sa.Integer, primary_key=True),
sa.Column('foo', sa.Integer),
)
tbl.create(engine)
with engine.begin() as conn:
vals = [42, 43, 42, 43, 56, 87, 89]
conn.execute(tbl.insert(), [{'foo': v} for v in vals])
del tbl
# Reflect the table.
metadata = sa.MetaData()
metadata.reflect(engine, only=['t'])
tbl = metadata.tables['t']
# Define some statements.
q1 = sa.select(tbl).where(tbl.c['foo'] == 42)
q2 = sa.select(tbl.c['id'], tbl.c['foo']).where(tbl.c['foo'] == 43)
q3 = sa.delete(tbl).where(tbl.c['foo'] != 42)
# Execute the statements.
with engine.connect() as conn:
rows = conn.execute(q1)
for row in rows:
print(row)
print()
rows = conn.execute(q2)
for row in rows:
print(row)
print()
with engine.begin() as conn:
conn.execute(q3)
with engine.connect() as conn:
rows = conn.execute(q1)
for row in rows:
print(row)
print()
Doing the same through the ORM layer is more complicated, as table and column names must be mapped to ORM entity classes (models) and their attributes. This replicates the previous example for a simple mapping (it assumes the same initial data as above).
import sqlalchemy as sa
from sqlalchemy import orm
Base = orm.declarative_base()
class Thing(Base):
__tablename__ = 't'
id = sa.Column(sa.Integer, primary_key=True)
thing_foo = sa.Column('foo', sa.Integer)
engine = sa.create_engine(...)
Base.metadata.create_all(engine)
Session = orm.sessionmaker(engine, future=True)
tablename = 't'
columnname = 'foo'
with Session.begin() as s:
# Get the mappers for the Base class.
mappers = Base.registry.mappers
# Get the mapper for our table.
mapper = next(m for m in mappers if m.entity.__tablename__ == tablename)
# Get the entity class (Thing).
entity = mapper.entity
# Get the column from the Table.
table_column = mapper.selectable.c[columnname]
# Get the mapper property that corresponds to the column
# (the entity attribute may have a different name to the
# column in the database).
mapper_property = mapper.get_property_by_column(table_column)
# Get the queryable entity attribute (Thing.thing_foo).
attr = mapper.all_orm_descriptors[mapper_property.key]
q = sa.select(entity).where(attr != 42)
entities = s.scalars(q)
for entity in entities:
s.delete(entity)
with Session() as s:
for thing in s.scalars(sa.select(Thing)):
print(thing.id, thing.thing_foo)

Error Iterating over Join Query results using Peewe (PostgreSQL)

I am not able to iterate through my query as I would like using Peewee
Those are the related Objects in Models.py
class Conversation(peewee.Model):
id = peewee.AutoField(unique=True, index=True)
creation_date = peewee.DateTimeField(default=datetime.now)
contact_id = ForeignKeyField(Contact, backref='conversation')
launch_id = ForeignKeyField(Launch, backref='conversation')
request_data = peewee.TextField(null=True)
status = peewee.TextField(null=True)
class Contact(peewee.Model):
id = peewee.AutoField(unique=True, index=True)
uuid = peewee.CharField(default=shortuuid.uuid, index=True)
whatsapp_phone = peewee.CharField(index=True, default='')
status = peewee.CharField(default='init')
conversationId = peewee.CharField(null=True)
Here's how I am trying to iterate:
for conversation in Conversation.select().where(Conversation.launch_id == str(launch_id)):
print(conversation.contact.id)
And this is the error that I a getting:
print(conversation.contact.id)
AttributeError: 'Conversation' object has no attribute 'contact'
I've tried to change the way I do my query:
query = Conversation.select(Contact).join(Contact).where(Conversation.launch_id == str(launch_id))
But I get the exact same error if I iterate in the same way.
The issue is you are, for some reason, trying to access .contact when you've named your foreign-key .contact_id. The peewee docs are clear about foreign key naming, but you want this:
class Conversation(peewee.Model):
id = peewee.AutoField(unique=True, index=True)
creation_date = peewee.DateTimeField(default=datetime.now)
# Data will be stored in a column named "contact_id":
contact = ForeignKeyField(Contact, backref='conversations')
# Data will be stored in a column named "launch_id":
launch = ForeignKeyField(Launch, backref='conversations')
request_data = peewee.TextField(null=True)
status = peewee.TextField(null=True)
This allows:
query = (Conversation
.select()
.where(Conversation.launch == str(launch_id)))
for conversation in query:
# Access the underlying foreign-key value.
print(conversation.contact_id)
Or, if you intend to access other fields on the Contact:
query = (Conversation
.select(Conversation, Contact)
.join(Contact)
.where(Conversation.launch == str(launch_id)))
for conversation in query:
# We now have a "full" Contact instance we can access efficiently:
print(conversation.contact.id)
Please read the docs:
http://docs.peewee-orm.com/en/latest/peewee/quickstart.html#lists-of-records
http://docs.peewee-orm.com/en/latest/peewee/relationships.html
http://docs.peewee-orm.com/en/latest/peewee/models.html#foreignkeyfield

Script to insert data into tables dynamically using Flask SQLAlchemy

I understand that the documented way to insert data into a table looks like
```class Table(db.Model):
__tablename___ = 'table'
id = db.Column(db.Integer, primary_key=True)
data = db.Column(db.String(50)
...
insert = Table(id = '0', data = 'new data')```
However, I am working on a project that has multiple tables all with different columns, lengths, and data. I have worked out how to get the dynamic data into a dict, prepped to create rows. Below is my actual code:
def load_csv_data(self, ctx):
data_classes = [Locations, Scents, Classes]
data_tables = ['locations', 'scents', 'classes']
tables = len(data_tables)
for i in range(tables):
with open('./development/csv/{}.csv'.format(data_tables[i]), newline='') as times_file:
times_reader = csv.reader(times_file, delimiter=',', quotechar='|')
for row in times_reader:
data_columns = data_classes[i].__table__.columns
columns = len(data_columns)
insert_data = {}
for col in range(columns):
row_key = data_columns[col].key
row_value = row[col]
insert_data.update({row_key: row_value})
The challenge I am having is finding a way to do the actual insert based on these dynamic params. So if the above returns:
insert_data = {val1: val2, val3: val4, val5: val6}
I would like to convert this to:
insert = Table(val1='val2', val3='val4', val5='val6)
Everything I have tried so far has issued a __init__() missing 2 required positional arguments: error.
Anyone have any thoughts on how I might accomplish this?

Unable to create table in Amazon redshift using Psycopg2

I am trying to make a simple script in python, which will fetch data from an endpoint convert it into a dataframe and write it to an Amazon redshift cluster and then automate the script using a cronjob from aws. I am using psycopg2 for connecting to the redshift cluster and the script executes the commands pretty well (creates table in redshift and writes the data as well). But when I try to see the table from a sql client the table doesnt show up
from pandas.io.json import json_normalize
import json
import pandas as pd
import requests
import psycopg2
mm_get = requests.get('endpoint link'})
mm_json=mm_get.json()
data_1 = json_normalize(data = mm_json['data'],
record_path = ['courses','modules'],
record_prefix = 'courses.modules.',
meta = [['courses', 'id'],
['courses', 'title'],
'activated',
'createdAt',
'email',
'employeeId',
'firstName',
'group',
'id',
'lastName',
'phone',
'teams'
]
)
data_2 = json_normalize(data = mm_json['data'],
record_path = 'lessons',
record_prefix = 'lessons.',
meta = 'id',
meta_prefix = 'user.'
)
data_3 = data_1.merge(
data_2,
how = 'outer',
left_on = ['courses.modules.id', 'id'],
right_on = ['lessons.moduleId', 'user.id']
)
cols = data_3.columns
cols = cols.tolist()
cols = pd.DataFrame(cols)
re_cols = pd.DataFrame(cols.loc[:,0].str.replace('.','_').tolist(),index=cols.index)
data_3.teams = data_3.teams.astype(str)
data_3.teams = data_3.teams.str.replace('[','')
data_3.teams = data_3.teams.str.replace(']','')
data_3.teams = data_3.teams.str.replace("'","")
con=psycopg2.connect(dbname='name',
host='hostname',
port='xxxx',user='username',password='password')
cur = con.cursor()
cur.execute('create table testing_learn.test (courses_modules_completionDate DATE, courses_modules_id int, courses_modules_status TEXT,courses_modules_title TEXT, courses_id int,courses_title TEXT, activated bool, createdAt TIMESTAMP, email TEXT, employeeId TEXT, firstName TEXT, group_name TEXT, id TEXT, lastname TEXT, phone int8, teams TEXT, lessons_courseId int, lessons_date DATE, lessons_id int, lessons_lessonNumber int, lessons_moduleId int,lessons_score TEXT, lessons_title TEXT,user_id int);')
cur.close()
data_mat = data_3.as_matrix()
str_mat = b','.join(cur.mogrify('(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)',x) for x in tuple(map(tuple,data_mat)))
cur = con.cursor()
cur.execute('insert into testing_learn.test VALUES '+str_mat.decode('utf-8'))
I am able to see the data when I query the same table from python using psycopg2, but the same table doesnt show up. It would be of great help if anyone could help with what I am doing wrong here. Thank in advance.
According to Psycopg2-2.7.5 official documentation, the main entry points of Psycopg2 includes:
The class connection encapsulates a database session. It allows to:
create new cursor instances using the cursor() method to execute database commands and queries,
terminate transactions using the methods commit() or rollback().
Therefore, you need to call con.commit() every time after you call cur.execute() to make the changes to the database persistent. Otherwise your table won't show up in the database.

Python BigQuery API - get table schema

I am trying to fetch schema form bigquery table. Given a sample code like
from google.cloud import bigquery
from google.cloud import storage
client = bigquery.Client.from_service_account_json('service_account.json')
def test_extract_schema(client):
project = 'bigquery-public-data'
dataset_id = 'samples'
table_id = 'shakespeare'
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
table = client.get_table(table_ref) # API Request
# View table properties
print(table.schema)
if __name__ == '__main__':
test_extract_schema(client)
This is returning value like:
[SchemaField('word', 'STRING', 'REQUIRED', 'A single unique word (where whitespace is the delimiter) extracted from a corpus.', ()), SchemaField('word_count', 'INTEGER', 'REQUIRED', 'The number of times this word appears in this corpus.', ()), SchemaField('corpus', 'STRING', 'REQUIRED', 'The work from which this word was extracted.', ()), SchemaField('corpus_date', 'INTEGER', 'REQUIRED', 'The year in which this corpus was published.', ())]
Where I am trying to capture schema only in the format like
'word' 'STRING','word_count' INTEGER'
Is there any way to get this using API call or any other method?
You can always get the table.schema variable and iterate over it, since the table is a list made of SchemaField values:
result = ["{0} {1}".format(schema.name,schema.field_type) for schema in table.schema]
Result for that same dataset and table:
['word STRING', 'word_count INTEGER', 'corpus STRING', 'corpus_date INTEGER']
here, schema has been expired.
An alternative is, after you have your client and table instances, to do something like this:
import io
f = io.StringIO("")
client.schema_to_json(table.schema, f)
print(f.getvalue())

Resources