Python BigQuery API - get table schema - python-3.x

I am trying to fetch schema form bigquery table. Given a sample code like
from google.cloud import bigquery
from google.cloud import storage
client = bigquery.Client.from_service_account_json('service_account.json')
def test_extract_schema(client):
project = 'bigquery-public-data'
dataset_id = 'samples'
table_id = 'shakespeare'
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
table = client.get_table(table_ref) # API Request
# View table properties
print(table.schema)
if __name__ == '__main__':
test_extract_schema(client)
This is returning value like:
[SchemaField('word', 'STRING', 'REQUIRED', 'A single unique word (where whitespace is the delimiter) extracted from a corpus.', ()), SchemaField('word_count', 'INTEGER', 'REQUIRED', 'The number of times this word appears in this corpus.', ()), SchemaField('corpus', 'STRING', 'REQUIRED', 'The work from which this word was extracted.', ()), SchemaField('corpus_date', 'INTEGER', 'REQUIRED', 'The year in which this corpus was published.', ())]
Where I am trying to capture schema only in the format like
'word' 'STRING','word_count' INTEGER'
Is there any way to get this using API call or any other method?

You can always get the table.schema variable and iterate over it, since the table is a list made of SchemaField values:
result = ["{0} {1}".format(schema.name,schema.field_type) for schema in table.schema]
Result for that same dataset and table:
['word STRING', 'word_count INTEGER', 'corpus STRING', 'corpus_date INTEGER']

here, schema has been expired.

An alternative is, after you have your client and table instances, to do something like this:
import io
f = io.StringIO("")
client.schema_to_json(table.schema, f)
print(f.getvalue())

Related

Get value of column in AWS Glue Custom Transform

I'm working on ETL in AWS Glue. I need to decode text from table which is in base64 - I'm doing that in Custom Transform in Python3.
My code is below:
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
import base64
newdf = dfc.select(list(dfc.keys())[0]).toDF()
data = newdf["email"]
data_to_decrypt = base64.b64decode(data)
I've got error like that:
TypeError: argument should be a bytes-like object or ASCII string, not 'Column'
How to get plan string from the Column object?
I was wrong and it was completely different thing than I thought.
Column object from newdf["email"] consists all rows for this single column, so it's not possible to just fetch one value from that.
What I ended up doing is iterating through whole rows and mapping them to new value like that:
def map_row(row):
id = row.id
client_key = row.client_key
email = decrypt_jasypt_string(row.email.strip())
phone = decrypt_jasypt_string(row.phone.strip())
created_on = row.created_on
return (id, email, phone, created_on, client_key)
df = dfc.select(list(dfc.keys())[0]).toDF()
rdd2=df.rdd.map(lambda row: map_row(row))
df2=rdd2.toDF(["id","email","phone", "created_on", "client_key"])
dyf_filtered = DynamicFrame.fromDF(df2, glueContext, "does it matter?")

inserting into jsonb type column

import psycopg2, json, requests, hidden
# Load secrets
secrets = hidden.secrets()
conn = psycopg2.connect(host=secrets['host'],port=secrets['port'],
....,connect_timeout=3)
cur = conn.cursor()
defaulturl = 'https://pokeapi.co/api/v2/pokemon?limit=100&offset=0'
sql = '''
CREATE TABLE IF NOT EXISTS pokeapi
(id INTEGER, body JSONB);
'''
print(sql)
cur.execute(sql)
response = requests.get(defaulturl)
js = json.loads(response.text)
# js is a library and i'm interested in the values of 'results' key.
results = js['results']
# 'results' is a list of libraries and i want to loop through each element of the list
# and extract the value of 'url' key
# I NEED TO INSERT EACH VALUE INTO pokeapi (body), note that 'body' is of type JSONB
for x in range(len(results)):
body = requests.get(results[x]['url'])
js_body = json.loads(body.text)
sql = f"INSERT INTO pokeapi (body) VALUES ('{js_body}')::JSONB";
cur.execute(sql, (defaulturl))
print('Closing database connection...')
conn.commit()
cur.close()
This script keeps throwing back an error:
CREATE TABLE IF NOT EXISTS pokeapi (id INTEGER, body text); Traceback
(most recent call last): File "pokeapi.py", line 45, in
cur.execute(sql, (defaulturl)) psycopg2.errors.SyntaxError: syntax error at or near "{" LINE 1: INSERT INTO pokeapi (body) VALUES
{'abilities': [{'ability':...
I have tried to insert into pokeapi (body) without casting to jsonb but I keep getting the same error back. Is there a fundamental that I'm missing?
You should pass the JSON string normally and not parse it, and without quotes and casting:
js_body = body.text
sql = "INSERT INTO pokeapi (body) VALUES (%s)";
cur.execute(sql, [js_body])
IMPORTANT: DO NOT USE format on random internet data! Always use
psycopg2's built-in parameter handling. It will correctly handle SQL
injection risks for you.
Currently you aren't using defaulturl, if you want to insert it then you need a column to insert it into. Also, you need to make the id auto increment:
sql = '''
CREATE TABLE IF NOT EXISTS pokeapi
("id" int8 NOT NULL GENERATED BY DEFAULT AS IDENTITY, body JSONB);
'''
If not you will have to supply an id with the body.
Finally, you should generally avoid trying to execute once every loop. If you have the memory for it, you should just loop over the payloads then use execute_values(): https://www.psycopg.org/docs/extras.html
rows = list()
for result in results:
response = requests.get(result['url'])
rows.append([response.text])
sql = "INSERT INTO pokeapi (body) VALUES %s";
sql_template = "(%s)"
execute_values(cur, sql, rows, sql_template)
(Also, for future reference, the requests library has a .json() method on the responses which can load the json string into python primitives for you. That said you don't need to parse the json in this case. https://docs.python-requests.org/en/master/user/quickstart/#json-response-content)
Here is the solution I ended coming up with. What I learnt was the need to understand deserialization of 'response' to proper python dictionary and then serialization of python dictionary before casting it to JSONB type.
import psycopg2, json, requests, hidden
# Load secrets
secrets = hidden.secrets()
conn = psycopg2.connect(host=secrets['host'],port=secrets['port'],
....,connect_timeout=3)
cur = conn.cursor()
defaulturl = 'https://pokeapi.co/api/v2/pokemon?limit=100&offset=0'
sql = '''
CREATE TABLE IF NOT EXISTS pokeapi
(id SERIAL, body JSONB); # <== CREATING id OF SERIAL TYPE HELPS AUTO-
# GENERATE ids of INTEGER TYPE.
'''
print(sql)
cur.execute(sql)
response = requests.get(defaulturl)
js = response.json() # <== THIS IS ONE OF THE CORRECTIONS, I NEEDED TO DE-
# SERIALIZE THE RESPONSE SO THAT IT'S A PROPER
# PYTHON DICTIONERY
# js is a library and i'm interested in the values of 'results' key.
results = js['results']
# 'results' is a list of libraries and i want to loop through each element of the list
# and extract the value of 'url' key
# I NEED TO INSERT EACH VALUE INTO pokeapi (body), note that 'body' is of type JSONB
for x in range(len(results)):
body = requests.get(results[x]['url'])
js_body = json.dumps(body) # <== 2ND MAJOR CORRECTION, I HAVE TO
# SERIALIZE THE PYTHON DICTIONERY/LIST
# TO BE ABLE TO CAST IT TO JSONB BELLOW
sql = f"INSERT INTO pokeapi (body) VALUES ('{js_body}'::JSONB)";
cur.execute(sql, (defaulturl))
print('Closing database connection...')
conn.commit()
cur.close()

How to convert sql query to list?

I am trying to convert my sql query output into a list to look a certain way.
Here is my code:
def get_sf_metadata():
import sqlite3
#Tables I want to be dynamically created
table_names=['AcceptedEventRelation','Asset', 'Book']
#SQLIte Connection
conn = sqlite3.connect('aaa_test.db')
c = conn.cursor()
#select the metadata table records
c.execute("select name, type from sf_field_metadata1 limit 10 ")
print(list(c))
get_sf_metadata()
Here is my output:
[('Id', 'id'), ('RelationId', 'reference'), ('EventId', 'reference')]
Is there any way to make the output looks like this:
[Id id, RelationId reference, EventId reference]
You can try
print(["{} {}".format(i[0], i[1]) for i in list(c)])
That will print you
['Id id', 'RelationId reference', 'EventId reference']

Unable to create table in Amazon redshift using Psycopg2

I am trying to make a simple script in python, which will fetch data from an endpoint convert it into a dataframe and write it to an Amazon redshift cluster and then automate the script using a cronjob from aws. I am using psycopg2 for connecting to the redshift cluster and the script executes the commands pretty well (creates table in redshift and writes the data as well). But when I try to see the table from a sql client the table doesnt show up
from pandas.io.json import json_normalize
import json
import pandas as pd
import requests
import psycopg2
mm_get = requests.get('endpoint link'})
mm_json=mm_get.json()
data_1 = json_normalize(data = mm_json['data'],
record_path = ['courses','modules'],
record_prefix = 'courses.modules.',
meta = [['courses', 'id'],
['courses', 'title'],
'activated',
'createdAt',
'email',
'employeeId',
'firstName',
'group',
'id',
'lastName',
'phone',
'teams'
]
)
data_2 = json_normalize(data = mm_json['data'],
record_path = 'lessons',
record_prefix = 'lessons.',
meta = 'id',
meta_prefix = 'user.'
)
data_3 = data_1.merge(
data_2,
how = 'outer',
left_on = ['courses.modules.id', 'id'],
right_on = ['lessons.moduleId', 'user.id']
)
cols = data_3.columns
cols = cols.tolist()
cols = pd.DataFrame(cols)
re_cols = pd.DataFrame(cols.loc[:,0].str.replace('.','_').tolist(),index=cols.index)
data_3.teams = data_3.teams.astype(str)
data_3.teams = data_3.teams.str.replace('[','')
data_3.teams = data_3.teams.str.replace(']','')
data_3.teams = data_3.teams.str.replace("'","")
con=psycopg2.connect(dbname='name',
host='hostname',
port='xxxx',user='username',password='password')
cur = con.cursor()
cur.execute('create table testing_learn.test (courses_modules_completionDate DATE, courses_modules_id int, courses_modules_status TEXT,courses_modules_title TEXT, courses_id int,courses_title TEXT, activated bool, createdAt TIMESTAMP, email TEXT, employeeId TEXT, firstName TEXT, group_name TEXT, id TEXT, lastname TEXT, phone int8, teams TEXT, lessons_courseId int, lessons_date DATE, lessons_id int, lessons_lessonNumber int, lessons_moduleId int,lessons_score TEXT, lessons_title TEXT,user_id int);')
cur.close()
data_mat = data_3.as_matrix()
str_mat = b','.join(cur.mogrify('(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)',x) for x in tuple(map(tuple,data_mat)))
cur = con.cursor()
cur.execute('insert into testing_learn.test VALUES '+str_mat.decode('utf-8'))
I am able to see the data when I query the same table from python using psycopg2, but the same table doesnt show up. It would be of great help if anyone could help with what I am doing wrong here. Thank in advance.
According to Psycopg2-2.7.5 official documentation, the main entry points of Psycopg2 includes:
The class connection encapsulates a database session. It allows to:
create new cursor instances using the cursor() method to execute database commands and queries,
terminate transactions using the methods commit() or rollback().
Therefore, you need to call con.commit() every time after you call cur.execute() to make the changes to the database persistent. Otherwise your table won't show up in the database.

Get BigQuery table schema using google.cloud

I can for example get BigQuery data into local python with:
import os
from google.cloud import bigquery
project_id = "example-project"
dataset_id = "exapmle_dataset"
table_id = "table_id"
os.environ["GOOGLE_CLOUD_PROJECT"] = project_id
bq = bigquery.Client()
query = "SELECT * FROM {}.{} LIMIT 5".format(dataset_id, table_id)
resp = bq.run_sync_query(query)
resp.run()
data_list = resp.rows
The result:
print(data_list)
>>> [('BEDD', '1',), ('A75', '1',), ('CE3F', '1',), ('0D8C', '1',), ('3E9C', '1',)]
How do I then go and get the schema for this table? Such that, for example
headings = ('heading1', 'heading2')
# or
schema_dict = {'fields': [{'name': 'heading1', 'type': 'STRING'}, {'name': 'heading2', 'type': 'STRING'}]}
You can use the schema method from your resp variable.
After running the query you can retrieve it:
schema = resp.schema
schema will be a list containing the definition for each column in your query.
As an example, lets say this is your query:
query = "select '1' as fv, STRUCT<i INT64, j INT64> (1, 2) t from `dataset.table` limit 1"
The schema will be a list containing 2 entries:
[<google.cloud.bigquery.schema.SchemaField at 0x7ffa64fe6e50>,
<google.cloud.bigquery.schema.SchemaField at 0x7ffa64fe6b10>]
For each object in schema, you have the methods field_type, fields, mode and name so if you run:
schema[0].field_type, schema[0].mode, schema[0].name
The result is "STRING", "NULLABLE", "fv".
As the second column is a record, then if you run:
schema[1].field_type, schema[1].mode, schema[1].name, schema[1].fields
The result is:
"RECORD", "NULLABLE", "t", [google schema 1, google schema 2]
Where google schema 1 contains the definition for the inner fields within the record.
As far as I know, there's no way of getting a dictionary as you showed in your question, which means you'll have to loop over the entries in schema and build it yourself. It should be simple though. Not sure if this is working as I haven't fully tested it but it might give you an idea on how to do it:
def extract_schema(schema_resp):
l = []
for schema_obj in schema_resp:
r = {}
r['name'] = schema_obj.name
r['type'] = schema_obj.field_type
r['mode'] = schema_obj.mode
if schema_obj.fields:
r['fields'] = extract_schema(schema_obj.fields)
l.append(r)
return l
So you'd just have to run schema = extract_schema(resp.schema) and (hopefully) you'll be good to go.

Resources