Unload multiple files from Redshift to S3

Unload multiple files from Redshift to S3 - python-3.x

Hi I am trying to unload multiple tables from Redshift to a particular S3 bucket getting below error:
psycopg2.InternalError: Specified unload destination on S3 is not empty. Consider using a different bucket / prefix, manually removing the target files in S3, or using the ALLOWOVERWRITE option.
if I add 'allowoverwrite' option on unload_function, it is overwritting before table and unloading last table in S3.
This is the code I have given:
import psycopg2
def unload_data(r_conn, aws_iam_role, datastoring_path, region, table_name):
unload = '''unload ('select * from {}')
to '{}'
credentials 'aws_iam_role={}'
manifest
gzip
delimiter ',' addquotes escape parallel off '''.format(table_name, datastoring_path, aws_iam_role)
print ("Exporting table to datastoring_path")
cur = r_conn.cursor()
cur.execute(unload)
r_conn.commit()
def main():
host_rs = 'dataingestion.*********.us******2.redshift.amazonaws.com'
port_rs = '5439'
database_rs = '******'
user_rs = '******'
password_rs = '********'
rs_tables = [ 'Employee', 'Employe_details' ]
iam_role = 'arn:aws:iam::************:role/RedshiftCopyUnload'
s3_datastoring_path = 's3://mysamplebuck/'
s3_region = 'us_*****_2'
print ("Exporting from source")
src_conn = psycopg2.connect(host = host_rs,
port = port_rs,
database = database_rs,
user = user_rs,
password = password_rs)
print ("Connected to RS")
for i, tabe in enumerate(rs_tables):
if tabe[0] == tabe[-1]:
print("No files to read!")
unload_data(src_conn, aws_iam_role = iam_role, datastoring_path = s3_datastoring_path, region = s3_region, table_name = rs_tables[i])
print (rs_tables[i])
if __name__=="__main__":
main()

It is complaining that you are saving the data to the same destination.
This would be like copying all the files on your computer to the same directory -- there will be files overwritten.
You should change your datastoring_path to be different for each table, such as:
.format(table_name, datastoring_path + '/' + table_name, aws_iam_role)

Related

Python read oldest file first

I have object and attributes data in separate csv files. there are 3 different types of objects.
Directory may contain different files but I have to read and process object and attribute files. After reading the object file and then will have to read respective attribute file.
Below is code and files
plant = []
flower = []
person = []
for file_name in os.listdir(dir_path):
if os.path.isfile(os.path.join(dir_path, file_name)):
if file_name.startswith('plant_file'):
plant.append(file_name)
if file_name.startswith('person_file'):
person.append(file_name)
if file_name.startswith('flower_file'):
flower.append(file_name)
for file_name in person:
object_file_path = dir_path + file_name
attribute_file_path = dir_path + file_name.replace('file','attributes_file')
read_object_csv = pd.read_csv(object_file_path)
read_attribute_csv = pd.read_csv(attribute_file_path)
for file_name in flower:
object_file_path = dir_path + file_name
attribute_file_path = dir_path + file_name.replace('file','attributes_file')
read_object_csv = pd.read_csv(object_file_path)
read_attribute_csv = pd.read_csv(attribute_file_path)
file name contains date and time in the format YYYYMMDDHHMMSS . Sample file names are
plant_attributes_file_20221013134403.csv
plant_attributes_file_20221013142151.csv
plant_attributes_file_20221013142455.csv
plant_file_20221013134403.csv
plant_file_20221013142151.csv
plant_file_20221013142455.csv
person_file_20221012134948.csv
person_file_20221012140706.csv
person_attributes_file_20221012134948.csv
person_attributes_file_20221012140706.csv
How can we sort file names in list using timestamp, so that oldest file can be loaded first and load latest file at last ?

Tableau Server how to bulk update data source server name?

Our Oracle database is upgraded to a new server, so it has a new server name. Most of our published workbooks on Tableau Server are connecting to this Oracle database. The username and password remains the same, but the server address is changed. I used the following Python code. It can identify the right workbook that needs server address update, however it produces an error: 'PW Update Failed with error:
404004: Resource Not Found
Datasource '5f125136-22da-48d0-bdc7-8e5edde8d809' could not be found.
'''
import tableauserverclient as TSC
import re
tableau_auth = TSC.TableauAuth('site_admin_username', 'site_admin_password', site_id='default') # site_id not needed if there is only one
search_server_regex = 'oldserver123' # server to search
replace_server = 'newserver123' # use if server name/address is changing- otherwise make it the same as search_server
overwrite_credentials = False # set to false to use existing credentials
search_for_certain_users = True # set to True if you only want to update connections for certain usernames
search_username = 'username'
replace_username = 'username'
replace_pw = 'password'
request_options = TSC.RequestOptions(pagesize=1000) # this needs to be > # of workbooks/data connections on the site
server = TSC.Server('http://tableau_server:8000') # tableau server
y = 0 # to keep track of how many are changed
try:
with server.auth.sign_in(tableau_auth):
all_workbooks, pagination_item = server.workbooks.get(req_options=request_options)
print("Total Workbooks to Search: {}".format(len(all_workbooks)))
for wb in all_workbooks:
server.workbooks.populate_connections(wb)
for item,conn in enumerate(wb.connections): #make sure to iterate through all connections in the workbook
if wb.connections[item].connection_type != 'sqlproxy': #sqlproxy indicates published datasource
if re.search(search_server_regex ,wb.connections[item].server_address,re.IGNORECASE):
connection = wb.connections[item]
if search_for_certain_users and re.search(search_username, connection.username, re.IGNORECASE):
# print(wb.name, '-', connection.connection_type)
connection.server_address = replace_server
connection.embed_password = False
if overwrite_credentials:
connection.embed_password = True
connection.username = replace_username
connection.password = replace_pw
server.datasources.update_connection(wb, connection)
y = y + 1
elif not search_for_certain_users:
# print(wb.name, '-', connection.connection_type)
connection.server_address = replace_server
connection.embed_password = False
if overwrite_credentials:
connection.embed_password = True
connection.username = replace_username
connection.password = replace_pw
server.datasources.update_connection(wb, connection)
y = y + 1
print("Workbook Connections Changed: {}".format(y))
except Exception as e:
print("PW Update Failed with error: {}".format(e))
print("Connections Updated: {}".format(y))
'''
How to fix the code?

you have to update the workbooks.
server.workbooks.update_connection(wb, connection)

I ran into the same problem and I'm hoping my solution fixes yours. I created a "helper" class that has one attribute called "id":
class datasource_id:
def __init__(self, id):
self.id = id
I put the class at the top of my code. Then I replaced the lines:
if overwrite_credentials:
connection.embed_password = True
connection.username = replace_username
connection.password = replace_pw
server.datasources.update_connection(wb, connection)
with the code below in both places:
if overwrite_credentials:
connection.embed_password = True
connection.username = replace_username
connection.password = replace_pw
d1 = datasource_id(wb.connections[item].datasource_id)
server.datasources.update_connection(d1, connection)
The reason this work is because the method .update_connections is using the argument in the id position of the "wb" as the datasource_id which isn't correct because the id position of the "wb" variable is the id of the workbook

Linkedin web scraping snippet

I'm doing a web scraping data university research project. I started working on a ready GitHub project, but this project does not retrieve all the data.
The project works like this:
Search Google using keywords: example: (accountant 'email me at' Google)
Extract a snippet.
Retrieve data from this snippet.
The issue is:
The snippets extracted are like this: " ... marketing division in 2009. For more information on career opportunities with our company, email me: vicki#productivedentist.com. Neighborhood Smiles, LLC ..."
The snippet does not show all, the "..." hides information like role, location... How can I retrieve all the information with the script?
from googleapiclient.discovery import build #For using Google Custom Search Engine API
import datetime as dt #Importing system date for the naming of the output file.
import sys
from xlwt import Workbook #For working on xls file.
import re #For email search using regex.
if __name__ == '__main__':
# Create an output file name in the format "srch_res_yyyyMMdd_hhmmss.xls in output folder"
now_sfx = dt.datetime.now().strftime('%Y%m%d_%H%M%S')
output_dir = './output/'
output_fname = output_dir + 'srch_res_' + now_sfx + '.xls'
search_term = sys.argv[1]
num_requests = int(sys.argv[2])
my_api_key = "replace_with_you_api_key" #Read readme.md to know how to get you api key.
my_cse_id = "011658049436509675749:gkuaxghjf5u" #Google CSE which searches possible LinkedIn profile according to query.
service = build("customsearch", "v1", developerKey=my_api_key)
wb=Workbook()
sheet1 = wb.add_sheet(search_term[0:15])
wb.save(output_fname)
sheet1.write(0,0,'Name')
sheet1.write(0,1,'Profile Link')
sheet1.write(0,2,'Snippet')
sheet1.write(0,3,'Present Organisation')
sheet1.write(0,4,'Location')
sheet1.write(0,5,'Role')
sheet1.write(0,6,'Email')
sheet1.col(0).width = 256 * 20
sheet1.col(1).width = 256 * 50
sheet1.col(2).width = 256 * 100
sheet1.col(3).width = 256 * 20
sheet1.col(4).width = 256 * 20
sheet1.col(5).width = 256 * 50
sheet1.col(6).width = 256 * 50
wb.save(output_fname)
row = 1 #To insert the data in the next row.
#Function to perform google search.
def google_search(search_term, cse_id, start_val, **kwargs):
res = service.cse().list(q=search_term, cx=cse_id, start=start_val, **kwargs).execute()
return res
for i in range(0, num_requests):
# This is the offset from the beginning to start getting the results from
start_val = 1 + (i * 10)
# Make an HTTP request object
results = google_search(search_term,
my_cse_id,
start_val,
num=10 #num value can be 1 to 10. It will give the no. of results.
)
for profile in range (0, 10):
snippet = results['items'][profile]['snippet']
myList = [item for item in snippet.split('\n')]
newSnippet = ' '.join(myList)
contain = re.search(r'[\w\.-]+#[\w\.-]+', newSnippet)
if contain is not None:
title = results['items'][profile]['title']
link = results['items'][profile]['link']
org = "-NA-"
location = "-NA-"
role = "-NA-"
if 'person' in results['items'][profile]['pagemap']:
if 'org' in results['items'][profile]['pagemap']['person'][0]:
org = results['items'][profile]['pagemap']['person'][0]['org']
if 'location' in results['items'][profile]['pagemap']['person'][0]:
location = results['items'][profile]['pagemap']['person'][0]['location']
if 'role' in results['items'][profile]['pagemap']['person'][0]:
role = results['items'][profile]['pagemap']['person'][0]['role']
print(title[:-23])
sheet1.write(row,0,title[:-23])
sheet1.write(row,1,link)
sheet1.write(row,2,newSnippet)
sheet1.write(row,3,org)
sheet1.write(row,4,location)
sheet1.write(row,5,role)
sheet1.write(row,6,contain[0])
print('Wrote {} search result(s)...'.format(row))
wb.save(output_fname)
row = row + 1
print('Output file "{}" written.'.format(output_fname))

How to avoid header while exporting BigQuery table in to Google Storage

I have developed below code which is helping to export BigQuery table in to Google storage bucket. I want to merge files into single file with out header, so that next processes will use file with out any issue.
def export_bq_table_to_gcs(self, table_name):
client = bigquery.Client(project=project_name)
print("Exporting table {}".format(table_name))
dataset_ref = client.dataset(dataset_name,
project=project_name)
dataset = bigquery.Dataset(dataset_ref)
table_ref = dataset.table(table_name)
size_bytes = client.get_table(table_ref).num_bytes
# For tables bigger than 1GB uses Google auto split, otherwise export is forced in a single file.
if size_bytes > 10 ** 9:
destination_uris = [
'gs://{}/{}{}*.csv'.format(bucket_name,
f'{table_name}_temp', uid)]
else:
destination_uris = [
'gs://{}/{}{}.csv'.format(bucket_name,
f'{table_name}_temp', uid)]
extract_job = client.extract_table(table_ref, destination_uris) # API request
result = extract_job.result() # Waits for job to complete.
if result.state != 'DONE' or result.errors:
raise Exception('Failed extract job {} for table {}'.format(result.job_id, table_name))
else:
print('BQ table(s) export completed successfully')
storage_client = storage.Client(project=gs_project_name)
bucket = storage_client.get_bucket(gs_bucket_name)
blob_list = bucket.list_blobs(prefix=f'{table_name}_temp')
print('Merging shard files into single file')
bucket.blob(f'{table_name}.csv').compose(blob_list)
Can you please help me to find a way to skip header.
Thanks,
Raghunath.

We can avoid header by using jobConfig to set the print_header parameter to False. Sample code
job_config = bigquery.job.ExtractJobConfig(print_header=False)
extract_job = client.extract_table(table_ref, destination_uris,
job_config=job_config)
Thanks

You can use skipLeadingRows (https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#externalDataConfiguration.googleSheetsOptions.skipLeadingRows)

CouchDB change Database directory

I am trying to change the directory of the couch database. I am using a python script to import a csv file to the CouchDB. Script is running ok. Here it is just in case:
from couchdbkit import Server, Database
from couchdbkit.loaders import FileSystemDocsLoader
from csv import DictReader
import sys, subprocess, math, os
def parseDoc(doc):
for k,v in doc.items():
if (isinstance(v,str)):
#print k, v, v.isdigit()
# #see if this string is really an int or a float
if v.isdigit()==True: #int
doc[k] = int(v)
else: #try a float
try:
if math.isnan(float(v))==False:
doc[k] = float(v)
except:
pass
return doc
def upload(db, docs):
db.bulk_save(docs)
del docs
return list()
def uploadFile(fname, dbname):
#connect to the db
theServer = Server()
db = theServer.get_or_create_db(dbname)
#loop on file for upload
reader = DictReader(open(fname, 'rU'), dialect = 'excel')
docs = list()
checkpoint = 100
i = 0
for doc in reader:
newdoc = parseDoc(doc)
docs.append(newdoc)
if len(docs)%checkpoint==0:
docs = upload(db,docs)
i += 1
print 'Number : %d' %i
#don't forget the last batch
docs = upload(db,docs)
if __name__=='__main__':
x = '/media/volume1/Crimes_-_2001_to_present.csv'
filename = x
dbname = 'test'
uploadFile(filename, dbname)
I saw plenty posts on how to change the directory for appending the database. If I leave the /etc/couchdb/local.ini as it is (original after installation) the script is appending data to the default directory /var/lib/couchdb/1.0.1/. When I modify the local.ini to store the database to another disk:
database_dir = /media/volume1
view_index_dir = /media/volume1
and after the reboot of the CouchDB service I get this error :
restkit.errors.RequestError: socket.error: [Errno 111] Connection refused
I have checked the open sockets (couchdb uses 5984 as default) and it is not opened. But I get no errors when I start CouchDB service.
Any ideas how to fix it ?

I think the error may be due to you have changed the directory location in Local.ini but when you are trying to make new connection to existing database, it cannot find it there.
So move the database_name.couch file to new location which you can put in local.ini and then try to make a connection. I think this should work.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Unload multiple files from Redshift to S3 - python-3.x

Related

Python read oldest file first

Tableau Server how to bulk update data source server name?

Linkedin web scraping snippet

How to avoid header while exporting BigQuery table in to Google Storage

CouchDB change Database directory

Categories

Resources