I want to load a large excel table data into AWS Redshift, using Python psycopg2 take a long time to load, so I try to use Sqlalchemy. but the redshift-sqlalchemy documentation is confusing. So I want to use the regular Sqlalchemy library. Below code can pull data from AWS redshift, I don't know how to modify it to INSERT data into redshift. if possible, I like to INSERT data at once.
import pandas as pd
import psycopg2
from sqlalchemy import create_engine
from sqlalchemy import text
sql = """
SELECT top 10 * FROM pg_user;
"""
redshift_endpoint1 = "YourDBname.cksrxes2iuiu.us-east-1.redshift.amazonaws.com"
redshift_user1 = "YourUserName"
redshift_pass1 = "YourRedshiftPassword"
port1 = 8192 #whaterver your Redshift portnumber is
dbname1 = "YourDBname"
from sqlalchemy import create_engine
from sqlalchemy import text
engine_string = "postgresql+psycopg2://%s:%s#%s:%d/%s" \
% (redshift_user1, redshift_pass1, redshift_endpoint1, port1, dbname1)
engine1 = create_engine(engine_string)
df1 = pd.read_sql_query(text(sql), engine1)
df = pd.DataFrame({ 'id':['444'],'id2':[555]})
df.to_sql('YourTable', con=engine1,if_exists='append',index= False)
Related
I'm using the Databricks. For my data I created a DeltaLake. Then I tried to modify the column using pandas API but for some reason the following error message pops up:
ValueError: Cannot combine the series or dataframe because it comes from a different dataframe. In order to allow this operation, enable 'compute.ops_on_diff_frames' option.
I use the following code to rewrite data in the table:
df_new = spark.read.format('delta').load(f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/{delta_name}")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from math import *
from pyspark.pandas.config import set_option
import pyspark.pandas as ps
%matplotlib inline
from pyspark.pandas.config import set_option
import pyspark.pandas as ps
win_len = 5000
# For this be sure you have runtime 1.11 or earlier version
df_new = df_new.pandas_api()
print('Creating Average active power for U1 and V1...')
df_new['p_avg1'] = df_new.Current1.mul(df_new['Voltage1']).rolling(min_periods=1, window=win_len).mean()
print('Creating Average active power for U2 and V2...')
df_new['p_avg2'] = df_new.Current2.mul(df_new['Voltage2']).rolling(min_periods=1, window=win_len).mean()
print('Creating Average active power for U3 and V3...')
df_new['p_avg3'] = df_new.Current3.mul(df_new['Voltage3']).rolling(min_periods=1, window=win_len).mean()
print('Creating Average active power for U4 and V4...')
df_new['p_avg4'] = df_new.Current4.mul(df_new['Voltage4']).rolling(min_periods=1, window=win_len).mean()
print('Converting to Spark dataframe')
df_new = df_new.to_spark()
print('Complete')
Previously with pandas API there were no problem, I'm using the lastest Runtime 11.2. Only one dataframe was loaded while I was using cluster.
Thank you in advance.
The error message is suggesting this: In order to allow this operation, enable 'compute.ops_on_diff_frames' option
Here's how to enable this option per the docs:
import pyspark.pandas as ps
ps.set_option('compute.ops_on_diff_frames', True)
The docs have this important warning:
Pandas API on Spark disallows the operations on different DataFrames (or Series) by default to prevent expensive operations. It internally performs a join operation which can be expensive in general.
I am trying to extract data from Google Trends by using the pytrends library to analyze it in MS PowerBI by using the following script:
import pandas as pd
from pytrends.request import TrendReq
pytrends = TrendReq()
data = pd.DataFrame()
kw_list = ["Bitcoin", "Ethereum"]
pytrends.build_payload(kw_list, timeframe='today 3-m')
data = pytrends.interest_over_time()
print(data)
When using the simple script in PowerBI, the date-column suddenly disappears. How can I include the date-column ?
import pandas as pd
from pytrends.request import TrendReq
pytrends = TrendReq()
data = pd.DataFrame()
kw_list = ["Bitcoin", "Ethereum"]
pytrends.build_payload(kw_list, timeframe='today 3-m')
data = pytrends.interest_over_time()
data.reset_index(inplace=True)
print(data)
Date column is index, you just need to add second last line
Hope this will work
Thanks!
Iam new to the aws-glue. I am trying to read the csv and transforming to the json object. As i seen the approach would be to read the csv via crawler and convert to Pyspark DF, then convert to json object.
Till now, i have converted to json object. Now i would need to write these json back to s3 bucket?
Below is the code
#########################################
### IMPORT LIBRARIES AND SET VARIABLES
#########################################
#Import python modules
from datetime import datetime
#Import pyspark modules
from pyspark.context import SparkContext
import pyspark.sql.functions as f
#Import glue modules
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
import json
import boto3
#Initialize contexts and session
spark_context = SparkContext.getOrCreate()
glue_context = GlueContext(spark_context)
session = glue_context.spark_session
s3_source = boto3.resource('s3')
#Parameters
glue_db = "umesh-db"
glue_tbl = "read"
#########################################
### EXTRACT (READ DATA)
#########################################
#Read movie data to Glue dynamic frame
dynamic_frame_read = glue_context.create_dynamic_frame.from_catalog(database = glue_db, table_name = glue_tbl)
#Convert dynamic frame to data frame to use standard pyspark functions
data_frame = dynamic_frame_read.toDF()
## Show DF data
print("Showing Df data")
data_frame.show()
### Convert the DF to the json
jsonContent = data_frame.toJSON()
jsonValue={}
arrraYObj=[]
for row in jsonContent.collect():
print("Row data ", row)
arrraYObj.append(row)
print("Array Obj",arrraYObj)
jsonValue['Employee']=arrraYObj
print("Json object ", jsonValue)
#Log end time
#dt_end = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
#print("Start time:", dt_end)
Appreciate if anyone can help to provide the right approach?
Thanks
data_frame.write.format(‘json’).save(‘s3://bucket/key’)
Or directly from dynamic frame
glue_context.write_dynamic_frame.from_options(frame = dynamic_frame_read,
connection_type = "s3",
connection_options = {"path": "s3://bucket/key"},
format = "json")
I have 6 files with named with Data_20190823101010,Data_20190823101112,Data_20190823101214,Data_20190823101310,Data_20190823101410,Data_20190823101510.
These are daily files to be loaded into a SQL Server DB table.
Due to size and performance reasons need to load one by one.
Python code must pick one file at a time,process and load into DB Table.
How to write the code?
Thanks in advance.
import glob
import os
import pandas as pd
import time
from datetime import datetime
import numpy as np
#folder_name = 'Data_Folder'
file_type = 'csv'
file_titles = ['C1','C2','C3',C4','C5']
df = pd.concat([pd.read_csv(f, header=None,skiprows=1,names=file_titles,low_memory=False) for f in glob.glob(folder_name + "//*Data_*" )])
You can import those csv files in a dataframe and then concatenate and use pandas to_sql function to connect and upload the data to MS SQL Server DB
from sqlalchemy import create_engine
import urllib
import pyodbc
import pandas as pd
import glob
connection= urllib.parse.quote_plus("DRIVER={SQL Server Native Client 11.0};SERVER=Server_name;DATABASE=DB Name")
engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(connection))
path = r'C:\file_path' # local drive File path
all_csv_files = glob.glob(path + "/*.csv")
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
df.to_sql('Table_Name', schema='dbo', con = engine)
I am trying to export a dataframe into a mysql database. I am getting the data via an Order and Inventory API call.
I have successfully been able to save the Order and Inventory API calls into dataframes and export Order dataframe into a MySQL table.
The Inventory dataframe however is throwing me the error:
TypeError: sequence item 0: expected str instance, dict found
I am not sure what I am doing wrong, I do suspect that the dataframe for inventory contains alot of nested json in many of the columns, but not sure what to do.
Here is my code so far for inventory:
import pandas as pd
#python libary to compare today date for birthday lists.
import numpy as np
import datetime as dt
import datetime
from pandas.io.json import json_normalize
from pandas.io import sql
import pymysql.cursors
import json
import pymysql
import pandas.io.sql
from sqlalchemy import create_engine
headers_inventory = {
'Accept': '',
'Content-Type': '',
'x-api-key': '',
'x-organization-id': '',
'x-facility-id': '',
'x-user-id': '',
}
r_inventory = requests.get(' URL', headers=headers_inventory, verify=False)
data = json.loads(r_inventory.text)
df_inventory = json_normalize(data)
print (df_inventory)
engine = create_engine('mysql+pymysql://USERNAME:PWD#HOST:3306/DB')
df_inventory.to_sql("inventory", engine, if_exists="replace", index = False)
Here is what the dataframe dtypes are:
int64
object
float64
Had a similar problem running a pivot table with fiscal years. The issue is as above, Python 3 does not seem to recognize the numbers as a category. You have to turn that series into a string (ie change 2007, 2008 into FY2017, FY2018). Alternately just change that one column into a string, not the entire df ( df.column.astype(str) ).
converting entire df to string helped using this line:
df = df.applymap(str)