I am looking to automatically generate the following string in Python 2.7 using a loop based on the number of columns in a Pandas DataFrame:
INSERT INTO table_name (firstname, lastname) VALUES (534737, 100.115)
This assumes that the DataFrame has 2 columns.
Here is what I have:
# Generate test numbers for table:
df = pd.DataFrame(np.random.rand(5,2), columns=['firstname','lastname'])
# Create list of tuples from numbers in each row of DataFrame:
list_of_tuples = [tuple(x) for x in df.values]
Now, I create the string:
Manually - this works:
add_SQL = INSERT INTO table_name (firstname, lastname) VALUES %s" % (list_of_tuples[4])
In this example, I only used 2 column names - 'firstname' and 'lastname'. But I must do this with a loop since I have 156 column names - I cannot do this manually.
What I need:
I need to automatically generate the placeholder %s the same
number of times as the number of columns in the Pandas DataFrame.
Here, the DataFrame has 2 columns so I need an automatic way to
generate %s twice.
Then I need to create a tuple with 2 entries,
without the ''.
My attempt:
sss = ['%s' for x in range(0,len(list(df)))]
add_SQL = "INSERT INTO table_name (" + sss + ") VALUES %s" % (len(df), list_of_tuples[4])
But this is not working.
Is there a way for me to generate this string automatically?
Here is what I came up with - it is based on dwanderson's approach in the 2nd comment of the original post (question):
table_name = name_a #name of table
# Loop through all columns of dataframe and generate one string per column:
cols_n = df.columns.tolist()
placeholder = ",".join(["%s"]*df.shape[1]) #df.shape[1] gives # of columns
column_names = ",".join(cols_n)
insrt = "INSERT INTO %s " % table_name
for qrt in range(0,df.shape[0]):
add_SQL_a_1 = insrt + "(" + column_names + ") VALUES (" + placeholder + ")" #part 1/2
add_SQL_a_2 = add_SQL_a_1 % list_of_tuples[qrt] #part 2/2
This way, the final string is in part 2/2.
For some reason, it would not let me do this all in one line and I can't figure out why.
Related
I have a large text file with 15GB of size. The data inside the text file is considered as the single string with some 20million records of data. Each record is of length 5000. Each record is having 450+ column
Now I want to split the each record of the text file into new line. And split the each record as per the schema with some delimiter to load it as a Dataframe.
This is the sample approach - sample data:
HiIamRowData1HiIamRowData2HiIamRowData3HiIamRowData4HiIamRowData5HiIamRowData6HiIamRowData7HiIamRowData8
Expected output:
Hi#I#am#Row#Data#1#
Hi#I#am#Row#Data#2#
Hi#I#am#Row#Data#3#
Hi#I#am#Row#Data#4#
Hi#I#am#Row#Data#5#
Hi#I#am#Row#Data#6#
Hi#I#am#Row#Data#7#
Hi#I#am#Row#Data#8#
Code:
### Schema
schemaData = [['col1',0,2],['col2',2,1],['col3',3,2],['col4',5,3],['col5',8,4],['col6',12,1]]
df = pd.DataFrame(data= schemaData, columns=['FeildName','offset','size'])
print(df.head(5))
file = 'sampleText.txt'
inputFile = open(file, 'r').read()
recordLen = 13
totFileLen = len(inputFile)
finalStr = ''
### First for loop to split the each record based on record length
for i in range(0,totFileLen,recordLen):
record = inputFile[i:i+recordLen]
recStr = ''
### Second For loop to apply the Schema on top of each record.
for index, row in df.iterrows():
#print(record[row['offset']:row['offset'] + row['size']])
recStr = recStr + record[row['offset']:row['offset'] + row['size']] + '#'
recStr = recStr + '\n'
finalStr += recStr
print(finalStr)
text_file = open("Output.txt", "w")
text_file.write(finalStr)
For the above 8 rows sample data It is taking 56 (8 rows + 48 row times column) Total Iterations.
In real Data set I am having 25 Million Rows and 500 columns. It will take 25 mil + 25 mil X 500col Iterations
Constraints:
The entire data in the text file is sequence data, all the records are placed next to each other and entire data is in one string. I want to read the text file and write the final Data to new text file.
I don't want to split the File into smaller size chunks while processing. Like 50 MB of data files, by doing this IF the last record got splits between the half first 50MB and second chunk of 50MB, Then from second 50MB chunk onwards the data will be wrong sliced. As we are slicing each record based on the length of record 5000.
If I can split the each chunk based on the File length inside the text file that will be possible approach.
I have tried the below python approach. For smaller files it is working fine. But for the file >500MB onwards its taking hours to split the each record schema wise.
I have tried multithreading and multiprocessing approach too didn't seen much improvement there.
QUESTION: Is there any better approach for this problem either in Python or PySpark? To reduce the time complexity.
You can effectively process your big file iteratively by:
capturing a sequential chunk of the needed size at a time
passing it to pandas.read_fwf with predefined column widths
and immediately export the constructed dataframe to the output csv file (creates it if it doesn't exist) appending the line with specified separator
from io import StringIO
rec_len = 13
widths = [2, 1, 2, 3, 4, 1]
with open('sampleText.txt') as inp, open('output.txt', 'w+') as out:
while (line := inp.read(rec_len).strip()):
pd.read_fwf(StringIO(line), widths=widths, header=None) \
.to_csv(out, sep='#', header=False, index=False, mode='a')
The output.txt contents I get:
Hi#I#am#Row#Data#1
Hi#I#am#Row#Data#2
Hi#I#am#Row#Data#3
Hi#I#am#Row#Data#4
Hi#I#am#Row#Data#5
Hi#I#am#Row#Data#6
Hi#I#am#Row#Data#7
Hi#I#am#Row#Data#8
Yes, we can achieve the same result using PySpark UDF with Spark functions. Let me show you how in 5 steps:
Import necessary
import pandas as pd
from pyspark.sql.functions import udf, split, explode
Reading text file using Spark read method
sample_df = spark.read.text("path/to/file.txt")
Convert your custom function to PySpark UDF (User Defined Function) inorder to use it in Spark
def delimit_records(value):
recordLen = 13
totFileLen = len(value)
finalStr = ''
for i in range(0,totFileLen,recordLen):
record = value[i:i+recordLen]
schemaData = [['col1',0,2],['col2',2,1],['col3',3,2],['col4',5,3],['col5',8,4],['col6',12,1]]
pdf = pd.DataFrame(data= schemaData, columns=['FeildName','offset','size'])
recStr = ''
for index, row in pdf.iterrows():
recStr = recStr + record[row['offset']:row['offset'] + row['size']] + '#'
recStr = recStr + '\n'
finalStr += recStr
return finalStr.rstrip()
Registering your User Defined Function
delimit_records = udf(delimit_records)
Use your custom function against the column, you want to modify
df1 = sample_df.withColumn("value", delimit_records("value"))
Split the record based on delimiter "\n" using PySpark split()
function
df2 = df1.withColumn("value", split("value", "\n"))
Use the explode() function to transform a column of arrays or maps
into multiple rows
df3 = df2.withColumn("value", explode("value"))
Let's print the output
df3.show()
Output:
+-------------------+
| value|
+-------------------+
|Hi#I#am#Row#Data#1#|
|Hi#I#am#Row#Data#2#|
|Hi#I#am#Row#Data#3#|
|Hi#I#am#Row#Data#4#|
|Hi#I#am#Row#Data#5#|
|Hi#I#am#Row#Data#6#|
|Hi#I#am#Row#Data#7#|
|Hi#I#am#Row#Data#8#|
+-------------------+
I am not sure how to select a substring from a series in a dataframe to extract some needed text.
Example: I have a 2 series in the dataframe and am trying to extract the last portion of the string in QRY series that will have "AND" string.
So If I have "This is XYZ AND y = 1" then I need to extract "AND y = 1".
For this I've chosen rfind("AND") since the AND can occur anywhere in string but I need the highest index and then wants to extract the string that begins with the highest index AND.
Sample for one string
strg = "This is XYZ AND y = 1"
print(strg[strg.rfind("AND"):])
-- This is working, but on a data frame its saying cannot do slice indexing on <class 'pandas.core.indexes.range.RangeIndex'
data set
import pandas as pd
data = {"CELL":["CELL1","CELL2","CELL3"], "QRY": ["This is XYZ AND y = 1","No that is not AND z = 0","Yay AND a= -1"]}
df = pd.DataFrame(data,columns = ["CELL","QRY"])
print(df.QRY.str.rfind("AND"))
I am able to get the column names and table name from using sql parse for only simple select SQL's.
Can somebody help how can get the column names and table name from any complex SQL's.
Here is a solution for extracting column names from complex sql select statements. Python 3.9
import sqlparse
def get_query_columns(sql):
stmt = sqlparse.parse(sql)[0]
columns = []
column_identifiers = []
# get column_identifieres
in_select = False
for token in stmt.tokens:
if isinstance(token, sqlparse.sql.Comment):
continue
if str(token).lower() == 'select':
in_select = True
elif in_select and token.ttype is None:
for identifier in token.get_identifiers():
column_identifiers.append(identifier)
break
# get column names
for column_identifier in column_identifiers:
columns.append(column_identifier.get_name())
return columns
def test():
sql = '''
select
a.a,
replace(coalesce(a.b, 'x'), 'x', 'y') as jim,
a.bla as sally -- some comment
from
table_a as a
where
c > 20
'''
print(get_query_columns(sql))
test()
# outputs: ['a', 'jim', 'sally']
This is how you print the table name in sqlparse
1) Using SELECT statement
>>> import sqlparse
>>> print([str(t) for t in parse[0].tokens if t.ttype is None][0])
'dbo.table'
(OR)
2) Using INSERT statement:
def extract_tables(sql):
"""Extract the table names from an SQL statment.
Returns a list of (schema, table, alias) tuples
"""
parsed = sqlparse.parse(sql)
if not parsed:
return []
# INSERT statements must stop looking for tables at the sign of first
# Punctuation. eg: INSERT INTO abc (col1, col2) VALUES (1, 2)
# abc is the table name, but if we don't stop at the first lparen, then
# we'll identify abc, col1 and col2 as table names.
insert_stmt = parsed[0].token_first().value.lower() == "insert"
stream = extract_from_part(parsed[0], stop_at_punctuation=insert_stmt)
return list(extract_table_identifiers(stream))
The column names may be tricky because column names can be ambiguous or even derived. However, you can get the column names, sequence and type from virtually any query or stored procedure.
Until FROM keyword is encountered, all the column names are fetched.
def parse_sql_columns(sql):
columns = []
parsed = sqlparse.parse(sql)
stmt = parsed[0]
for token in stmt.tokens:
if isinstance(token, IdentifierList):
for identifier in token.get_identifiers():
columns.append(str(identifier))
if isinstance(token, Identifier):
columns.append(str(token))
if token.ttype is Keyword: # from
break
return columns
I have two dataframe df1 and df2. df2 consist of "tagname" and "value" column. Dictionary "bucket_dict" holds the data from df2.
bucket_dict = dict(zip(df2.tagname,df2.value))
In a df1 there are millions of row.3 columns are there "apptag","comments" and "Type" in df1. I want to match between this two dataframes like, if
"dictionary key" from bucket_dict contains in df1["apptag"] then update the value of df1["comments"] = corresponding dictionary key
and df1["Type"] = corresponding bucket_dict["key name"]
. I used below code:
for each_tag in bucket_dict:
df1.loc[(df1["apptag"].str.match(each_tag, case = False ,na = False)), "comments"] = each_tag
df1.loc[(df1["apptag"].str.match(each_tag, case = False ,na = False)), "Type"] = bucket_dict[each_tag]
Is there any efficient way to do this since it's taking longer time.
Bucketing df from which dictionary has been created:
bucketing_df = pd.DataFrame([["pen", "study"], ["pencil", "study"], ["ersr","study"],["rice","grocery"],["wht","grocery"]], columns=['tagname', 'value'])
other dataframe:
output_df = pd.DataFrame([["test123-pen", "pen"," "], ["test234-pencil", "pencil"," "], ["test234-rice","rice", " "], columns=['apptag', 'comments','type'])
Required output:
You can do this by calling an apply on your comments column along with a loc on your bucketing_df in this manner -
def find_type(a):
try:
return (bucketing_df.loc[[x in a for x in bucketing_df['tagname']]])['value'].values[0]
except:
return ""
def find_comments(a):
try:
return (bucketing_df.loc[[x in a for x in bucketing_df['tagname']]])['tagname'].values[0]
except:
return ""
output_df['type'] = output_df['apptag'].apply(lambda a: find_type(a))
output_df['comments'] = output_df['apptag'].apply(lambda a:find_comments(a))
Here I had to make them separate functions so it could handle cases where no tagname existed in apptag
It gives you this as the output_df -
apptag comments type
0 test123-pen pen study
1 test234-pencil pencil study
2 test234-rice rice grocery
All this code uses is the existing bucketing_df and output_df you provided at the end of your question.
I am using psycopg2 on a python script
The script parses json files, and put them in a Postgres RDS.
When a value is missing on the json file, the script supposed to put skip the specific column
(so it supposed to inert null value in the table, but instead it puts NaN)
Has anybody encountered this issue?
The part that checks if the column is empty -
if (str(df.loc[0][col]) == "" or df.loc[0][col] is None or str(df.loc[0][col]) == 'None' or str(df.loc[0][col]) == 'NaN' or str(df.loc[0][col]) == 'null'):
df.drop(col, axis=1, inplace=True)
else:
cur.execute("call mrr.add_column_to_table('{0}', '{1}');".format(table_name, col))
The insertion part -
def copy_df_to_sql(df, conn, table_name):
if len(df) > 0:
df_columns = list(df)
columns = '","'.join(df_columns) # create (col1,col2,...)
# create VALUES('%s', '%s",...) one '%s' per column
values = "VALUES({})".format(",".join(["%s" for _ in df_columns]))
# create INSERT INTO table (columns) VALUES('%s',...)
emp = '"'
insert_stmt = 'INSERT INTO mrr.{} ({}{}{}) {}'.format(table_name, emp, columns, emp, values)
cur = conn.cursor()
import psycopg2.extras
psycopg2.extras.execute_batch(cur, insert_stmt, df.values)
conn.commit()
cur.close()
Ok, so the reason this is happening is probably because pandas is treating null values as NaN,
so when I insert a Dataframe into the table in inserts the null values as pandas null, which is NaN