I am trying to extract the DDL of tables and store it in .sql files using pandas
The code I have tried is :
query = "show table tablename"
df = pd.read_sql(query, connect)
df.to_csv('xyz.sql', index=False, header=False, quoting=None)
This creates a .sql file with the DDL like this -
" CREATE TABLE .....
.... ; "
How do I write the file without the quotes, like -
CREATE TABLE .....
.... ;
Given a string s, such as "CREATE ...",
one can delete double-quote characters with:
s = s.replace('"', '')
And don't forget
maketrans,
which (with translate) is very good at efficiently
deleting unwanted characters from very long strings.
Related
For Eg:
1)File has
ID|Name|job|hobby|salary|hobby2
2)Data:
1|ram|architect|tennis|20000|cricket
1|ram|architect|football|20000|gardening
2|krish|teacher|painting|25000|cooking
3)Table:
Columns in table: ID-Name-Job-Hobby-Salary
Is it possible to load data into table as below:
1-ram-architect-tenniscricketfootbalgardening-20000
2-krish-teacher-paintingcooking-25000
Command: db2 "Load CLIENT FROM ABC.FILE of DEL MODIFIED BY coldel0x7x keepblanks REPLACE INTO tablename(ID,Name,Job,Hobby,salary) nonrecoverable"
You cannot achieve what you think you want in a single action with either LOAD CLIENT or IMPORT.
You are asking to denormalize, and I presume you understand the consequences.
Regardless, you can use a multi-step approach, first load/import into a temporary table, and then in a second step use SQL to denormalize into the final table, before discarding the temporary table.
Or if you are adept with awk , and the data file is correctly sorted, then you can pre-process the file externally to a database before load/import.
Or use an ETL tool.
You may use the INGEST command instead of LOAD.
You must create the corresponding infrastructure for this command beforehand with the following command, for example:
CALL SYSINSTALLOBJECTS('INGEST', 'C', 'USERSPACE1', NULL);
Load your file afterwards with the following command:
INGEST FROM FILE ABC.FILE
FORMAT DELIMITED by '|'
(
$id INTEGER EXTERNAL
, $name CHAR(8)
, $job CHAR(20)
, $hobby CHAR(20)
, $salary INTEGER EXTERNAL
, $hobby2 CHAR(20)
)
MERGE INTO tablename
ON ID = $id
WHEN MATCHED THEN
UPDATE SET hobby = hobby CONCAT $hobby CONCAT $hobby2
WHEN NOT MATCHED THEN
INSERT (ID, NAME, JOB, HOBBY, SALARY) VALUES($id, $name, $job, $hobby CONCAT $hobby2, $salary);
I want to save the tsv file to adls gen1. Using the below command to save the data but it writing a row delimiter as "\n"(LF) I want to writing a row delimiter "\r\n"
df.coalesce(1).write.mode("overwrite").format("csv").options(delimiter="\t",header="true",nullValue= None,lineSep ='\r\n').save(gen1temp)
I am having a 400+columns and 2M rows and file size in 6GB.
Please help with optimal solumn.
support for lineSep option for CSV files exists only in Spark 3.0, and doesn't exist in the earlier versions, like, 2.4, so it simply ignored.
Initially I thought about following workaround - append \r to the last column:
from pyspark.sql.functions import concat, col, lit
data = spark.range(1, 100).withColumn("abc", col("id")).withColumn("def", col("id"))
cols = map(lambda cn: col(cn), data.schema.fieldNames())
cols[-1] = concat(cols[-1].cast("string"), lit("\r"))
data.select(cols).write.csv("1.csv")
but unfortunately it doesn't work - it looks like that it's stripping ending whitespace when writing data into CSV...
When I run my chatbot it creates a db.sqlite3 file in backend for storing all the conversation. I want to convert this db.sqlite3 file into a csv file using API. How should I implement it in python? The image contains the type of file.
There are multiple tables in the db file associated with Chatterbot. They are conversation_association, conversation, response, statement, tag_association and tag. Out of all these tables only response and statement tables have proper data (at least in my case). However I tried to convert all tables into csv. So you may find some empty csv files too.
import sqlite3, csv
db = sqlite3.connect("chatterbot-database") # enter your db name here
cursor = db.cursor()
tables = [table[0] for table in cursor.execute("select name from sqlite_master where type = 'table'")] # fetch table names from db
for table in tables:
with open('%s.csv'%table, 'w') as fd:
csvwriter = csv.writer(fd)
for data in cursor.execute("select * from '%s'"%table): # get data from each table
csvwriter.writerow(data)
I have an Excel file with a few columns (20) and some data that I need to upload into 4 SQL Server tables. The tables are related and specific columns represents my id for each table.
Is there an ETL tool that I can use to automate this process?
This query uses bulk insert to store the file in a #temptable
and then inserts the contents from this temp table into the table you want in the database, however the file being imported is .csv. you can just save your excel file as csv, before doing this.
CREATE TABLE #temptable (col1,col2,col3)
BULK INSERT #temptable from 'C:\yourfilelocation\yourfile.csv'
WITH
(
FIRSTROW = 2,
fieldterminator = ',',
rowterminator = '0x0A'
) `
INSERT INTO yourTableInDataBase (col1,col2,col3)
SELECT (col1,col2,col3)
FROM #temptable
To automate this, you can put this inside a stored procedure and call the stored procedure using batch.Edit this code and put this inside textfile and save as cmd
set MYDB= yourDBname
set MYUSER=youruser
set MYPASSWORD=yourpassword
set MYSERVER=yourservername
sqlcmd -S %MYSERVER% -d %MYDB% -U %MYUSER% -P %MYPASSWORD% -h -1 -s "," -W -Q "exec yourstoredprocedure"
I'm processing events using Dataframes converted from a stream of JSON events which eventually gets written out as Parquet format.
However, some of the JSON events contains spaces in the keys which I want to log and filter/drop such events from the data frame before converting it to Parquet because ;{}()\n\t= are considered special characters in Parquet schema (CatalystSchemaConverter) as listed in [1] below and thus should not be allowed in the column names.
How can I do such validations in Dataframe on the column names and drop such an event altogether without erroring out the Spark Streaming job.
[1]
Spark's CatalystSchemaConverter
def checkFieldName(name: String): Unit = {
// ,;{}()\n\t= and space are special characters in Parquet schema
checkConversionRequirement(
!name.matches(".*[ ,;{}()\n\t=].*"),
s"""Attribute name "$name" contains invalid character(s) among " ,;{}()\\n\\t=".
|Please use alias to rename it.
""".stripMargin.split("\n").mkString(" ").trim
)
}
For everyone experiencing this in pyspark: this even happened to me after renaming the columns. One way I could get this to work after some iterations is this:
file = "/opt/myfile.parquet"
df = spark.read.parquet(file)
for c in df.columns:
df = df.withColumnRenamed(c, c.replace(" ", ""))
df = spark.read.schema(df.schema).parquet(file)
You can use a regex to replace all invalid characters with an underscore before you write into parquet. Additionally, strip accents from the column names too.
Here's a function normalize that do this for both Scala and Python :
Scala
/**
* Normalize column name by replacing invalid characters with underscore
* and strips accents
*
* #param columns dataframe column names list
* #return the list of normalized column names
*/
def normalize(columns: Seq[String]): Seq[String] = {
columns.map { c =>
org.apache.commons.lang3.StringUtils.stripAccents(c.replaceAll("[ ,;{}()\n\t=]+", "_"))
}
}
// using the function
val df2 = df.toDF(normalize(df.columns):_*)
Python
import unicodedata
import re
def normalize(column: str) -> str:
"""
Normalize column name by replacing invalid characters with underscore
strips accents and make lowercase
:param column: column name
:return: normalized column name
"""
n = re.sub(r"[ ,;{}()\n\t=]+", '_', column.lower())
return unicodedata.normalize('NFKD', n).encode('ASCII', 'ignore').decode()
# using the function
df = df.toDF(*map(normalize, df.columns))
This is my solution using Regex in order to rename all the dataframe's columns following the parquet convention:
df.columns.foldLeft(df){
case (currentDf, oldColumnName) => currentDf.withColumnRenamed(oldColumnName, oldColumnName.replaceAll("[ ,;{}()\n\t=]", ""))
}
I hope it helps,
I had the same problem with column names containing spaces.
The first part of the solution was to put the names in backquotes.
The second part of the solution was to replace the spaces with underscores.
Sorry but I have only the pyspark code ready:
from pyspark.sql import functions as F
df_tmp.select(*(F.col("`" + c+ "`").alias(c.replace(' ', '_')) for c in df_tmp.columns)
Using alias to change your field names without those special characters.
I have encounter this error "Error in SQL statement: AnalysisException: Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema. Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'. For more details, refer to https://learn.microsoft.com/azure/databricks/delta/delta-column-mapping Or you can use alias to rename it."
The issue was because I used MAX(COLUM_NAME) when creating a table based on a parquet / Delta table, and the new name of the new table was "MAX(COLUM_NAME)" because forgot to use Aliases and parquet files doesn't support brackets '()'
Solved by using aliases (removing the brackets)
It was fixed in Spark 3.3.0 release at least for the parquet files (I tested), it might work with JSON as well.