Spark read single column from PostgreSQL table - apache-spark

Question
Is there a way to load a specific column from a (PostreSQL) database table as a Spark DataFrame?
Below is what I've tried.
Expected behavior:
The code below should result in only the specified column being stored in memory, not the entire table (table is too large for my cluster).
# make connection in order to get column names
conn = p2.connect(database=database, user=user, password=password, host=host, port="5432")
cursor = conn.cursor()
cursor.execute("SELECT column_name FROM information_schema.columns WHERE table_name = '%s'" % table)
for header in cursor:
header = header[0]
df = spark.read.jdbc('jdbc:postgresql://%s:5432/%s' % (host, database), table=table, properties=properties).select(str(header)).limit(10)
# doing stuff with Dataframe containing this column's contents here before continuing to next column and loading that into memory
df.show()
Actual behavior:
Out of memory exception occurs. I'm presuming it is because Spark attempts to load the entire table and then select a column, rather than just loading the selected column? Or is it actually loading just the column, but that column is too large; I limited the column to just 10 values, so that shouldn't be the case?
2018-09-04 19:42:11 ERROR Utils:91 - uncaught error in thread spark-listener-group-appStatus, stopping SparkContext
java.lang.OutOfMemoryError: GC overhead limit exceeded

SQL query with one column only can be used in jdbc instead of "table" parameter, please find some details here:
spark, scala & jdbc - how to limit number of records

Related

mariadb python - executemany using SELECT

Im trying to input many rows to a table in a mariaDB.
For doing this i want to use executemany() to increase speed.
The inserted row is dependent on another table, which is found with SELECT.
I have found statements that SELECT doent work in a executemany().
Are there other ways to sole this problem?
import mariadb
connection = mariadb.connect(host=HOST,port=PORT,user=USER,password=PASSWORD,database=DATABASE)
cursor = connection.cursor()
query="""INSERT INTO [db].[table1] ([col1], [col2] ,[col3])
VALUES ((SELECT [colX] from [db].[table2] WHERE [colY]=? and
[colZ]=(SELECT [colM] from [db].[table3] WHERE [colN]=?)),?,?)
ON DUPLICATE KEY UPDATE
[col2]= ?,
[col3] =?;"""
values=[input_tuplets]
When running the code i get the same value for [col1] (the SELECT-statement) which corresponds to the values from the from the first tuplet.
If SELECT doent work in a executemany() are there another workaround for what im trying to do?
Thx alot!
I think that reading out the tables needed,
doing the search in python,
use exeutemany() to insert all data.
It will require 2 more queries (to read to tables) but will be OK when it comes to calculation time.
Thanks for your first question on stackoverflow which identified a bug in MariaDB Server.
Here is a simple script to reproduce the problem:
CREATE TABLE t1 (a int);
CREATE TABLE t2 LIKE t1;
INSERT INTO t2 VALUES (1),(2);
Python:
>>> cursor.executemany("INSERT INTO t1 VALUES \
(SELECT a FROM t2 WHERE a=?))", [(1,),(2,)])
>>> cursor.execute("SELECT a FROM t1")
>>> cursor.fetchall()
[(1,), (1,)]
I have filed an issue in MariaDB Bug tracking system.
As a workaround, I would suggest reading the country table once into an array (according to Wikipedia there are 195 different countries) and use these values instead of a subquery.
e.g.
countries= {}
cursor.execute("SELECT country, id FROM countries")
for row in cursor:
countries[row[0]]= row[1]
and then in executemany
cursor.executemany("INSERT INTO region (region,id_country) values ('sounth', ?)", [(countries["fra"],) (countries["ger"],)])

How to load data from database to Spark before filtering

I'm trying to run such a PySpark application:
with SparkSession.builder.appName(f"Spark App").getOrCreate() as spark:
dataframe_mysql = spark.read.format('jdbc').options(
url="jdbc:mysql://.../...",
driver='com.mysql.cj.jdbc.Driver',
dbtable='my_table',
user=...,
password=...,
partitionColumn='id',
lowerBound=0,
upperBound=10000000,
numPartitions=11,
fetchsize=1000000,
isolationLevel='NONE'
).load()
dataframe_mysql = dataframe_mysql.filter("date > '2022-01-01'")
dataframe_mysql.write.parquet('...')
And I found that Spark didn't load data from Mysql until executing the write, this means that Spark let the database take care of filtering the data, and the SQL that database received may like:
select * from my_table where id > ... and id< ... and date > '2022-01-01'
My table was too big and there's no index on date column, the database couldn't handle the filtering. How can I load data into Spark's memory before filtering, I hope the query sent to the databse could be:
select * from my_table where id > ... and id< ...
According to #samkart's comment, set pushDownPredicate to False could solve this problem

Write Pandas dataframe data to CSV file

I am trying to write a pipeline to bring oracle database table data to aws.
It only takes a few ms to fill the dataframe, but when I try to write the dataframe to a csv-file it takes more than 2 min to write 10000 rows. In addition, one of the column's datatypes is cx_oracle lob type.
I thought this meant that it must take time to write data. So I converted the data to categorical data. But then the operation will take more memory space. Does anyone have any suggestions on how to optimize this process?
query = 'select * from tablename'
cursor.execute(query)
iter_idx = 0
while True:
results = cursor.fetchmany()
if not results:
break
iter_idx += 1
df = pd.DataFrame(results)
df.columns = field['source_field_names']
rec_count = df.shape[0]
t_rec_count += rec_count
file = generate_micro_file()
print('memory usage : \n', df.info(memory_usage='deep'))
# sd = dd.from_pandas(df, npartitions=1)
df.to_csv(file, encoding=str(encoding_type), header=False, index=False, escapechar='\\',chunksize=arraysize)
code output:
From the data access side, there is room for improvement by optimizing the fetching of rows across the network. Either by:
passing a large num_rows value to fetchmany(), see the cx_Oracle doc on [Cursor.fetchmany()[(https://cx-oracle.readthedocs.io/en/latest/api_manual/cursor.html#Cursor.fetchmany).
or increasing the value of Cursor.arraysize.
Your question didn't explain enough about your LOB usage. See the sample return_lobs_as_strings.py for optimizing fetches.
See the cx_Oracle documentation Tuning Fetch Performance.
Is there a particular reason to spend the overhead of converting to a Pandas dataframe? Why not write directly using the csv module?
Maybe something like this:
with connection.cursor() as cursor:
sql = "select * from all_objects where rownum <= 100000"
cursor.arraysize = 10000
with open("testwrite.csv", "w", encoding="utf-8") as outputfile:
writer = csv.writer(outputfile, lineterminator="\n")
results = cursor.execute(sql)
writer.writerows(results)
You should benchmark and choose the best solution.

Error while writing dataframe with column length more that default value (256) to SQL warehouse

I am trying to write a data frame from Spark to SQL warehouse table. One of the column in this table has values whoes length is greater than the default value of string (256) . As per this link ,
https://docs.databricks.com/spark/latest/data-sources/azure/sql-data-warehouse.html
"maxStrLength" specifies the maximum length that can be used for string while loading to SQL warehouse, but this option is not helping me to increase the length of varchar from default value . Can you please suggest ? The below is my dataframe write statement that I am executing , let me know if you need more detail.
df.write
.format("com.databricks.spark.sqldw")
.option("url", sqlDwUrlSmall).option( "forward_spark_azure_storage_credentials","True").option("tempDir",tempDir).option("maxStrLength ","4000").option("dbTable",sqlschemaName + "." + sqlDwhTbl)
.option("tableOptions", "DISTRIBUTION = ROUND_ROBIN")
.mode("overwrite")
.save()
error message :
Underlying SQLException(s): - com.microsoft.sqlserver.jdbc.SQLServerException: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopSqlException: String or binary data would be truncated. [ErrorCode = 107090] [SQLState = S0001]
The column is defined as varchar and you could change that to varchar(max) or investigate if any padding or double-byte characters are present in the data, that might be driving the total character count to exceed a column width definition greater than 4000 in DWH.
A quick test to try: ("maxStrLength ","3500") and see if the string is accepted?

Hive - insert into table partition throwing error

I am trying to create a partitioned table in Hive on spark and load it with data available in other table in Hive.
I am getting following error while loading the data:
Error: org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException:
Partition spec {cardsuit=, cardcolor=, cardSuit=SPA, cardColor=BLA}
contains non-partition columns;
following are the commands used to execute the task:-
create table if not exists hive_tutorial.hive_table(color string, suit string,value string) comment 'first hive table' row format delimited fields terminated by '|' stored as TEXTFILE;
LOAD DATA LOCAL INPATH 'file:///E:/Kapil/software-study/Big_Data_NoSql/hive/deckofcards.txt' OVERWRITE INTO TABLE hive_table; --data is correctly populated(52 rows)
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
create table if not exists hive_tutorial.hive_table_partitioned(color string, suit string,value int) comment 'first hive table' partitioned by (cardSuit string,cardColor string) row format delimited fields terminated by '|' stored as TEXTFILE;
INSERT INTO TABLE hive_table_partitioned PARTITION (cardSuit,cardColor) select color,suit,value,substr(suit, 1, 3) as cardSuit,substr(color, 1, 3) as cardColor from hive_table;
--alternatively i tried
INSERT OVERWRITE TABLE hive_table_partitioned PARTITION (cardSuit,cardColor) select color,suit,value,substr(suit, 1, 3) as cardSuit,substr(color, 1, 3) as cardColor from hive_table;
sample of data:-
BLACK|SPADE|2
BLACK|SPADE|3
BLACK|SPADE|4
BLACK|SPADE|5
BLACK|SPADE|6
BLACK|SPADE|7
BLACK|SPADE|8
BLACK|SPADE|9
I am using spark 2.2.0 and java version 1.8.0_31.
I have checked and tried answers given in similar thread but could not solve my problem:-
SemanticException Partition spec {col=null} contains non-partition columns
Am I missing something here?
After carefully reading the scripts above while creating tables ,value column is type of int in partitioned table where as string in original. Error was misleading !!!

Resources