I've been trying to figure this out for the past day, but have not been successful.
Problem I am facing
I'm reading a parquet file that is about 2GB big. The initial read is 14 partitions, then eventually gets split into 200 partitions. I perform seemingly simple SQL query that runs for 25+ mins runtime, about 22 mins is spent on a single stage. Looking in the Spark UI, I see that all computation is eventually pushed to about 2 to 4 executors, with lots of shuffling. I don't know what is going on. Please I would appreciate any help.
Setup
Spark environment - Databricks
Cluster mode - Standard
Databricks Runtime Version - 6.4 ML (includes Apache Spark 2.4.5, Scala 2.11)
Cloud - Azure
Worker Type - 56 GB, 16 cores per machine. Minimum 2 machines
Driver Type - 112 GB, 16 cores
Notebook
Cell 1: Helper functions
load_data = function(path, type) {
input_df = read.df(path, type)
input_df = withColumn(input_df, "dummy_col", 1L)
createOrReplaceTempView(input_df, "__current_exp_data")
## Helper function to run query, then save as table
transformation_helper = function(sql_query, destination_table) {
createOrReplaceTempView(sql(sql_query), destination_table)
}
## Transformation 0: Calculate max date, used for calculations later on
transformation_helper(
"SELECT 1L AS dummy_col, MAX(Date) max_date FROM __current_exp_data",
destination_table = "__max_date"
)
## Transformation 1: Make initial column calculations
transformation_helper(
"
SELECT
cId AS cId
, date_format(Date, 'yyyy-MM-dd') AS Date
, date_format(DateEntered, 'yyyy-MM-dd') AS DateEntered
, eId
, (CASE WHEN isnan(tSec) OR isnull(tSec) THEN 0 ELSE tSec END) AS tSec
, (CASE WHEN isnan(eSec) OR isnull(eSec) THEN 0 ELSE eSec END) AS eSec
, approx_count_distinct(eId) OVER (PARTITION BY cId) AS dc_eId
, COUNT(*) OVER (PARTITION BY cId, Date) AS num_rec
, datediff(Date, DateEntered) AS analysis_day
, datediff(max_date, DateEntered) AS total_avail_days
FROM __current_exp_data
CROSS JOIN __max_date ON __main_data.dummy_col = __max_date.dummy_col
",
destination_table = "current_exp_data_raw"
)
## Transformation 2: Drop row if Date is not valid
transformation_helper(
"
SELECT
cId
, Date
, DateEntered
, eId
, tSec
, eSec
, analysis_day
, total_avail_days
, CASE WHEN analysis_day == 0 THEN 0 ELSE floor((analysis_day - 1) / 7) END AS week
, CASE WHEN total_avail_days < 7 THEN NULL ELSE floor(total_avail_days / 7) - 1 END AS avail_week
FROM current_exp_data_raw
WHERE
isnotnull(Date) AND
NOT isnan(Date) AND
Date >= DateEntered AND
dc_eId == 1 AND
num_rec == 1
",
destination_table = "main_data"
)
cacheTable("main_data_raw")
cacheTable("main_data")
}
spark_sql_as_data_table = function(query) {
data.table(collect(sql(query)))
}
get_distinct_weeks = function() {
spark_sql_as_data_table("SELECT week FROM current_exp_data GROUP BY week")
}
Cell 2: Call helper function that triggers the long running task
library(data.table)
library(SparkR)
spark = sparkR.session(sparkConfig = list())
load_data_pq("/mnt/public-dir/file_0000000.parquet")
set.seed(1234)
get_distinct_weeks()
Long running stage DAG
Stats about long running stage
Logs
I trimmed it down, and show only entries that appeared multiple times below
BlockManager: Found block rdd_22_113 locally
CoarseGrainedExecutorBackend: Got assigned task 812
ExternalAppendOnlyUnsafeRowArray: Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
InMemoryTableScanExec: Predicate (dc_eId#61L = 1) generates partition filter: ((dc_eId.lowerBound#622L <= 1) && (1 <= dc_eId.upperBound#621L))
InMemoryTableScanExec: Predicate (num_rec#62L = 1) generates partition filter: ((num_rec.lowerBound#627L <= 1) && (1 <= num_rec.upperBound#626L))
InMemoryTableScanExec: Predicate isnotnull(Date#57) generates partition filter: ((Date.count#599 - Date.nullCount#598) > 0)
InMemoryTableScanExec: Predicate isnotnull(DateEntered#58) generates partition filter: ((DateEntered.count#604 - DateEntered.nullCount#603) > 0)
MemoryStore: Block rdd_17_104 stored as values in memory (estimated size <VERY SMALL NUMBER < 10> MB, free 10.0 GB)
ShuffleBlockFetcherIterator: Getting 200 non-empty blocks including 176 local blocks and 24 remote blocks
ShuffleBlockFetcherIterator: Started 4 remote fetches in 1 ms
UnsafeExternalSorter: Thread 254 spilling sort data of <Between 1 and 3 GB> to disk (3 times so far)
I am trying to do parallel scan but it looks like it processing sequentially and not fetching full records
My ddb has 400k records and i am trying to pull all records using DDB parallel scans.I want to use 40 threads
dynamodb = boto3.resource('dynamodb',
region_name='us-east-1',
endpoint_url="https://dynamodb.us-east-1.amazonaws.com",
aws_access_key_id="123",
aws_secret_access_key="456")
table = dynamodb.Table('solved_history')
last_evaluated_key = None
a = 0
while True:
if last_evaluated_key:
response = table.scan(ExclusiveStartKey=last_evaluated_key)
allResults = response['Items']
for each in allResults:
storeRespectively(each["decodedText"], each["base64StringOfImage"])
a += 1
else:
response = table.scan(TotalSegments=40, Segment=39)
allResults = response['Items']
for each in allResults:
storeRespectively(each["decodedText"], each["base64StringOfImage"])
a += 1
last_evaluated_key = response.get('LastEvaluatedKey')
if not last_evaluated_key:
break
print(a)
I run Flume to ingest Twitter data into HDFS (in JSON format) and run Spark to read that file.
But somehow, it doesn't return the correct result: it seems the content of the file is not updated.
Here's my Flume configuration:
TwitterAgent01.sources = Twitter
TwitterAgent01.channels = MemoryChannel01
TwitterAgent01.sinks = HDFS
TwitterAgent01.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent01.sources.Twitter.channels = MemoryChannel01
TwitterAgent01.sources.Twitter.consumerKey = xxx
TwitterAgent01.sources.Twitter.consumerSecret = xxx
TwitterAgent01.sources.Twitter.accessToken = xxx
TwitterAgent01.sources.Twitter.accessTokenSecret = xxx
TwitterAgent01.sources.Twitter.keywords = some_keywords
TwitterAgent01.sinks.HDFS.channel = MemoryChannel01
TwitterAgent01.sinks.HDFS.type = hdfs
TwitterAgent01.sinks.HDFS.hdfs.path = hdfs://hadoop01:8020/warehouse/raw/twitter/provider/m=%Y%m/
TwitterAgent01.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent01.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent01.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent01.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent01.sinks.HDFS.hdfs.rollCount = 0
TwitterAgent01.sinks.HDFS.hdfs.rollInterval = 86400
TwitterAgent01.channels.MemoryChannel01.type = memory
TwitterAgent01.channels.MemoryChannel01.capacity = 10000
TwitterAgent01.channels.MemoryChannel01.transactionCapacity = 10000
After that I check the output with hdfs dfs -cat and it returns more than 1000 rows, meaning that the data was successfully inserted.
But in Spark that's not the case
spark.read.json("/warehouse/raw/twitter/provider").filter("m=201802").show()
only has 6 rows.
Did I miss something here?
I'm not entirely sure of why you specified the latter part of the path as the condition expression of filter.
I believe to correctly read your file you can just write:
spark.read.json("/warehouse/raw/twitter/provider/m=201802").show()
I'm developing code in SQL spark reading tables in Hive ( HDFS ) .
The problem is that when I load my code in the shell of spark, recursively me the following message :
"WARN LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems."
I run the code that is:
val query_fare_details = sql("""
SELECT *
FROM fare_details
WHERE fardet_cd_carrier = 'LA'
AND fardet_cd_origin_city = 'SCL'
AND fardet_cd_dest_city = 'MIA'
AND fardet_cd_fare_basis = 'NNE0F0O1'
""")
query_fare_details.registerTempTable("query_fare_details")
val matchFAR1 = sql("""
SELECT *
FROM query_fare_details f
JOIN fare_rules r ON f.fardet_cd_carrier = r.farrul_cd_carrier
AND f.fardet_num_rule_tariff = r.farrul_num_rule_tariff
AND f.fardet_cd_fare_rule_bigint = r.farrul_cd_fare_rule_bigint
AND f.fardet_cd_fare_basis = r.farrul_cd_fare_basis
LIMIT 10""")
matchFAR1.show(5)
Any idea what goes wrong?
you can safely ignore this warning. This is not an error
Refer [https://issues.apache.org/jira/browse/SPARK-3057][1]
I would like to select from a PostgreSQL instance on a server running openSUSE from an Oracle instance on a server running Oracle's custom Linux flavor. I would like to do this using the Oracle ODBC gateway. I have done this successfully in the past and continue to do it using the same Oracle box and other SUSE/Postgres boxes.
my ODBC manager on the SUSE (Postgres) side is: unixODBC
My odbc.ini is:
[postgresql]
Description = Test to Postgres
Driver = /usr/lib64/psqlodbcw.so
Trace = Yes
TraceFile = sql.log
Database = host
Servername = localhost
UserName = *****
Password = *****
Port = 5432
Protocol =
ReadOnly = Yes
RowVersioning = No
ShowSystemTables = No
ShowOidColumn = No
FakeOidIndex = No
ConnSettings =
My odbcinst.ini is
[postgresql]
Description = Postgresql driver for Linux
Driver = /usr/lib64/psqlodbcw.so
UsageCount = 1
my tnsnames.ora (with other ODBC entries omitted) is:
# tnsnames.ora Network Configuration File: /opt/oracle/product/11gR1/db/network/admin/tnsnames.ora
# Generated by Oracle configuration tools.
LISTENER_ORCL =
(ADDRESS = (PROTOCOL = TCP)(HOST = dbs1)(PORT = 1522))
ORCL =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = dbs1)(PORT = 1522))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = orcl)
)
)
ODBC_SERVER123=
(DESCRIPTION=
(ADDRESS=
(PROTOCOL=TCP)
(HOST=dbs1)
(PORT=1522)
)
(CONNECT_DATA=
(SID=server123)
)
(HS=OK)
)
my listener.ora (with other SID_DESC entries omitted) is:
# listener.ora Network Configuration File: /opt/oracle/product/11gR1/db/network/admin/listener.ora
# Generated by Oracle configuration tools.
LISTENER =
(DESCRIPTION_LIST =
(DESCRIPTION =
(ADDRESS_LIST=
(ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC1522))
(ADDRESS = (PROTOCOL = TCP)(HOST = dbs1)(PORT = 1522))
)
)
)
ADR_BASE_LISTENER = /opt/oracle
SID_LIST_LISTENER=
(SID_LIST=
(SID_DESC=
(SID_NAME=server123)
(ORACLE_HOME=/opt/oracle/product/11gR1/db)
(PROGRAM=dg4odbc)
(ENVS=LD_LIBRARY_PATH=/usr/lib64:/opt/oracle/product/11gR1/db/lib)
)
)
TRACE_LEVEL_LISTENER = 0
LOGGING_LISTENER = off
Also, here is the inithost123.ora file located in $ORACLE_HOME/hs/admin:
#
# HS init parameters
#
HS_FDS_CONNECT_INFO = server123
HS_FDS_TRACE_LEVEL = 0
#HS_FDS_TRACE_LEVEL=DEBUG
HS_FDS_SHAREABLE_NAME = /usr/lib64/libodbc.so
HS_FDS_SUPPORT_STATISTICS = FALSE
HS_LANGUAGE=AMERICAN_AMERICA.WE8ISO8859P1
#HS_LANGUAGE=AMERICAN_AMERICA.US7ASCII
HS_FDS_TIMESTAMP_MAPPING = "TIMESTAMP(6)"
HS_FDS_FETCH_ROWS=1
HS_FDS_SQLLEN_INTERPRETATION=32
#
# ODBC specific environment variables
#
set ODBCINI=/etc/unixODBC/odbc.ini
#
# Environment variables required for the non-Oracle system
#
#set <envvar>=<value>
And just for good measure, my sqlnet.ora is:
# sqlnet.ora Network Configuration File: /opt/oracle/product/11gR1/db/network/admin/sqlnet.ora
# Generated by Oracle configuration tools.
NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT)
ADR_BASE = /opt/oracle
I added a gateway link oracle-side using the following:
CREATE DATABASE LINK ODBC_SERVER123 CONNECT TO "*****" IDENTIFIED BY "*****" USING 'ODBC_SERVER123';
When I try to perform my select I receive the following error:
[SQL] select * from "legit_view"#ODBC_SERVER123
[Err] ORA-28500: connection from ORACLE to a non-Oracle system returned this message:
[unixODBC][Driver Manager]Data source name not found, and no default driver specified {IM002}
ORA-02063: preceding 2 lines from ODBC_SERVER123
I can successfully test ODBC locally on the SUSE/Postgres box with isql:
>isql -v postgresql ***** *****
+---------------------------------------+
| Connected! |
| |
| sql-statement |
| help [tablename] |
| quit |
| |
+---------------------------------------+
SQL>
What step have I forgotten?
Your odbc.ini file must be on the remote box! That is, the one originating the query. In the example above that would be the Oracle box. Here is an entry that must go on the Oracle (remote) box:
[server123]
Driver = /usr/lib64/psqlodbcw.so
Description = ODBC to PostgreSQL on Server 123
Trace = No
Tracefile = /var/log/sql_host_123.log
Servername = ip.address.for.123
Username = *****
Password = *****
Port = 5432
Protocol = 6.4
Database = host
QuotedId = Yes
ReadOnly = No
RowVersioning = No
ShowSystemTables = No
ShowOidColumn = No
FakeOidIndex = No