I'm trying to run a Spark job using 3.2.0 and I was trying to migrate from log4j-1x to log4j-2x as Spark 3.2.0 uses log4j-2x instead of log4j-1x.
Below are contents of my log4j.properties file:
log4j.appender.stdout.layout.extractFieldsFromMessage=false
log4j.appender.stdout.layout.mdcKeys=*
log4j.appender.stdout.layout=<some-package>
log4j.appender.stdout.Target=System.out
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.logger.org.apache.parquet=ERROR
log4j.logger.org.apache.spark.executor.CoarseGrainedExecutorBackend=WARN
log4j.logger.org.apache.spark.executor.Executor=WARN
log4j.logger.org.apache.spark.network.client.TransportClientFactory=WARN
log4j.logger.org.apache.spark.scheduler.DAGScheduler=WARN
log4j.logger.org.apache.spark.scheduler.TaskSetManager=WARN
log4j.logger.org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport=WARN
log4j.logger.org.apache.spark.sql.execution.streaming.state=WARN
log4j.logger.org.apache.spark.storage=WARN
log4j.logger.pie.spark.orchestra.driver.rest.controllers.EventLogController=WARN
How can I re-write the above properties into log4j-2x?
I used some reference and tried to convert into below syntax, but my spark job failed with below error.
New syntax used:
rootLogger.level = INFO
logging.packages=<some-package>
rootLogger.appenderRef.stdout.ref = stdout
appender.stdout.type = Console
appender.stdout.name = stdout
appender.stdout.target = SYSTEM_OUT
appender.stdout.layout = <some-package>
appender.stdout.layout.mdcKeys = *
appender.stdout.layout.extractFieldsFromMessage = false
logger.parquet.name = org.apache.parquet
logger.parquet.level = WARN
logger.spark_executor_CoarseGrainedExecutorBackend.name = org.apache.spark.executor.CoarseGrainedExecutorBackend
logger.spark_executor_CoarseGrainedExecutorBackend.level = WARN
logger.spark_executor_Executor.name = org.apache.spark.executor.Executor
logger.spark_executor_Executor.level = WARN
logger.spark_network.name = org.apache.spark.network.client.TransportClientFactory
logger.spark_network.level = WARN
logger.spark_scheduler_DAGScheduler.name = org.apache.spark.scheduler.DAGScheduler
logger.spark_scheduler_DAGScheduler.level = WARN
logger.spark_scheduler_TaskSetManager.name = org.apache.spark.scheduler.TaskSetManager
logger.spark_scheduler_TaskSetManager.level = WARN
logger.spark_sql_execution_datasources.name = org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport
logger.spark_sql_execution_datasources.name.level = WARN
logger.spark_sql_execution_streaming.name = org.apache.spark.sql.execution.streaming.state
logger.spark_sql_execution_streaming.level = WARN
logger.spark_storage.name = org.apache.spark.storage
logger.spark_storage.level = WARN
logger.spark_orchestra.name = pie.spark.orchestra.driver.rest.controllers.EventLogController
logger.spark_orchestra.level = WARN
Spark Error:
{"lastObservedDriverPodSummary":{"containerStates":
[{"exitCode":1,"message":"nfigurationBuilder.createAppender(PropertiesConfigurationBuilder.java:222)\n\tat
org.apache.logging.log4j.core.config.properties.PropertiesConfigurationBuilder.build(PropertiesConfigurationBuilder.java:158)\n\tat
org.apache.logging.log4j.core.config.properties.PropertiesConfigurationFactory.getConfiguration(PropertiesConfigurationFactory.java:56)\n\tat
org.apache.logging.log4j.core.config.properties.PropertiesConfigurationFactory.getConfiguration(PropertiesConfigurationFactory.java:35)\n\tat
org.apache.logging.log4j.core.config.ConfigurationFactory$Factory.getConfiguration(ConfigurationFactory.java:523)\n\tat
org.apache.logging.log4j.core.config.ConfigurationFactory$Factory.getConfiguration(ConfigurationFactory.java:498)\n\tat
org.apache.logging.log4j.core.config.ConfigurationFactory$Factory.getConfiguration(ConfigurationFactory.java:422)\n\tat
org.apache.logging.log4j.core.config.ConfigurationFactory.getConfiguration(ConfigurationFactory.java:323)\n\tat
org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:695)\n\tat
org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:716)\n\tat
org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:270)\n\tat
org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:155)\n\tat
org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:47)\n\tat org.apache.logging.log4j.LogManager.getContext(LogManager.java:196)\n\tat
org.apache.logging.log4j.spi.AbstractLoggerAdapter.getContext(Abst","name":"spark-driver",
"reason":"Error"}],"phase":"Running"}}
Related
I use Glue 3.0 - Supports Spark 3.1 and Python 3 from an infrastructure perspective. I am trying to do MERGE INTO target USING source operation in spark sql for a table UPSERT operation. However, I am getting the below error for the same:
An error occurred while calling o91.sql. MERGE INTO TABLE is not supported temporarily.
I am not using any Delta Table, I read directly from a postgreSQL - AuroraDb using spark dataframe reader which is my target. The source here is another dataframe read from parquet file using spark dataframe reader.
I have tried changing the Glue Version but it did not help. When I looked for answers in internet I get links to Iceberg and DeltaTable. Is my approach to the problem is correct. Please share you inputs.
The code is provided as below:
def changeDataCapture(inputDf, currDf, spark):
inputDf.createOrReplaceTempView('inputDf')
currDf.createOrReplaceTempView('currDf')
currDf = spark.sql("""
MERGE INTO currDf USING inputDf
ON currDf.REG_NB = inputDf.registerNumber
AND currDf.ANN_RTN_DT = inputDf.annual_return_date
WHEN MATCHED
THEN UPDATE SET
currDf.LAST_SEEN_DT = inputDf.LAST_SEEN_DT,
currDf.TO_DB_DT = inputDf.TO_DB_DT,
currDf.TO_DB_TM = inputDf.TO_DB_TM,
currDf.BATCH_ID = inputDf.BATCH_ID,
currDf.DATA_PROC_ID = inputDf.DATA_PROC_ID,
currDf.FIRST_SEEN_DT = CASE
WHEN currDf.CO_REG_DEBT = inputDf.registered_indebtedness
AND currDf.HLDR_LIST_CD = inputDf.holder_list_indicator
AND currDf.HLDR_LEGAL_STAT = inputDf.holder_legal_status
AND currDf.HLDR_REFRESH_CD = inputDf.holder_refresh_flag
AND currDf.HLDR_SUPRESS_IN = inputDf.HLDR_SUPRESS_IN
AND currDf.BULK_LIST_ID = inputDf.Bulk_List_In
THEN currDf.FIRST_SEEN_DT
ELSE inputDf.FIRST_SEEN_DT
END,
currDf.SUPERSEDED_DT = CASE
WHEN currDf.CO_REG_DEBT = inputDf.registered_indebtedness
AND currDf.HLDR_LIST_CD = inputDf.holder_list_indicator
AND currDf.HLDR_LEGAL_STAT = inputDf.holder_legal_status
AND currDf.HLDR_REFRESH_CD = inputDf.holder_refresh_flag
AND currDf.HLDR_SUPRESS_IN = inputDf.HLDR_SUPRESS_IN
AND currDf.BULK_LIST_ID = inputDf.Bulk_List_In
THEN currDf.SUPERSEDED_DT
ELSE inputDf.SUPERSEDED_DT
END
WHEN NOT MATCHED
THEN INSERT
(REG_NB, ANN_RTN_DT, SUPERSEDED_DT, TO_DB_DT, TO_DB_TM, FIRST_SEEN_DT, LAST_SEEN_DT, BATCH_ID,
DATA_PROC_ID, CO_REG_DEBT, HLDR_LIST_CD, HLDR_LIST_DT, HLDR_LEGAL_STAT,
HLDR_REFRESH_CD, HLDR_SUPRESS_IN, BULK_LIST_ID, DOC_TYPE_CD)
VALUES
(registerNumber, annual_return_date, SUPERSEDED_DT, TO_DB_DT, TO_DB_TM, FIRST_SEEN_DT, LAST_SEEN_DT,
BATCH_ID, DATA_PROC_ID, registered_indebtedness, holder_list_indicator,
holder_list_date, holder_legal_status, holder_refresh_flag, HLDR_SUPRESS_IN,
Bulk_List_In, DOC_TYPE_CD)
""")
return currDf
Thanks
I am working with Confluent and trying to write data from a topic to Azure Data Lake Gen2 using the sink connector. I have set up a Connect Image which contains the connector, deployed it in AKS using Confluent Operator, and configured it to write messages from a topic to storage, however every time I send messages to the topic, the Connecter enters a FAILED state with the following message:
{"name":"adls-gen2-sink",
"connector":{"state":"FAILED","worker_id":"connectors-0.connectors.operator.svc.cluster.local:9083",
"trace":"org.apache.kafka.common.errors.TimeoutException: License topic could not be created\n
Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=createTopics, deadlineMs=1615683285588, tries=1, nextAllowedTryMs=1615683285689) timed out at 1615683285589 after 1 attempt(s)\n
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment.\n"},
"tasks":[],"type":"sink"}
Connector config:
name = adls-gen2-sink
connector.class = io.confluent.connect.azure.datalake.gen2.AzureDataLakeGen2SinkConnector
tasks.max = 1
topics = clicks
errors.deadletterqueue.topic.replication.factor = 1
format.class = io.confluent.connect.azure.storage.format.avro.AvroFormat
flush.size = 3
azure.datalake.gen2.client.id = xxx
azure.datalake.gen2.token.endpoint = https://login.microsoftonline.com/xxx/oauth2/token
azure.datalake.gen2.account.name = xxx
azure.datalake.gen2.client.key = xxx
storage.class = io.confluent.connect.azure.datalake.gen2.storage.AzureDataLakeGen2Storage
partitioner.class = io.confluent.connect.storage.partitioner.DailyPartitioner
confluent.topic.bootstrap.servers = localhost:9092
confluent.topic = License
confluent.topic.replication.factor = 1
I run Flume to ingest Twitter data into HDFS (in JSON format) and run Spark to read that file.
But somehow, it doesn't return the correct result: it seems the content of the file is not updated.
Here's my Flume configuration:
TwitterAgent01.sources = Twitter
TwitterAgent01.channels = MemoryChannel01
TwitterAgent01.sinks = HDFS
TwitterAgent01.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent01.sources.Twitter.channels = MemoryChannel01
TwitterAgent01.sources.Twitter.consumerKey = xxx
TwitterAgent01.sources.Twitter.consumerSecret = xxx
TwitterAgent01.sources.Twitter.accessToken = xxx
TwitterAgent01.sources.Twitter.accessTokenSecret = xxx
TwitterAgent01.sources.Twitter.keywords = some_keywords
TwitterAgent01.sinks.HDFS.channel = MemoryChannel01
TwitterAgent01.sinks.HDFS.type = hdfs
TwitterAgent01.sinks.HDFS.hdfs.path = hdfs://hadoop01:8020/warehouse/raw/twitter/provider/m=%Y%m/
TwitterAgent01.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent01.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent01.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent01.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent01.sinks.HDFS.hdfs.rollCount = 0
TwitterAgent01.sinks.HDFS.hdfs.rollInterval = 86400
TwitterAgent01.channels.MemoryChannel01.type = memory
TwitterAgent01.channels.MemoryChannel01.capacity = 10000
TwitterAgent01.channels.MemoryChannel01.transactionCapacity = 10000
After that I check the output with hdfs dfs -cat and it returns more than 1000 rows, meaning that the data was successfully inserted.
But in Spark that's not the case
spark.read.json("/warehouse/raw/twitter/provider").filter("m=201802").show()
only has 6 rows.
Did I miss something here?
I'm not entirely sure of why you specified the latter part of the path as the condition expression of filter.
I believe to correctly read your file you can just write:
spark.read.json("/warehouse/raw/twitter/provider/m=201802").show()
I'm developing code in SQL spark reading tables in Hive ( HDFS ) .
The problem is that when I load my code in the shell of spark, recursively me the following message :
"WARN LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems."
I run the code that is:
val query_fare_details = sql("""
SELECT *
FROM fare_details
WHERE fardet_cd_carrier = 'LA'
AND fardet_cd_origin_city = 'SCL'
AND fardet_cd_dest_city = 'MIA'
AND fardet_cd_fare_basis = 'NNE0F0O1'
""")
query_fare_details.registerTempTable("query_fare_details")
val matchFAR1 = sql("""
SELECT *
FROM query_fare_details f
JOIN fare_rules r ON f.fardet_cd_carrier = r.farrul_cd_carrier
AND f.fardet_num_rule_tariff = r.farrul_num_rule_tariff
AND f.fardet_cd_fare_rule_bigint = r.farrul_cd_fare_rule_bigint
AND f.fardet_cd_fare_basis = r.farrul_cd_fare_basis
LIMIT 10""")
matchFAR1.show(5)
Any idea what goes wrong?
you can safely ignore this warning. This is not an error
Refer [https://issues.apache.org/jira/browse/SPARK-3057][1]
We have a Percona Xtradb cluster with 5 nodes and an arbitrator. One of our Php developers ran a bad query on the cluster, crashing all the nodes. After the crash, we could not collect any error log to tell us what really went wrong as the entire cluster crashed without performing any logging.
I have always thought that when a single query is executed on the cluster, it is processed by only one of the nodes in the cluster. So if the query is bad (to the point of killing a db server), it should only crash the one node thats processing it, leaving the cluster running with the remaining 4 nodes.
This behavior has puzzled us and we would like to understand what is really going on especially that this is the second time this is happening. Why would a query running on the cluster while processed by one of the nodes would cause other nodes in the cluster to crash in case of some issue while being processed?
Below is our my.cnf config:
#
# Default values.
[mysqld_safe]
flush_caches
numa_interleave
#
#
[mysqld]
back_log = 65535
binlog_format = ROW
character_set_server = utf8
collation_server = utf8_general_ci
datadir = /var/lib/mysql
default_storage_engine = InnoDB
expand_fast_index_creation = 1
expire_logs_days = 7
innodb_autoinc_lock_mode = 2
innodb_buffer_pool_instances = 16
innodb_buffer_pool_populate = 1
innodb_buffer_pool_size = 32G # XXX 64GB RAM, 80%
innodb_data_file_path = ibdata1:64M;ibdata2:64M:autoextend
innodb_file_format = Barracuda
innodb_file_per_table
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
innodb_io_capacity = 1600
innodb_large_prefix
innodb_locks_unsafe_for_binlog = 1
innodb_log_file_size = 64M
innodb_print_all_deadlocks = 1
innodb_read_io_threads = 64
innodb_stats_on_metadata = FALSE
innodb_support_xa = FALSE
innodb_write_io_threads = 64
log-bin = mysqld-bin
log-queries-not-using-indexes
log-slave-updates
long_query_time = 1
max_allowed_packet = 64M
max_connect_errors = 4294967295
max_connections = 4096
min_examined_row_limit = 1000
port = 3306
relay-log-recovery = TRUE
skip-name-resolve
slow_query_log = 1
slow_query_log_timestamp_always = 1
table_open_cache = 4096
thread_cache = 1024
tmpdir = /db/tmp
transaction_isolation = REPEATABLE-READ
updatable_views_with_limit = 0
user = mysql
wait_timeout = 60
#
# Galera Variable config
wsrep_cluster_address = gcomm://ip_1, ip_2, ip_3,ip_4,ip_4,ip_5
wsrep_cluster_name = cluster_db
wsrep_provider = /usr/lib/libgalera_smm.so
wsrep_provider_options = "gcache.size=4G"
wsrep_slave_threads = 32
wsrep_sst_auth = "user:password"
wsrep_sst_donor = "db1"
#wsrep_sst_method = xtrabackup_throttle
wsrep_sst_method = xtrabackup-v2
#
# XXX You *MUST* change!
server-id = 1
Can you post the query? SELECT queries only execute on a single node but all write queries will execute everywhere. What's in your error log?