Spark-etl load data into accumulo failure - accumulo

When I learned to use geotrellis to load data into accumulo, I had a problem:
Exception in thread "main"
geotrellis.spark.io.package$LayerWriteError: Failed to write
Layer(name = "example", zoom = 13)......
rg.apache.accumulo.core.client.AccumuloException:
file:/geotrellis-ingest/726314aa-5b72-4f9c-9c41-f9521a045603-O45VGIHPpi:
java.io.IOException:
file:/geotrellis-ingest/726314aa-5b72-4f9c-9c41-f9521a045603-O45VGIHPpi
is not in a volume configured for Accumulo
Here are images of my config file:
config
config
config

I'm not familiar with geotrellis or spark, but the error message indicates that there's a bulk import into Accumulo being attempted across filesystems (volumes), which Accumulo doesn't support.
The files you bulk import must be on a volume that is already configured for use in Accumulo. Accumulo will move the files within the same volume to its own directories, but it won't move them across volumes. To configure volumes for use within Accumulo, see the documentation for instance.volumes.

Related

Spark event log not able to write to s3

I am trying to write the eventlog of my spark application to s3 for consuming through the history server later , but i get below warning message in the log
WARN S3ABlockOutputStream: Application invoked the Syncable API against stream writing to /spark_logs/eventlog_v2_local-1671627766466/events_1_local-1671627766466. This is unsupported
Below is the spark config I used:
config("spark.eventLog.enabled", "true")\
.config("spark.eventLog.dir", 's3a://change-data-capture-cdc-test/pontus_data_load/spark_logs')\
.config("spark.eventLog.rolling.enabled", "true")\
.config("spark.eventLog.rolling.maxFileSize", "10m")
spark version - 3.3.1
dependant jars:
org.apache.hadoop:hadoop-aws:3.3.0
com.amazonaws:aws-java-sdk-bundle:1.11.901
Only the appstatus_local-1671627766466.inprogress file is created, the actual log file is not created. But with my local file system its working as expected.
the warning means "the Application invoked the Syncable API against stream writing to /spark_logs/eventlog_v2_local-1671627766466/events_1_local-1671627766466. This is unsupported"
application code persists data to a filesystem using sync() to flush and save. clearly the spark logging is calling this. And as noted, the s3a client says "no can do".
s3 is not a filesystem. it is an object store; objects are written in single atomic operations. If you look at the S3ABlockOutputStream class -it is all open source after all- you can see that it may upload data, but it only completes the write in close().
therefore, it is not visible during the logging process itself. The warning is to make clear this is happening. It will appear once the log is closed.
If you want, you can set spark.hadoop.fs.s3a.downgrade.syncable.exceptions to true and it will raise an exception instead. That really makes clear to applications like hbase that the filesystem lacks the semantics it needs.

org.postgresql.util.PSQLException: SSL error: Received fatal alert: handshake_failure while writing from Azure Databricks to Azure Postgres Citus

I am trying to write pyspark dataframe to Azure Postgres Citus (Hyperscale).
I am using latest Postgres JDBC Driver and I tried writing on Databricks Runtime 7,6,5.
df.write.format("jdbc").option("url","jdbc:postgresql://<HOST>:5432/citus?user=citus&password=<PWD>&sslmode=require" ).option("dbTable", table_name).mode(method).save()
This is what I get after running the above command
org.postgresql.util.PSQLException: SSL error: Received fatal alert: handshake_failure
I have already tried different parameters in the URL and unders the option as well, but no luck so far.
However, I am able to connect to this instance using my local machine and on databricks driver/notebook using psycopg2
Both the Azure Postgres Citus and Databricks are in the same region and Azure Postgres Citus is public.
It worked by overwriting the java security properties for driver and executor
spark.driver.extraJavaOptions -Djava.security.properties=
spark.executor.extraJavaOptions -Djava.security.properties=
Explanation:
What is happening in reality is that the “security” variable of the JVM is reading by default the following file (/databricks/spark/dbconf/java/extra.security) and in this file there are some TLS algorithms that are being disabled by default. That means that if I edit this file and replace the TLS cyphers that work for PostGres citus for an empty string that should also work.
When I set this variable to the executors (spark.executor.extraJavaOptions) it will not change the default variables from the JVM. The same does not happen for the driver which overwrites and so it starts to work.
Note: We need to edit this file before the variable is read and so the init script is the only way of accomplishing that.

Access hdfs cluster from pydoop

I have hdfs cluster and python on the same google cloud platform. I want to access the files present in the hdfs cluster from python. I found that using pydoop one can do that but I am struggling with giving it right parameters maybe. Below is the code that I have tried so far:-
import pydoop.hdfs as hdfs
import pydoop
pydoop.hdfs.hdfs(host='url of the file system goes here',
port=9864, user=None, groups=None)
"""
class pydoop.hdfs.hdfs(host='default', port=0, user=None, groups=None)
A handle to an HDFS instance.
Parameters
host (str) – hostname or IP address of the HDFS NameNode. Set to an empty string (and port to 0) to connect to the local file system; set to 'default' (and port to 0) to connect to the default (i.e., the one defined in the Hadoop configuration files) file system.
port (int) – the port on which the NameNode is listening
user (str) – the Hadoop domain user name. Defaults to the current UNIX user. Note that, in MapReduce applications, since tasks are spawned by the JobTracker, the default user will be the one that started the JobTracker itself.
groups (list) – ignored. Included for backwards compatibility.
"""
#print (hdfs.ls("/vs_co2_all_2019_v1.csv"))
It gives this error:-
RuntimeError: Hadoop config not found, try setting HADOOP_CONF_DIR
And if I execute this line of code:-
print (hdfs.ls("/vs_co2_all_2019_v1.csv"))
nothing happens. But this "vs_co2_all_2019_v1.csv" file does exist in the cluster but is not available at the moment, when I took screenshot.
My hdfs screenshot is shown below:
and the credentials that I have are shown below:
Can anybody tell me that what am I doing wrong? Which credentials do I need to put where in the pydoop api? Or maybe there is another simpler way around this problem, any help will be much appreciated!!
Have you tried the following?
import pydoop.hdfs as hdfs
import pydoop
hdfs_object = pydoop.hdfs.hdfs(host='url of the file system goes here',
port=9864, user=None, groups=None)
hdfs_object.list_directory("/vs_co2_all_2019_v1.csv")
or simply:
hdfs_object.list_directory("/")
Keep in mind that pydoop.hdfs module is not directly related with the hdfs class (hdfs_object). Thus, the connection that you established in the first command is not used in hdfs.ls("/vs_co2_all_2019_v1.csv")

Apache Spark FileNotFoundException

I am trying to play a little bit with apache-spark cluster mode.
So my cluster consists of a driver in my machine and a worker and manager in host machine(separate machine).
I send a textfile using sparkContext.addFile(filepath) where the filepath is the path of my text file in local machine for which I get the following output:
INFO Utils: Copying /home/files/data.txt to /tmp/spark-b2e2bb22-487b-412b-831d-19d7aa96f275/userFiles-147c9552-1a77-427e-9b17-cb0845807860/data.txt
INFO SparkContext: Added file /home/files/data.txt at http://192.XX.XX.164:58143/files/data.txt with timestamp 1457432207649
But when I try to access the same file using SparkFiles.get("data.txt"), I get the path to file in my driver instead of worker.
I am setting my file like this
SparkConf conf = new SparkConf().setAppName("spark-play").setMaster("spark://192.XX.XX.172:7077");
conf.setJars(new String[]{"jars/SparkWorker.jar"});
JavaSparkContext sparkContext = new JavaSparkContext(conf);
sparkContext.addFile("/home/files/data.txt");
List<String> file =sparkContext.textFile(SparkFiles.get("data.txt")).collect();
I am getting FileNotFoundException here.
I have recently faced the same issue and hopefully my solution can help other people solve this issue.
We know that when you use SparkContext.addFile(<file_path>), it sends the file to the automatically created working directories in the driver node (in this case, your machine) as well as the worker nodes of the Spark cluster.
The block of code that you shared where you are using SparkFiles.get("data.txt") is being executed on the driver, so it returns the path to the file on the driver, instead of the worker. But, the task is being run on the worker and path to the file on the driver does not match the path to the file on the worker because the driver and worker nodes have different working directory paths. Hence, you get the FileNotFoundException.
There is a workaround to this problem without using any distributed file system or ftp server. You should put the file in your working directory on your host machine. Then, instead of using SparkContext.get("data.txt"), you use "./data.txt".
List<String> file = sparkContext.textFile("./data.txt").collect();
Now, even though there is a mismatch of working directory paths between the spark driver and worker nodes, you will NOT face FileNotFoundException since you are using a relative path to access the file.
I think that the main issue is that you are trying to read the file via the textFile method. What is inside the brackets of the textFile method is executed in the driver program. In the worker node only the code tobe run against an RDD is performed. When you type textFile what happens is that in your driver program it is created a RDD object with a trivial associated DAG.But nothing happens in the worker node.
Thus, when you're trying to collect the data, the worker is asked to read the file at the URL you've passed to textFile, which is told by the driver. Since your file is in the local filesystem of the driver and the worker node doesn't have access to it, you get the FileNotFoundException.
The solution is to make the file available to the worker node by putting it into a distributed filesystem as HDFS or via (s)ftp or you have to trasfer the file into the worker node before running the Spark job and then you have to put as an argument of textFile the path of the file in the worker filesystem.

what is zookeeper.broker.path

I'm learning Spark and Kafka and came across this project kafka-spark-consumer that seems to consume messages from Kafka efficiently. This project requires to configure few kafka & zookeeper properties thats where I'm struggling. I mean what does this property mean zookeeper.broker.path? Sorry, if its a basic question.
I have configured kafka in single node and with the following properties,
broker.id=1
port=9093
log.dir=/tmp/kafka-logs-1
and zookeeper as,
zookeeper.connect=localhost:2181/brokers
zookeeper.connection.timeout.ms=6000
if i try to configure the zookeeper.broker.path with /brokers i get the following exception from the consumer,
Exception in thread "main" java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/topics/<name>/partitions
at consumer.kafka.ReceiverLauncher.getNumPartitions(ReceiverLauncher.java:217)
at consumer.kafka.ReceiverLauncher.createStream(ReceiverLauncher.java:79)
at consumer.kafka.ReceiverLauncher.launch(ReceiverLauncher.java:51)
at com.ibm.spark.streaming.KafkaConsumer.run(KafkaConsumer.java:78)
at com.ibm.spark.streaming.KafkaConsumer.start(KafkaConsumer.java:43)
at com.ibm.spark.streaming.KafkaConsumer.main(KafkaConsumer.java:103)
Can you help me to understand what is the zookeeper broker path here and how can i configure that?
EDIT
The above error is caused due to non-existent topic, the moment i created the topic, the error went away.
As answered by user007, the /brokers directory is created by zookeeper by default.
No need of '/brokers' for zookeeper.connect property. It should be
zookeeper.connect=localhost:2181
I am not familiar with the "kafka-spark-consumer" project which you mentioned. But usually /brokers is the default node kafka creates in zookeeper. I haven't seen any library asking the user to configure it.
/brokers is the znode path under which metadata like topics are stored.
Go to kafka bin directory. Then invoke zookeeper shell - ./zookeeper-shell.sh localhost
Then do ls. You should be able to see topics and other child nodes created there.

Resources