Datanode size in hadoop cluster increasing fast

Datanode size in hadoop cluster increasing fast - apache-spark

I have a hadoop cluster with 2-master nodes and 3-data nodes. I am running an spark-streaming application. Application is streaming file from s3 and after processing storing result to s3. I'm not able to understand why data node storage getting full without storing data on hdfs.
Location:
/hadoopdata/hdfs/datanode/current/BP-1127218281-IP-1662958210599/current/finalized/subdir0
Please help how can i handle this situation?

Related

Cassandra + Spark executor hyperconvergence

As Apache Spark is a suggested distributed processing engine for Cassandra, I know that there is a possibility to run Spark executors along with Cassandra nodes.
My question is if the driver and Spark connector are smart enough to understand partitioning and shard allocation so data are processed in a hyper-converged manner.
Simply, does the executors read data stored from partitions that are hosted on nodes where an executor is running so no unnecessary data are transferred across the network as Spark does when it's run over HDFS?

Yes, Spark Cassandra Connector is able to do this. From the source code:
The getPreferredLocations method tells Spark the preferred nodes to fetch a partition from, so that the data for the partition are at the same node the task was sent to. If Cassandra nodes are collocated with Spark nodes, the queries are always sent to the Cassandra process running on the same node as the Spark Executor process, hence data are not transferred between nodes. If a Cassandra node fails or gets overloaded during read, the queries are retried to a different node.

Theoretically yes. Same for HDFS too. Howevet practically I have seen less of it on the cloud where separate nodes are used for spark and Cassandra when their cloud services are used. If you use IAsAS and setup your own Cassandra and Spark then you can achieve it.

I would like to add to Alex's answer:
Yes, Spark Cassandra Connector is able to do this. From the source
code:
The getPreferredLocations method tells Spark the preferred nodes to
fetch a partition from, so that the data for the partition are at the
same node the task was sent to. If Cassandra nodes are collocated with
Spark nodes, the queries are always sent to the Cassandra process
running on the same node as the Spark Executor process, hence data are
not transferred between nodes. If a Cassandra node fails or gets
overloaded during read, the queries are retried to a different node.
That this is a bad behavior.
In Cassandra when you ask to get the data of a particular partition, only one node is accessed. Spark can actually access 3 nodes thanks to the replication. So without shuffeling you have 3 nodes participating in the job.
In Hadoop however, when you ask to get the data of a particular partition, usually all nodes in the cluster are accessed and then Spark uses all nodes in the cluster as executors.
So in case you have a 100 nodes: In Cassandra, Spark will take advantage of 3 nodes. In Hadoop, Spark will take advantage of a 100 nodes.
Cassandra is optimized for real-time operational systems, and therefore not optimized for analytics like data lakes.

Spark checkpointing has a lot of tmp.crc files

I am using spark structured streaming where I read a stream from Kafka and after some transformation I write the resulted stream to Kafka.
I see a lot of hidden ..*tmp.crc files within my checkpoint directory. These files are not getting cleaned up and ever growing in number.
Am I missing some configuration?
I am not running spark on Hadoop. Using EBS based volume for checkpointing.

What is the advantage of using spark with HDFS as file storage system and YARN as resource manager?

I am trying to understand if spark is an alternative to the vanilla MapReduce approach for analysis of BigData. Since spark saves the operations on the data in the memory so while using the HDFS as storage system for spark , does it take the advantage of distributed storage of the HDFS? For instance suppose i have 100GB CSV file stored in HDFS, now i want to do analysis on it. If i load that from HDFS to spark , will spark load the complete data in-memory to do the transformations or it will use the distributed environment for doing its jobs that HDFS provides for Storage which is leveraged by the MapReduce programs written in hadoop. If not then what is the advantage of using spark over HDFS ?
PS: I know spark spills on the disks if there is RAM overflow but does this spill occur for data per node(suppose 5 GB per node) of the cluster or for the complete data(100GB)?

Spark jobs can be configured to spill to local executor disk, if there is not enough memory to read your files. Or you can enable HDFS snapshots and caching between Spark stages.
You mention CSV, which is just a bad format to have in Hadoop in general. If you have 100GB of CSV, you could just as easily have less than half that if written in Parquet or ORC...
At the end of the day, you need some processing engine, and some storage layer. For example, Spark on Mesos or Kubernetes might work just as well as on YARN, but those are separate systems, and are not bundled and tied together as nicely as HDFS and YARN. Plus, like MapReduce, when using YARN, you are moving the execution to the NodeManagers on the datanodes, rather than pulling over data over the network, which you would be doing with other Spark execution modes. The NameNode and ResourceManagers coordinate this communication for where data is stored and processed
If you are convinced that MapReduceV2 can be better than Spark, I would encourage looking at Tez instead

How to specify where to read data from HDFS when submitting Spark application?

I have been trying to deploy a spark multi-node cluster on three machines (master, slave1 and slave2). I have successfully deployed the spark cluster but I am confused about how to distribute my HDFS data over the slaves? Do I need to manually put data on my slave nodes and how can I specify where to read data from when submitting an application from the client? I have searched multiple forums but haven't been able to figure out how to use HDFS with Spark without using Hadoop.

tl;dr Store files to be processed by a Spark application on Hadoop HDFS and Spark executors will be told how to access them.
From HDFS Users Guide:
This document is a starting point for users working with Hadoop Distributed File System (HDFS) either as a part of a Hadoop cluster or as a stand-alone general purpose distributed file system.
A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.
So, HDFS is a mere file system that you can use to store files and use them in a distributed application, incl. a Spark application.
To my great surprise, it's only in HDFS Architecture where you can find a HDFS URI, i.e. hdfs://localhost:8020/user/hadoop/delete/test1 that is a HDFS URL to a resource delete/test1 that belongs to the user hadoop.
The URL that start with hdfs points at a HDFS that in the above example is managed by a NameNode at localhost:8020.
That means that HDFS does not require Hadoop YARN, but is usually used together because they come together and is just simple to use together.
Do I need to manually put data on my slave nodes and how can I specify where to read data from when submitting an application from the client?
Spark supports Hadoop HDFS with or without Hadoop YARN. A cluster manager (aka master URL) is an orthogonal concern to HDFS.
Wrapping it up, just use hdfs://hostname:port/path/to/directory with to access files on HDFS.

Spark without HDFS in cluster mode: Which data is stored where?

I am using Spark 1.5 without HDFS in cluster mode to build an application. I was wondering, when having a saving operation, e.g.,
df.write.parquet("...")
which data is stored where? Is all the data stored at the master, or is each worker storing its local data?

Generally speaking all workers nodes will perform writes to its local file system with driver writing only a _SUCCESS file.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Datanode size in hadoop cluster increasing fast - apache-spark

Related

Cassandra + Spark executor hyperconvergence

Spark checkpointing has a lot of tmp.crc files

What is the advantage of using spark with HDFS as file storage system and YARN as resource manager?

How to specify where to read data from HDFS when submitting Spark application?

Spark without HDFS in cluster mode: Which data is stored where?

Categories

Resources