i have a spark streaming (2.1.1 with cloudera 5.12). with input kafka and output HDFS (in parquet format)
the problem is , i'm getting LeaseExpiredException randomly (not in all mini-batch)
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/qoe_fixe/data_tv/tmp/cleanData/_temporary/0/_temporary/attempt_20180629132202_0215_m_000000_0/year=2018/month=6/day=29/hour=11/source=LYO2/part-00000-c6f21a40-4088-4d97-ae0c-24fa463550ab.snappy.parquet (inode 135532024): File does not exist. Holder DFSClient_attempt_20180629132202_0215_m_000000_0_-1048963677_900 does not have any open files.
i'm using the dataset API for writing to hdfs
if (!InputWithDatePartition.rdd.isEmpty() ) InputWithDatePartition.repartition(1).write.partitionBy("year", "month", "day","hour","source").mode("append").parquet(cleanPath)
my job fails after few hours because of this error
Two jobs write to the same directory share the same _temporary folder.
So when the first job finishes this code is executed (FileOutputCommitter class):
public void cleanupJob(JobContext context) throws IOException {
if (hasOutputPath()) {
Path pendingJobAttemptsPath = getPendingJobAttemptsPath();
FileSystem fs = pendingJobAttemptsPath
.getFileSystem(context.getConfiguration());
// if job allow repeatable commit and pendingJobAttemptsPath could be
// deleted by previous AM, we should tolerate FileNotFoundException in
// this case.
try {
fs.delete(pendingJobAttemptsPath, true);
} catch (FileNotFoundException e) {
if (!isCommitJobRepeatable(context)) {
throw e;
}
}
} else {
LOG.warn("Output Path is null in cleanupJob()");
}
}
it deletes pendingJobAttemptsPath(_temporary) while the second job is still running
This may be helpful:
Multiple spark jobs appending parquet data to same base path with partitioning
Related
We've files containing millions of lines. I need to read each line from the file & send it to Kinesis. I am trying following code:
KinesisAsyncClient kinesisClient = KinesisClientUtil.createKinesisAsyncClient(KinesisAsyncClient.
builder().region(region));
rdd.map(line -> {
PutRecordRequest request = PutRecordRequest.builder()
.streamName("MyTestStream")
.data(SdkBytes.fromByteArray(line.getBytes()))
.build();
try {
kinesisClient.putRecord(request).get();
} catch (InterruptedException e) {
LOG.info("Interrupted, assuming shutdown.");
} catch (ExecutionException e) {
LOG.error("Exception while sending data to Kinesis. Will try again next cycle.", e);
}
return null;
}
);
I am getting this error message:
object not serializable (class: software.amazon.awssdk.services.kinesis.DefaultKinesisAsyncClient
It seems KinesisAsyncClient is not the right object to use. Which other object can be used? NOTE: We don't want to use Spark (Structured) Streaming for this use case because the files will come only a few times in a day. Doesn't make sense to keep the Streaming app running.
Is there a better way to send messages to Kinesis via Spark. NOTE: We want to use Spark so that messages can be sent in Distributed fashion. Sequentially sending each message it taking too long.
dataFrame.coalesce(1).write().save("path") sometimes writes only _SUCCESS and ._SUCCESS.crc files without an expected *.csv.gz even on non-empty input DataFrame
file save code:
private static void writeCsvToDirectory(Dataset<Row> dataFrame, Path directory) {
dataFrame.coalesce(1)
.write()
.format("csv")
.option("header", "true")
.option("delimiter", "\t")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.mode(SaveMode.Overwrite)
.save("file:///" + directory);
}
file get code:
static Path getTemporaryCsvFile(Path directory) throws IOException {
String glob = "*.csv.gz";
try (DirectoryStream<Path> stream = Files.newDirectoryStream(directory, glob)) {
return stream.iterator().next();
} catch (NoSuchElementException e) {
throw new RuntimeException(getNoSuchElementExceptionMessage(directory, glob), e);
}
}
file get error example:
java.lang.RuntimeException: directory /tmp/temp5889805853850415940 does not contain a file with glob *.csv.gz. Directory listing:
/tmp/temp5889805853850415940/_SUCCESS,
/tmp/temp5889805853850415940/._SUCCESS.crc
I rely on this expectation, can someone explain me why it work this way?
Output file should (must by logic) contain at least the header line and some data lines. But he does not exist at all
This comment was a bit misleading. According to the code on Github, this will happen only if the Dataframe is empty, and won't produce SUCCESS files. Considering that those files are present - Dataframe is not empty and the writeCsvToDirectory from your code is triggered.
I have a couple of questions:
Does your Spark job finish without errors?
Does the timestamp of SUCCESS file gets updated?
My two main suspects are:
coalesce(1) - if you have a lot of data, this might fail
SaveMode.Overwrite - I have a feeling that those SUCCESS files are in that folder from previous runs
It is depend on your storage that you choose to write your csv file.
if you write on hdfs everything's ok. but whenever you decide to write in your local files system you must care that nothing will be written in driver local files system and your data will be in worker's files system and you should find them in worker's storage.
two solution's:
Run Spark in Local Mode
set mater local[NUMBER_OF_CORES] that you can submit your job by --master local[10] config
Write In Distributed File System
write your data in distributed file system like s3,hdfs,...
My own solution solved this problem.
I replace .save("file://" with hadoopFileSystem.copyToLocalFile
The thing is .save("file:// works expectedly only with SparkSession.builder().master("local"), where hdfs:// is emulated by master's file://.
I may be wrong in theory, but it works.
static Path writeCsvToTemporaryDirectory(Dataset<Row> dataFrame) throws IOException {
String temporaryDirectoryName = getTemporaryDirectoryName();
writeCsvToDirectory(dataFrame, temporaryDirectoryName, sparkContext);
return Paths.get(temporaryDirectoryName);
}
static void writeCsvToDirectory(Dataset<Row> dataFrame, String directory) throws IOException {
dataFrame.coalesce(1)
.write()
.option("header", "true")
.option("delimiter", "\t")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.mode(SaveMode.Overwrite)
.csv(directory);
FileSystem hadoopFileSystem = FileSystem.get(sparkContext.hadoopConfiguration());
hadoopFileSystem.copyToLocalFile(true,
new org.apache.hadoop.fs.Path(directory),
new org.apache.hadoop.fs.Path(directory));
}
static Path getTemporaryCsvFile(Path directory) throws IOException {
String glob = "*.csv.gz";
try (DirectoryStream<Path> stream = Files.newDirectoryStream(directory, glob)) {
return stream.iterator().next();
} catch (NoSuchElementException e) {
throw new RuntimeException(getNoSuchElementExceptionMessage(directory, glob), e);
}
}
Path temporaryDirectory = writeCsvToTemporaryDirectory(dataFrame);
Path temporaryFile = DataFrameIOUtils.getTemporaryCsvFile(temporaryDirectory);
try {
return otherStorage.upload(temporaryFile, name, fields).join();
} catch (InterruptedException | ExecutionException e) {
throw new RuntimeException(e);
} finally {
removeTemporaryDirectory(temporaryDirectory);
}
I have a use case, where I have to read parquet files and publish the records towards a kafka topic.
I read the parquet files using :
spark.read.schema(//path to parquet file )
Then I sort this dataframe based on the timestamp (use case specific requirements to preserve order)
Finally, I do the following :
binaryFiles below is the dataframe containing the sorted records from the parquet file
binaryFiles.coalesce(1).foreachPartition(partition => {
val producer = new KafkaProducer[String,Array[Byte]](properties)
partition.foreach(file => {
try {
var producerRecord = new ProducerRecord[String,Array[Byte]](targetTopic,file.getAs[Integer](2),file.getAs[String](0),file.getAs[Array[Byte]](1))
var metadata = producer.send(producerRecord, new Callback {
override def onCompletion(recordMetadata:RecordMetadata , e:Exception):Unit = {
if (e != null){
println ("Error while producing" + e);
producer.close()
}
}
});
producer.flush()
}
catch{
case unknown :Throwable => println("Exception obtained with record. Key : " + unknown)
}
})
producer.close()
println("Closing the producer for this partition")
})
While writing the failover strategy for this scenario, one scenario that I am trying to cater to is what if the node that runs the kafka producer goes down.
Now when the kafka producer is restarted it will again read the parquet file from start and start publishing all the records again to the same topic.
How can we overcome this and implement some sort of checkpointing that spark stream provides.
PS: I cannot use spark structured streaming as that does not preserve the order of the messages
My requirement is to process the hourly data of a stock market.
i.e, get the data from source once per streaming interval and process it via DStream.
I have implemented a custom receiver to scrap/monitor the website by implementing onStart() and onStop() methods and its working.
Challenges encountered:
Receiver thread is fetching the data continuously i.e, multiples times per interval.
Unable to coordinate receiver and DStream execution time interval.
Options I tried:
Receiver Thread to sleep for few seconds (equal to streaming interval).
In this case data is not the latest data while processing.
class CustomReceiver(interval: Int)
extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {
def onStart() {
new Thread("Website Scrapper") {
override def run() { receive() }
}.start()
}
def onStop() {
}
/** Create a socket connection and receive data until receiver is stopped */
private def receive() {
println("Entering receive:" + new Date());
try {
while (!isStopped) {
val scriptsLTP = StockMarket.getLiveStockData()
for ((script, ltp) <- scriptsLTP) {
store(script + "," + ltp)
}
println("sent data")
System.out.println("going to sleep:" + new Date());
Thread.sleep(3600 * 1000);
System.out.println("awaken from sleep:" + new Date());
}
println("Stopped receiving")
restart("Trying to connect again")
} catch {
case t: Throwable =>
restart("Error receiving data", t)
}
println("Exiting receive:" + new Date());
}
}
How to make the Spark Streaming receiver in sync with DStream processing?
This use case doesn't seem a good fit for Spark Streaming. The interval is long enough to consider this as a regular batch job instead. That way, we can make better use of the cluster resources.
I would rewrite it as a Spark Job by parallelizing the target tickers, using a mapPartitions to use the executors as distributed web scrappers and then process as intended.
Then schedule the Spark job to run each hour with cron or more advanced alternatives, such as Chronos at the exact times wanted.
I have one Apache access log file which has some data and it is continuously increasing. I want to analyze that data using Apache Spark Streaming API.
And Spark is new for me and i created one program in which ,i use jssc.textFileStream(directory) function to get log data. but its not work as per my requirement.
please suggest me some approaches to analyze that log file using spark.
Here is my code.
SparkConf conf = new SparkConf()
.setMaster("spark://192.168.1.9:7077")
.setAppName("log streaming")
.setSparkHome("/usr/local/spark")
.setJars(new String[] { "target/sparkstreamingdemo-0.0.1.jar" });
StreamingContext ssc = new StreamingContext(conf, new Duration(5000));
DStream<String> filerdd = ssc.textFileStream("/home/user/logs");
filerdd.print();
ssc.start();
ssc.awaitTermination();
This code does not return any data from existing files. This is only work when i create a new file but when i update that new file, program again does not return updated data.
If the file is modified in real-time you can use Tailer from Apache Commons IO.
That's the simpliest sample:
public void readLogs(File f, long delay) {
TailerListener listener = new MyTailerListener();
Tailer tailer = new Tailer(f, listener, delay);
// stupid executor impl. for demo purposes
Executor executor = new Executor() {
public void execute(Runnable command) {
command.run();
}
};
executor.execute(tailer);
}
public class MyTailerListener extends TailerListenerAdapter {
public void handle(String line) {
System.out.println(line);
}
}
The code above may be used as a log reader for Apache Flume and applied as a source. Then you need to configure Flume sink to redirect collected logs to Spark stream and apply Spark for analyzing data from Flume stream (http://spark.apache.org/docs/latest/streaming-flume-integration.html)
More details about Flume setup in this post: real time log processing using apache spark streaming