logstash maintaining the order in which data is read - logstash

I have a single logstash instance and reading data from multiple files. I want to maintain the order in which data is updated in Elasticsearch as the _id field is the key.
So if there are two records in the input file with same key, it has to be updated in order.
How do enforce ordering from a source in logstash?
Input file 1:
Key = A1 , Data = abc , time=5:51 PM
Key = B1 , Data = efg , time=5:52 PM
Key = C1 , Data = hij , time=5:53 PM
Input file 2:
Key = A1 , Data = klm, time=5:50 PM
This will be read by two threads in logstash.
If there are two filter threads which formats the data.
Output goes to elastics search with _id:
output {
elasticsearch {
embedded => true
index => "samples6"
index_type => "sample"
document_id => "%{Key}"
}
}
How to ensure that Key=A1 has Data=abc not "klm".

If the data that needs to be processed in order is read from different files there's no way of doing this since Logstash doesn't maintain an ordered queue of events. If you have more than one filter worker (i.e. start Logstash with -w/--filterworkers greater than one) there's no order guarantee even if you read from a single file.
You'll have to write something yourself to get the ordering right. Perhaps Apache Samza could be of use.

Related

How to validate large csv file either column wise or row wise in spark dataframe

I have a large data file of 10GB or more with 150 columns in which we need to validate each of its data (datatype/format/null/domain value/primary key ..) with different rule and finally create 2 output file one is having success data and another having error data with error details. we need to move the row in error file if any of column having error at very first time no need to validate further.
I am reading a file in spark data frame does we validate it column-wise or row-wise by which way we got the best performance?
To answer your question
I am reading a file in spark data frame do we validate it column-wise or row-wise by which way we got the best performance?
DataFrame is a distributed collection of data that is organized as set of rows distributed across the cluster and most of the transformation which is defined in spark is applied on the rows which work on Row object .
Psuedo code
import spark.implicits._
val schema = spark.read.csv(ip).schema
spark.read.textFile(inputFile).map(row => {
val errorInfo : Seq[(Row,String,Boolean)] = Seq()
val data = schema.foreach(f => {
// f.dataType //get field type and have custom logic on field type
// f.name // get field name i.e., column name
// val fieldValue = row.getAs(f.name) //get field value and have check's on field value on field type
// if any error in field value validation then populate #errorInfo info object i.e (row,"error_info",false)
// otherwise i.e (row,"",true)
})
data.filter(x => x._3).write.save(correctLoc)
data.filter(x => !x._3).write.save(errorLoc)
})

Apache-Spark Combining Data from Two Files

I want to combine only specific data from two files and output a file with columns for these data and how many times they appeared in both data files for an ssn. Should I be doing something like
val data = spark.read.json("file")
val ssn = data.filter(line => line.contains("ssn")).count()
val tickets = data.filter(line => line.contains("tickets")).count()
Is there a way to join the data based off the ssn and then keep incrementing up the tickets if I see the same ssn in the output file?
Also ssn is in a json key value pair in both files but on each line, they are in their own little json message and when I tried to read the files, I got a "_corrupt_record" Should I just be reading it as a plain text?

microbatch lookup in external database from spark

I have a requirement where I need to process log lines using Spark. One of the steps in processing is to lookup certain value in external DB.
For ex:
my log line contains multiple key-value pair. One of the key that is present in log is "key1". This key needs to be used for lookup call.
I dont want to make multiple lookup call sequentially in external DB for each value of "key1" in RDD .Rather I want to create a list of all values of "key1" present in RDD and then make a single lookup call in external DB.
My code to extract key from each log line would look as follows:
lines.foreachRDD{rdd => rdd.map(line => extractKey(line))
// next step is lookup
// then further processing
The .map function would be called for each log line, so I am not sure , how can I create a list of Keys which can be used for external lookup.
Thanks
Use collect.
lines.foreachRDD{rdd =>
val keys = rdd.map(line => extractKey(line)).collect()
// here you can use keys List
Probably you'll have to use mapPartitions also:
lines.foreachRDD{rdd =>
rdd.foreachPartition(iter => {
val keys = iter.map(line => extractKey(line)).toArray
// here you can use keys Array
}
}
There will be 1 call per 1 partition, this method avoids serialization problems
It looks like you want this:
lines.groupByKey().filter()
Could you provide more info?

Spark Streaming Desinging Questiion

I am new in spark. I wanted to do spark streaming setup to retrieve key value pairs of below format files:
file: info1
Note: Each info file will have around of 1000 of these records. And our system is continuously generating these info files. Through, spark streaming i wanted to do mapping of line numbers and info files and wanted to get aggregate result.
Can we give input to spark cluster these kind of files? I am interested in the "SF" and "DA" delimiters only, "SF" corresponds to source file and "DA" corresponds the ( line number, count).
As this input data is not the line format, so is this the good idea to use these files for the spark input or should i need to do some intermediary stage where i need to clean these files to generate new files which will have each record information in line instead of blocks?
Or can we achieve this in Spark itself? What should be the right approach?
What i wanted to achieve?
I wanted to get line level information. Means, to get line (As a key) and info files (as values)
Final output i wanted is like below:
line178 -> (info1, info2, info7.................)
line 2908 -> (info3, info90, ..., ... ,)
Do let me know if my explanation is not clear or if i am missing something.
Thanks & Regards,
Vinti
You could do something like this. Having your DStream stream:
// this gives you DA & FP lines, with the line number as the key
val validLines = stream.map(_.split(":")).
filter(line => Seq("DA", "FP").contains(line._1)).
map(_._2.split(","))
map(line => (line._1, line._2))
// now you should accumulate values
val state = validLines.updateStateByKey[Seq[String]](updateFunction _)
def updateFunction(newValues: Seq[Seq[String]], runningValues: Option[Seq[String]]): Option[Seq[String]] = {
// add the new values
val newVals = runnigValues match {
case Some(list) => list :: newValues
case _ => newValues
}
Some(newVals)
}
This should accumulate for each key a sequence with the values associated, storing it in state

Cassandra DataStax driver: how to page through columns

I have wide rows with timestamp columns. If I use the DataStax Java driver, I can page row results by using LIMIT or FETCH_SIZE, however, I could not find any specifics as to how I can page through columns for a specific row.
I found this post: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/CQL-3-and-wide-rows-td7594577.html
which explains how I could get ranges of columns based on the column name (timestamp) values.
However, what I need to do is to get ALL columns, I just don't want to load them all into memory , but rather "stream" the results and process a chunk of columns (preferably of a controllable size) at a time until all columns of the row are processed.
Does the DataStax driver support streaming of this kind? and of so - what is the syntax for using it?
Additional clarification:
Essentially, what I'm looking for is an equivalent of the Hector's ColumnSliceIterator using which I could iterate over all columns (up to Integer.MAX_VALUE number) of a specific row in batches of, say, 100 columns at a time as following:
SliceQuery sliceQuery = HFactory.createSliceQuery(keySpace, ...);
sliceQuery.setColumnFamily(MY_COLUMN_FAMILY);
sliceQuery.setKey(myRowKey);
// columns to be returned. The null value indicates all columns
sliceQuery.setRange(
null // start column
, null // end column
, false // reversed order
, Integer.MAX_VALUE // number of columns to return
);
ColumnSliceIterator iter = new ColumnSliceIterator(
sliceQuery // previously created slice query needs to be passed as parameter
, null // starting column name
, null // ending column name
, false // reverse
, 100 // column count <-- the batch size
);
while (iter.hasNext()) {
String myColumnValue = iter.next().getValue();
}
How do I do the exact same thing using the DataStax driver?
thanks!
Marina
The ResultSet Object that you get is actually setup to do this sort of paginating for you by default. Calling one() repeatedly or iterating using the iterator() will allow you to access all the data without calling it all into memory at once. More details are available in the api.

Resources