How to implement index update functionality in elasticsearch using spark?

How to implement index update functionality in elasticsearch using spark? - apache-spark

I am new to ElasticSearch. I have a huge data to index using Elasticsearch.
I am use Apache Spark to index the data in hive table using Elasticsearch.
as part of this functionality, i wrote simple Spark Script.
object PushToES {
def main(args: Array[String]) {
val Array(inputQuery, index, host) = args
val sparkConf = new SparkConf().setMaster("local[1]").setAppName("PushToES")
sparkConf.set("....",Host)
sparkConf.set("....","9200")
val sc = new SparkContext(sparkConf)
val ht = new org.apache.spark.sql.hive.HiveContext(sc)
val ps = hhiveSqlContext.sql(inputQuery)
ps.toJSON.saveJsonToEs(index)
}
}
After that I am generating jar and submitting the job by using spark-submit
spark-submit --jars ~/*.jar --master local[*] --class com.PushToES *.jar "select * from gtest where day=20170711" gest3 localhost
then I am executing the below command for
curl -XGET 'localhost:9200/test/test_test/_count?pretty'
first time it is showing properly
{
"count" : 10,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
if i execute second time same curl command it is giving result like bleow
{
"count" : 20,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
if i execute 3rd time same command i am getting
{
"count" : 30,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
But I am not understanding every time why it is adding count value to existing index value(i.e. Count)
Please let me know how can I resolve this issue i.e . if I am execute any number of time also I have to get same value (correct count value i.e 10)
I am expecting below result for this case because correct count value is 10.(I executed count query on hive table for getting every time count(*) as 10)
{
"count" : 10,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
Thanks in advance .

If you want to "replace" the data each time you run, and not to "append" it, then you have to configure for such a scenario in your Spark Elasticsearch properties.
First thing you need to do is to have an ID in your document, and tell elastisearch what is your id "column" (if you come from a dataframe) or key (in json terms).
This is documented here : https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
For cases where the id (or other metadata fields like ttl or timestamp) of the document needs to be specified, one can do so by setting the appropriate mapping namely es.mapping.id. Following the previous example, to indicate to Elasticsearch to use the field id as the document id, update the RDD configuration (it is also possible to set the property on the SparkConf though due to its global effect it is discouraged):
EsSpark.saveToEs(rdd, "spark/docs", Map("es.mapping.id" -> "id"))
A second configuration key is available to control what kind of job elasticsearch tries to do upon writing data, but the default is correct for your user case :
es.write.operation (default index)
The write operation elasticsearch-hadoop should peform - can be any of:
index (default)
new data is added while existing data (based on its id) is replaced (reindexed).
create
adds new data - if the data already exists (based on its id), an exception is thrown.
update
updates existing data (based on its id). If no data is found, an exception is thrown.
upsert
known as merge or insert if the data does not exist, updates if the data exists (based on its id).

Related

junk data fix in pyspark or linux command

I have large data set will come from NIFI, then I'll do ETL transformation with pyspark,
unfortunately, one column in middle got split with new line, making extra column and existing records as NULL for same row, So I need to fix with Linux command at Nifi flow or pyspark code while doing ETL transformation
Ex: source.csv
1,hi,21.0,final,splitexthere,done,v1,v2,done
2,hi,21.0,final,splitext
here,done,v1,v2,done
3,hi,21.0,final,splitexthere,done,v1,v2,done
4,hi,21.0,final,splitexthere,done,v1,v2,failed
expected.csv
1,hi,21.0,final,splitexthere,done,v1,v2,done
2,hi,21.0,final,splitexthere,done,v1,v2,done
3,hi,21.0,final,splitexthere,done,v1,v2,done
4,hi,21.0,final,splitexthere,done,v1,v2,failed
here are some inputs,
we don't know which column will be split like above splittexhere
id column will be numbers always
and one file has multiple splits with new line

As #daggett highlighted, data must conform to the CSV format specifications to be valid across heterogeneous systems.
Add a ValidateRecord or ConvertRecord processor to your NiFi flow to validate CSV to CSV. This will filter out invalid records and valid records from the source data, so basically two forks out of a flowfile and then you can have a separate logic to handle/clean invalid data. Same can be doable in Spark as well but in NiFi it is pretty straightforward!
Note: While configuring CSVReader schema, make sure that all the fields are NOT NULL.
eg. Sample schema for two fields (you have nine fields)
{
"type" : "record",
"namespace" : "com.example.etl",
"name" : "validate_csv_data",
"fields" : [
{ "name" : "col_1", "type" : "string" },
{ "name" : "col_2", "type" : "string" }
]
}

Presto subqueries: Key not present in map

I have been banging my head a while to Superset -> Presto (PrestoSQL) -> Prometheus combination (as Superset does not yet support Prometheus) and got stymied with an issue when trying to extract columns from Presto's map type column containing Prometheus labels.
In order to get necessary labels mapped as columns from Superset's point of view, I create extra table (or I guess a view in this case) in Superset on top of existing table which had following SQL for creating the necessary columns:
SELECT labels['system_name'] AS "system",labels['instance'] AS "instance","timestamp" AS "timestamp","value" AS "value" FROM "up"
This table is then used as a data source in Superset's chart which treats it as a subquery. The resulting SQL query created by Superset and then sent to Presto looks e.g. like this:
SELECT "system" AS "system",
"instance" AS "instance",
"timestamp" AS "timestamp",
"value" AS "value"
FROM
(SELECT labels['system_name'] AS "system",
labels['instance'] AS "instance",
"timestamp" AS "timestamp",
"value" AS "value"
FROM "up") AS "expr_qry"
WHERE "timestamp" >= from_iso8601_timestamp('2020-10-19T12:00:00.000000')
AND "timestamp" < from_iso8601_timestamp('2020-10-19T13:00:00.000000')
ORDER BY "timestamp" ASC
LIMIT 250;
However, what I get out from above is an error:
io.prestosql.spi.PrestoException: Key not present in map: system_name
at io.prestosql.operator.scalar.MapSubscriptOperator$MissingKeyExceptionFactory.create(MapSubscriptOperator.java:173)
at io.prestosql.operator.scalar.MapSubscriptOperator.subscript(MapSubscriptOperator.java:143)
at io.prestosql.$gen.CursorProcessor_20201019_165636_32.filter(Unknown Source)
After reading a bit about queries from Presto's user guide, I tried a modified query from command line by using WITH:
WITH x AS (SELECT labels['system_name'] AS "system",labels['instance'] AS "instance","timestamp" AS "timestamp","value" AS "value" FROM "up")
SELECT system, timestamp, value FROM x
WHERE "timestamp" >= from_iso8601_timestamp('2020-10-19T12:00:00.000000')
AND "timestamp" < from_iso8601_timestamp('2020-10-19T13:00:00.000000')
LIMIT 250;
And that went throught without any issues. But it seems that I have no way to define how Superset executes its queries, so I'm stuck with the first option. The question is, is there anything wrong with it which could be fixed?
I guess that one option (if everything else fails) would be defining extra tables in Presto side which would do the same trick for mapping the columns, thus hopefully avoiding above issue.

The map subscript operator in Presto requires that the key be present in the map. Otherwise, you get the failure you described.
If some keys can be missing, you can use the element_at function instead, which will return a NULL result:
Returns value for given key, or NULL if the key is not contained in the map.

How to control processing of spark-stream while there is no data in Kafka topic

I am using spark-sql 2.4.1 , spark-cassandra-connector_2.11-2.4.1.jar and java8.
I have cassandra table like this:
CREATE company(company_id int, start_date date, company_name text, PRIMARY_KEY (company_id, start_date))
WITH CLUSTERING ORDER BY (start_date DESC);
The field start_date here is a derived field, which is calculated in the business logic.
I have spark-sql streaming code in which I call below mapFunction.
public static MapFunction<Company, CompanyTransformed> mapFunInsertCompany = ( record ) ->{
CompanyTransformed rec = new CompanyTransformed();
rec.setCompany_id(record.getCompanyId());
rec.setCompany_name(record.getCompanyName());
if(record.getChangeFlag().equalsIgnoreCase("I") && record.getCreateDate() != null )
rec.setStart_date(record.getCreateDate());
if(record.getChangeFlag().equalsIgnoreCase("U"))
rec.setStart_date(new Date(CommonUtils.today().getTime() + 86400000));
return rec;
};
While starting my consumer and there is no records in the kafka topic, the streaming flow continuously calls above map function.
Because record.getCreateDate() = null start_date is set to null.
But start_date is part of primary key in my C* table, hence, insertion failing and spark indefinitely waits, can NOT recover and save data into C* table.
So
1. what should be done to fix it? Any clue please?
Part 2 :
How to recover from failure ?
latestRecords
.writeStream()
.foreachBatch((batchDf, batchId) -> {
batchDf
.write()
.format("org.apache.spark.sql.cassandra")
.option("table", "company")
.option("keyspace", "ks_1")
.mode(SaveMode.Append)
.save();
}).start()..awaitTermination();
I am using above Java API, I dont find equalent method to check "isEmpty" rdd in java.
Any clue how to handle in java ?
Part 3:
Tried this
.foreachBatch((batchDf, batchId) -> {
System.out.println( "latestRecords batchDf.isEmpty : " +
batchDf.isEmpty() + "\t length : " + batchDf.rdd().getPartitions().length);
}
Gives output as
latestRecords batchDf.isEmpty : false length : 6
So how to check isEmpty ? as isEmpty : false
part 4 :
While I start consumer, no data available in topic.
Even though dataset showing no data , but count shows 3 as show below output, how is it possible ?
If I try this
.foreachBatch((batchDf, batchId) -> {
System.out.println( "latestRecords batchDf.rdd().count : " + batchDf.rdd().count() + "\t batchDf.count :" + batchDf.count());
}
output
latestRecords batchDf.rdd().count : 3 batchDf.count :3

You are facing a common problem for Spark Streaming Applications. When there is no data in the source (in your case a Kafka Topic) Spark creates an emptyRDD. You can validated if an RDD is empty by adding
if(!rdd.isEmpty)
Before calling your method mapFunInsertCompany.
Please also have a look at this blog post.

MongoDB Data Structure

I'm a bit of a noob with MongoDB, so would appreciate some help with figuring out the best solution/format/structure in storing some data.
Basically, the data that will be stored will be updated every second with a name, value and timestamp for a certain meter reading.
For example, one possibility is water level and temperature in a tank. The tank will have a name and then the level and temperature will be read and stored every second. Overall, there will be 100's of items (i.e. tanks), each with millions of timestamped values.
From what I've learnt so far (and please correct me if I'm wrong), there are a few options as how to structure the data:
A slightly RDMS approach:
This would consist of two collections, Items and Values
Items : {
_id : "id",
name : "name"
}
Values : {
_id : "id",
item_id : "item_id",
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}
The more document db denormalized method:
This method involves one collection of items each with an array of timestamped values
Items : {
_id : "id",
name : "name"
values : [{
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}]
}
A collection for each item
Save all the values in a collection named after that item.
ItemName : {
_id : "id",
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}
The majority of read queries will be to retrieve the timestamped values for a specified time period of an item (i.e. tank) and display in a graph. And for this, the first option makes more sense to me as I don't want to retrieve the millions of values when querying for a specific item.
Is it even possible to query for values between specific timestamps for option 2?
I will also need to query for a list of items, so maybe a combination of the first and third option with a collection for all the items and then a number of collections to store the values for each of those items?
Any feedback on this is greatly appreciated.

Don't use timestamp if you are not modifying the ObjectId.
As ObjectId itself has time stamp in it.
So you will be saving a lot of memory by it.
MongoDB Id Documentation
In case if you dont require the previous data then you can use update query in MongoDB to update the fields every second instead of storing.
If you want to store the updated data each time then instead of updating store it in flat structure.
{ "_id" : ObjectId("XXXXXX"),
"name" : "ItemName",
"value" : "ValueOfItem"
"created_at" : "timestamp"
}
Edit 1: Added timestamp as per the comments

ArangoDB Hash Index Ignored (?)

My database currently consists of 3 document collections with between 250k to 1.5M documents. I set my own document _keys and have added Hash indexes on a few toplevel fields and lists (the lists containing references to other keys or (indexed) fields).
The collections A and C have an n:m relationship via B. The query I first came up with looks like this:
for a in collection_a
filter a.name != null
filter length(a.bs) > 0
limit 1
return {
'akey': a._key
, 'name': a.name
, 'cs': (
for b in collection_b
filter b.a == a._key
for c in collection_c
filter b.c == c._key
return c.name
)
}
This is excruciatingly slow. I also tried other approaches such as making the middle for a for b in a.bs (bs being a list of keys of collection_b documents).
Printing out explain() of the above query returns an immense cost and getExtra() indicates no indexes were used:
{
"stats" : {
"writesExecuted" : 0,
"writesIgnored" : 0,
"scannedFull" : 6009930,
"scannedIndex" : 0
},
"warnings" : [ ]
}
An alternate approach works as fast as I'd expected it to be in the first place:
for a in collection_a
filter a.name != null
filter length(a.bs) > 0
limit 1
return {
'akey': a._key
, 'name': a.name
, 'cs': (
for b in a.bs
return DOCUMENT(collection_c , DOCUMENT(collection_b, b).c ).name
)
}
But even here, no indexes appear to be used:
{
"stats" : {
"writesExecuted" : 0,
"writesIgnored" : 0,
"scannedFull" : 3000,
"scannedIndex" : 0
},
"warnings" : [ ]
}
One thing that may already explain this is, that hash indexes don't work for elements of a list (or I made a mistake when creating them)? The getExtras() of the second example would hint at this.
My expectation, however, would be that arangodb indexes all elements of the lists (such as a.bs) and the query optimizer should realize that indexed attributes are used in the query.
If I run for b in collection_b filter b.a == 'somekey', I get an instantaneous result as expected. And that's just running the middle for in isolation. Same behaviour when I run the innermost for in isolation.
Is this a bug? Is there an explanation for this behaviour? Am I doing something wrong in the first query? The AQL Examples themself use nested fors so that's what I naturally ended up trying first.

This has been fixed in release 2.3.2.
clarification: the query you posted is correct. There was an issue in release 2.3.0 that prevented indexes in subqueries being used.
This issue has been fixed in release 2.3.2.
The initial query you posted should properly use indexes in 2.3.2. If there is a hash index available on the join attributes, it should be used because the query only contains equality lookups.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string