Unable to set _id on elasticsearch-hadoop - python-3.x

I am trying to write from an rdd to elasticsearch (pyspark, python 3.5).
I am capable of writing the body of the json correctly but elasticsearch instead of taking my _id, it creates it's own.
My code:
class Article:
def __init__(self, title, text, text2):
self.id_ = title
self.text = text
self.text2 = text2
if __name__ == '__main__':
pt=_sc.parallelize([Article("rt", "ted", "ted2"),Article("rt2", "ted2", "ted22")])
save=pt.map(lambda item:
(item.id_,
{
'text' : item.text,
'text2' : item.text2
}
))
es_write_conf = {
"es.nodes": "localhost",
"es.port": "9200",
"es.resource": 'db/table1'
}
save.saveAsNewAPIHadoopFile(
path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_write_conf)
Program trace:
link to the image

This is a setting of the mapping to the index, u can find out in the official user guide.
The sample code like this:
curl -XPOST localhost:9200/test -d '{
"settings" : {
"number_of_shards" : 1,
"number_of_replicas":0
},
"mappings" : {
"test1" : {
"_id":{"path":"mainkey"},
"_source" : { "enabled" : false },
"properties" : {
"mainkey" : { "type" : "string", "index" : "not_analyzed" }
}
}
}
}'

Related

json file in terraform

I have a JSON file with the following and trying to list the shapes ex: "t3-nano, t3-micro, t3-small, t2.medium, t3-2xlarge, r6g-medium".
json file = info.json
{
"t3-nano" : {
"service_name" : "t3",
"existing" : 100
},
"t3-micro" : {
"service_name" : "t3",
"existing" : 1
},
"t3-small" : {
"service_name" : "t3",
"existing" : 2
},
"t2.medium" : {
"service_name" : "t2",
"existing" : 0
},
"t3-2xlarge" : {
"service_name" : "t3-2",
"existing" : 5
},
"r6g-medium" : {
"service_name" : "r6g.medium",
"existing" : 10
}
}
I tried the following
locals {
service_name = flatten([for i in local.info : i[*].service_name])
shapes = flatten([for i in local.info : i[*].index])
}
and it got failed
Error: Unsupported attribute
This object does not have an attribute named "index".
I was expecting to print shapes = [t3-nano, t3-micro, t3-small, t2.medium, t3-2xlarge, r6g-medium]. Can someone help if there is a way to just list the shapes?
The flatten function and for expression are both unnecessary here. The function keys already has the functionality and return value that you want to achieve:
shapes = keys(local.info)
and that will assign the requested value.

How to get list of directories from Artifactory repo using groovy

I have the below groovy script in one of my Jenkins Active choice parameters:
import groovy.json.JsonSlurper
import jenkins.model.Jenkins
versions = getArtifactsVersions()
def getArtifactsVersions(){
def responseJson = "curl -k -X GET https://{Artifatory_URL}/storage/my-repo/".execute().text
def projectList = (new groovy.json.JsonSlurper().parseText(responseJson)).children
return projectList
}
Its supposes to return list of Folders which exists in this path (without subdirctories), however, the result I gets from it is:
{uri=/TEST_FOR_YONI, folder=true}
{uri=/TEST_FOR_PKMLODEL_V2, folder=true}
how can I change it to return:
TEST_FOR_YONI
TEST_FOR_PKMLODEL_V2
For debug, I ran the below:
import groovy.json.JsonSlurper
import jenkins.model.Jenkins
def responseJson = "curl -k -X GET https://{Artifatory_URL}/storage/my-repo/".execute().text
print responseJson
and it reruns the below:
{
"repo" : "my-repo",
"path" : "/",
"created" : "2020-11-29T18:00:42.635+02:00",
"lastModified" : "2020-11-29T18:00:42.635+02:00",
"lastUpdated" : "2020-11-29T18:00:42.635+02:00",
"children" : [ {
"uri" : "/TEST_FOR_YONI",
"folder" : true
}, {
"uri" : "/TEST_FOR_PKLMODEL_V2",
"folder" : true
}
],
"uri" : "https://{Artifatory_URL}/storage/my-repo/"
}
def responseJson='''
{
"repo" : "my-repo",
"path" : "/",
"created" : "2020-11-29T18:00:42.635+02:00",
"lastModified" : "2020-11-29T18:00:42.635+02:00",
"lastUpdated" : "2020-11-29T18:00:42.635+02:00",
"children" : [ {
"uri" : "/TEST_FOR_YONI",
"folder" : true
}, {
"uri" : "/TEST_FOR_PKLMODEL_V2",
"folder" : true
}
],
"uri" : "https://{Artifatory_URL}/storage/my-repo/"
}
'''
def projectList = new groovy.json.JsonSlurper().parseText(responseJson).children.collect{ it.uri.substring(1) }

Elasticsearch Search/filter by occurrence or order in an array

I am having a data field in my index in which,
I want only doc 2 as result i.e logically where b comes before
a in the array field data.
doc 1:
data = ['a','b','t','k','p']
doc 2:
data = ['p','b','i','o','a']
Currently, I am trying terms must on [a,b] then checking the order in another code snippet.
Please suggest any better way around.
My understanding is that the only way to do that would be to make use of Span Queries, however it won't be applicable on an array of values.
You would need to concatenate the values into a single text field with whitespace as delimiter, reingest the documents and make use of Span Near query on that field:
Please find the below mapping, sample document, the query and response:
Mapping:
PUT my_test_index
{
"mappings": {
"properties": {
"data":{
"type": "text"
}
}
}
}
Sample Documents:
POST my_test_index/_doc/1
{
"data": "a b"
}
POST my_test_index/_doc/2
{
"data": "b a"
}
Span Query:
POST my_test_index/_search
{
"query": {
"span_near" : {
"clauses" : [
{ "span_term" : { "data" : "a" } },
{ "span_term" : { "data" : "b" } }
],
"slop" : 0, <--- This means only `a b` would return but `a c b` won't.
"in_order" : true <--- This means a should come first and the b
}
}
}
Note that slop controls the maximum number of intervening unmatched positions permitted.
Response:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.36464313,
"hits" : [
{
"_index" : "my_test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.36464313,
"_source" : {
"data" : "a b"
}
}
]
}
}
Let me know if this helps!

adding new documents not being show in ElasticSearch index

I am new to ElasticsSearch and was messing around with it today. I have a node running on my localhost and was creating/updating my cat index. As I was adding more documents into my cat indexes, I noticed that when I do a GET request to see all of the documents in Postman, the new cats I make are not being added. I started noticing the issue after I added my tenth cat. All code is below.
ElasticSearch Version: 6.4.0
Python Version: 3.7.4
my_cat_mapping = {
"mappings": {
"_doc": {
"properties": {
"breed": { "type": "text" },
"info" : {
"cat" : {"type" : "text"},
"name" : {"type" : "text"},
"age" : {"type" : "integer"},
"amount" : {"type" : "integer"}
},
"created" : {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
}
}
cat_body = {
"breed" : "Persian Cat",
"info":{
"cat":"Black Cat",
"name": " willy",
"age": 5,
"amount": 1
}
}
def document_add(index_name, doc_type, body, doc_id = None):
"""Funtion to add a document by providing index_name,
document type, document contents as doc and document id."""
resp = es.index(index=index_name, doc_type=doc_type, body=body, id=doc_id)
print(resp)
document_add("cat", "cat_v1", cat_body, 100 )
Since the document id is passed as 100 it just updates the same cat document. I'm assuming its not changed on every run !?
You have to change the document id doc_id with every time to add new cat instead of updating existing ones.
...
cat_id = 100
cat_body = {
"breed" : "Persian Cat",
"info":{
"cat":"Black Cat",
"name": " willy",
"age": 5,
"amount": 1
}
}
...
document_add("cat", "cat_v1", cat_body, cat_id )
With this you can change both cat_id and cat_body to get new cats.

How to get value from Json using groovy script

Hi I am new to groovy and API Automation. I have the following Json and i want to add assertion to check cyclestartdate and cycleEnddate based on sequence number.
{
"status" : "success",
"locale" : "",
"data" : {
"periods" : [
{
"payCycleId" : "custompayperiod",
"sequence" : 1,
"cycleStartDate" : "2018-10-01",
"cycleEndDate" : "2018-10-08"
},
{
"payCycleId" : "custompayperiod",
"sequence" : 2,
"cycleStartDate" : "2018-10-09",
"cycleEndDate" : "2018-10-16"
}
]
}
}
How do i check if sequence 1 cycleStartDate is 2018-10-01
Groovy provides JsonSlurper class that makes parsing JSON documents easier. Consider following example that reads JSON document as a String (it supports different initialization methods as well):
import groovy.json.JsonSlurper
def inputJson = '''{
"status" : "success",
"locale" : "",
"data" : {
"periods" : [
{
"payCycleId" : "custompayperiod",
"sequence" : 1,
"cycleStartDate" : "2018-10-01",
"cycleEndDate" : "2018-10-08"
},
{
"payCycleId" : "custompayperiod",
"sequence" : 2,
"cycleStartDate" : "2018-10-09",
"cycleEndDate" : "2018-10-16"
}
]
}
}'''
def json = new JsonSlurper().parseText(inputJson)
assert json.data.periods.find { it.sequence == 1 }.cycleStartDate == '2018-10-01'
Having JSON document loaded you can extract data by accessing nested fields. For instance, json.data.periods gives you an access to the array stored in your JSON document. Then method find { it.sequence == 1 } returns a node from this array where sequence field is equal to 1. And lastly you can extract cycleStartDate and compare it with the expected date.
You can find more useful examples in Groovy's official documentation.

Resources