How to simplify `tracing-actix-web` output - rust

The log output by tracing-actix-web is as follows
2022-12-13T13:36:46.546387Z INFO HTTP request{http.method=GET http.route=/storeapi/goods/getInfo/{id} http.flavor=1.0 http.scheme=https http.host=store.example.com http.client_ip=127.0.0.1 http.user_agent=okhttp/4.10.0 http.target=/storeapi/goods/getInfo/19 otel.name=HTTP GET /storeapi/goods/getInfo/{id} otel.kind="server" request_id=813af77e-331a-4818-a84d-7e9bee1a83f7}: step 1
2022-12-13T13:36:46.546387Z INFO HTTP request{http.method=GET http.route=/storeapi/goods/getInfo/{id} http.flavor=1.0 http.scheme=https http.host=store.example.com http.client_ip=127.0.0.1 http.user_agent=okhttp/4.10.0 http.target=/storeapi/goods/getInfo/19 otel.name=HTTP GET /storeapi/goods/getInfo/{id} otel.kind="server" request_id=813af77e-331a-4818-a84d-7e9bee1a83f7}: step 2
2022-12-13T13:36:46.546387Z INFO HTTP request{http.method=GET http.route=/storeapi/goods/getInfo/{id} http.flavor=1.0 http.scheme=https http.host=store.example.com http.client_ip=127.0.0.1 http.user_agent=okhttp/4.10.0 http.target=/storeapi/goods/getInfo/19 otel.name=HTTP GET /storeapi/goods/getInfo/{id} otel.kind="server" request_id=813af77e-331a-4818-a84d-7e9bee1a83f7}: step 3
2022-12-13T13:36:46.546387Z INFO HTTP request{http.method=GET http.route=/storeapi/goods/getInfo/{id} http.flavor=1.0 http.scheme=https http.host=store.example.com http.client_ip=127.0.0.1 http.user_agent=okhttp/4.10.0 http.target=/storeapi/goods/getInfo/19 otel.name=HTTP GET /storeapi/goods/getInfo/{id} otel.kind="server" request_id=813af77e-331a-4818-a84d-7e9bee1a83f7 http.status_code=200 otel.status_code="OK"}: tracing_actix_web::root_span_builder: close time.busy=205µs time.idle=13.3µs
It gives too much repetitive information
Can I output a log like the following?
2022-12-13T13:36:46.546387Z INFO 813af77e-331a-4818-a84d-7e9bee1a83f7: step 1
2022-12-13T13:36:46.546387Z INFO 813af77e-331a-4818-a84d-7e9bee1a83f7: step 2
2022-12-13T13:36:46.546387Z INFO 813af77e-331a-4818-a84d-7e9bee1a83f7: step 3
2022-12-13T13:36:46.546387Z INFO 813af77e-331a-4818-a84d-7e9bee1a83f7: HTTP request{http.method=GET http.route=/storeapi/goods/getInfo/{id} http.flavor=1.0 http.scheme=https http.host=store.example.com http.client_ip=127.0.0.1 http.user_agent=okhttp/4.10.0 http.target=/storeapi/goods/getInfo/19 otel.name=HTTP GET /storeapi/goods/getInfo/{id} otel.kind="server" request_id=813af77e-331a-4818-a84d-7e9bee1a83f7 http.status_code=200 otel.status_code="OK"} close time.busy=205µs time.idle=13.3µs
The request_id in span needs to be consistent with the one generated by tracing-actix-web
tracing-actix-web
Part of the code is as follows
use std::str::FromStr;
use actix_web::{App, HttpServer};
use tracing_actix_web::TracingLogger;
#[actix_web::main]
async fn main() -> std::io::Result<()> {
tracing_subscriber::FmtSubscriber::builder()
.with_max_level(tracing::Level::from_str(&APP_CONFIG.log.level).unwrap())
.with_span_events(tracing_subscriber::fmt::format::FmtSpan::CLOSE)
.init();
db_config::init();
HttpServer::new(move || {
App::new()
.wrap(TracingLogger::default())
.wrap(auth_config::Auth)
.configure(router_config::init)
})
.bind((APP_CONFIG.server.bind.as_str(), APP_CONFIG.server.port))?
.run()
.await
}
{
tracing::info!("step 1");
tracing::info!("step 2");
tracing::info!("step 3");
}
I try to implement CustomRootSpanBuilder to generate uuid to replace span in request, but this uuid is different from request_id in request
2022-12-13T13:36:46.546387Z INFO 3fdc3e0b-571a-46e1-bf4b-0bc6238247d8: HTTP request{ ...... request_id=813af77e-331a-4818-a84d-7e9bee1a83f7 ......}

Related

Nakama Godot Error code Not Found when trying to auth

extends Node
var client : NakamaClient
var session : NakamaSession
var socket : NakamaSocket
var username = "example"
var play_service
func _ready():
if Engine.has_singleton("GodotPlayGamesServices"):
play_service = Engine.get_singleton("GodotPlayGamesServices")
play_service.init(true, true, true, "id")
play_service.connect("_on_sign_in_success", self, "ConnectToNakama")
play_service.connect("_on_sign_in_failed", self, "_on_sign_in_failed")
play_service.signIn() # <-
func _on_sign_in_failed(status:int) -> void:
pass
func ConnectToNakama(account_id:String) -> void:
var split = account_id.split('"')
var id = str(split[11])
print(id)
print(username)
client = Nakama.create_client('defaultkey', "ip", 7351,
'http', 3)
session = yield(client.authenticate_device_async(id, username), 'completed')
if session.is_exception():
print("connection has failed " + session.exception.message)
return
socket = Nakama.create_socket_from(client)
yield(socket.connect_async(session), "completed")
print("Connected!")
Server version: 3.14.0+e2df3a29
Godot version: 3.5.1
I get this error code:
DEBUG === Freeing request 1 01-15 22:52:00.542 20046 20263 I godot : === Nakama : DEBUG === Request 1 returned response code: 404, RPC code: 5, error: Not Found 01-15 22:52:00.542 20046 20263 I godot : connection has failed Not Found
I double checked evrything but I can't find the problem help much aprecciated
I've fixed the issue with the help of someone. You just have to change the port from 7351 to 7350 because 7351 is for admins

HTTPBuilder Retry strategy

I have the following method in my Rest requests handler
def get_Game(String gameId){
def gameInfo
http.request(Method.GET, ContentType.XML){
uri.path = gamePath
uri.query = [id : gameId]
response.'200' = { resp, reader ->
gameInfo = reader.item
}
response.'202' = {
println 'retry'
}
}
return gameInfo
}
The response status for this request is 200 if data is retrieved or 202 when the request is queued and I need to keep retrying until the status is 200.
Is there anyway to retry a request while response code is 202?

django and asyncio - fetch data asynchronously from remote REST endpoint

I'm trying to rewrite a django management command in an asynchronous way using asyncio and aiohttp. Those are the files involved:
# rest_async.py
async def t_search_coro(token, loop, **kwargs):
"""
ws T Search Query:
kwargs:
- modification_start_date: (str) Format: YYYY-MM-DDTHH:MM:SS (e.g.: 2013-02-26T11:00:00)
- modification_end_date: (str) Format: YYYY-MM-DDTHH:MM:SS (e.g.: 2013-02-26T11:00:00)
- lo_type: (str) LO Type. Defaults to 'Event'
- status: (str) T Status of the LO. Required
- portal: portal. Default: settings.PORTAL
- page_nr: PageNumber querystring parameter. Default: 1
"""
path = '/services/api/TSearch'
method = 'GET'
modification_start_date = kwargs.pop('modification_start_date')
modification_end_date = kwargs.pop('modification_end_date')
lo_type = kwargs.pop('lo_type', 'Event')
status = kwargs.pop('status')
portal = kwargs.pop('portal', settings.PORTAL)
page_nr = kwargs.pop('page_nr', 1)
debugging = kwargs.pop('debugging', True)
signature_kws = get_signature_kwargs(token, path, method)
headers = signature_kws.get('headers')
params = {
'LOType': lo_type,
'Status': status,
'PageNumber': page_nr,
'format': 'JSON'
}
if modification_start_date is not None:
params['ModificationStartDate'] = modification_start_date
if modification_end_date is not None:
params['ModificationEndDate'] = modification_end_date
service_end_point = 'https://{}.example.net{}'.format(portal, path)
print("fetching data: {} - {}".format(modification_start_date, modification_end_date))
async with aiohttp.ClientSession(loop=loop) as session:
async with session.get(url=service_end_point, params=params, headers=headers) as resp:
assert resp.status == 200
return await resp.read()
# utils_async.py
async def fetch_t_data_coro(
loop, lo_type='Session', modification_start_date=now()-timedelta(hours=22), modification_end_date=now(),
status='Completed', **kwargs):
date_fmt = "%Y-%m-%dT%H:%M:%S"
if (modification_end_date - modification_start_date).total_seconds() > timedelta(days=1).total_seconds():
raise Exception("modification start/end datetime interval must be within 24 hrs."
"\nmod. start date: {}\nmod. end date: {}".format(
modification_start_date.strftime(date_fmt), modification_end_date.strftime(date_fmt)
))
debugging = kwargs.pop('debugging', False)
page_nr = kwargs.get('page_nr', 1)
modification_start_date = modification_start_date.strftime(date_fmt)
modification_end_date = modification_end_date.strftime(date_fmt)
rtn_data = []
params = {
'LOType': lo_type, 'Status': status, 'PageNumber': page_nr, 'format': 'JSON'
}
already_added = set()
while True:
data = await rest_async.t_search_coro(
token, loop, modification_start_date=modification_start_date, modification_end_date=modification_end_date,
lo_type=lo_type, status=status, page_nr=page_nr, debugging=debugging
)
data_dict = json.loads(data.decode('utf-8'))
if 'data' not in data_dict:
break
total_pages = data_dict['data'][0]['T_Item']['TotalPages']
t_raw_data = data_dict['data'][0]['T_Item']['T']
for item in t_raw_data:
_h = hash(json.dumps(item, sort_keys=True))
if _h in already_added:
continue
already_added.add(_h)
rtn_data.append(item)
if page_nr >= total_pages:
break
page_nr += 1
return rtn_data
# load_data_async.py (actual django management command)
import asyncio
from datetime import timedelta, datetime
import argparse
import logging
from django.core.management.base import BaseCommand
from django.utils.timezone import now
from myapp.utils_async import fetch_transcript_data_coro
RUNNING_INTERVAL_MINS = 60
logger = logging.getLogger('my_proj')
MAX_BACKDAYS = 160
BACKDAYS_HOURS = {3, 9, 15, 21}
DEFAULT_TIMEFRAME=24
GO_BACK_DAYS = 30
GO_BACK_DAYS_TIMEFRAME = 24
class Command(BaseCommand):
help = "fetch data asynchrounously"
def add_arguments(self, parser):
parser.add_argument(
'--timeframe', action='store', dest='timeframe', default=DEFAULT_TIMEFRAME, type=int,
help='Timeframe hours to be used (default to 24, range: 1 to 24)'
)
parser.add_argument(
'--backdays', action='store', dest='backdays', default=None, type=int,
help='repeat the command execution (for the same timeframe) n days before the current day'
)
parser.add_argument('--start-date', type=valid_date_type)
parser.add_argument('--end-date', type=valid_date_type)
def handle(self, *args, **options):
self.loop = asyncio.get_event_loop()
self.loop.run_until_complete(self._handle(*args, **options))
async def _handle(self, *args, **options):
timeframe = options.get('timeframe')
backdays = options.get('backdays', None)
start_date = options.get('start_date')
end_date = options.get('end_date')
backdays = backdays + 1 if backdays is not None else 1
if all([start_date is not None, end_date is not None]):
days_range = [start_date + timedelta(days=x) for x in range((end_date - start_date).days + 1)]
else:
days_range = [now() - timedelta(days=x) for x in range(backdays)]
for mod_end_datetime in days_range:
mod_start_datetime = mod_end_datetime - timedelta(minutes=RUNNING_INTERVAL_MINS * timeframe)
data = await fetch_t_data_coro(
loop=self.loop, modification_start_date=mod_start_datetime, modification_end_date=mod_end_datetime
)
def valid_date_type(arg_date_str):
try:
return datetime.strptime(arg_date_str, "%Y-%m-%d")
except ValueError:
msg = "Given Date ({0}) not valid! Expected format, YYYY-MM-DD!".format(arg_date_str)
raise argparse.ArgumentTypeError(msg)
I then tried to run the cmd as:
python manage.py load_data_async --start-date 2018-04-20 --end-date 2018-06-6
the command runs without errors, however it seems from the print statement that the coroutines are executed sequentially, in the same way of the original synchrounous code:
# output
fetching data: 2018-04-19T00:00:00 - 2018-04-20T00:00:00
fetching data: 2018-04-19T00:00:00 - 2018-04-20T00:00:00
fetching data: 2018-04-20T00:00:00 - 2018-04-21T00:00:00
fetching data: 2018-04-20T00:00:00 - 2018-04-21T00:00:00
fetching data: 2018-04-20T00:00:00 - 2018-04-21T00:00:00
fetching data: 2018-04-20T00:00:00 - 2018-04-21T00:00:00
fetching data: 2018-04-20T00:00:00 - 2018-04-21T00:00:00
fetching data: 2018-04-20T00:00:00 - 2018-04-21T00:00:00
fetching data: 2018-04-20T00:00:00 - 2018-04-21T00:00:00
fetching data: 2018-04-21T00:00:00 - 2018-04-22T00:00:00
fetching data: 2018-04-21T00:00:00 - 2018-04-22T00:00:00
fetching data: 2018-04-21T00:00:00 - 2018-04-22T00:00:00
fetching data: 2018-04-22T00:00:00 - 2018-04-23T00:00:00
fetching data: 2018-04-23T00:00:00 - 2018-04-24T00:00:00
fetching data: 2018-04-24T00:00:00 - 2018-04-25T00:00:00
fetching data: 2018-04-24T00:00:00 - 2018-04-25T00:00:00
fetching data: 2018-04-25T00:00:00 - 2018-04-26T00:00:00
fetching data: 2018-04-25T00:00:00 - 2018-04-26T00:00:00
fetching data: 2018-04-25T00:00:00 - 2018-04-26T00:00:00
fetching data: 2018-04-26T00:00:00 - 2018-04-27T00:00:00
fetching data: 2018-04-26T00:00:00 - 2018-04-27T00:00:00
fetching data: 2018-04-26T00:00:00 - 2018-04-27T00:00:00
fetching data: 2018-04-26T00:00:00 - 2018-04-27T00:00:00
fetching data: 2018-04-26T00:00:00 - 2018-04-27T00:00:00
fetching data: 2018-04-26T00:00:00 - 2018-04-27T00:00:00
fetching data: 2018-04-26T00:00:00 - 2018-04-27T00:00:00
...
...
fetching data: 2018-05-22T00:00:00 - 2018-05-23T00:00:00
fetching data: 2018-05-22T00:00:00 - 2018-05-23T00:00:00
fetching data: 2018-05-23T00:00:00 - 2018-05-24T00:00:00
fetching data: 2018-05-23T00:00:00 - 2018-05-24T00:00:00
fetching data: 2018-05-24T00:00:00 - 2018-05-25T00:00:00
fetching data: 2018-05-25T00:00:00 - 2018-05-26T00:00:00
fetching data: 2018-05-25T00:00:00 - 2018-05-26T00:00:00
fetching data: 2018-05-25T00:00:00 - 2018-05-26T00:00:00
fetching data: 2018-05-25T00:00:00 - 2018-05-26T00:00:00
fetching data: 2018-05-26T00:00:00 - 2018-05-27T00:00:00
fetching data: 2018-05-27T00:00:00 - 2018-05-28T00:00:00
fetching data: 2018-05-28T00:00:00 - 2018-05-29T00:00:00
fetching data: 2018-05-29T00:00:00 - 2018-05-30T00:00:00
fetching data: 2018-05-30T00:00:00 - 2018-05-31T00:00:00
fetching data: 2018-05-30T00:00:00 - 2018-05-31T00:00:00
fetching data: 2018-05-30T00:00:00 - 2018-05-31T00:00:00
fetching data: 2018-05-31T00:00:00 - 2018-06-01T00:00:00
fetching data: 2018-05-31T00:00:00 - 2018-06-01T00:00:00
fetching data: 2018-06-01T00:00:00 - 2018-06-02T00:00:00
fetching data: 2018-06-01T00:00:00 - 2018-06-02T00:00:00
fetching data: 2018-06-01T00:00:00 - 2018-06-02T00:00:00
fetching data: 2018-06-01T00:00:00 - 2018-06-02T00:00:00
fetching data: 2018-06-02T00:00:00 - 2018-06-03T00:00:00
fetching data: 2018-06-02T00:00:00 - 2018-06-03T00:00:00
fetching data: 2018-06-02T00:00:00 - 2018-06-03T00:00:00
fetching data: 2018-06-03T00:00:00 - 2018-06-04T00:00:00
fetching data: 2018-06-03T00:00:00 - 2018-06-04T00:00:00
fetching data: 2018-06-04T00:00:00 - 2018-06-05T00:00:00
fetching data: 2018-06-04T00:00:00 - 2018-06-05T00:00:00
fetching data: 2018-06-05T00:00:00 - 2018-06-06T00:00:00
fetching data: 2018-06-05T00:00:00 - 2018-06-06T00:00:00
fetching data: 2018-06-05T00:00:00 - 2018-06-06T00:00:00
fetching data: 2018-06-05T00:00:00 - 2018-06-06T00:00:00
Anyone noticed something wrong? or this is the correct behavior?
I have no experience with asyncio but I was expecting not a sequential execution...
python version: 3.6.3
The code seems to await the fetch_t_data_coro invocations one by one, which forces them to run in sequence.
To run them in parallel, you can use asyncio.gather:
coros = []
for mod_end_datetime in days_range:
mod_start_datetime = mod_end_datetime - timedelta(minutes=RUNNING_INTERVAL_MINS * timeframe)
coros.append(fetch_t_data_coro(
loop=self.loop, modification_start_date=mod_start_datetime, modification_end_date=mod_end_datetime
))
data_list = await asyncio.gather(*coros)
Two unrelated notes:
The code instantiates aiohttp.ClientSession in each t_search_coro. This is an anti-pattern - you should create a single ClientSession at top-level and pass it down to individual coroutines (even ones running in parallel), so that they all share the same session instance.
Beginning with Python 3.5.3, asyncio.get_event_loop() will correctly pick up the running event loop when called from a coroutine. As a result, you don't need to send the loop object down the coroutine invocations, just call get_event_loop when you need it (which in your code you don't, since ClientSession also correctly infers the event loop on its own).

Spark Opening multiple threads for a single job while trying to run parallel jobs

We have a use case we were we need to run parallel spark sql queries on single spark session via rest-api (akka http).
Application Conf
my-blocking-dispatcher {
type = Dispatcher
executor = "thread-pool-executor"
thread-pool-executor {
// or in Akka 2.4.2+
fixed-pool-size = 4
}
throughput = 100
}
Spark Service
import org.apache.spark.scheduler.{SparkListener, SparkListenerJobEnd, SparkListenerJobStart}
import org.apache.spark.sql.execution.ui.CustomSqlListener
import org.apache.spark.sql.{Row, SparkSession}
import scala.collection.mutable.ListBuffer
import scala.concurrent.{ExecutionContext, Future}
import scala.util.parsing.json.JSON
trait SparkService {
val session = SparkSession
.builder()
.config("spark.scheduler.mode", "FAIR")
.appName("QueryCancellation")
.master("local[*]")
.enableHiveSupport()
.getOrCreate()
var queryJobMapStart = Map[String, String]()
var queryStatusMap = Map[String,String]()
session.sparkContext.setLogLevel("ERROR")
session.sparkContext.setCallSite("Reading the file")
val dataDF = session.read
.format("csv")
.option("inferSchema","true")
.option("header","true")
.load("C:\\dev\\QueryCancellation\\src\\main\\resources\\Baby_Names__Beginning_2007.csv")
dataDF.createOrReplaceTempView("data_tbl")
//dataDF.printSchema()
val customListener = new CustomSqlListener(session.sparkContext.getConf,queryJobMapStart,queryStatusMap)
val appListener = session.sparkContext.addSparkListener(customListener)
def runQuery(query : String, queryId: String)(implicit ec : ExecutionContext)= {
// println("queryId: "+ queryId +" query:" + query)
session.sparkContext.setLocalProperty("callSite.short",queryId)
session.sparkContext.setLocalProperty("callSite.long",query)
session.sql(query).show(2)
//Thread.sleep(60000)
// Future(data)
}
}
object SparkService extends SparkService
Query Service
import java.util.UUID
import java.util.concurrent.ConcurrentHashMap
import akka.actor.ActorSystem
import akka.http.scaladsl.server.Route
import akka.http.scaladsl.server.Directives._
import akka.stream.ActorMaterializer
trait QueryService extends SparkService {
implicit val system: ActorSystem
implicit val materializer : ActorMaterializer
// implicit val sparkSession: SparkSession
// val datasetMap = new ConcurrentHashMap[String, Dataset[Row]]()
implicit val blockingDispatcher = system.dispatchers.lookup("my-blocking-dispatcher")
val route: Route =
pathSingleSlash {
get {
complete {
"welcome to rest service"
}
}
} ~
path("runQuery" / "county"/Segment) { county =>
get {
complete{
var res= ""
val documentId = "user ::" + UUID.randomUUID().toString
val queryId = System.nanoTime().toString
val stmt = "select a.sex,count(*) from data_tbl a,data_tbl b where b.county=a.county and a.country= '"+county+"' group by a.sex"
val result = runQuery(stmt,queryId)
/* var entity = queryResult match {
case Some(result) =>s"Query : $stmt is submitted. Query id is $result. User id is $documentId"
case None => s"Query : $stmt could not be submitted. User id is $documentId"
}*/
/*result.onComplete{
case Success(value) => println(s"Query completed")
case Failure(e) => None
}*/
var entity = s"Query : $stmt is submitted. Query id is $queryId. User id is $documentId"
entity
}
}
} ~
path("getStatus" / """[\w[0-9]-_]+""".r) { id =>
get {
complete {
val statusResult = getStatus(id)
var res = statusResult match {
case Some(result) => s"Status for query id : $id is $result"
case None => s"Could not find the status of the query id : $id"
}
res
}
}
} ~
path("killQuery" / """[\w[0-9]-_]+""".r) { id =>
get {
complete {
val statusResult = killQuery(id)
s"Query id $id is cancelled."
}
}
}
}
Query Server
import akka.actor.ActorSystem
import akka.http.scaladsl.Http
import akka.stream.ActorMaterializer
import scala.concurrent.Future
//import scala.concurrent.ExecutionContext.Implicits.global
class QueryServer (implicit val system:ActorSystem ,
implicit val materializer: ActorMaterializer) extends QueryService {
def startServer(address : String, port: Int) = {
Http().bindAndHandle(route,address,port)
}
}
object QueryServer extends App {
implicit val actorSystem = ActorSystem("query-server")
implicit val materializer = ActorMaterializer()
val server = new QueryServer()
server.startServer("localhost",8080)
println("running server at localhost 8080")
}
When I try to run a query on spark sql via http:localhost:8080/runQuery/county/'KINGS', multiple job ids are created and out of which maximum skipped.
Below is the screen shot of Spark UI. I cannot understand why tasking highlighted is being created.
Below is the console log which shows the job was executed only once:
"running server at localhost 8080
173859599588358->2
****************************************************************************************
****************************************************************************************
Job id 2 is completed
--------------------------------------------------------
173859599588358->3
****************************************************************************************
****************************************************************************************
Job id 3 is completed
--------------------------------------------------------
173859599588358->4
****************************************************************************************
****************************************************************************************
Job id 4 is completed
--------------------------------------------------------
173859599588358->5
****************************************************************************************
****************************************************************************************
Job id 5 is completed
--------------------------------------------------------
173859599588358->6
****************************************************************************************
****************************************************************************************
Job id 6 is completed
--------------------------------------------------------
173859599588358->7
****************************************************************************************
****************************************************************************************
Job id 7 is completed
--------------------------------------------------------
+---+--------+
|sex|count(1)|
+---+--------+
| F|12476769|
| M|12095080|
+---+--------+
Spark Version :- 2.2.1
It looks like spark catalyst optimizer is optimizing the query.
More than one DAG is created and probably choosing the best execution plan.
You can see the execution plan, and there may be more than one.
I think there is no relation to akka http here.
Try running the code in spark shell and can verify the claim.
The part option("inferSchema","true") will trigger a Spark Job in addition to the rest of your logic. See Spark API for details.
Due to the fact that the Jobs are short-lived, they might be consecutive and not parallel (telling from the start times). Have you had a look at the SQL tab of the Spark UI. It is possible that all these jobs (except for the one infering the schema) are part of one SQL query executed.

Connecting Spark and elasticsearch

I'am trying to run a simple code of Spark that copies the content of an RDD into an elastic search document. Both spark and elastic search are installed on my local machine.
import org.elasticsearch.spark.sql._
import org.apache.spark.sql.SparkSession
object ES {
case class Person(ID: Int, name: String, age: Int, numFriends:
Int);
def mapper(line: String): Person = {
val fields = line.split(',')
val person: Person = Person(fields(0).toInt, fields(1),
fields(2).toInt, fields(3).toInt)
return person}
def main(args: Array[String]): Unit = {
val spark: SparkSession =
SparkSession
.builder().master("local[*]")
.appName("SparkEs")
.config("es.index.auto.create", "true")
.config("es.nodes","localhost:9200")
.getOrCreate()
import spark.implicits._
val lines = spark.sparkContext.textFile("/home/herch/fakefriends.csv")
val people = lines.map(mapper).toDF()
people.saveToEs("spark/people")
}
}
I'am Getting this error. After multiples retries
INFO HttpMethodDirector: I/O exception (java.net.ConnectException)
caught when processing request:Connection timed out (Connection timed
out)
INFO HttpMethodDirector: Retrying request
INFO DAGScheduler: ResultStage 0 (runJob at EsSparkSQL.scala:97)
failed in 525.902 s due to Job aborted due to stage failure: Task 1
in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in
stage 0.0 (TID 1, localhost, executor driver):
org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException:
Connection error (check network and/or proxy settings)- all nodes
failed; tried [[192.168.0.22:9200]]
It seems to be a connection problem but i cannot identify its cause. Elastic search is running on my local machine on localhost:9200 and i'am able to query it via the terminal.
Since you are running both locally, you need to set es.nodes.wan.only to true (default false) in your SparkConf. I ran into the same exact problem and that fixed it.
See: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html
As seen on the elasticsearch / spark connector documentation page, you need to separate the host and port arguments inside the configuration :
val options13 = Map("path" -> "spark/index",
"pushdown" -> "true",
"es.nodes" -> "someNode", "es.port" -> "9200")
See how es.nodes only contains the host name, and es.port contains the HTTP port.

Resources