Spark 3 custom Source not exported to Prometheus - apache-spark

I wrote a custom Source, that successfully exports my metrics to ConsoleSink. However, it doesn't export it to the Prometheus endpoint.
Is there a special way to enable Prometheus export to metrics?
class TestMetricsSource extends Source {
override val sourceName: String = "TestMetricSource"
override val metricRegistry: MetricRegistry = new MetricRegistry
val METRIC: Counter = metricRegistry.counter(MetricRegistry.name("testCounter"))
}
Registered it
val source = new TestMetricsSource
SparkEnv.get.metricsSystem.registerSource(source)
Set value
source.METRIC.update(100);

Related

How to get the taskID or mapperID(something like partitionID in Spark) in a hive UDF?

As question, How to get the taskID or mapperID(something like partitionID in Spark) in a hive UDF ?
You can access task information using TaskContext:
import org.apache.spark.TaskContext
sc.parallelize(Seq[Int](), 4).mapPartitions(_ => {
val ctx = TaskContext.get
val stageId = ctx.stageId
val partId = ctx.partitionId
val hostname = java.net.InetAddress.getLocalHost().getHostName()
Iterator(s"Stage: $stageId, Partition: $partId, Host: $hostname")}).collect.foreach(println)
A similar functionality has been added to PySpark in Spark 2.2.0 (SPARK-18576):
from pyspark import TaskContext
import socket
def task_info(*_):
ctx = TaskContext()
return ["Stage: {0}, Partition: {1}, Host: {2}".format
(ctx.stageId(), ctx.partitionId(), socket.gethostname())]
for x in sc.parallelize([], 4).mapPartitions(task_info).collect():
print(x)
I think it will provide you the information about the task including map id you are looking for.
I have found the correct answer on my own, we can get the taskID in a hive UDF the way as below :
public class TestUDF extends GenericUDF {
private Text result = new Text();
private String tmpStr = "";
#Override
public void configure(MapredContext context) {
//get the number of tasks 获取task总数量
int numTasks = context.getJobConf().getNumMapTasks();
//get the current taskID 获取当前taskID
String taskID = context.getJobConf().get("mapred.task.id");
this.tmpStr = numTasks + "_h_xXx_h_" + taskID;
}
#Override
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException {
return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}
#Override
public Object evaluate(DeferredObject[] arguments) {
result.set(this.tmpStr);
return this.result;
}
#Override
public String getDisplayString(String[] children) {
return "RowSeq-func()";
}
}
but this would be effective only in MapReduce execution engine, it would not work in a SparkSQL engine.
Test code as below:
add jar hdfs:///home/dp/tmp/shaw/my_udf.jar;
create temporary function seqx AS 'com.udf.TestUDF';
with core as (
select
device_id
from
test_table
where
p_date = '20210309'
and product = 'google'
distribute by
device_id
)
select
seqx() as seqs,
count(1) as cc
from
core
group by
seqx()
order by
seqs asc
Result in MR engine as below, see we have got the task number and taskID successfully:
Result in Spark engine with same sql above, the UDF is not valid, we get nothing about taskID:
If you run your HQL in Spark engine and call the Hive UDF meanwhile, and really need to get the partitionId in Spark, see the code below :
import org.apache.spark.TaskContext;
public class TestUDF extends GenericUDF {
private Text result = new Text();
private String tmpStr = "";
#Override
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException {
//get spark partitionId
this.tmpStr = TaskContext.getPartitionId() + "-initial-pid";
return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}
public Object evaluate(DeferredObject[] arguments) {
//get spark partitionId
this.tmpStr = TaskContext.getPartitionId() + "-evaluate-pid";
result.set(this.tmpStr);
return this.result;
}
}
As above, you can get the Spark partitionId by calling TaskContext.getPartitionId() in the override method initialize or evalute of UDF class.
Notice: your UDF must has params, suchs select my_udf(param), this would lead your UDF initialized in multiple tasks, if your UDF do not have a param, it will be initialized at the Driver, and the Driver do not have the taskContext and partitionId, so you would get nothing.
The image below is a result produced by the above UDF executed in Spark engine,see, we get the partitionIds successfully :

How to broadcast a DataFrame?

I am using spark-sql-2.4.1 version.
creating a broadcast variable as below
Broadcast<Map<String,Dataset>> bcVariable = javaSparkContext.broadcast(//read dataset);
Me passing the bcVariable to a function
Service.calculateFunction(sparkSession, bcVariable.getValue());
public static class Service {
public static calculateFunction(
SparkSession sparkSession,
Map<String, Dataset> dataSet ) {
System.out.println("---> size : " + dataSet.size()); //printing size 1
for( Entry<String, Dataset> aEntry : dataSet.entrySet() ) {
System.out.println( aEntry.getKey()); // printing key
aEntry.getValue().show() // throw null pointer exception
}
}
What is wrong here ? how to pass a dataset/dataframe in the function?
Try 2 :
Broadcast<Dataset> bcVariable = javaSparkContext.broadcast(//read dataset);
Me passing the bcVariable to a function
Service.calculateFunction(sparkSession, bcVariable.getValue());
public static class Service {
public static calculateFunction(
SparkSession sparkSession,
Dataset dataSet ) {
System.out.println("---> size : " + dataSet.size()); // throwing null pointer exception.
}
What is wrong here ? how to pass a dataset/dataframe in the function?
Try 3 :
Dataset metaData = //read dataset from oracle table i.e. meta-data.
Me passing the metaData to a function
Service.calculateFunction(sparkSession, metaData );
public static class Service {
public static calculateFunction(
SparkSession sparkSession,
Dataset metaData ) {
System.out.println("---> size : " + metaData.size()); // throwing null pointer exception.
}
What is wrong here ? how to pass a dataset/dataframe in the function?
The value to be broadcast has to be any Scala object but not a DataFrame.
Service.calculateFunction(sparkSession, metaData) is executed on executors and hence metaData is null (as it was not serialized and sent over the wire from the driver to executors).
broadcast[T](value: T): Broadcast[T]
Broadcast a read-only variable to the cluster, returning a org.apache.spark.broadcast.Broadcast object for reading it in distributed functions. The variable will be sent to each cluster only once.
Think of DataFrame data abstraction to represent a distributed computation that is described in a SQL-like language (Dataset API or SQL). It simply does not make any sense to have it anywhere but on the driver where computations can be submitted for execution (as tasks on executors).
You simply have to "convert" the data this computation represents (in DataFrame terms) using DataFrame.collect.
Once you collected the data you can broadcast it and reference using .value method.
The code could look as follows:
val dataset = // reading data
Broadcast<Map<String,Dataset>> bcVariable =
javaSparkContext.broadcast(dataset.collect);
Service.calculateFunction(sparkSession, bcVariable.getValue());
The only change compared to your code is collect.

how to insert data to HIVE using foreach method in spark structured streaming

I try inserting data to HIVE table using foreach method.
I use spark 2.3.0.
Here is my code
df_drop_window.writeStream
.foreach(new ForeachWriter[Row]() {
override def open(partitionId: Long, epochId: Long): Boolean = true
override def process(value: Row): Unit = {
println(s">> Processing ${value}")
// how to onvert the value as dataframe ?
}
override def close(errorOrNull: Throwable): Unit = {
}
}).outputMode("update").start()
As you can see above, I want convert the "value" to dataframe and insert data to HIVE table like insert into tablename(select * from dataframe). can someone help how to do it ?am new to spark streaming
I can see only following option available. can some say how can i convert value:Row to dataframe ?
I have tried following but am getting error (org.apache.spark.SparkException: Task not serializable)
df.writeStream
.foreach(new ForeachWriter[Row]() {
override def open(partitionId: Long, epochId: Long): Boolean = true
override def process(value: Row): Unit = {
val rowsRdd = sc.parallelize(Seq(value))
val df2 = spark.createDataFrame(rowsRdd, schema)
df2.createOrReplaceTempView("testing2")
spark.sql("insert into table are.table_name1 Partition(date) select * from testing2")
}
override def close(errorOrNull: Throwable): Unit = {
}
}).outputMode("append").start()
Spark Session is not serializable on the executor side, you need to broadcast spark session

how to connect to Cassandra at application start up

I have a Play application which need to connect to Cassandra. I am using Datastax's driver to connect to Cassandra.
I am able to connect to the db from a controller. The code snippet is (full code is from http://manuel.kiessling.net/setting-up-a-scala-sbt-multi-project-with-cassandra-connectivity-and-migrations
val cluster = new Cluster.Builder().
addContactPoints(uri.hosts.toArray: _*).
withPort(uri.port).
withQueryOptions(new QueryOptions().setConsistencyLevel(defaultConsistencyLevel)).build
val session = cluster.connect
session.execute(s"USE ${uri.keyspace}")
session
I am using the above code in a controller as follows:
class UserController #Inject()(cc: ControllerComponents)(implicit exec: ExecutionContext) extends AbstractController(cc){
def addUser = Action.async{ implicit request => {
println("addUser controller called")
println("testing database connection")
val uri = CassandraConnectionUri("cassandra://localhost:9042/killrvideo")
println(s"got uri object ${uri.host}, ${uri.hosts}, ${uri.port}, ${uri.keyspace}")
val session = Helper.createSessionAndInitKeyspace(uri)
val resultSet = session.execute(s"select * from users")
val row = resultSet.one()
println("got row ",row)
val user = User(UUID.randomUUID(),UserProfile(true,Some("m#m.com"),Some("m"),Some("c")))
...
}
Though the code works, I suppose I shouldn't be connecting to the database from within a controller. I should connect to the database when the play application starts and inject the connection in the controller. But I don't know how to do this. Is this the right way to create a database application in Play?
Short description:
It's not a good practice to connect C* from controller class. It is encouraged to have a separate repository/storage class while accessing DB. You will create a DB accessing class and inject that class to your controller class's constructor.
Here is an open-source sample application what I followed to create my own Cassandra application. Play-Framework-Cassandra-Example. You can follow this project.
Long description:
Here are some basic concepts how to do it:
Step 1:
Define DB configuration in application.conf file:
db {
keyspace = "persons"
table = "person_info"
preparedStatementCacheSize = 100
session {
contactPoints = ["127.0.0.1"]
queryOptions {
consistencyLevel = "LOCAL_QUORUM"
}
}
}
step 2: create a Singleton class to main the connection with Cassandra DB
class CassandraConnectionProvider #Inject()(config: Configuration) extends Provider[CassandraConnection] {
override def get(): CassandraConnection = {
val hosts = config.getStringList("db.session.contactPoints")
val keyspace = config.getString("db.keyspace")
// Use the Cluster Builder if you need to add username/password and handle SSL or tweak the connection
ContactPoints(hosts.asScala).keySpace(keyspace)
}
}
Step 3: Now create a repository class where you can operate CRUD operation into DB.
class PhantomPersonRepository #Inject()(config: Configuration, connection: CassandraConnection, ec: ExecutionContext)
extends CassandraTable[PhantomPersonRepository, Person] with PersonRepository[Future] {
// See https://github.com/outworkers/phantom/wiki/Using-the-Database-class-and-understanding-connectors
implicit val session: Session = connection.session
implicit val keySpace: KeySpace = connection.provider.space
override val tableName: String = config.getString("db.table").getOrElse("person_info")
implicit val executionContext: ExecutionContext = ec
object id extends UUIDColumn(this) with PartitionKey
object firstName extends StringColumn(this) {
override def name: String = "first_name"
}
object lastName extends StringColumn(this) {
override def name: String = "last_name"
}
object studentId extends StringColumn(this) {
override def name: String = "student_id"
}
object gender extends EnumColumn[Gender.Value](this)
override implicit val monad: Monad[Future] = cats.instances.future.catsStdInstancesForFuture
override def create(person: Person): Future[Person] =
insert.value(_.id, person.id)
.value(_.firstName, person.firstName)
.value(_.lastName, person.lastName)
.value(_.studentId, person.studentId)
.value(_.gender, person.gender)
.consistencyLevel_=(ConsistencyLevel.LOCAL_QUORUM)
.future()
.map(_ => person)
// https://github.com/outworkers/phantom/wiki/Querying#query-api
override def find(personId: UUID): Future[Option[Person]] =
select.where(_.id eqs personId)
.consistencyLevel_=(ConsistencyLevel.LOCAL_QUORUM)
.one()
override def update(person: Person): Future[Person] = create(person)
.....
Step 4: Now Inject this repository classes to your Controller class and access DB:
#Singleton
class PersonController #Inject()(personRepo: PersonRepository[Future])(implicit ec: ExecutionContext) extends Controller {
def create: Action[JsValue] = Action.async(parse.json) { request =>
onValidationSuccess[CreatePerson](request.body) { createPerson =>
val person = Person(UUID.nameUUIDFromBytes(createPerson.studentId.getBytes()), createPerson.firstName,
createPerson.lastName, createPerson.studentId, createPerson.gender.toModel)
personRepo.find(person.id).flatMap {
case None => personRepo.create(person).map(createdPerson => Created(createdPerson.toJson))
case Some(existing) => Future.successful(Conflict(existing.toJson))
}.recover { case _ => ServiceUnavailable }
}
}
.....
Hope this helps. All code credits to calvinlfer

Does Spring-data-Cassandra 1.3.2.RELEASE support UDT annotations?

Is #UDT (http://docs.datastax.com/en/developer/java-driver/2.1/java-driver/reference/mappingUdts.html) supported by Spring-data-Cassandra 1.3.2.RELEASE? If not, how can I add workaround for this
Thanks
See the details here:
https://jira.spring.io/browse/DATACASS-172
I faced with the same issue and it sounds like it does not(
debug process shows me that spring data cassandra check for
#Table, #Persistent or #PrimaryKeyClass Annotation only and raise exception
in other case
>
Invocation of init method failed; nested exception is org.springframework.data.cassandra.mapping.VerifierMappingExceptions:
Cassandra entities must have the #Table, #Persistent or #PrimaryKeyClass Annotation
But I found the solution.
I figured out the approach that allows me to manage entities that include UDT and the ones that don't. In my application I use spring cassandra data project together with using of direct datastax core driver. The repositories that don't contain object with UDT use spring cassanta data approach and the objects that include UDT use custom repositories.
Custom repositories use datastax mapper and they work correctly with UDT
(they located in separate package, see notes below why it's needed):
package com.fyb.cassandra.custom.repositories.impl;
import java.util.List;
import java.util.UUID;
import javax.annotation.PostConstruct;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.cassandra.config.CassandraSessionFactoryBean;
import com.datastax.driver.core.ResultSet;
import com.datastax.driver.mapping.Mapper;
import com.datastax.driver.mapping.MappingManager;
import com.datastax.driver.mapping.Result;
import com.google.common.collect.Lists;
import com.fyb.cassandra.custom.repositories.AccountDeviceRepository;
import com.fyb.cassandra.dto.AccountDevice;
public class AccountDeviceRepositoryImpl implements AccountDeviceRepository {
#Autowired
public CassandraSessionFactoryBean session;
private Mapper<AccountDevice> mapper;
#PostConstruct
void initialize() {
mapper = new MappingManager(session.getObject()).mapper(AccountDevice.class);
}
#Override
public List<AccountDevice> findAll() {
return fetchByQuery("SELECT * FROM account_devices");
}
#Override
public void save(AccountDevice accountDevice) {
mapper.save(accountDevice);
}
#Override
public void deleteByConditions(UUID accountId, UUID systemId, UUID deviceId) {
final String query = "DELETE FROM account_devices where account_id =" + accountId + " AND system_id=" + systemId
+ " AND device_id=" + deviceId;
session.getObject().execute(query);
}
#Override
public List<AccountDevice> findByAccountId(UUID accountId) {
final String query = "SELECT * FROM account_devices where account_id=" + accountId;
return fetchByQuery(query);
}
/*
* Take any valid CQL query and try to map result set to the given list of appropriates <T> types.
*/
private List<AccountDevice> fetchByQuery(String query) {
ResultSet results = session.getObject().execute(query);
Result<AccountDevice> accountsDevices = mapper.map(results);
List<AccountDevice> result = Lists.newArrayList();
for (AccountDevice accountsDevice : accountsDevices) {
result.add(accountsDevice);
}
return result;
}
}
And the spring data related repos that resonsible for managing entities that don't include UDT objects looks like as follows:
package com.fyb.cassandra.repositories;
import org.springframework.data.cassandra.repository.CassandraRepository;
import com.fyb.cassandra.dto.AccountUser;
import org.springframework.data.cassandra.repository.Query;
import org.springframework.stereotype.Repository;
import java.util.List;
import java.util.UUID;
#Repository
public interface AccountUserRepository extends CassandraRepository<AccountUser> {
#Query("SELECT * FROM account_users WHERE account_id=?0")
List<AccountUser> findByAccountId(UUID accountId);
}
I've tested this solution and it's works 100%.
In addition I've attached my POJO objects:
Pojo that uses only data stax annatation:
package com.fyb.cassandra.dto;
import java.util.List;
import java.util.Map;
import java.util.UUID;
import com.datastax.driver.mapping.annotations.ClusteringColumn;
import com.datastax.driver.mapping.annotations.Column;
import com.datastax.driver.mapping.annotations.Frozen;
import com.datastax.driver.mapping.annotations.FrozenValue;
import com.datastax.driver.mapping.annotations.PartitionKey;
import com.datastax.driver.mapping.annotations.Table;
#Table(name = "account_systems")
public class AccountSystem {
#PartitionKey
#Column(name = "account_id")
private java.util.UUID accountId;
#ClusteringColumn
#Column(name = "system_id")
private java.util.UUID systemId;
#Frozen
private Location location;
#FrozenValue
#Column(name = "user_token")
private List<UserToken> userToken;
#Column(name = "product_type_id")
private int productTypeId;
#Column(name = "serial_number")
private String serialNumber;
}
Pojo without using UDT and using only spring data cassandra framework:
package com.fyb.cassandra.dto;
import java.util.Date;
import java.util.UUID;
import org.springframework.cassandra.core.PrimaryKeyType;
import org.springframework.data.cassandra.mapping.Column;
import org.springframework.data.cassandra.mapping.PrimaryKeyColumn;
import org.springframework.data.cassandra.mapping.Table;
#Table(value = "accounts")
public class Account {
#PrimaryKeyColumn(name = "account_id", ordinal = 0, type = PrimaryKeyType.PARTITIONED)
private java.util.UUID accountId;
#Column(value = "account_name")
private String accountName;
#Column(value = "currency")
private String currency;
}
Note, that the entities below use different annotations:
#PrimaryKeyColumn(name = "account_id", ordinal = 0, type = PrimaryKeyType.PARTITIONED)and #PartitionKey
#ClusteringColumn and #PrimaryKeyColumn(name = "area_parent_id", ordinal = 2, type = PrimaryKeyType.CLUSTERED)
At first glance - it's uncomfortable, but it allows you to work with objects that includes UDT and that don't.
One important note. That two repos(that use UDT and don't should reside in different packages) cause Spring config looking for base packages with repos:
#Configuration
#EnableCassandraRepositories(basePackages = {
"com.fyb.cassandra.repositories" })
public class CassandraConfig {
..........
}
User Defined data type is now supported by Spring Data Cassandra. The latest release 1.5.0.RELEASE uses Cassandra Data stax driver 3.1.3 and hence its working now. Follow the below steps to make it working
How to use UserDefinedType(UDT) feature with Spring Data Cassandra :
We need to use the latest jar of Spring data Cassandra (1.5.0.RELEASE)
group: 'org.springframework.data', name: 'spring-data-cassandra', version: '1.5.0.RELEASE'
Make sure it uses below versions of the jar :
datastax.cassandra.driver.version=3.1.3
spring.data.cassandra.version=1.5.0.RELEASE
spring.data.commons.version=1.13.0.RELEASE
spring.cql.version=1.5.0.RELEASE
Create user defined type in Cassandra : The type name should be same as defined in the POJO class
Address data type
CREATE TYPE address_type (
id text,
address_type text,
first_name text,
phone text
);
Create column-family with one of the columns as UDT in Cassandra:
Employee table:
CREATE TABLE employee(
employee_id uuid,
employee_name text,
address frozen,
primary key (employee_id, employee_name)
);
In the domain class, define the field with annotation -CassandraType and DataType should be UDT:
#Table("employee") public class Employee {
-- othere fields--
#CassandraType(type = DataType.Name.UDT, userTypeName = "address_type")
private Address address;
}
Create domain class for the user defined type : We need to make sure that column name in the user defined type schema
has to be same as field name in the domain class.
#UserDefinedType("address_type") public class Address { #CassandraType(type = DataType.Name.TEXT)
private String id; #CassandraType(type = DataType.Name.TEXT) private String address_type; }
In the Cassandra Config, Change this :
#Bean public CassandraMappingContext mappingContext() throws Exception {
BasicCassandraMappingContext mappingContext = new BasicCassandraMappingContext();
mappingContext.setUserTypeResolver(new SimpleUserTypeResolver(cluster().getObject(), cassandraKeyspace));
return mappingContext;
}
User defined type should have the same name across everywhere. for e.g
#UserDefinedType("address_type")
#CassandraType(type = DataType.Name.UDT, userTypeName = "address_type")
CREATE TYPE address_type

Resources