Some problems about serialization when i use spark read from hbase - apache-spark

I want to implement a class have a function that read from hbase by spark, like this:
public abstract class QueryNode implements Serializable{
private static final long serialVersionUID = -2961214832101500548L;
private int id;
private int parent;
protected static Configuration hbaseConf;
protected static Scan scan;
protected static JavaSparkContext sc;
public abstract RDDResult query();
public int getParent() {
return parent;
}
public void setParent(int parent) {
this.parent = parent;
}
public int getId() {
return id;
}
public void setId(int id) {
this.id = id;
}
public void setScanToConf() {
try {
ClientProtos.Scan proto = ProtobufUtil.toScan(scan);
String scanToString = Base64.encodeBytes(proto.toByteArray());
hbaseConf.set(TableInputFormat.SCAN, scanToString);
} catch (IOException e) {
e.printStackTrace();
}
}}
This is a parent class, i hava some subclasses implement the menthod query() to read from hbase , but if I set Configuration, Scan and JavaSparkContext is not static, I will get some errors : these classes are not serialized.
Why these classes must be static? Have I some other ways to slove this problem? thks.

You can try to set transient for these fields to avoid serialization exception like
Caused by: java.io.NotSerializableException:
org.apache.spark.streaming.api.java.JavaStreamingContext
so you say to java you just dont want to serialize these fields:
protected transient Configuration hbaseConf;
protected transient Scan scan;
protected transient JavaSparkContext sc;
Are you initializing JavaSparkContext, Configuration and Scan in main or in any static method? With static, your fields are shared through all instancies. But it depends on your use cases if static should be used.
But with transient way it is better than static because serialization of JavaSparkCOntext does not make sense cause this is created on driver.
-- edit after discussion in comment:
java doc for newAPIHadoopRDD
public <K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>> JavaPairRDD<K,V> newAPIHadoopRDD(org.apache.hadoop.conf.Configuration conf,
Class<F> fClass,
Class<K> kClass,
Class<V> vClass)
conf - Configuration for setting up the dataset. Note: This will
be put into a Broadcast. Therefore if you plan to reuse this conf
to create multiple RDDs, you need to make sure you won't modify the
conf. A safe approach is always creating a new conf for a new
RDD.
Broadcast:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
So basically I think for that case static is ok (you create hbaceConf only once), but if you want to avoid static, you can follow suggestion in javadoc to always craete a new conf for a new RDD.

Related

repository always null after initilization of testing containers

I am attempting to use TestingContainers. I was able to get it to run but my tests are always null. I am trying to avoid mocking but rather having real data.
Repository
#Sql("classpath:data.sql")
class OrderDataRepositoryTest extends AbstractTestConfiguration {
//#Mock
#MockBean
//#Autowired
private OrderDataRepository orderRepository;
private AutoCloseable closeable;
#BeforeEach
public void init() {
closeable = MockitoAnnotations.openMocks(this);
}
#AfterEach
void closeService() throws Exception {
closeable.close();
}
#Test
void getAllUsersTest() {
List<Order> orders = orderRepository.findAll();
orders.toString();
}
}
config
#AutoConfigureTestDatabase(replace = AutoConfigureTestDatabase.Replace.NONE)
#Testcontainers
public abstract class AbstractTestConfiguration {
#Container
private MySQLContainer database = new MySQLContainer("mysql:8.0");
#Test
public void test() {
assertTrue(database.isRunning());
}
}
main
#SpringBootTest
#Sql("classpath:init.sql")
#TestPropertySource("classpath:application-test.yml")
class TentingContainerApplicationTests {
}
application.properties
spring:
application:
datasource:
url: jdbc:mysql:8.0:///test?TC_INITSCRIPT=file:src/main/resources/init.sql
driver-class-name: com.mysql.jdbc.Driver
The commented out
//#Mock
#MockBean
//#Autowired
is what I tried. Of course mock works out but I want real data for the #services and #repository classes.
advice?
If you want to test your database-related code in isolation (I assume you're using Spring Data JPA) then #DataJpaTest fits perfectly.
This annotation will create a sliced Spring context for you that contains only persistence relevant beans like: DataSource, EntityManager, YourRepository. This doesn't include your service classes, your #Component classes, or #RestController.
By default, this annotation tries to configure an embedded in-memory database as the DataSource. We can override this (and you already did with some of your code examples) behavior to use Testcontainers:
#DataJpaTest
#Testcontainers
#AutoConfigureTestDatabase(replace = AutoConfigureTestDatabase.Replace.NONE)
class OrderDataRepositoryTest {
#Container
static MySQLContainer database = new MySQLContainer("mysql:8.0");
#DynamicPropertySource
static void setDatasourceProperties(DynamicPropertyRegistry propertyRegistry) {
propertyRegistry.add("spring.datasource.url", database::getJdbcUrl);
propertyRegistry.add("spring.datasource.password", database::getPassword);
propertyRegistry.add("spring.datasource.username", database::getUsername);
}
#Autowired
private OrderDataRepository orderRepository;
#Test
void shouldReturnOrders() {
}
}
If you want to write another test that includes all your beans and also starts the embedded servlet container, take a look at #SpringBootTest for writing integration tests.
#SpringBootTest(webEnvironment = WebEnvironment.RANDOM_PORT)
#Testcontainers
class MyIntegrationTest {
#Container
static MySQLContainer database = new MySQLContainer("mysql:8.0");
#DynamicPropertySource
static void setDatasourceProperties(DynamicPropertyRegistry propertyRegistry) {
propertyRegistry.add("spring.datasource.url", database::getJdbcUrl);
propertyRegistry.add("spring.datasource.password", database::getPassword);
propertyRegistry.add("spring.datasource.username", database::getUsername);
}
#Autowired
private ServiceA serviceA;
#Autowired
private OrderDataRepository orderDataRepository;
}
When working with a Spring TestContext for your test and Mockito, make sure to understand the difference between #Mock and #MockBean.

Mockito: How to test a class's void method?

Unit test noob here.
I have three classes: Db1Dao, Db2Dao, ExecuteClass where Db1Dao, Db2Dao are database access objects for two different databases. My goal is to fetch some data from db1 using Db1Dao and run executeClass.execute() to "put" the processed data into db2 using Db2Dao.
My ExecuteClass looks like this:
class ExecuteClass {
private Db1Dao db1Dao;
private Db2Dao db2Dao;
public void execute() {
...
List<String> listOfString = getExternalData(someParam);
List<Metadata> metadatum = db1Dao.get(someInputs);
... I do something to generate a list of new class `A` based on listOfString & metadatum ...
try {
db2Dao.put(listOfA);
} catch (PutException e){
...
}
}
public List<String> getExternalData(SomeClass someParam){
... do something
return listOfString;
}
}
Now I want to test:
Given a specific listOfString (returned by getExternalData) and a specific metadatum (returned by db1Dao.get):
Will I get the desired listOfA?
Am I able to call db2Dao.put and its input parameter is listOfA?
Particularly, I have hard-coded sample listOfString and metadatum and desired listOfA (and they will be passed via an object MockData, see the following code) but I don't know how to write the test using Mockito. The following is a test class I wrote but it does not work:
class TestClass extends BaseTest {
#Mock
private Db1Dao db1Dao;
#Mock
private Db2Dao db2Dao;
private ExecuteClass executeClass;
#BeforeEach
public void setUp() {
MockitoAnnotations.initMocks(this);
executeClass = new ExecuteClass(db1Dao, db2Dao);
}
#ParameterizedTest
#MethodSource("MockDataProvider")
public void executeClassTest(final MockData mockData) throws PutException {
Mockito.when(db1Dao.get(Mockito.any(), ...))
.thenReturn(mockData.getMetadatum());
ExecuteClass executeClassSpy = Mockito.spy(executeClass);
Mockito.when(executeClassSpy.getExternalData(Mockito.any()))
.thenReturn(mockData.getListOfString());
executeClassSpy.execute();
// executeClass.execute(); not working neither...
List<A> listOfA = mockData.getDesiredListOfA();
Mockito.verify(db2Dao).put(listOfA);
}
}
Could anyone please let me know? Thank you in advance!!
You should not create a spy of the same class you want to test. Instead, try to write a unit test for the smallest amount of code (e.g. a public method) and mock every external operator (in your case Db1Dao and Db2Dao).
If testing a public method involves calling another public method of the same class, make sure to mock everything inside the other public method (in your case getExternalData). Otherwise, this other public method might be a good candidate for an extra class to have clear separation of concerns.
So, remove the ExecuteClass executeClassSpy = Mockito.spy(executeClass); and make sure you setup everything with Mockito what's called within getExternalData.
To now actually, verify that Db2Dao was called with the correct parameter, either use your current approach with verifying the payload. But here it's important to 100% create the same data structure you get while executing your application code.
Another solution would be to use Mockito's #Captor. This allows you to capture the value of why verifying the invocation of a mock. Later on, you can also write assertions on the captured value:
#Captor
private ArgumentCaptor<ClassOfListOfA> argumentCaptor;
#Test
public void yourTest() {
Mockito.verify(db2Dao).put(argumentCaptor.capture());
assertEquals("StringValue", argumentCaptur.getValue().getWhateverGetterYouHave);
}
The following code worked for me.
I partially accepted #rieckpil's answer. I used #Captor which is very handy.
The reason I had to mock getExternalData() is because its implementation is still a "TODO".
class TestClass extends BaseTest {
#Mock
private Db1Dao db1Dao;
#Mock
private Db2Dao db2Dao;
#Captor
private ArgumentCaptor<List<A>> argumentCaptor;
private ExecuteClass executeClass;
#BeforeEach
public void setUp() {
MockitoAnnotations.initMocks(this);
executeClass = new ExecuteClass(db1Dao, db2Dao);
}
#ParameterizedTest
#MethodSource("MockDataProvider")
public void executeClassTest(final MockData mockData) throws PutException {
Mockito.when(db1Dao.get(Mockito.any(), ...))
.thenReturn(mockData.getMetadatum());
ExecuteClass executeClassSpy = Mockito.spy(executeClass);
Mockito.when(executeClassSpy.getExternalData(Mockito.any()))
.thenReturn(mockData.getListOfString());
executeClassSpy.execute();
List<A> listOfA = mockData.getDesiredListOfA();
Mockito.verify(db2Dao).put(argumentCaptor.capture());
assertEquals(listOfA, argumentCaptor.getValue());
}
}

How do you Unit Test a ForeachWriter implementation?

I've been trying to setup some unit tests to verify the logic in a ForeachWriter custom implementation but am running into a bit of mocking / duplication trouble.
I'd like to Mock an injected dependency in the ForeachWriter, but my mocks seem to be duplicated during execution. Originally I thought the mocked dependencies weren't getting called, but during debug inspection I've found that multiple versions of them seem to exist (based on hashCode).
Here's some quick sample code of what I've been trying to do:
//Class I'd like to test
public class TestForeachSink extends ForeachWriter<String> {
#Inject
SomeDependency dep;
public TestForeachSink(SomeDependency dep) {
this.dep = dep;
}
#Override
public boolean open(long partitionId, long version) {
dep.doSomethingStartupRelatedOrThrow();
return true;
}
#Override
public void process(String value) {
dep.processSomething(value);
}
#Override
public void close(Throwable errorOrNull) {
dep.closeConnections();
}
}
//Testing Class
public class TestForeachSinkTests {
#Mock SomeDependency _dep;
TestForeachSink target;
#BeforeEach
public void init() {
_dep = mock(SomeDependency.class, withSettings().serializable());
target = new TestForeachSink(_dep);
}
#Test
pubic void shouldVerifyDependencyInteractions() {
//setup stream, add data to it
stream.toDS().writeStream().foreach(target).start().processAllAvailable();
//VERIFY INTERACTIONS WITH MOCK HERE
}
}
The added data runs through the stream as expected but it seems like the mock I've passed in of SomeDependency is replaced during execution with a copy. I think that makes sense if the execution is running as though it were performing on a separate worker, but I'd still like to be able to test the ForeachWriter.
Is anyone else testing this part of the code? I haven't come across any other tests for ForeachSink custom implementations but direction on moving forward would be very appreciated!

How to mock the Data Stax Row object[com.datastax.driver.core.Row;] - Unit Test

Please find the below code for the DAO & Entity Object and Accessor
#Table(name = "Employee")
public class Employee {
#PartitionKey
#Column(name = "empname")
private String empname;
#ClusteringColumn(0)
#Column(name = "country")
private String country;
#Column(name = "status")
private String status;
}
Accessor:
#Accessor
public interface EmployeeAccessor {
#Query(value = "SELECT DISTINCT empname FROM EMPLOYEE ")
ResultSet getAllEmployeeName();
}
}
DAO getAllEmployeeNames returns a List which are employee names
and it will be sorted in ascending order.
DAO
public class EmployeeDAOImpl implements EmployeeDAO {
private EmployeeAccessor employeeAccessor;
#PostConstruct
public void init() {
employeeAccessor = datastaxCassandraTemplate.getAccessor(EmployeeAccessor.class);
}
#Override
public List<String> getAllEmployeeNames() {
List<Row> names = employeeAccessor.getAllEmployeeName().all();
List<String> empnames = names.stream()
.map(name -> name.getString("empname")).collect(Collectors.toList());
empnames.sort(naturalOrder()); //sorted
return empnames;
}
}
JUnit Test(mockito):
I am not able to mock the List[datastax row]. How to mock and returns a list of rows with values "foo" and "bar".Please help me in unit test this.
#Category(UnitTest.class)
#RunWith(MockitoJUnitRunner.class)
public class EmployeeDAOImplUnitTest {
#Mock
private ResultSet resultSet;
#Mock
private EmployeeAccessor empAccessor;
//here is the problem....how to mock the List<Row> Object --> com.datastax.driver.core.Row (interface)
//this code will result in compilation error as we are mapping a List<Row> to the ArrayList<String>
//how to mock the List<Row> with a list of String row object
private List<Row> unSortedTemplateNames = new ArrayList() {
{
add("foo");
add("bar");
}
};
//this is a test case to check if the results are sorted or not
//mock the accessor and send rows as "foo" & "bar"
//after calling the dao , the first element must be "bar" and not "foo"
#Test
public void shouldReturnSorted_getAllTemplateNames() {
when(empAccessor.getAllEmployeeName()).thenReturn(resultSet);
when(resultSet.all()).thenReturn(unSortedTemplateNames); //how to mock the List<Row> object ???
//i am testing if the results are sorted, first element should not be foo
assertThat(countryTemplates.get(0), is("bar"));
}
}
Wow! This is overly complex, hard to follow, and not an ideal way to write unit tests.
Using PowerMock(ito) along with "static" references in your own code is not recommended and is a sure sign of a code smells.
First, I am not sure why you decided to use a static reference (e.g. EmployeeAccessor.getAllEmployeeName().all(); inside the EmployeeDAOImpl class, getAllEmployeeNames() method) instead of using the instance variable (i.e. empAccessor), which is more conducive to actual "unit testing"?
The EmployeeAccessor, getAllEmployeeName() "interface" method is not static (clearly). However, seemingly, whatever this (datastaxCassandraTemplate.getAccessor(EmployeeAccessor.class);) generates makes it so (really?), which then requires the use of PowerMock(ito), o.O
Frameworks like PowerMock, and extensions of (i.e. "PowerMockito"), were meant to test and mock code used by your application (unfortunately, but necessarily so) where this "other" code makes use of statics, Singletons, private methods and so on. This anti-pattern really ought not be followed in your own application design.
Second, it is not really apparent what the "Subject Under Test" (SUT) is in your test case. You implemented a test class (i.e. EmployeeDAOImplTest) for, supposedly, your EmployeeDAOImpl class (the actual "SUT"), but inside your test case (i.e. shouldReturnSorted_getAllTemplateNames()), you are calling... countryLocalizationDAOImpl.getAllTemplateNames(); thus testing the CountryLocalizationDAOImpl class (??), which is not the "SUT" of the EmployeeDAOImplTest class.
Additionally, it is not apparent that the EmployeeDAOImpl even uses a CountryLocalizationDAO instance (assuming an interface here as well), and if it does, then it is certainly something that should be "mocked" when the EmployeeDAOImpl "interacts" with instances of CountryLocalizationDAO, particularly in the context of a unit test. The only correlation between the EmployeeDAO and CountryLocalizationDAO is that the Employee has a country field.
There are a few other problems with your design/setup as well, but anyway.
Here are a few suggestions...
First, let's test what your EmployeeDAOImplTest is meant to test... EmployeeDAO.getAllEmployeeNames() in a sorted fashion. This in turn may give you ideas of how to test your "CountryLocalizationDAO, getAllTemplateNames() method perhaps (if it even makes sense, i.e. getAllTemplateNames() is in fact dependent on an Employee's country, when Employees are ordered by name (i.e. "empname" and accessed via EmployeeAccessor).
public class EmployeeDAOImpl implements EmployeeDAO {
private final EmployeeAccessor employeeAccessor;
// where does the DataStaxCassandraTemplate reference come from?!
private DataStaxCassadraTemplate datastaxCassandraTemplate = ...;
public EmployeeDAOImpl() {
this(datastaxCassandraTemplate.getAccessor(EmployeeAccessor.class));
}
public EmployeeDAOImpl(EmployeeAccessor employeeAccessor) {
this.employeeAccessor = employeeAccessor;
}
protected EmployeeAccessor getEmployeeAccessor() {
return this.empAccessor;
}
public List<String> getAllEployeeNames() {
List<Row> nameRows = getEmployeeAccessor().getAllEmployeeName().all();
...
}
}
Then in your test class...
public class EmployeeDAOImplUnitTest {
#Mock
private EmployeeAccessor mockEmployeeAccessor;
// SUT
private EmployeeDAO employeeDao;
#Before
public void setup() {
employeeDao = new EmployeeDAOImpl(mockEmployeeAccessor);
}
protected ResultSet mockResultSet(Row... rows) {
ResultSet mockResultSet = mock(ResultSet.class);
when(mockResultSet.all()).thenReturn(Arrays.asList(rows));
return mockResultSet;
}
protected Row mockRow(String employeeName) {
Row mockRow = mock(Row.class, employeeName);
when(mockRow.getString(eq("empname")).thenReturn(employeeName);
return mockRow;
}
#Test
public void getAllEmployeeNamesReturnsSortListOfNames() {
when(mockEmployeeAccessor.getAllEmployeeName())
.thenReturn(mockResultSet(mockRow("jonDoe"), mockRow("janeDoe")));
assertThat(employeeDao.getAllEmployeeNames())
.contains("janeDoe", "jonDoe");
verify(mockEmployeeAccessor, times(1)).getAllEmployeeName();
}
}
Now, you can apply similar techniques if in fact there is an actual correlation between Employees and CountryLocalizationDAO via the EmployeeAccessor.
Hope this helps get you on a better track!
-j

Pass parameters from driver to executors in spark

I am using spark 2.0.0.
Is there a way to pass parameters from spark driver to executors? I tried the following.
class SparkDriver {
public static void main(String argv[]){
SparkConf conf = new SparkConf().setAppName("test").setMaster("yarn");
SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate();
Dataset<Row> input = sparkSession.read().load("inputfilepath");
Dataset<Row> modifiedinput = input.mapPartitions(new customMapPartition(5),Encoders.bean(Row.class));
}
class customMapPartition implements MapPartitionsFunction{
private static final long serialVersionUID = -6513655566985939627L;
private static Integer variableThatHastobePassed = null;
public customMapPartition(Integer passedInteger){
customMapPartition.variableThatHastobePassed= passedInteger;
}
#Override
public Iterator<Row> call(Iterator<Row> input) throws Exception {
System.out.println("number that is passed " + variableThatHastobePassed);
}
}
As mentioned above I wrote a custom mappartitionfunction to pass the parameters. and am accessing the static variable in call method of partitionfunction. This worked when i ran in my local with "setmaster("local"). But did not work when ran on a cluster with .setmaster("yarn"). (printed null in the system.out.println statements)
Is there a way to pass parameters from driver to executors.
my bad i was using
private static Integer variableThatHastobePassed = null;
the variable should not be declared as static.

Resources