Spark Java Map processing entire data set in all executors

Spark Java Map processing entire data set in all executors - apache-spark

I have 2 executors, 2 cores each (total 4 partitions), the following function executed 22824 times for my data set containing only 5706 records. I see messages repeated 33824 (5706 x 4) in my logs:
here is my code:
Dataset<Row> dbDataSet= dbSvc.getRecords(fkId);
RDD<String> rdd= dbDataSet.toJavaRDD()
.map(row-> transformSvc.transformBlobToString(row)).rdd();
public class TransformService{
public static String transformBlobToString(Row row)
logger.info("*** transformBlobToString() Invoked on ExecutorId " + SparkEnv.get().executorId());
In logs I see
this repeated 11412 times (5706 x2 cores)
23/01/06 18:01:11 INFO TransformService: *** transformBlobToString() Invoked on ExecutorId 1
23/01/06 18:01:11 INFO TransformService: *** transformBlobToString() Invoked on ExecutorId 1
23/01/06 18:01:11 INFO TransformService: *** transformBlobToString() Invoked on ExecutorId 1
this repeated 11412 times (5706 x2 cores)
23/01/06 18:01:11 INFO TransformService: *** transformBlobToString() Invoked on ExecutorId 2
23/01/06 18:01:11 INFO TransformService: *** transformBlobToString() Invoked on ExecutorId 2
23/01/06 18:01:11 INFO TransformService: *** transformBlobToString() Invoked on ExecutorId 2
Why is this happening? What I want to do is split my total of 5706 records in 4 partitions and process 1426 in each of the partition. The function I used to fetch the DB (dbService.getRecords()) already splits the data in 4 equal parts through an incremental column. So I expected these 4 parts to be executed in parallel in 4 cores each one processing 1426 per partition. Instead it is doing entirely opposite of what I expected -- processing the entire 5706 records in each partition. Please help how I can fix this and only process 1/4rth of records in each partition.
Full code:
Mainfunction() {
Dataset < Row > topLevelDataSet = dbSvc.getRecords(id);
JavaRDD < String > stringifiedDataSet = processRecords(topLevelDataSet);
Dataset < Row > xmlDataSet = sparkSession.read().json(stringifiedDataSet);
csvDataSet = filterFormatDataFrame(xmlDataSet);
int requiredPartitions = calculatePartitions(csvDataSet);
if (requiredPartitions > dbSvc.getNumberOfDBPartitions()) {
csvDataSet.repartition((requiredPartitions).write().format("csv").option("header", "true")
.save(filepath);
}
else {
csvDataSet.write().format("csv").option("header", "true")
.save(filepath);
}
}
public Dataset < Row > getRecords(Long id) throws SQLException {
String table = "(select t1.*, ROWNUM as num_rows from (select * from myTableName where id= " + id + " ) t1) mytable";
String totalRowCount = getTotalRowCount(id, otherFIlters);
Properties properties = new Properties();
properties.setProperty("partitionColumn", "num_rows");
properties.setProperty("lowerBound", "0");
properties.setProperty("upperBound", totalRowCount);
properties.setProperty("numPartitions", 4);
properties.setProperty("fetchsize", "100");
properties.setProperty("Driver", driver);
properties.setProperty("user", user);
properties.setProperty("password", password);
Dataset < Row > records = SparkSession.getActiveSession().get().read().jdbc(jdbcUrl, table, properties);
return records;
}
}
private JavaRDD < String > processRecords(Dataset < Row > topLevelDataSet) {
JavaRDD < Row > rdd = topLevelDataSet.toJavaRDD();
// have confirmed that each Partition ONLY has 1426 records
JavaRDD < String > transformed = rdd.mapPartitions(xmlRows - > {
logger.info("*** Inside Parallellize function mapPArtition");
List < String > transformedX = new ArrayList < String > ();
int count = 0;
while (xmlRows.hasNext()) {
count++;
transformedX.add(extractBlobToString(xmlRows.next()));
}
logger.info("*** processRecords outside loop total count of records :" + count); < -Only 1405 records
return transformedX.iterator();
});
return transformed;
}
private int calculatePartitions(Dataset < Row > csvDataset) {
LongAccumulator numberOfPartitionAccumulator = sparkSession.sparkContext()
.longAccumulator("NumberOfPartitions");
csvDataset.foreachPartition(rows - > {
numberOfPartitionAccumulator.add(1);
Long dataSize = 0;
while (rows.hasNext()) {
dataSize += rows.next().json();
if (dataSize >= MAX_FILE_SIZE) {
numberOfPartitionAccumulator.add(1);
dataSize = 0;
}
}
});
return numberOfPartitionAccumulator.value().intValue();
}
An update to the original post:
another piece of critical info I found. I have stages in the DAG, stage0 executes fine as expected (each partition only processing 1/4rth of the recordset). then next stage1 "save" reexecutes the loop from the very beginning, and then the count operation stage reexecutes the loop all over again. That explains why same row was processed multiple times, but that does not explain why the stage0 steps are reexecuted in next stages. what in my code above is causing it?

Related

Using Java 8 feature takes more time

Sample Program provided to count number of elements from an array that are less than the specified value.
Processing time taken by the program varies, using Java 8 forEach and stream takes more time for execution. Please explain if I should go with Java 8 features and if not which areas should it be avoided, additionally will parallelStream or multithreading with multi core processor help?
Code:
public static void main(String[] args) {
int lessThan = 4;
// Using for loop iteration
List<Integer> integerForSort1 = Arrays.asList(4, 1, 1, 2, 3);
long startTime1 = System.nanoTime();
long count1 = countNumbers(integerForSort1, lessThan);
long stopTime1 = System.nanoTime();
System.out.println(stopTime1 - startTime1);
System.out.println(count1);
integerForSort1 = null;
System.gc();
// Using binary search
List<Integer> integerForSort2 = Arrays.asList(4, 1, 1, 2, 3);
long startTime2 = System.nanoTime();
long count2 = countByBinarySearch(integerForSort2, lessThan);
long stopTime2 = System.nanoTime();
System.out.println(stopTime2 - startTime2);
System.out.println(count2);
integerForSort2 = null;
System.gc();
// Using Java 8
List<Integer> integerForSort3 = Arrays.asList(4, 1, 1, 2, 3);
long startTime3 = System.nanoTime();
long count3 = integerForSort3.stream()
.filter(p -> p < lessThan)
.count();
long stopTime3 = System.nanoTime();
System.out.println(stopTime3 - startTime3);
System.out.println(count3);
integerForSort3 = null;
System.gc();
//Using Java 8 for each loop
List<Integer> integerForSort4 = Arrays.asList(4, 1, 1, 2, 3);
long startTime4 = System.nanoTime();
long count4 = process(integerForSort4, p -> p < lessThan);
long stopTime4 = System.nanoTime();
System.out.println(stopTime4 - startTime4);
System.out.println(count4);
integerForSort4 = null;
}
public static long countNumbers(List<Integer> integerForSort, int lessThan) {
long count = 0;
Collections.sort(integerForSort);
for (Integer anIntegerForSort : integerForSort) {
if (anIntegerForSort < lessThan)
count++;
}
return count;
}
public static long countByBinarySearch(List<Integer> integerForSort, int lessThan){
if(integerForSort==null||integerForSort.isEmpty())
return 0;
int low = 0, mid = 0, high = integerForSort.size();
Collections.sort(integerForSort);
while(low != high){
mid = (low + high) / 2;
if (integerForSort.get(mid) < lessThan) {
low = mid + 1;
}
else {
high = mid;
}
}
return low;
}
public static long process(List<Integer> integerForSort, Predicate<Integer> predicate) {
final AtomicInteger i = new AtomicInteger(0);
integerForSort.forEach((Integer p) -> {
if (predicate.test(p)) {
i.getAndAdd(1);
}
});
return i.intValue();
}
Output:
345918
4
21509
4
29651234
4
2242999
4
Questions:
Is it possible to reduce the process time using Java 8 features?
Why does Java 8 stream takes more time?
How can I use lambda expression with binary Search, will it process faster?
Even using multithreading with concurrent.ExecutorService gave consistent results:
Result 4 : Thread 'pool-1-thread-1' ran process - Using Java 8 stream in 6 millisecond, from 11:16:05:361 to 11:16:05:367
Result 4 : Thread 'pool-1-thread-2' ran process - Using Java 8 forEach in 3 millisecond, from 11:16:05:361 to 11:16:05:364
Result 4 : Thread 'pool-1-thread-4' ran process - Using Java 7 binary Search in 0 millisecond, from 11:16:05:379 to 11:16:05:379
Result 4 : Thread 'pool-1-thread-3' ran process - Using Java 7 for loop in 1 millisecond, from 11:16:05:362 to 11:16:05:363

I do not know the answer, since I didn't perform any tests, but I think, the performance tests with 5 elements have no diagnostic value and are senseless. You should generate arrays of 10s,100s of thousands, or hundreds of milions, to see the performance difference.
Java 8 creates several objects, which causes some overhead in opposite of simple for loop. So having only 5 test elements, your results depend on how much work is needed for initialization of you execution.
You know, multiplying of 5 numbers on CPU is faster then even copying it to GPU memory, so you have your result on CPU even before GPU starts to compute. But if your data grows up and your GPU multiplies parallely hundreds of numbers, you will see the speed difference.

How to divide a huge loop into multiple threads and then add result in collection?

I am performing some task in a loop. I need to divide this loop of 1.2 million into multiple threads. Each thread will have some result in list. When all threads are completed I need to add all threads list data into one common list. I can not use ExecutorService. How can I do this?
It should be compatible to jdk 1.6.
This is what I am doing right now:
List<Thread> threads = new ArrayList<Thread>();
int elements = 1200000;
public void function1() {
int oneTheadElemCount = 10000;
float fnum_threads = (float)elements / (float)oneTheadElemCount ;
String s = String.valueOf(fnum_threads);
int num_threads = Integer.parseInt(s.substring(0, s.indexOf("."))) + 1 ;
for(int count =0 ; count < num_threads ; count++) {
int endIndex = ((oneTheadElemCount * (num_threads - count)) + 1000) ;
int startindex = endIndex - oneTheadElemCount ;
if(count == (num_threads-1) )
{
startindex = 0;
}
if(startindex == 0 && endIndex > elements) {
endIndex = elements -1 ;
}
dothis( startindex,endIndex);
}
for(Thread t : threads) {
t.run();
}
}
public List dothis(int startindex, int endIndex) throws Exception {
Thread thread = new Thread(new Runnable() {
#Override
public void run() {
for (int i = startindex;
(i < endIndex && (startindex < elements && elements) ) ; i++)
{
//task adding elements in list
}
}
});
thread.start();
threads.add(thread);
return list;
}

I don't know which version of Java you are using but in Java 7 and higher, you can use Fork/Join ForkJoinPool.
Basically,
Fork/Join, introduced in Java 7, isn't intended to replace or compete
with the existing concurrency utility classes; instead it updates and
completes them. Fork/Join addresses the need for divide-and-conquer,
or recursive task-processing in Java programs (see Resources).
Fork/Join's logic is very simple: (1) separate (fork) each large task
into smaller tasks; (2) process each task in a separate thread
(separating those into even smaller tasks if necessary); (3) join the
results.
Citation.
There are various example online that can help with it. I haven't used it myself.
I hope this helps.
For Java6, you can follow this related SO question.

Why Is BuiltStatement more efficient than BoundStatement in Cassandra?

This link says:
BoundStatement: obtained by binding values to a prepared statement. Typically used for queries that are executed often, with different values.
BuiltStatement: a statement built with the QueryBuilder DSL. It can be executed directly like a simple statement, or prepared.
So in my opinion, BuiltStatement is equal to BoundStatement.
Howerver, in my case, I found BuiltStatement is obviously more efficient than BoundStatement. Why did this happen?
public static void main(String[] args) {
Data mc = null;
ResultSet results = null;
PK pk = null;
CassandraData dao = new CassandraData();
Session session = dao.getSession();
long start, end;
long start0 = System.currentTimeMillis();
// PreparedStatement prepared = session.prepare(
// "select * from test where E=? and D=? and M=?");
Statement statement = null;
logger.info("Start:");
for (int i = 0; i < 100; i++) {
pk = ValidData.getOnePk();
start = System.currentTimeMillis();
// statement = prepared.bind(pk.E, pk.D, pk.M);
// statement.setReadTimeoutMillis(100000);
statement = getSelect(pk);
results = session.execute(statement);
end = System.currentTimeMillis();
logger.info("Show OneKb:" + (end - start) / 1000.0 + "s.");
}
long end0 = System.currentTimeMillis();
logger.info("Show OneKb Average:" + (end0 - start0) / 1000.0 / 100 + "s/OneKb.");
}
private static Statement getSelect(PK pk) {
Select ss = QueryBuilder.select().from("test");
ss.setConsistencyLevel(com.datastax.driver.core.ConsistencyLevel.ONE);
ss.where(QueryBuilder.eq("E", pk.E))
.and(QueryBuilder.eq("D", pk.D))
.and(QueryBuilder.eq("M", pk.M)).limit(1)
.setReadTimeoutMillis(100 * 1000);
return ss;
}
I ran this case 100 times and the average time of BoundStatement was 1.316s and the average time of BuiltStatement was 0.199s.

I found where I was wrong.
When using BuiltStatement, I appended limit(1) method to fetch only one record. But when using BoundStatement, I didn't append limit 1 to restrict the returned quantity. In fact, it would return average 100 records. So in this condition, it was slower.

Spark is taking too much time and creating thousands of jobs for some tasks

Machine Config :
RAM: 16 gb
Processor: 4 cores(Xeon E3 3.3 GHz)
Problem:
Time Consuming : Taking more than 18 minutes
Case Scenario :
Spark Mode: Local
Database: Using Cassandra 2.1.12
I am fetching 3 tables into dataframes , which is having less than 10 rows. yes, less than 10 (ten).
After fetching it into dataframes I performing joins,count,show and collect operation many times. When I execute my program Spark is creating 40404 jobs 4 times. it indicates that count requires to perform those jobs. I am using count 4-5 times in program. After waiting for more than 18 minutes(approx 18.5 to 20) it gives me expected output.
why Spark is creating that much of jobs?
is it obvious ('ok') to take this much time (18 minutes) to execute this number of jobs(40404 * 4 approx)?
Thanks in advance.
Sample code 1:
def getGroups(id: Array[String], level: Int): DataFrame = {
var lvl = level
if (level >= 0) {
for (iterated_id <- id) {
val single_level_group = supportive_df.filter("id = '" + iterated_id + "' and level = " + level).select("family_id")
//single_level_group.show()
intermediate_df = intermediate_df.unionAll(single_level_group)
//println("for loop portion...")
}
final_df = final_df.unionAll(intermediate_df)
lvl -= 1
val user_id_param = intermediate_df.collect().map { row => row.getString(0) }
intermediate_df = empty_df
//println("new method...if portion...")
getGroups(user_id_param, lvl)
} else {
//println("new method...")
final_df.distinct()
}
}
Sample code 2:
setGetGroupsVars("u_id", user_id.toString(), sa_user_df)
var user_belong_groups: DataFrame = empty_df
val user_array = Array[String](user_id.toString())
val user_levels = sa_user_df.filter("id = '" + user_id + "'").select("level").distinct().collect().map { x => x.getInt(0) }
println(user_levels.length+"...rapak")
println(user_id.toString())
for (u_lvl <- user_levels) {
val x1 = getGroups(user_array, u_lvl)
x1.show()
empty_df.show()
user_belong_groups.show()
user_belong_groups = user_belong_groups.unionAll(x1)
x1.show()
}
setGetGroupsVars("obj_id", obj_id.toString(), obj_type_specific_df)
var obj_belong_groups: DataFrame = empty_df
val obj_array = Array[String](obj_id.toString())
val obj_levels = obj_type_specific_df.filter("id = '" + obj_id + "'").select("level").distinct().collect().map { x => x.getInt(0) }
println(obj_levels.length)
for (ob_lvl <- obj_levels) {
obj_belong_groups = obj_belong_groups.unionAll(getGroups(obj_array, ob_lvl))
}
user_belong_groups = user_belong_groups.distinct()
obj_belong_groups = obj_belong_groups.distinct()
var user_obj_joined_df = user_belong_groups.join(obj_belong_groups)
user_obj_joined_df.show()
println("vbgdivsivbfb")
var user_obj_access_df = user_obj_joined_df
.join(sa_other_access_df, user_obj_joined_df("u_id") === sa_other_access_df("user_id")
&& user_obj_joined_df("obj_id") === sa_other_access_df("object_id"))
user_obj_access_df.show()
println("KDDD..")
val user_obj_access_cond1 = user_obj_access_df.filter("u_id = '" + user_id + "' and obj_id != '" + obj_id + "'")
if (user_obj_access_cond1.count() == 0) {
val user_obj_access_cond2 = user_obj_access_df.filter("u_id != '" + user_id + "' and obj_id = '" + obj_id + "'")
if (user_obj_access_cond2.count() == 0) {
val user_obj_access_cond3 = user_obj_access_df.filter("u_id != '" + user_id + "' and obj_id != '" + obj_id + "'")
if (user_obj_access_cond3.count() == 0) {
default_df
} else {
val result_ugrp_to_objgrp = user_obj_access_cond3.select("permission").agg(max("permission"))
println("cond4")
result_ugrp_to_objgrp
}
} else {
val result_ugrp_to_ob = user_obj_access_cond2.select("permission")
println("cond3")
result_ugrp_to_ob
}
} else {
val result_u_to_obgrp = user_obj_access_cond1.select("permission")
println("cond2")
result_u_to_obgrp
}
} else {
println("cond1")
individual_access
}
These two are major code block in my prog where the execution is taking too longer. It generally takes much time at show or count operation.

First you can check in GUI that which stage of your program is taking long time.
Second is you are using distinct() many times, So while use distinct() you have to look how many number of partitions are comes after distinct. I thought that's the reason why spark creating thousand of jobs.
If that is the reason you can use coalesce() after distinct().

Ok, so let's remember some basics !
Spark is a lazy, and show and count are actions.
An action triggers transformations, which you have loads of. And in case you are pooling data from Cassandra (or any other source) this cost a lot since you don't seem to be caching your transformations !
So, you need to consider caching when you compute intensively on a DataFrame or RDD, that will make your actions get performed faster !
Concerning the reason why you have that many tasks (jobs) is of-course explain by spark parallelism mechanism to perform you actions times the number of transformations/actions you are executing, not mentioning the loops !
Nevertheless, still with the information given and the quality of the code snippets posted in the question, this is as far as my answer goes.
I hope this helps !

MPI initialize array only on root

I have a working Wafefront program using MPI express. What happens in this program is that for a matrix of n x m there are n processes. Each process is assigned a row. Each process does the following:
for column = 0 to matrix_width do:
1) x = get the value of this column from the row above (rank - 1 process)
2) y = Get the value left of us (our row, column-1)
3) Add to our current column value: (x + y)
So on the master process I will declare an array of n x m elements. Each slave process should thus allocate an array of length m. But as it stands in my solution each process has to allocate an array of n x m for the scatter operation to work, otherwise I get a nullpointer (if I assign it null) or an out of bounds exception (if I instantiate it with new int[1]). I'm sure there has to be a solution to this, otherwise each process would require as much memory as the root.
I think I need something like allocatable in C.
Below the important part is the one marked "MASTER". Normally I would pull the initialization into the if(rank == 0) test and initialize the array with null (not allocating the memory) in the else branch but that does not work.
package be.ac.vub.ir.mpi;
import mpi.MPI;
// Execute: mpjrun.sh -np 2 -jar parsym-java.jar
/**
* Parallel and sequential implementation of a prime number counter
*/
public class WaveFront
{
// Default program parameters
final static int size = 4;
private static int rank;
private static int world_size;
private static void log(String message)
{
if (rank == 0)
System.out.println(message);
}
////////////////////////////////////////////////////////////////////////////
//// MAIN //////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
public static void main(String[] args) throws InterruptedException
{
// MPI variables
int[] matrix; // matrix stored at process 0
int[] row; // each process keeps its row
int[] receiveBuffer; // to receive a value from ``row - 1''
int[] sendBuffer; // to send a value to ``row + 1''
/////////////////
/// INIT ////////
/////////////////
MPI.Init(args);
rank = MPI.COMM_WORLD.Rank();
world_size = MPI.COMM_WORLD.Size();
/////////////////
/// ALL PCS /////
/////////////////
// initialize data structures
receiveBuffer = new int[1];
sendBuffer = new int[1];
row = new int[size];
/////////////////
/// MASTER //////
/////////////////
matrix = new int[size * size];
if (rank == 0)
{
// Initialize matrix
for (int idx = 0; idx < size * size; idx++)
matrix[idx] = 0;
matrix[0] = 1;
receiveBuffer[0] = 0;
}
/////////////////
/// PROGRAM /////
/////////////////
// distribute the rows of the matrix to the appropriate processes
int startOfRow = rank * size;
MPI.COMM_WORLD.Scatter(matrix, startOfRow, size, MPI.INT, row, 0, size, MPI.INT, 0);
// For each column each process will calculate it's new values.
for (int col_idx = 0; col_idx < size; col_idx++)
{
// Get Y from row above us (rank - 1).
if (rank > 0)
MPI.COMM_WORLD.Recv(receiveBuffer, 0, 1, MPI.INT, rank - 1, 0);
// Get the X value (left from current column).
int x = col_idx == 0 ? 0 : row[col_idx - 1];
// Assign the new Z value.
row[col_idx] = row[col_idx] + x + receiveBuffer[0];
// Wait for other process to ask us for this value.
sendBuffer[0] = row[col_idx];
if (rank + 1 < size)
MPI.COMM_WORLD.Send(sendBuffer, 0, 1, MPI.INT, rank + 1, 0);
}
// At this point each process should be done so we call gather.
MPI.COMM_WORLD.Gather(row, 0, size, MPI.INT, matrix, startOfRow, size, MPI.INT, 0);
// Let the master show the result.
if (rank == 0)
for (int row_idx = 0; row_idx < size; ++row_idx)
{
for (int col_idx = 0; col_idx < size; ++col_idx)
System.out.print(matrix[size * row_idx + col_idx] + " ");
System.out.println();
}
MPI.Finalize(); // Don't forget!!
}
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark Java Map processing entire data set in all executors - apache-spark

Related

Using Java 8 feature takes more time

How to divide a huge loop into multiple threads and then add result in collection?

Why Is BuiltStatement more efficient than BoundStatement in Cassandra?

Spark is taking too much time and creating thousands of jobs for some tasks

MPI initialize array only on root

Categories

Resources