How to join 3 RDD tables in spark using java? - apache-spark

Please bear with on this one. I have three RDDs ( coming from Hadoop ). All Three have unique keys such as ippaddress and boxnumber on which they can be matched/joined. Here are some sample data from all tables. Table A column boxnumber have to converted to number before it can be matched.
Table A:
ipaddress|boxnumber|cardnumber
94.254.57.16|59774DEa1|0D1EDF40
94.154.57.176|5F7377Ga9|0D3F796D
Table B:
cardno,boxnumber
1500914,2000096
1500413,2211469
Table C:
ipaddress|kanal|bitrate|kanaltimespent|date|country
94.254.57.16|sky|2023|003DF6A.ts|12-02-2016|chile
94.154.57.176|ITV|3425|003DF6A.ts|23-04-2014|egypt
My first attempt in java:
//TABLE A
JavaSparkContext sc = SetupSparkContext("SparkSample");
JavaRDD<ExtractTable_A> ta_RDD= ExtractTable_A.getRDD(sc);
JavaPairRDD<String, ExtractTable_A> A_PairRDD = ta_RDD.mapToPair(new PairFunction<extractTable_A, String, ExtractTable_A>()
{
#Override
public Tuple2<String, ExtractTable_A> call(ExtractTable_A extractTable_A) throws Exception
{
String [] A= extractTable_A.toString().split("|") ;
return new Tuple2<>(A[0],extractTable_A);
}
});
//TABLE B
JavaRDD<ExtractOttPdl> tb_RDD = ExtractTableB.getRDD(sc);
JavaPairRDD<String, ExtractTable_B> BPairRDD = tb_RDD.mapToPair(new PairFunction<extractTable_B, String, ExtractTable_B>()
{
#Override
public Tuple2<String, ExtractTable_B> call(ExtractTable_B extractTable_B) throws Exception
{
String [] B= extractTable_B.toString().split(",") ;
return new Tuple2<>(B[1],extractTable_B);
}
});
//TABE C
JavaRDD<ExtractTable_C> tc_RDD = ExtractTableC.getRDD(sc);
JavaPairRDD<String, ExtractTable_C> CPairRDD = tb_RDD.mapToPair(new PairFunction<extractTable_C, String, ExtractTable_C>()
{
#Override
public Tuple2<String, ExtractTableC> call(ExtractTableC extractTable_C) throws Exception
{
String [] C= extractTable_A.toString().split("|") ;
return new Tuple2<>(C[0],extractTable_A);
}
});
//At this point i need to join and create an .txt output file
The final result shoud be a file with these headers
KANAL|BITRATE|TIMESPENT|DATE|COUNTRY
===update===
I have managed to join the Table A and Table B but now i am stuck on how to join the TableC to Table A?
//Joined table A and B
JavaPairRDD<String, Tuple2<ExtractTableA, ExtractTableB>> join_1 = A_PairRDD.join(B_PairRDD);
. . .
//Joined table A and C
JavaPairRDD<String, Tuple2<ExtractTableA, ExtractTableC>> Join_2 = A_PairRDD.join(B_PairRDD);
// Output results from TableA and TableB
join_1.map(in -> {
return new ResultStringBuilder("|")
.append(Long.parseLong((in._2()._1().getCardno().trim()),16))
.append(Long.parseLong((in._2()._1().getBoxno().trim()),16))
.append(in._2()._2().getBoxno())
*** HERE I NEED TO ALSO APPEND THE COLUMN FROM TableC
.toString();
})
.saveAsTextFile("c:\outfile");

Remember that when you are working with the spark API, you always want to create a new RDD when you modify anything in the RDD sturcture because RDD is immutable.
In order to do three way join in this case,
you need to create a new JavaPairRDD after you join the first two tables,
because you want to have a PairRDD with new key-value pair because unique keys for Table A, B, C are different.
There could be two ways to do this (either join AB first or AC first)
The way you could join tables is like this:
Table A - Table B (PairRDD with key : boxnumber or cardnumber or maybe both)
After you join Table A and Table B, you need to create a new PairRDD with key ipaddress because you want to join with Table C.
// joinedAB is RDD resulting from join operation of Table A and C
JavaPairRDD joinedABForC = joinedAB.map(l -> new Tuple2(l[0], l));
// now joinedABForC has ipaddress as the RDD's key
// join rdd joinedABForC with Table C
After we moved the unique column to the key of the pairRdd, you can now join it with the Table C and three way join is done.
Joined Table AB - Table C (PairRdd with key : ipaddress)

Related

How to get value of a Spark dataset column value to use it dynamically in SQL Query?

I have a Dataset DS1 which is having one column value "LEVEL". I want to check this column value and get another column "COMPANIES" which is an array and based on some business logic, I have to update the values.
For this update operation, I am using withColumn() method.
DS1.withColumn("COMPANIES", functions.when(functions.col("LEVEL").gt(1), someMethod(sparkSession, functions.col("COMPANIES"), functions.col("LEVEL"))).otherwise(functions.col("value")));
inside the someMethod(), I am trying to use the Column as parameters.
private int[] someMethod(SparkSession sparkSession, Column companies, Column Level) {
String query = "Select cs.level from DS1 cs inner join DS2 cp on cs.level=" + (Level.minus(1)) + " and cs.company_private_id=ANY(" + companies + ")";
sparkSession.sql(query);
List<Integer> list = sparkSession.sql(query).collectAsList().get(0).getList(0);
return list.stream().mapToInt(i -> i).toArray();
}
I could not get the values for the variables Level, companies as they are of Column type. How to do the logic here.
Assuming data type for levels is Integer. If type is something else change row.getInteger(0) to row.getDecimal(0) for data type bigDecimal.
List<Row> dataSet = sparkSession.sql(query).collectAsList();
List<Integer> levels = dataSet.stream().map(row -> row.getInteger(0)).collect(Collectors.toList());

How to get Cassandra cql string given a Apache Spark Dataframe in 2.2.0?

I am trying to get a cql string given a Dataframe. I came across this function
Where I can do something like this
TableDef.fromDataFrame(df, "test", "hello", ProtocolVersion.NEWEST_SUPPORTED).cql()
It looks to me that the library uses first column as Partition Key and does not care about Clustering Key so how do I specify to use particular set of columns of a Dataframe as a PartitionKey and ParticularSet of columns as a Clustering Key ?
Looks like I can create a new TableDef however I have to do the entire mapping by myself and in some cases the necessary functions like ColumnType are not accessible in Java. for Example I tried to create a new ColumnDef like below
new ColumnDef("col5", new PartitionKeyColumn(), ColumnType is not accessible in Java)
Objective: To get a CQL create Statement from a Spark DataFrame.
Input My dataframe can have any number of columns with their respective Spark Types. so say I have a Spark Dataframe with 100 columns where my col8, col9 of my dataframe corresponds to cassandra partitionKey columns and my column10 corresponds to cassandra clustering Key column
col1| col2| ...|col100
Now I want to use spark-cassandra-connector library to give me a CQL create table statement given the info above.
Desired Output
create table if not exists test.hello (
col1 bigint, (whatever column1 type is from my dataframe I just picked bigint randomly)
col2 varchar,
col3 double,
...
...
col100 bigint,
primary key(col8,col9)
) WITH CLUSTERING ORDER BY (col10 DESC);
Because required components (PartitionKeyColumn & instances of ColumnType) are objects in Scala, you need to use following syntax to access their intances:
// imports
import com.datastax.spark.connector.cql.ColumnDef;
import com.datastax.spark.connector.cql.PartitionKeyColumn$;
import com.datastax.spark.connector.types.TextType$;
// actual code
ColumnDef a = new ColumnDef("col5",
PartitionKeyColumn$.MODULE$, TextType$.MODULE$);
See code for ColumnRole & PrimitiveTypes to find full list of names of objects/classes.
Update after additional requirements: Code is lengthy, but should work...
SparkSession spark = SparkSession.builder()
.appName("Java Spark SQL example").getOrCreate();
Set<String> partitionKeys = new TreeSet<String>() {{
add("col1");
add("col2");
}};
Map<String, Integer> clustereingKeys = new TreeMap<String, Integer>() {{
put("col8", 0);
put("col9", 1);
}};
Dataset<Row> df = spark.read().json("my-test-file.json");
TableDef td = TableDef.fromDataFrame(df, "test", "hello",
ProtocolVersion.NEWEST_SUPPORTED);
List<ColumnDef> partKeyList = new ArrayList<ColumnDef>();
List<ColumnDef> clusterColumnList = new ArrayList<ColumnDef>();
List<ColumnDef> regColulmnList = new ArrayList<ColumnDef>();
scala.collection.Iterator<ColumnDef> iter = td.allColumns().iterator();
while (iter.hasNext()) {
ColumnDef col = iter.next();
String colName = col.columnName();
if (partitionKeys.contains(colName)) {
partKeyList.add(new ColumnDef(colName,
PartitionKeyColumn$.MODULE$, col.columnType()));
} else if (clustereingKeys.containsKey(colName)) {
int idx = clustereingKeys.get(colName);
clusterColumnList.add(new ColumnDef(colName,
new ClusteringColumn(idx), col.columnType()));
} else {
regColulmnList.add(new ColumnDef(colName,
RegularColumn$.MODULE$, col.columnType()));
}
}
TableDef newTd = new TableDef(td.keyspaceName(), td.tableName(),
(scala.collection.Seq<ColumnDef>) partKeyList,
(scala.collection.Seq<ColumnDef>) clusterColumnList,
(scala.collection.Seq<ColumnDef>) regColulmnList,
td.indexes(), td.isView());
String cql = newTd.cql();
System.out.println(cql);

Transform JavaPairDStream to Tuple3 in Java

I am experimenting with the Spark job that streams data from Kafka and produces to Cassandra.
The sample I am working with takes a bunch of words in a given time interval and publishes the word count to Cassandra. I am also trying to also publish the timestamp along with the word and its count.
What I have so far is as follows:
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, zkQuorum, groupId, topicMap);
JavaDStream<String> lines = messages.map(Tuple2::_2);
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator());
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(s -> new Tuple2<>(s, 1))
.reduceByKey((i1, i2) -> i1 + i2);
Now I am trying to append to these records the timestamp. What I have tried is something like this:
Tuple3<String, Date, Integer> finalRecord =
wordCounts.map(s -> new Tuple3<>(s._1(), new Date().getTime(), s._2()));
Which of course is shown as wrong in my IDE. I am completely new to working with Spark libraries and writing in this form (I guess lambda based) functions.
Can someone help me correct this error and achieve what I am trying to do?
After some searching done on the web and studying some examples I was able to achieve what I wanted as follows.
In order to append the timestamp attribute to the existing Tuple with two values, I had to create a simple bean with which represents my Cassandra row.
public static class WordCountRow implements Serializable {
String word = "";
long timestamp;
Integer count = 0;
Then, I had map the (word, count) Tuple2 objects in the JavaPairDStream structure to a JavaDStream structure that holds objects of the above WordCountRow class.
JavaDStream<WordCountRow> wordCountRows = wordCounts.map((Function<Tuple2<String, Integer>, WordCountRow>)
tuple -> new WordCountRow(tuple._1, new Date().getTime(), tuple._2));
Finally, I could call foreachRDD method on this structure (which returns objects of WordCountRow) which I can write to Cassandra one after the other.
wordCountRows.foreachRDD((VoidFunction2<JavaRDD<WordCountRow>,Time>)(rdd,time)->{
final SparkConf sc=rdd.context().getConf();
final CassandraConnector cc=CassandraConnector.apply(sc);
rdd.foreach((VoidFunction<WordCountRow>)wordCount->{
try(Session session=cc.openSession()){
String query=String.format(Joiner.on(" ").join(
"INSERT INTO test_keyspace.word_count",
"(word, ts, count)",
"VALUES ('%s', %s, %s);"),
wordCount.word,wordCount.timestamp,wordCount.count);
session.execute(query);
}
});
});
Thanks

spark UDF to return list of Rows

In spark SQL I am trying to join multiple table those are already in place.
How ever i need to use a function which will take user input and then get details from two other tables,then use this result in the join.
The query is some thing like below
select t1.col1,t1.col2,t2.col3,cast((t1.value * t3.value) from table1 t1
left join table2 t2 on t1.col = t2.col
left join fn_calculate (value1, value2) as t3 on t1.value = t3.value
here fn_calculate is the function which is taking value1,value2 as parameters and then that function returns a table of rows.(in SQL server that is returning Table)
I am trying to do this by using hive generic UDF which will take the input param and then return the dataframe? like below
public String evaluate(DeferredObject[] arguments) throws HiveException {
if (arguments.length != 1) {
return null;
}
if (arguments[0].get() == null) {
return null;
}
DataFrame dataFrame = sqlContext
.sql("select * from A where col1 = value and col2 =value2");
javaSparkContext.close();
return "dataFrame";
}
or the scala functions do i need use like below?
static class Z extends scala.runtime.AbstractFunction0<DataFrame> {
#Override`enter code here`
public DataFrame apply() {
// TODO Auto-generated method stub
return sqlContext.sql("select * from A where col1 = value and col2
=value2");
}
}

Accessing a global lookup Apache Spark

I have a list of csv files each with a bunch of category names as header columns. Each row is a list of users with a boolean value (0, 1) whether they are part of that category or not. Each of the csv files does not have the same set of header categories.
I want to create a composite csv across all the files which has the following output:
Header is a union of all the headers
Each row is a unique user with a boolean value corresponding to the category column
The way I wanted to tackle this is to create a tuple of a user_id and a unique category_id for each cell with a '1'. Then reduce all these columns for each user to get the final output.
How do I create the tuple to begin with? Can I have a global lookup for all the categories?
Example Data:
File 1
user_id,cat1,cat2,cat3
21321,,,1,
21322,1,1,1,
21323,1,,,
File 2
user_id,cat4,cat5
21321,1,,,
21323,,1,,
Output
user_id,cat1,cat2,cat3,cat4,cat5
21321,,1,1,,,
21322,1,1,1,,,
21323,1,1,,,,
Probably the title of the question is misleading in the sense that conveys a certain implementation choice as there's no need for a global lookup in order to solve the problem at hand.
In big data, there's a basic principle guiding most solutions: divide and conquer. In this case, the input CSV files could be divided in tuples of (user,category).
Any number of CSV files containing an arbitrary number of categories can be transformed to this simple format. The resulting CSV results of the union of the previous step, extraction of the total nr of categories present and some data transformation to get it in the desired format.
In code this algorithm would look like this:
import org.apache.spark.SparkContext._
val file1 = """user_id,cat1,cat2,cat3|21321,,,1|21322,1,1,1|21323,1,,""".split("\\|")
val file2 = """user_id,cat4,cat5|21321,1,|21323,,1""".split("\\|")
val csv1 = sparkContext.parallelize(file1)
val csv2 = sparkContext.parallelize(file2)
import org.apache.spark.rdd.RDD
def toTuples(csv:RDD[String]):RDD[(String, String)] = {
val headerLine = csv.first
val header = headerLine.split(",")
val data = csv.filter(_ != headerLine).map(line => line.split(","))
data.flatMap{elem =>
val merged = elem.zip(header)
val id = elem.head
merged.tail.collect{case (v,cat) if v == "1" => (id, cat)}
}
}
val data1 = toTuples(csv1)
val data2 = toTuples(csv2)
val union = data1.union(data2)
val categories = union.map{case (id, cat) => cat}.distinct.collect.sorted //sorted category names
val categoriesByUser = union.groupByKey.mapValues(v=>v.toSet)
val numericCategoriesByUser = categoriesByUser.mapValues{catSet => categories.map(cat=> if (catSet(cat)) "1" else "")}
val asCsv = numericCategoriesByUser.collect.map{case (id, cats)=> id + "," + cats.mkString(",")}
Results in:
21321,,,1,1,
21322,1,1,1,,
21323,1,,,,1
(Generating the header is simple and left as an exercise for the reader)
You dont need to do this as a 2 step process if all you need is the resulting values.
A possible design:
1/ Parse your csv. You dont mention whether your data is on a distributed FS, so i'll assume it is not.
2/ Enter your (K,V) pairs into a mutable parallelized (to take advantage of Spark) map.
pseudo-code:
val directory = ..
mutable.ParHashMap map = new mutable.ParHashMap()
while (files[i] != null)
{
val file = directory.spark.textFile("/myfile...")
val cols = file.map(_.split(","))
map.put(col[0], col[i++])
}
and then you can access your (K/V) tuples by way of an iterator on the map.

Resources