How to convert List to JavaRDD - apache-spark

We know that in spark there is a method rdd.collect which converts RDD to a list.
List<String> f= rdd.collect();
String[] array = f.toArray(new String[f.size()]);
I am trying to do exactly opposite in my project. I have an ArrayList of String which I want to convert to JavaRDD. I am looking for this solution for quite some time but have not found the answer. Can anybody please help me out here?

You're looking for JavaSparkContext.parallelize(List) and similar. This is just like in the Scala API.

Adding to Sean Owen and others solutions
You can use JavaSparkContext#parallelizePairs for List ofTuple
List<Tuple2<Integer, Integer>> pairs = new ArrayList<>();
pairs.add(new Tuple2<>(0, 5));
pairs.add(new Tuple2<>(1, 3));
JavaSparkContext sc = new JavaSparkContext();
JavaPairRDD<Integer, Integer> rdd = sc.parallelizePairs(pairs);

There are two ways to convert a collection to a RDD.
1) sc.Parallelize(collection)
2) sc.makeRDD(collection)
Both of the method are identical, so we can use any of them

If you are using a .scala file, or you don't want to or cannot use JavaSparkContext, then you could:
use SparkContext instead of JavaSparkContext
convert your Java List to a Scala List
use SparkContext's parallelize method
For example:
List<String> javaList = new ArrayList<>()
javaList.add("abc")
javaList.add("def")
sc.parallelize(javaList.asScala)
This will generate an RDD for you.

List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("fieldx1", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("fieldx2", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("fieldx3", DataTypes.LongType, true));
List<Row> data = new ArrayList<>();
data.add(RowFactory.create("","",""));
Dataset<Row> rawDataSet = spark.createDataFrame(data, schema).toDF();

Related

How to convert Java ArrayList to Apache Spark Dataset?

I have a list like this:
List<String> dataList = new ArrayList<>();
dataList.add("A");
dataList.add("B");
dataList.add("C");
I need to convert Dataset<Row> dataDs = Seq(dataList).toDs();
List<String> data = Arrays.asList("abc", "abc", "xyz");
Dataset<String> dataDs = spark.createDataset(data, Encoders.STRING());
Dataset<String> dataListDs = spark.createDataset(dataList, Encoders.STRING());
dataDs.show();
You can convert a List<String> to Dataset<Row> like so:
Get a List<Object> from List<String> on each element with correct Object class. eg - Integer, String, etc.
Generate List<Row> from List<Object>
Get datatypeList and headerList which you want for Dataset<Row> schema.
Construct the schema object:
Create dataset
List<Object> data = new ArrayList();
data.add("hello");
data.add(null);
List<Row> ls = new ArrayList<Row>();
Row row = RowFactory.create(data.toArray());
ls.add(row);
List<DataType> datatype = new ArrayList<String>();
datatype.add(DataTypes.StringType);
datatype.add(DataTypes.IntegerType);
List<String> header = new ArrayList<String>();
headerList.add("Field_1_string");
headerList.add("Field_1_integer");
StructField structField1 = new StructField(headerList.get(0), datatype.get(0), true, org.apache.spark.sql.types.Metadata.empty());
StructField structField2 = new StructField(headerList.get(1), datatype.get(1), true, org.apache.spark.sql.types.Metadata.empty());
List<StructField> structFieldsList = new ArrayList<>();
structFieldsList.add(structField1);
structFieldsList.add(structField2);
StructType schema = new StructType(structFieldsList.toArray(new StructField[0]));
Dataset<Row> dataset = sparkSession.createDataFrame(ls, schema);
dataset.show();
dataset.printSchema();
This is the derived answer that worked for me. It is inspired from NiharGht's answer.
suppose we have the list like this (not to run but just idea)
List<List<Integer>> data = [
[1, 2, 3],
[2, 3, 4],
[3, 4, 5]
];
Now to convert each List to Row so that can be used to make DF
List<Row> rows = new ArrayList<>();
for (List<Integer> that_line : data){
Row row = RowFactory.create(that_line.toArray());
rows.add(row);
}
Then just make the dataframe! (no instead of using RDD, use the List
Dataset<Row> r2DF = sparkSession.createDataFrame(rows, schema); // supposing you have schema already.
r2DF.show();
The catch is in this line:
Dataset<Row> r2DF = sparkSession.createDataFrame(rows, schema);
It is where we are usually using RDD instead of the List.

access Broadcast Variables in Spark java

I need to process spark Broadcast variables using Java RDD API. This is my code what I have tried so far:
This is only sample code to check whether its works or not? In my case I need to work on two csvfiles.
SparkConf conf = new SparkConf().setAppName("BroadcastVariable").setMaster("local");
JavaSparkContext ctx = new JavaSparkContext(conf);
Map<Integer,String> map = new HashMap<Integer,String>();
map.put(1, "aa");
map.put(2, "bb");
map.put(9, "ccc");
Broadcast<Map<Integer, String>> broadcastVar = ctx.broadcast(map);
List<Integer> list = new ArrayList<Integer>();
list.add(1);
list.add(2);
list.add(9);
JavaRDD<Integer> listrdd = ctx.parallelize(list);
JavaRDD<Object> mapr = listrdd.map(x -> broadcastVar.value());
System.out.println(mapr.collect());
and it prints output like this:
[{1=aa, 2=bb, 9=ccc}, {1=aa, 2=bb, 9=ccc}, {1=aa, 2=bb, 9=ccc}]
and my requirement is :
[{aa, bb, ccc}]
Is it possible to do like in my required way?
I used JavaRDD<Object> mapr = listrdd.map(x -> broadcastVar.value().get(x));
insted of JavaRDD<Object> mapr = listrdd.map(x -> broadcastVar.value());.
Its working now.

Save as Table in Hive : Failure with exception 'atleast one column must be specified for the table'

I have a simple spark job that splits the words from a file and loads into a table in hive.
public static void wordCountJava7() {
// Define a configuration to use to interact with Spark
SparkConf conf = new SparkConf().setMaster("local[4]").setAppName("Work Count App");
SparkContext sc = new SparkContext(conf);
// Create a Java version of the Spark Context from the configuration
JavaSparkContext jsc = new JavaSparkContext(sc);
// Load the input data, which is a text file read from the command line
JavaRDD<String> input = jsc.textFile("file:///home/priyanka/workspace/ZTA/spark/src/main/java/sample.txt");
// Java 7 and earlier
JavaRDD<String> words = input.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
});
// Java 7 and earlier: transform the collection of words into pairs (word and 1)
JavaPairRDD<String, Integer> counts = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2(s, 1);
}
});
// Java 7 and earlier: count the words
JavaPairRDD<String, Integer> reducedCounts = counts.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer x, Integer y) {
return x + y;
}
});
HiveContext hiveContext = new HiveContext(sc);
DataFrame dataFrame = hiveContext.createDataFrame(words, SampleBean.class);
dataFrame.write().saveAsTable("Sample");
words.saveAsTextFile("output");
jsc.close();
}
The spark jobs fails with the following trace:
16/04/29 15:41:21 WARN HiveContext$$anon$2: Could not persist `sample` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format.
org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: at least one column must be specified for the table
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:720)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:677)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply$mcV$sp(ClientWrapper.scala:424)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:422)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:422)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:290)
at org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:237)
at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:236)
at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:279)
at org.apache.spark.sql.hive.client.ClientWrapper.createTable(ClientWrapper.scala:422)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:358)
at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:280)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
Checked the table sample in Hive. It did have one column
hive> desc sample;
OK
word string None
Time taken: 0.218 seconds, Fetched: 1 row(s)
When i try to save it as a table, this error is thrown.
Any help is appreciated.
it means columns data type is not correct
i got same error while using sing Avro schema
in dataframe data type was decimal(20,2)
and in Avro schema I had mentioned type as Decimal(20,2)
it gave same error
schema with issue
later changed data type in Avro schema to string and it worked fine for me
as Avro convert internal decimal to string
changed schema
This issue occured for me because I was doing
DataFrameWriter<Row> dfw =
sparkSession.createDataFrame(jsc.parallelize(uuids), CustomDataClass.class).write();
And I fixed it by doing
DataFrameWriter<Row> dfw =
sparkSession.createDataFrame(uuids, CustomDataClass.class).write();
(No need to parallelize). Or, in general, make sure the Bean Class you pass matches the type in the List you pass to createDataFrame.

How to parallelize a list of key value pairs to JavaPairRDD in Apache spark Java API?

I have List of Key,value pairs such as List((A,1),(B,2),(C,3)) in heap memory. How can I parallelize this list to create a JavaPairRDD?
In scala :
val pairs = sc.parallelize(List((A,1),(B,2),(C,3)).
Likewise, Is there any way with java API?
I found the answer. First store the List of tuples in JavaRDD and then convert it to JavaPairRDD.
List<Tuple2> data = Arrays.asList(new Tuple2("panda", 0),new Tuple2("panda", 1));
JavaRDD rdd = sc.parallelize(data);
JavaPairRDD pairRdd = JavaPairRDD.fromJavaRDD(rdd);
Have a look at this answer
I can see this one working for me
sc.parallelizePairs(Arrays.asList(new Tuple2("123","123")));
Parallelized collections are created by calling JavaSparkContext’s parallelize method on an existing Collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.
List data = ......;
JavaRDD rdd = sc.parallelize(data);
Convert Tuple into List with below code snippet.
Tuple2<Sensor, Integer> tuple = new Tuple2<Sensor, Integer>(arg0._2, 1);
List<Tuple2<Sensor, Integer>> list = new ArrayList<Tuple2<Sensor, Integer>>();
list.add(tuple);

Spark job on hbase data

I am new to spark and I am trying to get my facebook data from HBASE table with following schema:
I want to do a spark job on it as explained below. Following is my code to get the JavaPairRDD.
SparkConf sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local[2]");
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sparkConf.set("spark.kryoserializer.buffer.mb", "256");
sparkConf.set("spark.kryoserializer.buffer.max", "512");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "localhost:2181");
conf.set("hbase.regionserver.port", "60010");
String tableName = "fbData";
conf.set("hbase.master", "localhost:60010");
conf.set(TableInputFormat.INPUT_TABLE, tableName);
JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = sc.newAPIHadoopRDD(conf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class);
Now using map() of RDD I am able to get the JavaRDD for posts/comments/replies using type column:
JavaRDD<Post> results = hBaseRDD.map(new Function<Tuple2<ImmutableBytesWritable, Result>, Post>() {
//fetching posts
return post;
}
Now I have 3 JavaRDDs for posts, comments and replies. POJO Post has fields for comments and replies. So I want to add the comments and Replies to the post using parent post Id. How can I accomplish this with Spark? Way that I thought of was to iterate through all posts, then iterate through all the comments and replies. Thanks in advance.
One way you can do this is by making your 3 RDDs JavaPairRDDs, with the fields in comment as the key. You can then use the join method.
Assuming that the results and comments RDD are pair RDDs then you can just do:
JavaPairRDD<??> aggregatedResults = results.join(comments)
I do not know what type you would use for the combined objects.

Resources