How to control Spark JavaRDD<MyTable> to take specific n rows? - apache-spark

I am having my own data structure called MyTable which is kind of columnar data store format table. Now I want to use Spark to create myTable in distributed environment as my datasets are in HDFS. I have used Spark earlier and I am familiar with it.
I am not able to figure out how can we control JavaRDD to take n rows. Here n could be 80k, 90k rows etc. If you see the following JavaRDD will always create one row MyTable, how do I create MyTable with n rows
JavaRDD<MyTable> rdd_records = sc.textFile("/path/to/hdfs").map(
new Function<String, MyTable>() {
public MyTable call(String line) throws Exception {
String[] fields = line.split(",");
Record record = create Record from above fields
MyTable table = new MyTable();
return table.append(record);
}
});
If I know how to command RDD to take certain no of rows then I can use it to create MyTable in distributed way.

when you load data using sc.textfile, spark automatically splits data on newlinesand puts them to partitions. So, what you need to do is a custom partitioning using your params (80k thing). Then you can use partitionBy on RDD. After that, you should be using mapPartitions instead of map to generate your data structures of Rows.
One advice, this seems a case to use Dataframes. If you are on 1.3, you take a look. It does converting tuples to schema in distributed way already

Related

Using DataFrame.foreachPartition, processing partitions as data frames

I have a dataframe that's partitioned by col0; there are many rows in the DF per value of col0. I have a database from which I want to fetch batches of data using the values of col0 in each partition, but I can't for the life of me figure out how to use foreachPartition, since it returns a Iterator[Row].
Here's pseudocode for what I'm wanting to do:
var df = spark.read.parquet(...).repartition(numPartitions, "col0")
df.foreachPartition((part_df : DataFrame) => {
val values = part_df.select("col0").distinct
val sql = "select * from table0 where col0 in (${values})" // or some smarter method :)
val db_df = spark.read.jdbc(..., table = sql)
part_df.join(db_dv, "col0") // and/or whatever else
})
Any ideas?
I wasn't able to find an elegant solution to this, but I was able to find an inelegant one.
When you write out to a filesystem, Spark will write a separate file for each partition. You can then use filesystem to list the files, then read in and operate on each one individually as a separate dataframe.

Convert dataset into list row is taking much time

i am calculating TFIDF, for that i need to convert my data set into list row.
My dataset has 40,00,000 records, when i call collectAsList function for my dataset it is taking more than 20mins to complete.
My RAM configured of 16gb.
Basically i need to work on individual row to calculate TFIDF for that particular record.
Please suggest me is there any other type of function to convert data set into list row in spark.
Even i tried for and foreach loop also, but still it is taking time.
Below is my sample code.
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("connection example").getOrCreate();
Dataset<Row> tokenlist= sqlContext.read().format("com.databricks.spark.csv").option("header", "true").option("nullValue", "").load("D:\\AI_MATCHING\\exampleTFIDF.csv");
tokenlist= tokenlist.select("features");
tokenlist.show(false);
List<Row> tokenizedWordsList1 = tokenlist.collectAsList();
/*tokenlist.foreach((ForeachFunction<Row>) individaulRow -> {
newtest.ItemIDSourceIndex=individaulRow.fieldIndex("ItemIDSource");
newtest.upcSourceIndex=individaulRow.fieldIndex("upcSource");
newtest.ManufacturerSourceIndex=individaulRow.fieldIndex("ManufacturerSource");
newtest.ManufacturerPartNumberSourceIndex=individaulRow.fieldIndex("Manufacturer part NumberSource");
newtest.PART_NUMBER_SOURCEIndex=individaulRow.fieldIndex("PART_NUMBER_SOURCE");
newtest.productDescriptionSourceIndex=individaulRow.fieldIndex("productDescriptionSource");
newtest.HASH_CODE_dummyIndex=individaulRow.fieldIndex("HASH_CODE_dummy");
newtest.rowIdSourceIndex=individaulRow.fieldIndex("rowIdSource");
newtest.rawFeaturesIndex=individaulRow.fieldIndex("rawfeatures ");
newtest.featuresIndex=individaulRow.fieldIndex("features ");
});*/
A) Spark ML library already doing TFIDF calculations by itself, try to use those methods.
B) If you have large rows (toList() will take time), try to use SQL methods.
Such as convert dataset into a table and query on it with certain conditions

Filter Partition Before Reading Hive table (Spark)

Currently I'm trying to filter a Hive table by the latest date_processed.
The table is partitioned by.
System
date_processed
Region
The only way I've managed to filter it, is by doing a join query:
query = "select * from contracts_table as a join (select (max(date_processed) as maximum from contract_table as b) on a.date_processed = b.maximum"
This way is really time consuming, as I have to do the same procedure for 25 tables.
Any one Knows a way to read directly the latest loaded partition of a table in Spark <1.6
This is the method I'm using to read.
public static DataFrame loadAndFilter (String query)
{
return SparkContextSingleton.getHiveContext().sql(+query);
}
Many thanks!
Dataframe with all table partitions can be received by:
val partitionsDF = hiveContext.sql("show partitions TABLE_NAME")
Values can be parsed, for get max value.

How to create multiple RDD rows from a single file record in Apache Spark

I'm struggling with the following logic using Apache Spark. My input file has rows in the following format pipe-delimited:
14586|9297,0.000128664|9298,0.0683921
14587|4673,0.00730174
14588|9233,1.15112e-07|9234,4.80094e-05|9235,1.91492e-05|9236,0.00776722
The first column is a key. There maybe one or more columns after that. Each subsequent column has a secondary key and a value, like this: 4673,0.00730174
While reading this file I want to have the resulted RDD having only 3 columns flattening other columns after the first one but retaining the main key, like these:
14586|9297,0.000128664
14586|9298,0.0683921
14587|4673,0.00730174
14588|9233,1.15112e-07
14588|9234,4.80094e-05
14588|9235,1.91492e-05
14588|9236,0.00776722
How can I do that in Scala?
Is this the thing you're looking for?
val sc: SparkContext = ...
val rdd = sc.parallelize(Seq(
"14586|9297,0.000128664|9298,0.0683921",
"14587|4673,0.00730174",
"14588|9233,1.15112e-07|9234,4.80094e-05|9235,1.91492e-05|9236,0.00776722"
)).flatMap { line =>
val splits = line.split('|')
val key = splits.head
val pairs = splits.tail
pairs.map { pair =>
s"$key|$pair"
}
}
rdd collect() foreach println
Output:
14586|9297,0.000128664
14586|9298,0.0683921
14587|4673,0.00730174
14588|9233,1.15112e-07
14588|9234,4.80094e-05
14588|9235,1.91492e-05
14588|9236,0.00776722
Have you considered using flatMap? It allows you to create multiple 0-n rows from a single row of input. Just parse the line and reconstruct the row with the different values for the primary row key.

PHOENIX SPARK - Load Table as DataFrame

I have created a DataFrame from a HBase Table (PHOENIX) which has 500 million rows. From the DataFrame I created an RDD of JavaBean and use it for joining with data from a file.
Map<String, String> phoenixInfoMap = new HashMap<String, String>();
phoenixInfoMap.put("table", tableName);
phoenixInfoMap.put("zkUrl", zkURL);
DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap).load();
JavaRDD<Row> tableRows = df.toJavaRDD();
JavaPairRDD<String, AccountModel> dbData = tableRows.mapToPair(
new PairFunction<Row, String, String>()
{
#Override
public Tuple2<String, String> call(Row row) throws Exception
{
return new Tuple2<String, String>(row.getAs("ID"), row.getAs("NAME"));
}
});
Now my question - Lets say the file has 2 unique million entries matching with the table. Is the entire table loaded into memory as RDD or only the matching 2 million records from the table will be loaded into memory as RDD ?
Your statement
DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap)
.load();
will load the entire table into memory. You have not provided any filter for phoenix to push down into hbase - and thus reduce the number of rows read.
If you do a join to a non-HBase datasource - e.g a flat file - then all of the records from the hbase table would first need to be read in. The records not matching the secondary data source would not be saved in the new DataFrame - but the initial reading would have still happened.
Update A potential approach would be to pre-process the file - i.e. extracting the id's you want. Store the results into a new HBase table. Then perform the join directly in HBase via Phoenix not Spark .
The rationale for that approach is to move the computation to the data. The bulk of the data resides in HBase - so then move the small data (id's in the files) to there.
I am not familiar directly with Phoenix except that it provides a sql layer on top of hbase. Presumably then it would be capable of doing such a join and storing the result in a separate HBase table ..? That separate table could then be loaded into Spark to be used in your subsequent computations.

Resources