What is the most effective way to get elements of RDD in spark - apache-spark

I need to get values of two columns of a dataframe converted to RDD.
The first solution I have thought is that
First convert the RDD to List of Row RDD.collect()
then for each element of List, get values by using Row[i].getInt(column_index)
this solution works fine with small and medium size of data. But in large one, I got over memory.
My temporary solution is that I only create newRDD which contains only two Columns instead all columns. And then, apply my solution above, this may reduce most of needed memory.
Current implementation is like this:
Row[] rows = sparkDataFrame.collect();
for (int i = 0; i < rows.length; i++) { //about 50 million rows
int yTrue = rows[i].getInt(0);
int yPredict = rows[i].getInt(1);
}
Could you help me to improve my solution, or suggest me other solutions!
Thanks!
ps: I'm a new spark's user!

First you convert your big RDD into Dataframe and than directly you can select whatever columns you require.
// Create the DataFrame
DataFrame df = sqlContext.jsonFile("examples/src/main/resources/people.json");
// Select only the "name" column
df.select(df.col("name"), df.col("age")).show();
For more detail you can follow this link

Related

How can I repartition RDD by key and then pack it to shards?

I have many files containing millions of rows in format:
id, created_date, some_value_a, some_value_b, some_value_c
This way of repartitioning was super slow and created for me over million of small ~500b files:
rdd_df = rdd.toDF(["id", "created_time", "a", "b", "c"])
rdd_df.write.partitionBy("id").csv("output")
I would like to achieve output files, where each file contains like 10000 unique IDs and all their rows.
How could I achieve something like this?
You can repartition by adding a Random Salt key.
val totRows = rdd_df.count
val maxRowsForAnId = rdd_df.groupBy("id").count().agg(max("count"))
val numParts1 = totRows/maxRowsForAnId
val totalUniqueIds = rdd_df.select("id").distinct.count
val numParts2 = totRows/(10000*totalUniqueIds)
val numPart = numParts1.min(numParts2)
rdd_df
.repartition(numPart,col("id"),rand)
.csv("output")
The main concept is each partition will be written as 1 file. SO you would have bring your required rows in to 1 partition by repartition(numPart,col("id"),rand).
The first 4-5 operations is just to calculate how many partitions we need to achieve almost 10000 ids per file.
Calculate assuming 10000 ids per partition
Corner case : if a single id has too many rows and doesn't fit in the above calculated partition size.
Hence we calculate no of paritition according to the largest count of ID present
Take min of the 2 noOfPartitons
rand is necessary so, that we can bring multiple IDs in a single partition
NOTE : Although this will give you larger files and each file will contain a set of unique ids for sure. But this involves shuffling , due to which your operation actually might be slower than the code you have mentioned in question.
You would need something like this:
rdd_df.repartition(*number of partitions you want*).write.csv("output", header = True)
or honestly - just let the job decide the number partitions instead of repartitioning. In theory, that should be faster:
rdd_df.write.csv("output", header = True)

Avoid lazy evaluation of code in spark without cache

How can I avoid lazy evaluation in spark. I have a data frame which needs to be populated at once, since I need to filter the data on the basis of random number generated for each row of data frame, say if random number generated > 0.5, it will be filtered as dataA and if random number generated < 0.5 it will be filtered as dataB.
val randomNumberDF = df.withColumn("num", Math.random())
val dataA = randomNumberDF.filter(col("num") >= 0.5)
val dataB = randomNumberDF.filter(col("num") < 0.5)
Since spark is doing lazy eval, while filtering there is no reliable distribution of rows which are being filtered as dataA and dataB(sometimes same row is being present in both dataA and dataB)
How can I avoid this re-computation of "num" column, I have tried using "cache", which worked, but given my data size is going to be big, I am ruling out that solution.
I have also tried using other actions on the randomNumberDF, like :
count
rdd.count
show
first
these didn't solve the problem.
Please suggest something different from cache/persist/writing data to HDFS and again reading it as solution.
References I have already checked :
How to force spark to avoid Dataset re-computation?
How to force Spark to only execute a transformation once?
How to force Spark to evaluate DataFrame operations inline
If all you're looking for is a way to ensure that the same values are in randomNumberDF.num, then you can generate random numbers with a seed (using org.apache.spark.sql.functions.rand()):
The below is using 112 as the seed value:
val randomNumberDF = df.withColumn("num", rand(112))
val dataA = randomNumberDF.filter(col("num") >= 0.5)
val dataB = randomNumberDF.filter(col("num") < 0.5)
That will ensure that the values in num are the same across the multiple evaluations of randomNumberDF.
besides using org.apache.spark.sql.functions.rand with a given seed, you coud use eager-checkpointing:
df.checkpoint(true)
This will materialize the dataframe to disk

Spark - Performing union of Dataframes inside a for loop starting from empty DataFrame

I have a Dataframe with a column called "generationId" and other fields. Field "generationId" takes a range of integer values from 1 to N (upper bound to N is known and is small, between 10 and 15) and I want to process the DataFrame in the following way (pseudo code):
results = emptyDataFrame <=== how do I do this ?
for (i <- 0 until getN(df)) {
val input = df.filter($"generationId" === i)
results.union(getModel(i).transform(input))
}
Here getN(df) gives the N for that data frame based on some criteria. In the loop, input is filtered based on matching against "i" and then fed to some model (some internal library) which transforms the input by adding 3 more columns to it.
Ultimately I would like to get union of all those transformed data frames, so I have all columns of the original data frame plus the 3 additional columns added by the model for each row. I am not able to figure out how to initialize results and unionize the results in each iteration. I do know the exact schema of the result ahead of time. So I did
val newSchema = ...
but I am not sure how to pass that to emptyRDD function and build a empty Dataframe and use it inside the loop.
Also, if there is a much efficient way to do this inside map operation, please suggest.
you can do something like this:
(0 until getN(df))
.map(i => {
val input = df.filter($"generationId" === i)
getModel(i).transform(input)
})
.reduce(_ union _)
that way you don't need to worry about the empty df

find holes into a dateTime spark RDD

I have a question about ordering a DateTime RDD, finding the holes contained in it and fill it, for example, suppose we have this record into my database:
20160410,"info1"
20160409,"info2"
20160407,"info3"
20160404,"info4"
Basically for my purpose I need also holes, because it will impact over my calculations, so I would like something like this at the end:
Some(20160410,"info1")
Some(20160409,"info2")
None
Some(20160407,"info3")
None
None
Some(20160404,"info4")
What the best strategy to do that?
This is a little imcomplete excerpt code:
val records = bdao // RDD[(String,List[RecordPO])]
.findRecords
.filter(_.getRecDate >= startDate)
.filter(_.getRecDate < endDate)
.keyBy(_.getId)
.aggregateByKey(List[RecordPO]())((list, value) => value +: list, _ ++ _)
...
/* transformations */
...
val finalRecords=.... // RDD[(String,List[Option[RecordPO])]
Thanks in advance
You will need to create dataframe of all dates you want to see in resulting dataset (for example, all dates from 20160404 to 20160410). Then perform left outer join of this dataset with your records and you will get None where you expect.

How to process tab-separated files in Spark?

I have a file which is tab separated. The third column should be my key and the entire record should be my value (as per Map reduce concept).
val cefFile = sc.textFile("C:\\text1.txt")
val cefDim1 = cefFile.filter { line => line.startsWith("1") }
val joinedRDD = cefFile.map(x => x.split("\\t"))
joinedRDD.first().foreach { println }
I am able to get the value of first column but not third. Can anyone suggest me how I could accomplish this?
After you've done the split x.split("\\t") your rdd (which in your example you called joinedRDD but I'm going to call it parsedRDD since we haven't joined it with anything yet) is going to be an RDD of Arrays. We could turn this into an array of key/value tuples by doing parsedRDD.map(r => (r(2), r)). That being said - you aren't limited to just map & reduce operations in Spark so its possible that another data structure might be better suited. Also for tab separated files, you could use spark-csv along with Spark DataFrames if that is a good fit for the eventual problem you are looking to solve.

Resources