Pyspark - truncate a dataframe based on index - apache-spark

Want to get n items in a specific range of a dataframe
data = [("Java", "123456"), ("Python", "123456"), ("Go", "123456"), ("Scala", "123456"), ("TypeScript", "123456")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF()
How to truncate the df in order to keep [2nd - 4th] items ? (for example)

Related

Dataframe Comparision in Spark: Scala

I have two dataframes in spark/scala in which i have some common column like salary,bonus,increment etc.
i need to compare these two dataframes's columns and anything changes like in first dataframe salary is 3000 and in second dataframe salary is 5000 then i need to insert 5000-3000=2000 in new dataframe as salary, and if in first dataframe salary is 5000 and in second dataframe salary is 3000 then i need to insert 5000+3000=8000 in new dataframe as salary, and if salary is same in both the dataframe then need to insert from second dataframe.
val columns = df1.schema.fields.map(_.salary)
val salaryDifferences = columns.map(col => df1.select(col).except(df2.select(col)))
salaryDifferences.map(diff => {if(diff.count > 0) diff.show})
I tried above query but its giving column and value where any difference is there i need to also check if diff is negative or positive and based to that i need to perform logic.can anyone please give me a hint how can i implement this and insert record in 3rd dataframe,
Join the Dataframes and use nested when and otherwise clause.
Also find comments in the code
import org.apache.spark.sql.functions._
object SalaryDiff {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df1 = List(("1", "5000"), ("2", "3000"), ("3", "5000")).toDF("id", "salary") // First dataframe
val df2 = List(("1", "3000"), ("2", "5000"), ("3", "5000")).toDF("id", "salary") // Second dataframe
val df3 = df1 // Is your 3rd tables
.join(
df2
, df1("id") === df2("id") // Join both dataframes on id column
).withColumn("finalSalary", when(df1("salary") < df2("salary"), df2("salary") - df1("salary")) // 5000-3000=2000 check
.otherwise(
when(df1("salary") > df2("salary"), df1("salary") + df2("salary")) // 5000+3000=8000 check
.otherwise(df2("salary")))) // insert from second dataframe
.drop(df1("salary"))
.drop(df2("salary"))
.withColumnRenamed("finalSalary","salary")
.show()
}
}

Get all record from nth bucket in Hive sql

How to get all record from nth bucket in hive.
Select * from bucketTable from bucket 9;
You can achieve this with different ways:
Approach-1: By getting the table stored location from desc formatted <db>.<tab_name>
Then read the 9th bucket file directly from HDFS filesystem.
(or)
Approach-2: Using input_file_name()
Then filter only 9th bucket data by using filename
Example:
Approach-1:
Scala:
val df = spark.sql("desc formatted <db>.<tab_name>")
//get table location in hdfs path
val loc_hdfs = df.filter('col_name === "Location").select("data_type").collect.map(x => x(0)).mkString
//based on your table format change the read format
val ninth_buk = spark.read.orc(s"${loc_hdfs}/000008_0*")
//display the data
ninth_buk.show()
Pyspark:
from pyspark.sql.functions import *
df = spark.sql("desc formatted <db>.<tab_name>")
loc_hdfs = df.filter(col("col_name") == "Location").select("data_type").collect()[0].__getattr__("data_type")
ninth_buk = spark.read.orc(loc_hdfs + "/000008_0*")
ninth_buk.show()
Approach-2:
Scala:
val df = spark.read.table("<db>.<tab_name>")
//add input_file_name
val df1 = df.withColumn("filename",input_file_name())
#filter only the 9th bucket filename and select only required columns
val ninth_buk = df1.filter('filename.contains("000008_0")).select(df.columns.head,df.columns.tail:_*)
ninth_buk.show()
pyspark:
from pyspark.sql.functions import *
df = spark.read.table("<db>.<tab_name>")
df1 = df.withColumn("filename",input_file_name())
ninth_buk = df1.filter(col("filename").contains("000008_0")).select(*df.columns)
ninth_buk.show()
Approach-2 will not be recommended if you have huge data as we need to filter through whole dataframe..!!
In Hive:
set hive.support.quoted.identifiers=none;
select `(fn)?+.+` from (
select *,input__file__name fn from table_name)e
where e.fn like '%000008_0%';
If it is a ORC table
SELECT * FROM orc.<bucket_HDFS_path>
select * from bucketing_table tablesample(bucket n out of y on clustered_criteria_column);
where bucketing_table is your bucket table name
n => nth bucket
y => total no. of buckets

Taking value from one dataframe and passing that value into loop of SqlContext

Looking to try do something like this:
I have a dataframe that is one column of ID's called ID_LIST. With that column of id's I would like to pass it into a Spark SQL call looping through ID_LIST using foreach returning the result to another dataframe.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val id_list = sqlContext.sql("select distinct id from item_orc")
id_list.registerTempTable("ID_LIST")
id_list.foreach(i => println(i)
id_list println output:
[123]
[234]
[345]
[456]
Trying to now loop through ID_LIST and run a Spark SQL call for each:
id_list.foreach(i => {
val items = sqlContext.sql("select * from another_items_orc where id = " + i
items.foreach(println)
}
First.. not sure how to pull the individual value out, getting this error:
org.apache.spark.sql.AnalysisException: cannot recognize input near '[' '123' ']' in expression specification; line 1 pos 61
Second: how can I alter my code to output the result to a dataframe I can use later ?
Thanks, any help is appreciated!
Answer To First Question
When you perform the "foreach" Spark converts the dataframe into an RDD of type Row. Then when you println on the RDD it prints the Row, the first row being "[123]". It is boxing [] the elements in the row. The elements in the row are accessed by position. If you wanted to print just 123, 234, etc... try
id_list.foreach(i => println(i(0)))
Or you can use native primitive access
id_list.foreach(i => println(i.getString(0))) //For Strings
Seriously... Read the documentation I have linked about Row in Spark. This will transform your code to:
id_list.foreach(i => {
val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
items.foreach(i => println(i.getString(0)))
})
Answer to Second Question
I have a sneaking suspicion about what you actually are trying to do but I'll answer your question as I have interpreted it.
Let's create an empty dataframe which we will union everything to it in a loop of the distinct items from the first dataframe.
import org.apache.spark.sql.types.{StructType, StringType}
import org.apache.spark.sql.Row
// Create the empty dataframe. The schema should reflect the columns
// of the dataframe that you will be adding to it.
val schema = new StructType()
.add("col1", StringType, true)
var df = ss.createDataFrame(ss.sparkContext.emptyRDD[Row], schema)
// Loop over, select, and union to the empty df
id_list.foreach{ i =>
val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
df = df.union(items)
}
df.show()
You now have the dataframe df that you can use later.
NOTE: An easier thing to do would probably be to join the two dataframes on the matching columns.
import sqlContext.implicits.StringToColumn
val bar = id_list.join(another_items_orc, $"distinct_id" === $"id", "inner").select("id")
bar.show()

What is the most efficient way to repartition a data frame with the following constraints?

I have a time series dataframe stored at one partition
+-------------+------+----+-------+
| TimeStamp| X| Y| Z|
+-------------+------+----+-------+
|1448949705421|-35888|4969|3491754|
|1448949705423|-35081|2795|3489177|
|1448949705425|-35976|5830|3488618|
|1448949705426|-36927|4729|3491807|
|1448949705428|-36416|6246|3490364|
|1448949705429|-36073|7067|3491556|
|1448949705431|-38553|3714|3489545|
|1448949705433|-39008|3034|3490230|
|1448949705434|-35295|4005|3489426|
|1448949705436|-36397|5076|3490941|
+-------------+------+----+-------+
I want to repartition this dataframe into 10 partitions, such that the first partition has roughly the first 1/10 rows, the second partition has roughly the second 1/10 rows, and so on.
One way I can think of is:
var df = ???
// add index to df
val rdd = df.rdd.zipWithIndex().map(indexedRow =>
Row.fromSeq(indexedRow._2.toLong +: indexedRow._1.toSeq))
val newstructure = StructType(Seq(StructField("rn", LongType, true)).++(df.schema.fields))
val dfWithIndex = sqlContext.createDataFrame(rdd, newstructure)
// create a group number using the index
val udfToInt = udf[Int, Double](_.toInt)
val dfWithGrp = dfWithIndex.withColumn("group", udfToInt($"rn" / (df.count / 10)))
// repartition by the "group" column
val partitionedDF = dfWithGrp.repartition(10, $"group")
Another way I can think of is by using a partitioner:
//After creating a group number
val grpIndex = dfWithGrp.schema.fields.size - 1
val partitionedRDD = dfWithGrp.rdd.map(r => (r.getInt(grpIndex), r))
.partitionBy(new HashPartitioner(10))
.values
But they seem to be not efficient because we need to add index first and then create a group number using the index. Is there a way to do this without adding an extra group column?

Randomly shuffle column in Spark RDD or dataframe

Is there anyway I can shuffle a column of an RDD or dataframe such that the entries in that column appear in random order? I'm not sure which APIs I could use to accomplish such a task.
What about selecting the column to shuffle, orderBy(rand) the column and zip it by index to the existing dataframe?
import org.apache.spark.sql.functions.rand
def addIndex(df: DataFrame) = spark.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
case class Entry(name: String, salary: Double)
val r1 = Entry("Max", 2001.21)
val r2 = Entry("Zhang", 3111.32)
val r3 = Entry("Bob", 1919.21)
val r4 = Entry("Paul", 3001.5)
val df = addIndex(spark.createDataFrame(Seq(r1, r2, r3, r4)))
val df_shuffled = addIndex(df
.select(col("salary").as("salary_shuffled"))
.orderBy(rand))
df.join(df_shuffled, Seq("_index"))
.drop("_index")
.show(false)
+-----+-------+---------------+
|name |salary |salary_shuffled|
+-----+-------+---------------+
|Max |2001.21|3001.5 |
|Zhang|3111.32|3111.32 |
|Paul |3001.5 |2001.21 |
|Bob |1919.21|1919.21 |
+-----+-------+---------------+
If you don't need a global shuffle across your data, you can shuffle within partitions using the mapPartitions method.
rdd.mapPartitions(Random.shuffle(_));
For a PairRDD (RDDs of type RDD[(K, V)]), if you are interested in shuffling the key-value mappings (mapping an arbitrary key to an arbitrary value):
pairRDD.mapPartitions(iterator => {
val (keySequence, valueSequence) = iterator.toSeq.unzip
val shuffledValueSequence = Random.shuffle(valueSequence)
keySequence.zip(shuffledValueSequence).toIterator
}, true)
The boolean flag at the end denotes that partitioning is preserved (keys are not changed) for this operation so that downstream operations e.g. reduceByKey can be optimized (avoid shuffles).
While one can not not just shuffle a single column directly - it is possible to permute the records in an RDD via RandomRDDs. https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/random/RandomRDDs.html
A potential approach to having only a single column permuted might be:
use mapPartitions to do some setup/teardown on each Worker task
suck all of the records into memory. i.e. iterator.toList. Make sure you have many (/small) partitions of data to avoid OOME
using the Row object rewrite all back out as original except for the given column
within the mapPartitions create an in-memory sorted list
for the desired column drop its values in a separate collection and randomly sample the collection for replacing each record's entry
return the result as list.toIterator from the mapPartitions
You can add one additional column random generated, and then sort the record based on this random generated column. By this way, you are randomly shuffle your destined column.
In this way, you do not need to have all data in memory, which can easily cause OOM. Spark will take care of sorting and memory limitation issue by spill to disk if necessary.
If you don't want the extra column, you can remove it after sorting.
In case someone is looking for a PySpark equivalent of Sascha Vetter's post, you can find it below:
from pyspark.sql.functions import rand
from pyspark.sql import Row
from pyspark.sql.types import *
def add_index_to_row(row, index):
print(index)
row_dict = row.asDict()
row_dict["index"] = index
return Row(**row_dict)
def add_index_to_df(df):
df_with_index = df.rdd.zipWithIndex().map(lambda x: add_index_to_row(x[0], x[1]))
new_schema = StructType(df.schema.fields + [StructField("index", IntegerType(), True)])
return spark.createDataFrame(df_with_index, new_schema)
def shuffle_single_column(df, column_name):
df_cols = df.columns
# select the desired column and shuffle it (i.e. order it by column with random numbers)
shuffled_col = df.select(column_name).orderBy(F.rand())
# add explicit index to the shuffled column
shuffled_col_index = add_index_to_df(shuffled_col)
# add explicit index to the original dataframe
df_index = add_index_to_df(df)
# drop the desired column from df, join it with the shuffled column on created index and finally drop the index column
df_shuffled = df_index.drop(column_name).join(shuffled_col_index, "index").drop("index")
# reorder columns so that the shuffled column comes back to its initial position instead of the last position
df_shuffled = df_shuffled.select(df_cols)
return df_shuffled
# initialize random array
z = np.random.randint(20, size=(10, 3)).tolist()
# create the pyspark dataframe
example_df = sc.parallelize(z).toDF(("a","b","c"))
# shuffle one column of the dataframe
example_df_shuffled = shuffle_single_column(df = example_df, column_name = "a")

Resources