I'm working with pyspark data frames (pyspark.sql.dataframe.DataFrame).
The data frame consists of ~20.000 rows and I want to extract only 1 in 20 rows to a new data frame, which then has ~1.000 rows.
In python/pandas, this can be easily done by df_new = df_old.iloc[::20, :] .
How can this be done in pyspark?
I'm trying to drop duplicates based on column1 and select the row with max value in column2. The column2 has "year"(2019,2020 etc) as values and it is of type "String". The solution I have is, converting the column 2 into integer and selecting the max value.
Dataset<Row> ds ; //The dataset with column1,column2(year), column3 etc.
Dataset<Row> newDs = ds.withColumn("column2Int", col("column2").cast(DataTypes.IntegerType));
newDs = newDs.groupBy("column1").max("column2Int"); // drops all other columns
This approach drops all other columns in the original dataset 'ds' when I do a "group by". So I have to do a join between 'ds' and 'newDS' to get back all the original columns. Also casting the String column to Integer looks like an ineffective workaround.
Is it possible to drop the duplicates and get the row with bigger string value from the original dataset itself ?
This is a classic de-duplication problem and you'll need to use Window + Rank + filter combo for this.
I'm not very familiar with the Java syntax, but the sample code should look like something below,
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
Dataset<Row> df = ???;
WindowSpec windowSpec = Window.partitionBy("column1").orderBy(functions.desc("column2Int"));
Dataset<Row> result =
df.withColumn("column2Int", functions.col("column2").cast(DataTypes.IntegerType))
.withColumn("rank", functions.rank().over(windowSpec))
.where("rank == 1")
.drop("rank");
result.show(false);
Overview of what happened,
Add the casted integer column to the df for future sorting.
Subsections/ windows were formed in your dataset (partitions) based on the value of column1
For each of these subsections/ windows/ partitions the rows were sorted on column casted to int. Desc order as you want max.
Ranks like row numbers are assigned to the rows in each partition/ window created.
Filtering is done for all row where rank is 1 (max value as the ordering was desc.)
I am trying to read the first row from a file and then filter that from the dataframe.
I am using take(1) to read the first row. I then want to filter this from the dataframe (it could appear multiple times within the dataset).
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext(appName = "solution01")
spark = SparkSession(sc)
df1 = spark.read.csv("/Users/abc/test.csv")
header = df1.take(1)
print(header)
final_df = df1.filter(lambda x: x != header)
final_df.show()
However I get the following error TypeError: condition should be string or Column.
I was trying to follow the answer from Nicky here How to skip more then one lines of header in RDD in Spark
The data looks like (but will have multiple columns that i need to do the same for):
customer_id
1
2
3
customer_id
4
customer_id
5
I want the result as:
1
2
3
4
5
take on dataframe results list(Row) we need to get the value use [0][0] and
In filter clause use column_name and filter the rows which are not equal to header
header = df1.take(1)[0][0]
#filter out rows that are not equal to header
final_df = df1.filter(col("<col_name>") != header)
final_df.show()
From this data frame I like to select rows with same concentration and also almost same name. For example, first three rows has same concentration and also same name except at the end of the name Dig_I, Dig_II, Dig_III. This 3 rows same with same concentration. I like to somehow select this three rows and take mean value of each column. After that I want to create a new data frame.
here is the whole data frame:
import pandas as pd
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
import pandas as pd
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
new_df = df.groupby('concentration').mean()
Note: This will only find the averages for columns with dtype float or int... this will drop the img_name column and will take the averages of all columns...
This may be faster...
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js").groupby('concentration').mean()
If you would like to preserve the img_name...
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
new = df.groupby('concentration').mean()
pd.merge(df, new, left_on = 'concentration', right_on = 'concentration', how = 'inner')
Does that help?
I have a dataframe df1 with 150 columns and many rows. I also have a dataframe df2 with the same schema but very few rows containing edits that should be applied to df1 (there's a key column id to identify which row to update). df2 has only columns with updates populated. The other of the columns are null. What I want to do is to update the rows in df1 with correspoding rows from dataframe df2 in the following way:
if a column in df2 is null, it should not cause any changes in df1
if a column in df2 contains a tilde "~", it should result in nullifying that column in df1
otherwise the value in column in df1 should get replaced with the value from df2
How can I best do it? Can it be done in a generic way without listing all the columns but rather iterating over them? Can it be done using dataframe API or do I need to switch to RDDs?
(Of course by updating dataframe df1 I mean creating a new, updated dataframe.)
Example
Let's say the schema is: id:Int, name:String, age: Int.
df1 is:
1,"Greg",18
2,"Kate",25
3,"Chris",30
df2 is:
1,"Gregory",null
2,~,26
The updated dataframe should look like this:
1,"Gregory",18
2,null,26
3,"Chris",30
you can also use case or coalesce using full outer join to merge the two dataframes. see a link below for an explanation.
Spark incremental loading overwrite old record
I figured out how to do it with an intermediate conversion to RDD. First, create a map idsToEdits where keys are row ids and values are maps of column numbers to values (only the non-null ones).
val idsToEdits=df2.rdd.map{row=>
(row(0),
row.getValuesMap[AnyVal](row.schema.fieldNames.filterNot(colName=>row.isNullAt(row.fieldIndex(colName))))
.map{case (k,v)=> (row.fieldIndex(k),if(v=="~") null else v)} )
}.collectAsMap()
Broadast that map and define an editRow function updating a row.
val idsToEditsBr=sc.broadcast(idsToEdits)
import org.apache.spark.sql.Row
val editRow:Row=>Row={ row =>
idsToEditsBr
.value
.get(row(0))
.map{edits => Row.fromSeq(edits.foldLeft(row.toSeq){case (rowSeq,
(idx,newValue))=>rowSeq.updated(idx,newValue)})}
.getOrElse(row)
}
Finally, use that function on RDD derived from df1 and convert back to a dataframe.
val updatedDF=spark.createDataFrame(df1.rdd.map(editRow),df1.schema)
It sounds like your question is how to perform this without explcitly naming all the columns so I will assume you have some "doLogic" udf function or dataframe functions to perform your logic after joining.
import org.apache.spark.sql.types.StringType
val cols = df1.schema.filterNot(x => x.name == "id").map({ x =>
if (x.dataType == StringType) {
doLogicUdf(col(x), col(x + "2")))
} else {
when(col(x + "2").isNotNull, col(x + "2")).otherwise(col(x))
}
}) :+ col("id")
val df2 = df2.select(df2.columns.map( x=> col(x).alias(x+"2")) : _*))
df1.join(df2, col("id") ===col("id2") , "inner").select(cols : _*)