How to turn a column values into a list - apache-spark

I have dataset which i want to extract a column of values into a list, how to achieve that ?
I have tried :
List<Row> rows = dF.select("col1").collectAsList();
But then how to iterate over values of col1 ?
Thanks

You can iterate over a list of rows the same way you can iterate over any list.
For eg: Assuming you want to access the first column (string type) of all the rows, you can use the following snippet
import scala.collection.JavaConversions.asScalaBuffer
val rows = df.select("col1").collectAsList();
rows.map(r => r.getAs[String](0))

This is how you can iterate over the list selected from a column, but it is not recommended if the column is big to fit in the driver node
df.select("col1").collectAsList()
.stream().forEach(row -> System.out.println(row.getString(0)));
There are other functions for other datatype to get the value from Row like getString, getLong, getInt etc

Related

How to select corresponding items of a columns without making it index column in pandas dataframe

I have a pandas dataframe like this:
How do I get the price of an item1 without making 'Items column' an index column?
I tried df['Price (R)'][item1] but it returns the price of item2, while I expect output to be 1
The loc operators are required in front of the selection brackets []. When using loc, the part before the comma is the rows you want, and the part after the comma is the columns you want to select. Therefore, the code can be:
result = df.loc[df['Items']=="item1","Price(R)"]
The data type of created output is Pandas Series object.

Power Query: Split table column with multiple cells in the same row

I have a SharePoint list as a datasource in Power Query.
It has a "AttachmentFiles" column, that is a table, in that table i want the values from the column "ServerRelativeURL".
I want to split that column so each value in "ServerRelativeURL"gets its own column.
I can get the values if i use the expand table function, but it will split it into multiple rows, I want to keep it in one row.
I only want one row per unique ID.
Example:
I can live with a fixed number of columns as there are usually no more than 3 attachments per ID.
I'm thinking that I can add a custom column that refers to "AttachmentFiles ServerRelativeURL Value(1)" but I don't know how.
Can anybody help?
Try this code:
let
fn = (x)=> {x, #table({"ServerRelativeUrl"},List.FirstN(List.Zip({{"a".."z"}}), x*2))},
Source = #table({"id", "AttachmentFiles"},{fn(2),fn(3),fn(1)}),
replace = Table.ReplaceValue(Source,0,0,(a,b,c)=>a[ServerRelativeUrl],{"AttachmentFiles"}),
cols = List.Transform({1..List.Max(List.Transform(replace[AttachmentFiles], List.Count))}, each "url"&Text.From(_)),
split = Table.SplitColumn(replace, "AttachmentFiles", (x)=>List.Transform({0..List.Count(x)-1}, each x{_}), cols)
in
split
I manged to solve it myself.
I added 3 custom columns like this
CustomColumn1: [AttachmentFiles]{0}
CustomColumn2: [AttachmentFiles]{1}
CustomColumn3: [AttachmentFiles]{2}
And expanded them with only the "ServerRelativeURL" selected.
It would be nice to have a dynamic solution. But this will work fine for now.

pick from first occurrences till last values in array column in pyspark df

I have problem in which is have to search for first occurrence of "Employee_ID" in "Mapped_Project_ID", Need to pick the values in the array till last value from the first matching occurrences
I have one dataframe like below :
Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E101, E102, E103]
Name3|E103|[E101, E102, E103, E104, E105]
I want to have output df like below:
Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E102, E103]
Name3|E103|[E103, E104, E105]
Not sure, How to achieve this.
Can someone provide an help on this or logic to handle this in spark without need of any UDFs?
Once you have your dataframe you can use spark 2.4's higher order array function (see https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html) to filter out any values within the array that are lower than the value in the Employee_ID column like so:
myDataframe
.selectExpr(
"Employee_Name",
"Employee_ID",
"filter(Mapped_Project_ID, x -> x >= Employee_ID) as Mapped_Project_ID"
);

Error while selecting rows from pandas data frame

I have a pandas dataframe df with the column Name. I did:
for name in df['Name'].unique ():
X = df[df['Name'] == name]
print (X.head())
but then X contains all kinds of different Name, not an unique name I want.
What did I do wrong?
Thanks a lot
You probably don't want to overwrite X with every iteration of your loop and only keep the dataframe containing the last value of df['Name'].unique().
Depending on your data and goal, you might want to use groupby as jezrael suggests, or maybe do something like df[~df['Name'].duplicated()].

Spark: filter out all rows based on key/value

I have an RDD, x, in which I have two fields: id, value. If a row has a particular value, I want to take the id and filter out all rows with that id.
For example if I have:
id1,value1
id1,value2
and I want to filter out all ids if any rows with that id have value value1, then I would expect all rows to be filtered out. But currently only the first row is filtered out because it has a value of value1.
I've tried something like
val filter = x.filter(row => (set contains row.value))
This filters out all rows with a particular value, but leaves the other rows with the same id still in the RDD.
You have to apply a filter function for each rdd row and the function after the => should include the row as Array whether or not it includes that token idx or whatever. You might have to adjust the number of the token , but it should look something like this ( whether you should use contains or not contains depends on whether you want to filter in or filter out:
val filteredRDD = rawRDD
.filter(rowItem => !(rowItem.map(_.toString).toSeq
.contains(rowItem.(0).toString)))
or even something like:
val filteredRDD = rdd1.rawRDD(rowItem => !(rowItem._2 contains rowItem._1))

Resources