Spark: good practice to check values in column are all same? - apache-spark

I have a dataset ds that has a column isInError, the dataset is read from a path.
For each dataset that I read, all values in this column should be the same (all true or all false).
Now I want to call some method based on this column (if all values in this column is true, I will add a new column, if all values are false, I will not add).
How can I do this properly ? I can surely do something like this :
dsFiltered = ds.filter(col("isInError").equals("true") then check if dsFiltered is empty, but I don't think it's best practice ?

Related

Difference between setting value to a column in a dataframe with and without .loc

I was wondering what is the difference between, and what are the pros and cons for, setting a constant value in a new column in a dataframe using the two different approaches below:
The first being directly assigning a value to a column
df["column"] = "value"
The second using .loc
df.loc[:, "column"] = "value"
Thanks!

How to update a dataframe column in pyspark based on another column value without using withColumn feature?

I have a dataframe in which I am trying to update a column value.
To do this, we can simply use spark.sql and run an update query on the dataframe.
But is there a way we can use dataframe's native API to do the same ?
I was able to set values of a new column by first creating it with withColumn and then setting its value based on a condition.
val df2 = df.withColumn("req_id", when(col("status") === "9088","Generated")
.when(col("status") === "9089","Deactive")
.otherwise("Unknown"))
But what if I already have a column req_id and it is with some values (not default values) where I want to update its value based on the column value status.
How can I update the value of the column req_id without doing a workaround like creating a new column and then updating its value ?
Any help is appreciated.
As #blackbishop mentioned in his comment :
val df2 = df.withColumn("req_id", when(col("status") === "9088","Generated")
.when(col("status") === "9089","Deactive")
.otherwise("Unknown"))
Will either create the column req_id or replace the values inside it if the column already exist.

pick from first occurrences till last values in array column in pyspark df

I have problem in which is have to search for first occurrence of "Employee_ID" in "Mapped_Project_ID", Need to pick the values in the array till last value from the first matching occurrences
I have one dataframe like below :
Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E101, E102, E103]
Name3|E103|[E101, E102, E103, E104, E105]
I want to have output df like below:
Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E102, E103]
Name3|E103|[E103, E104, E105]
Not sure, How to achieve this.
Can someone provide an help on this or logic to handle this in spark without need of any UDFs?
Once you have your dataframe you can use spark 2.4's higher order array function (see https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html) to filter out any values within the array that are lower than the value in the Employee_ID column like so:
myDataframe
.selectExpr(
"Employee_Name",
"Employee_ID",
"filter(Mapped_Project_ID, x -> x >= Employee_ID) as Mapped_Project_ID"
);

Iterating over rows of dataframe but keep each row as a dataframe

I want to iterate over the rows of a dataframe, but keep each row as a dataframe that has the exact same format of the parent dataframe, except with only one row. I know about calling DataFrame() and passing in the index and columns, but for some reason this doesn't always give me the same format of the parent dataframe. Calling to_frame() on the series (i.e. the row) does cast it back to a dataframe, but often transposed or in some way different from the parent dataframe format. Isn't there some easy way to do this and guarantee it will always be the same format for each row?
Here is what I came up with as my best solution so far:
def transact(self, orders):
# Buy or Sell
if len(orders) > 1:
empty_order = orders.iloc[0:0]
for index, order in orders.iterrows():
empty_order.loc[index] = order
#empty_order.append(order)
self.sub_transact(empty_order)
else:
self.sub_transact(orders)
In essence, I empty the dataframe and then insert the series, from the For loop, back into it. This works correctly, but gives the following warning:
C:\Users\BNielson\Google Drive\My Files\machine-learning\Python-Machine-Learning\ML4T_Ex2_1.py:57: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
empty_order.loc[index] = order
C:\Users\BNielson\Anaconda3\envs\PythonMachineLearning\lib\site-packages\pandas\core\indexing.py:477: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s
So it's this line giving the warning:
empty_order.loc[index] = order
This is particularly strange because I am using .loc already, when normally you get this error when you don't use .loc.
There is a much much easier way to do what I want.
order.to_frame().T
So...
if len(orders) > 1:
for index, order in orders.iterrows():
self.sub_transact(order.to_frame().T)
else:
self.sub_transact(orders)
What this actually does is translates the series (which still contains the necessary column and index information) back to a dataframe. But for some Moronic (but I'm sure Pythonic) reason it transposes it so that the previous row is now the column and the previous columns are now multiple rows! So you just transpose it back.
Use groupby with a unique list. groupby does exactly what you are asking for as in, it iterates over each group and each group is a dataframe. So, if you manipulate it such that you groupby a value that is unique for each and every row, you'll get a single row dataframe when you iterate over the group
for n, group in df.groupby(np.arange(len(df))):
pass
# do stuff
If I can suggest an alternative way than it would be like this:
for index, order in orders.iterrows():
orders.loc[index:index]
orders.loc[index:index] is exactly one row dataframe slice with the same structure, including index and column names.

Using which Query Implementation to get a row from Cassandra

I want to retrieve a row from Cassandra using column family and row key.
However when I using SliceQuery, there is an exception:Caused by: me.prettyprint.hector.api.exceptions.HectorException: Neither column names nor range were set, this is an invalid slice predicate.
Does anyone know whether I have used a wrong Query implementation?
This will give you an entire row:
SliceQuery query = HFactory.createSliceQuery(_keyspace, _stringSerializer, _stringSerializer, _stringSerializer);
query.setColumnFamily(columnFamily)
.setKey(key)
.setRange("", "", false, Integer.MAX_VALUE);

Resources