pyspark how to pass the values dynamically to countDistinct - apache-spark

I have a csv file that contains (FileName,ColumnName,Rule and RuleDetails) as headers.
I have multiple rules for column Rule like NotNull,Max,Min etc.
For the rule "Unique" there can be multiple columns, I need to pass those columns and perform countDistinct.
If I pass the values dynamically instead of hardcoding I'm getting below error
AnalysisException: Column '`"SITEID", "ASSETNUM"`' does not exist. Did you mean one of the following? [spark_catalog.maximo_dq.Assets_new.ASSETNUM, spark_catalog.maximo_dq.Assets_new.HasLD, spark_catalog.maximo_dq.Assets_new.SITEID, spark_catalog.maximo_dq.Assets_new.Status, spark_catalog.maximo_dq.Assets_new.SerialNumber, spark_catalog.maximo_dq.Assets_new.Description, spark_catalog.maximo_dq.Assets_new.InstallDate, spark_catalog.maximo_dq.Assets_new.Classification, spark_catalog.maximo_dq.Assets_new.LongDescription];
Similarly how to get the count of records which are not matching the specified date format.
I need to take check how many records in INSTALLDATE are not in the format of RuleDetails

Use tuple unpacking to pass the values
UNIQUUECOLSString = ['a','b','c'] #keep it in an array
df.select(countDistinct( *UNIQUUECOLSString ))

Related

Expand list columns with variable length into rows of a pandas dataframe

I have a input dataframe containing multiple list columns with unequal number of elements with in the list. I need to expand all the list columns into rows so that each bin has the corresponding value in the same row.
code for generating the df:
df_dict = {'vin':['VIN123','VIN123','VIN123','VIN234','VIN345'],
'date':['01-22-2022','01-23-2022','01-23-2022','01-23-2022','01-22-2022'],
'celltype':['A','A','B','A','B'],
'soc_bins':[['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170'],['0-10','10-20','50-80','85-90','100-150','150-170']],
'soc_value': [[10,300,85,20,5,0],[20,400,125,670,5,7],[20,500,55,60,9,9],[40,300,65,90,1,0],[20,700,35,50,2,0]],
'temp_bins':[['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f']],
'temp_value':[[1,2,3],[4,3,4],[5,3,5],[6,900,7],[3,600,9]],
'temp_bins':[['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f'],['50f-55f','60f-70f','90f-110f']]}
Input_df:
Output_df:
vin
date
celltype
soc_bins
soc_value
temp_bins
temp_value
VIN123
01-22-2022
A
0-10
10
50f-55f
1
VIN123
01-22-2022
A
10-20
300
60f-70f
2
In short, each value in the soc_value column corresponds to the corresponding bin in the soc_bin column and same goes for the temp columns.
Few problems I encountered using the explode method or similar methods is:
The number of bins in soc_bins (5) and temp_bins (3) are not equal.
Also, there might be a same value for two bins (ex: 3rd row, soc_value contains two values as 9) so when I first expand the soc_value column there is no way for the explode fucntion to identify the two rows as different and hence i am getting an error "cannot handle a non-unique multi-index!"
There are a lot many columns that has to be manipulated in the same way.
Can use df.set_index('date','vin','celltype').apply(lambda x: x.apply(pd.Series).stack()).reset_index() but i am getting NaN's in the indexed columns.
To fill the NaN's I can use the .ffill() but I am unable to distinguish between original null values.
Also, in this method if some of indexes are null's i'm getting an error "cannot handle a non-unique multi-index!"
Current output:
Required output: I need the output similar to my current output but without the null values. I could use .ffill() to fill the null values, but then i am unable to differentiate the actual null values vs the the ones created from the df.set_index().
Assigning a row_number to the df before exploding it into columns has solved the "cannot handle a non-unique multi-index!" issue.
df['row_number'] = np.arange(len(df))
df.set_index('date','vin','celltype').apply(lambda x: x.apply(pd.Series).stack()).reset_index()

How can I reset the index in a .groupby output?

I have the following dataframe. I want to group the name and brand columns by their respective unique values.
dataframe
I wrote the following python code to group them:
high_products = products.reset_index().groupby(['name', 'brand'])[['name', 'brand', 'count_name']]
and printed the output with the following code:
high_products.head().sort_values(by='count_name', ascending=False)
grouped
As you can see from the image above, it appears that it is grouping them also based on the index. In the end, I just want to get the unique name and brand values and their respective count names.
How can I do this with .groupby?
Thank you.
Change the first line of your code to (count aggregation on name, brand):
high_products = products.groupby(['name', 'brand']).agg(['count']).reset_index()

pick from first occurrences till last values in array column in pyspark df

I have problem in which is have to search for first occurrence of "Employee_ID" in "Mapped_Project_ID", Need to pick the values in the array till last value from the first matching occurrences
I have one dataframe like below :
Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E101, E102, E103]
Name3|E103|[E101, E102, E103, E104, E105]
I want to have output df like below:
Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E102, E103]
Name3|E103|[E103, E104, E105]
Not sure, How to achieve this.
Can someone provide an help on this or logic to handle this in spark without need of any UDFs?
Once you have your dataframe you can use spark 2.4's higher order array function (see https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html) to filter out any values within the array that are lower than the value in the Employee_ID column like so:
myDataframe
.selectExpr(
"Employee_Name",
"Employee_ID",
"filter(Mapped_Project_ID, x -> x >= Employee_ID) as Mapped_Project_ID"
);

Iterating over rows of dataframe but keep each row as a dataframe

I want to iterate over the rows of a dataframe, but keep each row as a dataframe that has the exact same format of the parent dataframe, except with only one row. I know about calling DataFrame() and passing in the index and columns, but for some reason this doesn't always give me the same format of the parent dataframe. Calling to_frame() on the series (i.e. the row) does cast it back to a dataframe, but often transposed or in some way different from the parent dataframe format. Isn't there some easy way to do this and guarantee it will always be the same format for each row?
Here is what I came up with as my best solution so far:
def transact(self, orders):
# Buy or Sell
if len(orders) > 1:
empty_order = orders.iloc[0:0]
for index, order in orders.iterrows():
empty_order.loc[index] = order
#empty_order.append(order)
self.sub_transact(empty_order)
else:
self.sub_transact(orders)
In essence, I empty the dataframe and then insert the series, from the For loop, back into it. This works correctly, but gives the following warning:
C:\Users\BNielson\Google Drive\My Files\machine-learning\Python-Machine-Learning\ML4T_Ex2_1.py:57: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
empty_order.loc[index] = order
C:\Users\BNielson\Anaconda3\envs\PythonMachineLearning\lib\site-packages\pandas\core\indexing.py:477: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s
So it's this line giving the warning:
empty_order.loc[index] = order
This is particularly strange because I am using .loc already, when normally you get this error when you don't use .loc.
There is a much much easier way to do what I want.
order.to_frame().T
So...
if len(orders) > 1:
for index, order in orders.iterrows():
self.sub_transact(order.to_frame().T)
else:
self.sub_transact(orders)
What this actually does is translates the series (which still contains the necessary column and index information) back to a dataframe. But for some Moronic (but I'm sure Pythonic) reason it transposes it so that the previous row is now the column and the previous columns are now multiple rows! So you just transpose it back.
Use groupby with a unique list. groupby does exactly what you are asking for as in, it iterates over each group and each group is a dataframe. So, if you manipulate it such that you groupby a value that is unique for each and every row, you'll get a single row dataframe when you iterate over the group
for n, group in df.groupby(np.arange(len(df))):
pass
# do stuff
If I can suggest an alternative way than it would be like this:
for index, order in orders.iterrows():
orders.loc[index:index]
orders.loc[index:index] is exactly one row dataframe slice with the same structure, including index and column names.

How to keep Rows and Columns headers when applying operation using Matlab

I have a data set stored in an excel file, when i importing data using matlab function :
A=xlread(xls -filename)
matrix A only stored numeric values of my table.. and when i used another function such as:
B= readtable(xls-filename)
then table will view complete data include rows and columns headers but when i apply such operation on it like
Bnorm=normc(B)
its unable to perform normalization on it due to the rows and columns headers ..
my question are:
is there any way to avoid rows and columns header in table B.
is there any way to store rows and columns headers when read table using xlread function .. such that
column header = store first row in (xls-filename)
row headers = store first column in (xls-filename)
thanks for any suggestion
dataset table
normalized matrix when apply xlread(xls-filename
The answers to your specific questions are:
With a table, you can avoid row labels but column labels always exist.
As per the doc for xlsread, the first output is the numeric data, and the second output is the text data, which in this case would include your header information.
But, in this case, you just need to learn how to work with tables properly. You want something like,
>> Bnorm = normc(B{:,2:end});
which extracts all the numeric elements of table B and uses them as input to normc.
If you want the result to be a table then use
Bnorm = B;
Bnorm{:,2:end} = normc(B{:,2:end}));

Resources