Get Spark dataset metadata - apache-spark

I am trying to convert the Dataset<row> into another object. Possibly be java.list. And I need to extract the metadata for this dataset. Like the number of column, column names and column types. Is there anyway to do it?
Thank you

You can get the schema from dataset as
ds.schema
This gives you StructType which contains all the information
ds.schema.fieldNames
This gives all the list of column names
ds.schema.fields
This gives you a list of StructField which contains column name, datatype and nullable as a boolean value.
ds.schema.size
This gives the total count of column names
Also, you can see the details with ds.printSchema()
Hope this helps!

Related

pyspark how to pass the values dynamically to countDistinct

I have a csv file that contains (FileName,ColumnName,Rule and RuleDetails) as headers.
I have multiple rules for column Rule like NotNull,Max,Min etc.
For the rule "Unique" there can be multiple columns, I need to pass those columns and perform countDistinct.
If I pass the values dynamically instead of hardcoding I'm getting below error
AnalysisException: Column '`"SITEID", "ASSETNUM"`' does not exist. Did you mean one of the following? [spark_catalog.maximo_dq.Assets_new.ASSETNUM, spark_catalog.maximo_dq.Assets_new.HasLD, spark_catalog.maximo_dq.Assets_new.SITEID, spark_catalog.maximo_dq.Assets_new.Status, spark_catalog.maximo_dq.Assets_new.SerialNumber, spark_catalog.maximo_dq.Assets_new.Description, spark_catalog.maximo_dq.Assets_new.InstallDate, spark_catalog.maximo_dq.Assets_new.Classification, spark_catalog.maximo_dq.Assets_new.LongDescription];
Similarly how to get the count of records which are not matching the specified date format.
I need to take check how many records in INSTALLDATE are not in the format of RuleDetails
Use tuple unpacking to pass the values
UNIQUUECOLSString = ['a','b','c'] #keep it in an array
df.select(countDistinct( *UNIQUUECOLSString ))

How to Flatten a semicolon Array properly in Azure Data Factory?

Context: I've a data flow that extracts data from SQL DB, when data comes is just one column with a string separated by tab, in order to manipulate the data properly, I've tried to separate every single column with its corresponding data:
Firstly, to 'rebuild' the table properly I used a 'Derived Column' activity replacing tab with semicolons instead (1)
dropLeft(regexReplace(regexReplace(regexReplace(descripcion,[\t],';'),[\n],';'),[\r],';'),1)
So, after that use 'split()' function to get an array and build the columns (2)
split(descripcion, ';')
Problem: When I try to use 'Flatten' activity (as here https://learn.microsoft.com/en-us/azure/data-factory/data-flow-flatten), is just not working and data flow throws me just one column or if I add an additional column in the 'Flatten' activity I just get another column with the same data that the first one:
Expected output:
column2
column1
column3
2000017
ENVASE CORONA CLARA 24/355 ML GRAB
PC13
2004297
ENVASE V FAM GRAB 12/940 ML USADO
PC15
Could you say me what i'm doing wrong, guys? thanks by the way.
You can use the derived column activity itself, try as below.
After the first derived column, what you have is a string array which can just be split again using derived schema modifier.
Where firstc represent the source column equivalent to your column descripcion
Column1: split(firstc, ';')[1]
Column2: split(firstc, ';')[2]
Column3: split(firstc, ';')[3]
Optionally you can select the columns you need to write to SQL sink

how to rename badly typed students' name in a column in a dataframe based on a reference list

We have student anwser MCQs after each lessons on socrative
They enter their name first, then anwser. For each lesson, we collect data from the Socrative platform but have issues "normalizing the names" such as 'John Doe', johndoe' or John,Doe' can be transformed into 'doe', as it is written is our main file.
Our main file for following up students (treated as a dataframe with python) has initially just 1 column, the name (as a string 'doe' for Mr. John Doe).
I'l like to write a function that goes through the 'name' column of my lesson1 dataframe and for each value of the name column, replace the badly typed name by the reference name.
To lower the case, suppress excessive spaces and suppress excessive punctuation, i've used the following code
lesson1["name"] = lesson1["name"].str.lower()
lesson1["name"] = lesson1["name"].str.strip()
import re
lesson1["name"]=lesson1["name"].apply(lambda x : re.sub('[^A-Za-z0-9]+', '', x))
Then I want to change the 'name' values for the reference name is necessary
I've tried the following code on 2 lists
bad=lesson1['name']
good=reference['name']
def changenames(lesson_list, reference_list):
for i,name in enumerate(lesson_list):
for j,ref in enumerate(reference_list):
if ref in name:
lesson_list[i]=ref
changenames(bad,good)
but 1/ it's not working due to SettingWithCopyWarning
2/ i fail to apply it to a column of the dataframe
Could you help me ?
Thx
L.
I've found out a way
I've 2 dataframes
- the reference_list dataframe, with the names of the students. It has a column 'name'
- the lesson dataframe with the names as the students type them when they answer the MCQs (not standardized) and the answers to the MCQs.
To transform the names of the students in the lesson dataframe, based on the well-types names in reference_list['name'], i have used :
for i in lesson['name']:
for ref in reference_list['name']:
if ref in i:
lesson.loc[lesson['name'] == i, 'name']=ref
and it works fine,
After that, you can apply functions to treat duplicates, merge data...
I've found help in this thread Replace single value in a pandas dataframe, when index is not known and values in column are unique
Hope it'll help some of you.
Louis

pick from first occurrences till last values in array column in pyspark df

I have problem in which is have to search for first occurrence of "Employee_ID" in "Mapped_Project_ID", Need to pick the values in the array till last value from the first matching occurrences
I have one dataframe like below :
Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E101, E102, E103]
Name3|E103|[E101, E102, E103, E104, E105]
I want to have output df like below:
Employee_Name|Employee_ID|Mapped_Project_ID
Name1|E101|[E101, E102, E103]
Name2|E102|[E102, E103]
Name3|E103|[E103, E104, E105]
Not sure, How to achieve this.
Can someone provide an help on this or logic to handle this in spark without need of any UDFs?
Once you have your dataframe you can use spark 2.4's higher order array function (see https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html) to filter out any values within the array that are lower than the value in the Employee_ID column like so:
myDataframe
.selectExpr(
"Employee_Name",
"Employee_ID",
"filter(Mapped_Project_ID, x -> x >= Employee_ID) as Mapped_Project_ID"
);

Iterating over rows of dataframe but keep each row as a dataframe

I want to iterate over the rows of a dataframe, but keep each row as a dataframe that has the exact same format of the parent dataframe, except with only one row. I know about calling DataFrame() and passing in the index and columns, but for some reason this doesn't always give me the same format of the parent dataframe. Calling to_frame() on the series (i.e. the row) does cast it back to a dataframe, but often transposed or in some way different from the parent dataframe format. Isn't there some easy way to do this and guarantee it will always be the same format for each row?
Here is what I came up with as my best solution so far:
def transact(self, orders):
# Buy or Sell
if len(orders) > 1:
empty_order = orders.iloc[0:0]
for index, order in orders.iterrows():
empty_order.loc[index] = order
#empty_order.append(order)
self.sub_transact(empty_order)
else:
self.sub_transact(orders)
In essence, I empty the dataframe and then insert the series, from the For loop, back into it. This works correctly, but gives the following warning:
C:\Users\BNielson\Google Drive\My Files\machine-learning\Python-Machine-Learning\ML4T_Ex2_1.py:57: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
empty_order.loc[index] = order
C:\Users\BNielson\Anaconda3\envs\PythonMachineLearning\lib\site-packages\pandas\core\indexing.py:477: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s
So it's this line giving the warning:
empty_order.loc[index] = order
This is particularly strange because I am using .loc already, when normally you get this error when you don't use .loc.
There is a much much easier way to do what I want.
order.to_frame().T
So...
if len(orders) > 1:
for index, order in orders.iterrows():
self.sub_transact(order.to_frame().T)
else:
self.sub_transact(orders)
What this actually does is translates the series (which still contains the necessary column and index information) back to a dataframe. But for some Moronic (but I'm sure Pythonic) reason it transposes it so that the previous row is now the column and the previous columns are now multiple rows! So you just transpose it back.
Use groupby with a unique list. groupby does exactly what you are asking for as in, it iterates over each group and each group is a dataframe. So, if you manipulate it such that you groupby a value that is unique for each and every row, you'll get a single row dataframe when you iterate over the group
for n, group in df.groupby(np.arange(len(df))):
pass
# do stuff
If I can suggest an alternative way than it would be like this:
for index, order in orders.iterrows():
orders.loc[index:index]
orders.loc[index:index] is exactly one row dataframe slice with the same structure, including index and column names.

Resources