How to separate stringin databricks - databricks

I try to separete a string like LESOES DO OMBRO (M75) using a function split_part in databricks, but occurs an error: AnalysisException: Undefined function: 'SPLIT_part'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'. I need to separate the code in parentheses of the rest of the text.
I have a column "patologia" the column is for example LESOES DO OMBRO (M75) and I need a new column with the value M75

If I understood correctly and you need a new column with the value that's between parentheses in another column, then you can extract such value with regular expression, like this
from pyspark.sql.functions import regexp_extract
regex_df = spark.createDataFrame([("LESOES DO OMBRO (M75)",)], "patologia: string")
extracted_col_df = regex_df.withColumn("extracted_value", regexp_extract("patologia", r'\(([^)]+)\)', 1))
extracted_col_df.show()
+---------------------+---------------+
|patologia |extracted_value|
+---------------------+---------------+
|LESOES DO OMBRO (M75)|M75 |
+---------------------+---------------+

Related

pyspark - what is the real use of "col" function

I am yet to find the real use of "col" function, so far, I am seeing the same impact with using col or without using col function. Can someone elborate an use case which can only be done with "col" function.
Both return the same result.So, what is the real need of "col" function. I understood from the documentation, it retruns the col type.
employeesDF. \
select(upper("first_name"), upper("last_name")). \
show()
employeesDF. \
select(upper(col("first_name")), upper(col("last_name"))). \
show()
In some cases the functions take column names aka strings as input or column types for example as you have above in select. A select is always going to return a dataframe of columns so supporting both input types makes sense. It is much more common to select using just the column name however.
In many situations though there is a big difference between (String) columnName and col(string) and you have to be explicit. For example say you have something like
when(col("my_col").isNull()).otherwise("other_col")
In that expression you would be returning the literal string "other_col" when "my_col" is null instead of the value from "other_col".

transform columns from column list within select method that's attached to join method

I have two data frames with the same schema. I'm using the outer join method on both data frames and I'm using the select and coalesce methods to select and transform all columns. I want to iterate over the column list within the select method without explicitly defining each column within the coalesce method. It would be great to know if there's a solution without using a UDF. The two tables that are being joined are songs and staging_songs within the code snippets below.
Instead of explicitly defining each column like so:
updated_songs = songs.join(staging_songs, songs.song_id == staging_songs.song_id, how='full').select(
f.coalesce(staging_songs.song_id, songs.song_id),
f.coalesce(staging_songs.artist_name, songs.artist_name),
f.coalesce(staging_songs.song_name, songs.song_name)
)
Doing something along the lines of:
# column names to iterate over in select method
songs_columns = songs.columns
updated_songs = songs.join(staging_songs, songs.song_id == staging_songs.song_id, how='full').select(
#using for loop like this raises a syntax error
for col in songs_columns:
f.coalesce(staging_songs.col, songs.col))
Try this:
updated_songs = songs.join(staging_songs, songs["song_id"] == staging_songs["song_id"], how='full').select(*[f.coalesce(staging_songs[col], songs[col]).alias(col) for col in songs_columns])

Returning a Pandas DataFrame Index as a String

I want to return the index of my DataFrame as a string. I am using this commandp_h = peak_hour_df.index.astype(str).str.zfill(4) It is not working, I am getting this result: Index(['1645'], dtype='object', name I need it to return the string '1645' How do I accomplish this?
In short:
do p_h = list(peak_hour_df.index.astype(str).str.zfill(4)). This will return a list and then you can index it.
In more detail:
When you do peak_hour_df.index.astype(str), as you see, the dtype is already an object (string), so that job is done. Note this is the type of the contents; not of the object itself. Also I am removing .str.zfill(4) as this is additional and does not change the nature of the problem or the retuning type.
Then the type of the whole objet you are returning is pandas.core.indexes.base.Index. You can check this like so: type(peak_hour_df.index.astype(str)). If you want to return a single value from it in type str (e.g. the first value), then you can either index the pandas object directly like so:
peak_hour_df.index.astype(str)[0]
or (as I show above) you can covert to list and then index that list (for some reason, most people find it more intuitive):
peak_hour_df.index.astype(str).to_list()[0]
list(peak_hour_df.index.astype(str))[0]

groupby select value only if match

I got my data sorted correctly, but now Im trying to find a way to group by "first not empty string value". Is there a way to do this without changing the rest of the data? First was close, but not quite what I needed
grouped = sortedvals.groupby(['name']).first().reset_index()
doesnt work if the first value is empty ie: '',2 (my goal is to return 2) but does work for everything else.
Use replace function to replace blank values with np.nan
import numpy as np
grouped = sortedvals.replace('',np.nan).groupby(['name']).first().reset_index()

Clojure - No matching method found for select method in DataFrame when using Flambo

I'm using Flambo to work with Spark. I want to retrieve a DataFrame which contains given column names. I wrote a simple function as follows:
(defn make-dataset
([data-path column-names and-another]
(let [data (sql/read-csv sql-context data-path)
cols (map #(.col data %) column-names)]
(.select data (Column. "C0")))))
I get the following exception when i execute it.
IllegalArgumentException No matching method found: select for class org.apache.spark.sql.DataFrame clojure.lang.Reflector.invokeMatchingMethod (Reflector.java:80)
What am i doing wrong? Why col. works whereas select. doesn't when both of them are available from the same Class?
Please help me if i am wrong?
DataFrame.select you are trying to call has following signature:
def select(cols: Column*): DataFrame
As you can see it accepts a vararg of Column whereas you provide it a single, bare Column value which doesn't match the signature, thus the exception. Scala's varargs are wrapped in scala.collection.Seq. You can wrap your column(s) into something that implements Seq using following code:
(scala.collection.JavaConversions/asScalaBuffer [(Column. "C0")])
in Clojure, use Arrays to pass to varargs fields. I have the same issue was resolved when I called select function on dataframe using String and Array of String.
something like
(def cols-vec ["a","b","c])
(defn covert->spark-cols [columns] (into [] (map #(Column. %) columns)))
we gets fooled by the way the java api works when it comes to collection...
when method signature says ... java is ok with one values where as Clojure expects a collection.

Resources