transform columns from column list within select method that's attached to join method - python-3.x

I have two data frames with the same schema. I'm using the outer join method on both data frames and I'm using the select and coalesce methods to select and transform all columns. I want to iterate over the column list within the select method without explicitly defining each column within the coalesce method. It would be great to know if there's a solution without using a UDF. The two tables that are being joined are songs and staging_songs within the code snippets below.
Instead of explicitly defining each column like so:
updated_songs = songs.join(staging_songs, songs.song_id == staging_songs.song_id, how='full').select(
f.coalesce(staging_songs.song_id, songs.song_id),
f.coalesce(staging_songs.artist_name, songs.artist_name),
f.coalesce(staging_songs.song_name, songs.song_name)
)
Doing something along the lines of:
# column names to iterate over in select method
songs_columns = songs.columns
updated_songs = songs.join(staging_songs, songs.song_id == staging_songs.song_id, how='full').select(
#using for loop like this raises a syntax error
for col in songs_columns:
f.coalesce(staging_songs.col, songs.col))

Try this:
updated_songs = songs.join(staging_songs, songs["song_id"] == staging_songs["song_id"], how='full').select(*[f.coalesce(staging_songs[col], songs[col]).alias(col) for col in songs_columns])

Related

How to separate stringin databricks

I try to separete a string like LESOES DO OMBRO (M75) using a function split_part in databricks, but occurs an error: AnalysisException: Undefined function: 'SPLIT_part'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'. I need to separate the code in parentheses of the rest of the text.
I have a column "patologia" the column is for example LESOES DO OMBRO (M75) and I need a new column with the value M75
If I understood correctly and you need a new column with the value that's between parentheses in another column, then you can extract such value with regular expression, like this
from pyspark.sql.functions import regexp_extract
regex_df = spark.createDataFrame([("LESOES DO OMBRO (M75)",)], "patologia: string")
extracted_col_df = regex_df.withColumn("extracted_value", regexp_extract("patologia", r'\(([^)]+)\)', 1))
extracted_col_df.show()
+---------------------+---------------+
|patologia |extracted_value|
+---------------------+---------------+
|LESOES DO OMBRO (M75)|M75 |
+---------------------+---------------+

Dynamically filtering a Pandas DataFrame based on user input

I would appreciate suggestions for a more computationally efficient way to dynamically filter a Pandas DataFrame.
The size of the DataFrame, len(df.index), is around 680,000.
This code from the callback function of a Plotly Dash dashboard is triggered when points on a scatter graph are selected. These points are passed to points as a list of dictionaries containing various properties with keys 'A' to 'C'. This allows the user to select a subset of the data in the pandas.DataFrame instance df for cross-filtering analysis.
rows_boolean = pandas.Series([False] * len(df.index))
for point in points:
current_condition = ((df['A'] == point['a']) & (df['B'] == point['b'])
& (df['C'] >= point['c']) & (df['C'] < point['d']))
rows_boolean = rows_boolean | current_condition
filtered = df.loc[rows_boolean, list_of_column_names]
The body of this for loop is very slow as it is iterating over the whole data frame, it is manageable to run it once but not inside a loop.
Note that these filters are not additive, as in this example; each successive iteration of the for loop increases, rather than decreases, the size of filtered (as | rather than & operator is used).
Note also that I am aware of the existence of the method df['C'].between(point['c'], point['d']) as an alternative to the last two comparison operators, however, I only want this comparison to be inclusive at the lower end.
Solutions I have considered
Searching the many frustratingly similar posts on SO reveals a few ideas which get some of the way:
Using pandas.DataFrame.query() will require building a (potentially very large) query string as follows:
query = ' | '.join([f'((A == {point["a"]}) & (B == {point["b"]})
& (C >= {point["c"]}) & (C < {point["d"]}))' for point in points])
filtered = df.query(query)
My main concern here is that I don’t know how efficient the query method becomes when the query passed has several dozen (or even several hundred) conditions strung together. This solution also currently does not allow the selection of columns using list_of_column_names.
Another possible solution could come from implementing something like this.
To reiterate, speed is key here, so I'm not just after something that works, but something that works a darn sight faster than my boolean implementation above:
There should be one-- and preferably only one --obvious way to do it. (PEP 20)

Using custom table to feed drop-down list datasource

Let's assume that I have Custom Table named Possible URL target parameters with code name xyz.PossibleTargets with 2 columns:
Explanation and Value.
How to feed drop-down field on page type with data to have Value (from table) as Value and Explanation as name in drop-down?
What I already tried and it is not working:
Generate value;name pairs divided by newline and place it as List of options:
z = ""; foreach (x in CMSContext.Current.GlobalObjects.CustomTables["xyz.PossibleTargets"].Items) {z += x.GetValue("Value"); z +=";"; z += x.GetValue("Explanation"); z += "\n" }; return z;
Validator do no allow me to do such trick.
Set option Macro expression and provide enumerable object:
CMSContext.Current.GlobalObjects.CustomTables["xyz.PossibleTargets"].Items
In Item transformation: {%Explanation%} and in Value column {%TargetValue%}.
This do not work also.
Dropdown configuration
How to do this correctly? Documentation and hints on the fields are not helpful.
Kentico v11.0.26
I think that you should do it without marking field as a macro. Just type there the macro. Take a look on screen
No need to use a macro, use straight SQL, using a macro only complicates what appears to be a simple dropdown list.
SELECT '', '-- select one --' AS Explanation
UNION
SELECT TargetValue, Explanation
FROM xyz_PossibleTargets -- make sure to use the correct table name
ORDER BY ExplanationText
This should populate exactly what you're looking for without the complication of a macro.

selecting all cells between two string in a column

I posted question previously as "using “.between” for string values not working in python" and I was not clear enough, but I could not edit, so I am reposting with clarity here.
I have a Data Frame. In [0,61] I have string. In [0,69] I have a string. I want to slice all the data in cells [0,62:68] between these two and merge them, and paste the result into [1,61]. Subsequently, [0,62:68] will be blank, but that is not important.
However, I have several hundred documents, and I want to write a script that executes on all of them. The strings in [0,61] and [0,69] are always present in all the documents, but along different locations in that column. So I tried using:
For_Paste = df[0][df[0].between('DESCRIPTION OF WORK / STATEMENT OF WORK', 'ADDITIONAL REQUIREMENTS / SUPPORTING DOCUMENTATION', inclusive = False)]
But the output I get is: Series([], Name: 0, dtype: object)
I was expecting a list or array with the desired data that I could merge and paste. Thanks.
enter image description here
If you want to select the rows between two indices (say idx_start and idx_end), excluding these two rows) on column col of the dataframe df, you will want to use
df.loc[idx_start + 1 : idx_end, col]
To find the first index matching a string s, use
idx = df.index[df[col] == s][0]
So for your case, to return a Series of the rows between these two indices, try the following:
start_string = 'DESCRIPTION OF WORK / STATEMENT OF WORK'
end_string = 'ADDITIONAL REQUIREMENTS / SUPPORTING DOCUMENTATION'
idx_start = df.index[df[0] == start_string][0]
idx_end = df.index[df[0] == end_string][0]
For_Paste = df.loc[idx_start + 1 : idx_end, 0]

How to use a vector of strings to call dataframe columns by its header

In R, I want to use a subset of a dataframe 'RL', by selecting specific headers (eg. 'RL$age01' etc.. I generate the selected headers as a vector of strigs:
v = c('ID', sprintf("sex%02d", seq(1,15)), sprintf("age%02d", seq(1,15)))
and the dataframe index as:
c = sprintf('RL$%s', v)
how can I evaluate these strigns to call the dataframe columns by header and rearange them in a matrix, in the sense of x = cbind(RL$ID, RL$age01, ...) ?
cbind(c) does not work neither using things like eval(), parse() or expression().
Thanks for any help
Rafael
Just use
RL[,v]
Just noticed this was already mentioned in the comments.

Resources