pyspark - what is the real use of "col" function - apache-spark

I am yet to find the real use of "col" function, so far, I am seeing the same impact with using col or without using col function. Can someone elborate an use case which can only be done with "col" function.
Both return the same result.So, what is the real need of "col" function. I understood from the documentation, it retruns the col type.
employeesDF. \
select(upper("first_name"), upper("last_name")). \
show()
employeesDF. \
select(upper(col("first_name")), upper(col("last_name"))). \
show()

In some cases the functions take column names aka strings as input or column types for example as you have above in select. A select is always going to return a dataframe of columns so supporting both input types makes sense. It is much more common to select using just the column name however.
In many situations though there is a big difference between (String) columnName and col(string) and you have to be explicit. For example say you have something like
when(col("my_col").isNull()).otherwise("other_col")
In that expression you would be returning the literal string "other_col" when "my_col" is null instead of the value from "other_col".

Related

Python Warning Panda Dataframe "Simple Issue!" - "A value is trying to be set on a copy of a slice from a DataFrame"

first post / total Python novice so be patient with my slow understanding!
I have a dataframe containing a list of transactions by order of transaction date.
I've appended an additional new field/column called ["DB/CR"], that dependant on the presence of "-" in the ["Amount"] field populates 'Debit', else 'Credit' in the absence of "-".
Noting the transactions are in date order, I've included another new field/column called [Top x]. The output of which is I want to populate and incremental independent number (starting at 1) for both debits and credits on a segregated basis.
As such, I have created a simple loop with a associated 'if' / 'elif' (prob could use else as it's binary) statement that loops through the data sent row 0 to the last row in the df and using an if statement 1) "Debit" or 2) "Credit" increments the number for each independently by "Debit" 'i' integer, and "Credit" 'ii' integer.
The code works as expected in terms of output of the 'Top x'; however, I always receive a warning "A value is trying to be set on a copy of a slice from a DataFrame".
Trying to perfect my script, without any warnings I've been trying to understand what I'm doing incorrect but not getting it in terms of my use case scenario.
Appreciate if someone can kindly shed light on / propose how the code needs to be refactored to avoid receiving this error.
Code (the df source data is an imported csv):
#top x debits/credits
i = 0
ii = 0
for ind in df.index:
if df["DB/CR"][ind] == "Debit":
i = i+1
df["Top x"][ind] = i
elif df["DB/CR"][ind] == "Credit":
ii = ii+1
df["Top x"][ind] = ii
Interpreter
df["Top x"][ind] = i
G:\Finances Backup\venv\Statementsv.03.py:173: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["Top x"][ind] = ii
Many thanks :)
You should use df.loc["DB/CR", ind] = "Debit"
Use iterrows() to iterate over the DF. However, updating DF while iterating is not preferable
see documentation here
Refer to the documentation here Iterrows()
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.

Dynamically filtering a Pandas DataFrame based on user input

I would appreciate suggestions for a more computationally efficient way to dynamically filter a Pandas DataFrame.
The size of the DataFrame, len(df.index), is around 680,000.
This code from the callback function of a Plotly Dash dashboard is triggered when points on a scatter graph are selected. These points are passed to points as a list of dictionaries containing various properties with keys 'A' to 'C'. This allows the user to select a subset of the data in the pandas.DataFrame instance df for cross-filtering analysis.
rows_boolean = pandas.Series([False] * len(df.index))
for point in points:
current_condition = ((df['A'] == point['a']) & (df['B'] == point['b'])
& (df['C'] >= point['c']) & (df['C'] < point['d']))
rows_boolean = rows_boolean | current_condition
filtered = df.loc[rows_boolean, list_of_column_names]
The body of this for loop is very slow as it is iterating over the whole data frame, it is manageable to run it once but not inside a loop.
Note that these filters are not additive, as in this example; each successive iteration of the for loop increases, rather than decreases, the size of filtered (as | rather than & operator is used).
Note also that I am aware of the existence of the method df['C'].between(point['c'], point['d']) as an alternative to the last two comparison operators, however, I only want this comparison to be inclusive at the lower end.
Solutions I have considered
Searching the many frustratingly similar posts on SO reveals a few ideas which get some of the way:
Using pandas.DataFrame.query() will require building a (potentially very large) query string as follows:
query = ' | '.join([f'((A == {point["a"]}) & (B == {point["b"]})
& (C >= {point["c"]}) & (C < {point["d"]}))' for point in points])
filtered = df.query(query)
My main concern here is that I don’t know how efficient the query method becomes when the query passed has several dozen (or even several hundred) conditions strung together. This solution also currently does not allow the selection of columns using list_of_column_names.
Another possible solution could come from implementing something like this.
To reiterate, speed is key here, so I'm not just after something that works, but something that works a darn sight faster than my boolean implementation above:
There should be one-- and preferably only one --obvious way to do it. (PEP 20)

Can a comparator function be made from two conditions connected by an 'and' in python (For sorting)?

I have a list of type:
ans=[(a,[b,c]),(x,[y,z]),(p,[q,r])]
I need to sort the list by using the following condition :
if (ans[j][1][1]>ans[j+1][1][1]) or (ans[j][1][1]==ans[j+1][1][1] and ans[j][1][0]<ans[j+1][1][0]):
# do something (like swap(ans[j],ans[j+1]))
I was able to implement using bubble sort, but I want a faster sorting method.
Is there a way to sort my list using the sort() or sorted() (Using comparator or something similar) functions while pertaining to my condition ?
You can create a comparator function that retuns a tuple; tuples are compared from left to right until one of the elements is "larger" than the other. Your input/output example is quite lacking, but I believe this will result into what you want:
def my_compare(x):
return x[1][1], x[1][0]
ans.sort(key=my_compare)
# ans = sorted(ans, key=my_compare)
Essentially this will first compare the x[1][1] value of both ans[j] and ans[j+1], and if it's the same then it will compare the x[1][0] value. You can rearrange and add more comparators as you wish if this didn't match your ues case perfectly.

Cognos query calculation - how to obtain a null/blank value?

I have a query calculation that should throw me either a value (if conditions are met) or a blank/null value.
The code is in the following form:
if([attribute] > 3)
then ('value')
else ('')
At the moment the only way I could find to obtain the result is the use of '' (i.e. an empty character string), but this a value as well, so when I subsequently count the number of distinct values in another query I struggle to get the correct number (the empty string should be removed from the count, if found).
I can get the result with the following code:
if (attribute='') in ([first_query].[attribute]))
then (count(distinct(attribute)-1)
else (count(distinct(attribute))
How to avoid the double calculation in all later queries involving the count of attribute?
I use this Cognos function:
nullif(1, 1)
I found out that this can be managed using the case when function:
case
when ([attribute] > 3)
then ('value')
end
The difference is that case when doesn't need to have all the possible options for Handling data, and if it founds a case that is not in the list it just returns a blank cell.
Perfect for what I needed (and not as well documented on the web as the opposite case, i.e. dealing with null cases that should be zero).

How to use a vector of strings to call dataframe columns by its header

In R, I want to use a subset of a dataframe 'RL', by selecting specific headers (eg. 'RL$age01' etc.. I generate the selected headers as a vector of strigs:
v = c('ID', sprintf("sex%02d", seq(1,15)), sprintf("age%02d", seq(1,15)))
and the dataframe index as:
c = sprintf('RL$%s', v)
how can I evaluate these strigns to call the dataframe columns by header and rearange them in a matrix, in the sense of x = cbind(RL$ID, RL$age01, ...) ?
cbind(c) does not work neither using things like eval(), parse() or expression().
Thanks for any help
Rafael
Just use
RL[,v]
Just noticed this was already mentioned in the comments.

Resources