Writing custom condition inside .withColumn in Pyspark

Writing custom condition inside .withColumn in Pyspark - python-3.x

I have to add a customized condition, which has many columns in .withColumn.
My scenario is somewhat like this. I have to check many columns row wise if they have Null values, and add those column names to a new column. My code looks somewhat like this:
df= df.withColumn("MissingColumns",\
array(\
when(col("firstName").isNull(),lit("firstName")),\
when(col("salary").isNull(),lit("salary"))))
Problem is I have many columns which I have to add to the condition. So I tried to customize it using loops and f-strings and tried using that.
df = df.withColumn("MissingColumns",condition)
But this condition is not working. May be because, the condition I have written is of data type String.
Is there any efficient way to do this?

You need to unpack your list inside the array as follows:
columns = ["firstName","salary"]
condition = array(*[when(col(c).isNull(),lit(c)) for c in columns])

Related

How to sort dynamic amount of columns in M

PowerQuery/Excel:
I got table with dynamic amount of columns named Level 1, Level 2, Level 3... etc and i need to apply Table.Sort(x,Order.Ascending) to all of them in same order, as they are.
I tried to create list from Table.ColumnNames and insert it directly into Table.Sort column name parameter, but it doesnt work. I also tried to create function, that would loop thru all columns names and apply sorting to each, but my knowledge of functions in DAX is far too low for this.
Any help will be very welcomed.

Assuming you only want to sort columns whose names start with 'Level', you could use something like this:
Table.Sort(Source, List.Select(Table.ColumnNames(Source), each Text.Start(_, 5) = "Level")

Better way to Vlookup

I would like to know if there is a better alternative to Vlookup to find matches between two cells (or Python Dfs).
Say I have the below Dfs,
I want my code to check if the values in DF1 was in DF2, If values exactly match OR if the values partially matche return me the value in the DF2.
Just like the matches in 4th column Row 2,3 returned values.
Thanks Amigo!

Well, as you probably suspected already, you have several options. You can easily search for an exact match, like this.
=VLOOKUP(value,data,column,FALSE)
Here is an example.
https://www.excelfunctions.net/vlookup-example-exact-match.html
Or, consider doing a partial match, as such.
=VLOOKUP(value&"*",data,column,FALSE)
Here is an example.
https://exceljet.net/formula/partial-match-with-vlookup
Oh, you can do a fuzzy match as well. Use the AddIn below for this kind of task.
https://www.microsoft.com/en-us/download/details.aspx?id=15011
In Python, it would be done like this.
matches = []
for c in checklist:
if c in words:
matches.append(c)
Obviously, the items in the square brackets are the items in the list.
For Python fuzzy matches, follow the steps outlined in the link below.
https://marcobonzanini.com/2015/02/25/fuzzy-string-matching-in-python/

Pandas read_excel removes columns under empty header

I have an Excel file where A1,A2,A3 are empty but A4:A53 contains column names.
In "R" when you were to read that data, the columns names for A1,A2,A3 would be "X_1,X_2,X_3" but when using pandas.read_excel it simply skips the first three columns, thus ignoring them. The problem is that the number of columns in each file is dynamic thus I cannot parse the column range, and I cannot edit the files and adding "dummy names" for A1,A2,A3

Use parameter skip_blank_lines=False, like so:
pd.read_excel('your_excel.xlsx', header=None, skip_blank_lines=False)
This stackoverflow question (finally) pointed me in the right direction:
Python Pandas read_excel doesn't recognize null cell
The pandas.read_excel docs don't contain any info about this since it is one of the keywords, but you can find it in the general io docs here: http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table

A quick fix would be to pass header=None to pandas' read_excel() function, manually insert the missing values into the first row (it now will contain the column names), then assign that row to df.columns and drop it after. Not the most elegant way, but I don't know of a builtin solution to your problem
EDIT: by "manually insert" I mean some messing with fillna(), since this appears to be an automated process of some sort

I realize this is an old thread, but I solved it by specifying the column names and naming the final empty column, rather than importing with no names and then having to deal with a row with names in it (also used use_cols). See below:
use_cols = 'A:L'
column_names = ['Col Name1', 'Col Name 2', 'Empty Col']
df = pd.read_excel(self._input_path, usecols=use_cols, names=column_names)

Use a list to define SELECT columns in a query

I have a need to query from a parquet file where the column names are completely inconsistent. In order to remedy this issue and insure that my model gets exactly the data it expects I need to 'prefetch' the columns list then apply some regex patterns to qualify which columns I need to retrieve. In pseudocode:
PrefetchList = sqlContext.read.parquet(my_parquet_file).schema.fields
# Insert variable statements to check/qualify the columns against rules here
dfQualified = SELECT [PrefetchList] from parquet;
I've searched around to see if this is achievable but not had any success. If this is syntactically correct (or close) or if someone has other suggestions I am open to it.
Thanks

You can use either the schema method but you can also use the .columns method.
Notice that the select method in spark is a little odd, it's defined as def
select(col: String, cols: String*) so you can't pass back to it select(fields:_*), and you'd have to use df.select(fields.head, fields.tail:_*) which is kind of ugly, but luckily there's selectExpr(exprs: String*) as an alternative. So this below will work. It takes only columns that begin with 'user'
fields = df.columns.filter(_.matches("^user.+")) // BYO regexp
df.selectExpr(fields:_*)
This of course assumes that df contains your dataframe, loaded with sqlContext.read.parquet().

More efficient way than nested ifs

In Excel, I want to use something other then nested if statements to execute a task. Is there a cleaner way of doing cases besides nested if statements? Is there a cases statement in excel? For example given a ordered tuple with ones and zeros (e.g (1,1,0)), I want the value of a cell to be something. Can I specify the ordered tuples in advance without something besides nested if statements?

If you already know the ordered tuples and what you want the final value to be, why not create a reference table somewhere else on your sheet with Col1 = tuple ; Col2 = Wanted output?
Then just use a Vlookup() statement on that table...
Hope this makes sense / does what you want....

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Writing custom condition inside .withColumn in Pyspark - python-3.x

You need to unpack your list inside the array as follows: columns = ["firstName","salary"] condition = array(*[when(col(c).isNull(),lit(c)) for c in columns])

Related

How to sort dynamic amount of columns in M

Better way to Vlookup

Pandas read_excel removes columns under empty header

Use a list to define SELECT columns in a query

More efficient way than nested ifs

Categories

Resources