Using orderBy with dataframes in spark(python) - apache-spark

I am not sure what is going wrong here. I am having a dataframe:
>DFexample.columns
>['url','weight1','weight2']
and I am trying to order in descending order from the weight2:
>DFexample.orderBy(DFexample.weight2.desc()).show(4)
>-----+--------+-------------------+
| url |weight1 | weight2 |
+-----+--------+-------------------+
| x |0 | 9.800000342342E-4 |
| x2 |1 | 2.432432432 |
| x3 | 1.2 | 2.134234234 |
| x4 | 1.32 | 1.232324 |
+-----+--------+-------------------+
Everything seems to be ordered except from the first value. For what reason would this happen?

Related

Extract values from column in spark dataframe and to two new columns

I have a spark dataframe that looks like this:
+----+------+-------------+
|user| level|value_pair |
+----+------+-------------+
| A | 25 |(23.52,25.12)|
| A | 6 |(0,0) |
| A | 2 |(11,12.12) |
| A | 32 |(17,16.12) |
| B | 22 |(19,57.12) |
| B | 42 |(10,3.2) |
| B | 43 |(32,21.0) |
| C | 33 |(12,0) |
| D | 32 |(265.21,19.2)|
| D | 62 |(57.12,50.12)|
| D | 32 |(75.12,57.12)|
| E | 63 |(0,0) |
+----+------+-------------+
How do I extract the values in the value_pair column and add them to two new columns called value1 and value2, using the comma as the separator.
+----+------+-------------+-------+
|user| level|value1 |value2 |
+----+------+-------------+-------+
| A | 25 |23.52 |25.12 |
| A | 6 |0 |0 |
| A | 2 |11 |12.12 |
| A | 32 |17 |16.12 |
| B | 22 |19 |57.12 |
| B | 42 |10 |3.2 |
| B | 43 |32 |21.0 |
| C | 33 |12 |0 |
| D | 32 |265.21 |19.2 |
| D | 62 |57.12 |50.12 |
| D | 32 |75.12 |57.12 |
| E | 63 |0 |0 |
+----+------+-------------+-------+
I know I can separate the values like so:
df = df.withColumn('value1', pyspark.sql.functions.split(df['value_pair'], ',')[0]
df = df.withColumn('value2', pyspark.sql.functions.split(df['value_pair'], ',')[1]
But how do I also get rid of the parantheses?
For the parentheses, as shown in the comments you can use regexp_replace, but you also need to include \. The backslash \ is the escape character for regular expressions.
Also, I believe you need to first remove the brackets, and then expand the column.
from pyspark.sql.functions import split
from pyspark.sql.functions import regexp_replace
df = df.withColumn('value_pair', regexp_replace(df.value_pair, "\(",""))
df = df.withColumn('value_pair', regexp_replace(df.value_pair, "\)",""))
df = df.withColumn('value1', split(df['value_pair'], ',').getItem(0)) \
.withColumn('value2', split(df['value_pair'], ',').getItem(1))
>>> df.show(truncate=False)
+----+-----+-----------+------+---------+
|user|level|value_pair |value1|value2 |
+----+-----+-----------+------+---------+
| A |25 |23.52,25.12|23.52 |25.12 |
| A |6 |0,0 |0 |0 |
| A |2 |11,12.12 |11 |12.12 |
| A |32 |17,16.12 |17 |16.12 |
| B |22 |19,57.12 |19 |57.12 |
| B |42 |10,3.2 |10 |3.2 |
| B |43 |32,21.0 |32 |21.0 |
| C |33 |12,0 |12 |0 |
| D |32 |265.21,19.2|265.21|19.2 |
| D |62 |57.12,50.12|57.12 |50.12 |
| D |32 |75.12,57.12|75.12 |57.12 |
| E |63 |0,0 |0 |0 |
+----+-----+-----------+------+---------+
As noticed, I changed slightly your code on how you grab the 2 items.
More information can be found here

excel cubset function to get 2 columns set

I am trying to use the cubset function to get a set of 2 columns. The data table is something like bellow:
TABLE
+--------+-------+-------+
| CLIENT | PRODA | PRODB |
+--------+-------+-------+
| 1 | A | X |
| 1 | A | Y |
| 1 | B | X |
| 2 | A | Y |
| 2 | B | X |
| 2 | C | Y |
+--------+-------+-------+
The code I running returns only 1 column set
=CUBSET("ThisWorkbookDataModel";"[TABLE].[CLIENT].&[1]*[TABLE].[PRODA].children";"result set")
The code I am trying to perform, I need to return both related columns PROD AND PRODB
=CUBSET("ThisWorkbookDataModel";"[TABLE].[CLIENT].&[1]*[TABLE].[PRODA].[PRODB].children";"result set")
result set
+-------+-------+
| PRODA | PRODB |
+-------+-------+
| A | X |
| A | Y |
| B | X |
+-------+-------+
So what is the correct way to write the code to retrieve both related columns ?
Appreciate any help

drop rows matching string from an excel file using pandas

i have an Excel file users.xls cotaining all users from enterprise departments (203 lines) with three lines headers before datas.
I'd like tro drop all users from compta department in the file
i tried to make a script to use pandas to do it
I can list all compta users in file but when i try to drop them, it fails
import pandas as pd
cu = 'users.xls'
users = pd.read_excel(cu, skiprows=3)
sorted_by_username = users.sort_values(['Department'], ascending=True)
df = pd.DataFrame(sorted_by_username)
comptausers = df['Department'].str.contains('compta')
dfu = df[comptausers]
print(dfu)
df.drop(index=dfu, axis=0, inplace=True)
acutal users.xlsfile:
cat users.xls
+--+---------------------+------------ +
| |List Of users | |
+--+----------+----------+-------------+
| | | | |
+--+----------+----------+-------------+
| | | | |
+--+----------+----------+-------------+
|ID|User Name |Department|User Location|
+--+----------+----------+-------------+
|2 |smith |Sales |Paris |
+--+----------+----------+-------------+
|10|Foo |Compta |Paris |
+--+----------+----------+-------------+
|1 |Targaryan |CEO |London |
+--+----------+----------+-------------+
|21|Cooper |Compta |Berlin |
+--+----------+-------+--+-------------+
What i want:
cat users.xls
+--+---------------------+------------ +
| |List Of users | |
+--+----------+----------+-------------+
| | | | |
+--+----------+----------+-------------+
| | | | |
+--+----------+----------+-------------+
|ID|User Name |Department|User Location|
+--+----------+----------+-------------+
|2 |smith |Sales |Paris |
+--+----------+----------+-------------+
|1 |Targaryan |CEO |London |
+--+----------+----------+-------------+

SparkSQL Get all prefixes of a word

Say I have a column in a SparkSQL DataFrame like this:
+-------+
| word |
+-------+
| chair |
| lamp |
| table |
+-------+
I want to explode out all the prefixes like so:
+--------+
| prefix |
+--------+
| c |
| ch |
| cha |
| chai |
| chair |
| l |
| la |
| lam |
| lamp |
| t |
| ta |
| tab |
| tabl |
| table |
+--------+
Is there a good way to do this WITHOUT using udfs, or functional programming methods such as flatMap in spark sql? (I'm talking about a solution using the codegen optimal functions in org.apache.spark.sql.functions._)
Technically it is possible but I doubt it will perform any better than a simple flatMap (if performance is the reason to avoid flatMap):
val df = Seq("chair", "lamp", "table").toDF("word")
df.withColumn("len", explode(sequence(lit(1), length($"word"))))
.select($"word".substr(lit(1), $"len") as "prefix")
.show()
Output:
+------+
|prefix|
+------+
| c|
| ch|
| cha|
| chai|
| chair|
| l|
| la|
| lam|
| lamp|
| t|
| ta|
| tab|
| tabl|
| table|
+------+

Excel if statement to display if anything is in the row

I don't really know how to search for this question or an appropriate title, so I hope that this will make sense.
I'm trying to construct an Excel spreadsheet to keep track of functions of a software that are currently have tests made for them. The spreadsheet looks something like below where A-F are placeholders for the tests and 1-5 are placeholders for functions.
| | A | B | C | D | E | F |
|:-:|---|---|---|---|---|---|
| 1 | X | | | | | X |
| 2 | | | | | | |
| 3 | | X | | | | |
| 4 | | | X | | | |
| 5 | | | | X | X | |
I would like to have another column at the end that would do something like this:
| | A | B | C | D | E | F | Tested? |
|:-:|---|---|---|---|---|---|---------|
| 1 | X | | | | | X | Yes |
| 2 | | | | | | | No |
| 3 | | X | | | | | Yes |
| 4 | | | X | | | | Yes |
| 5 | | | | X | X | | Yes |
where the final column is an if statement that will display a conditional string base on if there are any entries in the row. I know that Excel's IF statements work something like this =IF(A1=10,"YES","NO") but I can't think how I would construct an IF statement that would print YES or NO based on whether there are any entries at all in the row.
EDIT: To add a little more detail. I've thought about constructing an IF statement like this: =IF(SUM(C3:AI3)>0, "YES", "NO") and this works essentially if I use 1s or 0s instead of X or O but I'd rather use the latter. Or really I'd just rather use strings instead of integers.
You can use following formula:
=IF(COUNTA(A1:F1)>0,"Yes","No")
You're looking for the ISBLANK function.
Your solution should be something like this:
=IF(ISBLANK(A1:F1), "Yes","No")

Resources