I wanted a single graph to show values.
One search is
index="cumu_open_csv" Assignee="ram"
| eval open_field=if(in(Status,"Open","Reopened","Waiting","In Progress"), 1,0)
| stats count(eval(open_field=1)) AS Open, count(eval(open_field=0)) AS closed by CW_Created
this gives me a table as
Similarly I have another search
index="cumu_open_csv" Assignee="ram"
| eval open_field=if(in(Status,"Open","Reopened","Waiting","In Progress"), 1,0)
| stats count(eval(open_field=1)) As DueOpen by CW_DueDate
which gives me another table as
I tried to combine these two using appendcols, but the X-axis has only the CW_Created and displays the second table details in wrong CW.
I wanted CW_Created and CW_Duedate to be combined and provide the result in a single table like CW, Open,Close,DueCount wherever DueCount is not for a particular CW fill it with 0, for others display the data like so.
CW |Open |Close |DueCount
CW27 |7 |0 |0
CW28 |2 |0 |0
CW29 |0 |0 |4
CW30 |0 |7 |3
CW31 |0 |0 |1
CW32 |0 |0 |1
The appendcols command is a bit tricky to use. Events from the main search and subsearch are paired on a one-to-one basis without regard to any field value. This means event CW27 will be matched with CW29, CW28 with CW30, and so on.
Try the append command, instead. The results of the subsearch will follow the results of the main search, but a stats command can be used to merge them.
index="cumu_open_csv" Assignee="ram"
| eval open_field=if(in(Status,"Open","Reopened","Waiting","In Progress"), 1,0)
| stats count(eval(open_field=1)) AS Open, count(eval(open_field=0)) AS closed by CW_Created
| append [ index="cumu_open_csv" Assignee="ram"
| eval open_field=if(in(Status,"Open","Reopened","Waiting","In Progress"), 1,0)
| stats count(eval(open_field=1)) As DueOpen by CW_DueDate ]
| eval CW = coalesce(CW_Created, CW_DueDate)
| stats values(*) as * by CW
This may be what you are looking for
index="cumu_open_csv" Assignee="ram"
| eval open_field=if(in(Status,"Open","Reopened","Waiting","In Progress"), 1,0)
| stats count(eval(open_field=1)) AS Open, count(eval(open_field=0)) AS closed by CW_Created
| rename CW_Created as CW
| join type=outer CW
[| search index="cumu_open_csv" Assignee="ram"
| eval open_field=if(in(Status,"Open","Reopened","Waiting","In Progress"), 1,0)
| stats count(eval(open_field=1)) As DueOpen by CW_DueDate
| rename CW_DueDate as CW ]
Or possibly this:
index="cumu_open_csv" Assignee="ram"
| eval open_field=if(in(Status,"Open","Reopened","Waiting","In Progress"), 1,0)
| eval CW=if(len(CW_Created)>1,CW_Created,CW_DueDate)
| stats count(eval(open_field=1)) AS Open, count(eval(open_field=0)) AS closed, count(eval(open_field=1)) as DueOpen by CW
Sample data would make this substantially easier to try to help you with
Related
I am working on a excel like this
I would like to create a condition from second table using matches between two tables columns values (Tool and tools) to automatically replace the column Unit prince
I want this result
<table>
| Tool | United Price |
| : ---|:------------:|
| Axe | 5,9 |
| : ---|:------------:|
| Axe | 5,9 |
| : ---|:------------:|
| Hoe | 9,1 |
| : ---|:------------:|
| Drill| 7,8 |
| : ---|:------------:|
| Hoe | 9,1 |
| : ---|:------------:|
| Hoe | 9,1 |
| : ---|:------------:|
| Drill| 7,8 |
</table>
I tried to use VLOOKUP(A2; E2:F4; 2; FALSE), but it's don't work
I think you want to use a Lookup function in the United Price cells. I’d suggest making both of them tables. From the image it just looks like loose cells but with tables you can use structured references to make the formulas cleaner and easier to maintain.
Try:
=VLOOKUP(A2; $E$2:$F$4; 2; FALSE)
This will fix the position of the lookup array.
I would like to take my dictionary which contains keywords and check a column in a pyspark df to see if that keyword exists and if so then return the value from the dictionary in a new column.
The problem looks like this;
myDict = {
'price': 'Pricing Issue',
'support': 'Support Issue',
'android': 'Left for Competitor'
}
df = sc.parallelize([('1','Needed better Support'),('2','Better value from android'),('3','Price was to expensive')]).toDF(['id','reason'])
+-----+-------------------------+
| id |reason |
+-----+-------------------------+
|1 |Needed better support |
|2 |Better value from android|
|3 | Price was to expensive |
|4 | Support problems |
+-----+-------------------------+
The end result that I am looking for is this:
+-----+-------------------------+---------------------+
| id |reason |new_reason |
+-----+-------------------------+---------------------+
|1 |Needed better support | Support Issue |
|2 |Better value from android| Left for Competitor |
|3 |Price was to expensive | Pricing Issue |
|4 |Support issue | Support Issue |
+-----+-------------------------+---------------------+
What's the best way to build an efficient function to do this in pyspark?
You can use when expressions to check whether the column reason matches the dict keys. You can dynamically generate the when expressions using python functools.reduce function by passing the list myDict.keys():
from functools import reduce
from pyspark.sql import functions as F
df2 = df.withColumn(
"new_reason",
reduce(
lambda c, k: c.when(F.lower(F.col("reason")).rlike(rf"\b{k.lower()}\b"), myDict[k]),
myDict.keys(),
F
)
)
df2.show(truncate=False)
#+---+-------------------------+-------------------+
#|id |reason |new_reason |
#+---+-------------------------+-------------------+
#|1 |Needed better Support |Support Issue |
#|2 |Better value from android|Left for Competitor|
#|3 |Price was to expensive |Pricing Issue |
#|4 |Support problems |Support Issue |
#+---+-------------------------+-------------------+
You can create a keywords dataframe, and join to the original dataframe using an rlike condition. I added \\\\b before and after the keywords so that only words between word boundaries will be matched, and there won't be partial word matches (e.g. "pineapple" matching "apple").
import pyspark.sql.functions as F
keywords = spark.createDataFrame([[k,v] for (k,v) in myDict.items()]).toDF('key', 'new_reason')
result = df.join(
keywords,
F.expr("lower(reason) rlike '\\\\b' || lower(key) || '\\\\b'"),
'left'
).drop('key')
result.show(truncate=False)
+---+-------------------------+-------------------+
|id |reason |new_reason |
+---+-------------------------+-------------------+
|1 |Needed better Support |Support Issue |
|2 |Better value from android|Left for Competitor|
|3 |Price was to expensive |Pricing Issue |
|4 |Support problems |Support Issue |
+---+-------------------------+-------------------+
I tried to do group by in SparkSQL which works good but most of the rows went missing.
spark.sql(
"""
| SELECT
| website_session_id,
| MIN(website_pageview_id) as min_pv_id
|
| FROM website_pageviews
| GROUP BY website_session_id
| ORDER BY website_session_id
|
|
|""".stripMargin).show(10,truncate = false)
I am getting output like this :
+------------------+---------+
|website_session_id|min_pv_id|
+------------------+---------+
|1 |1 |
|10 |15 |
|100 |168 |
|1000 |1910 |
|10000 |20022 |
|100000 |227964 |
|100001 |227966 |
|100002 |227967 |
|100003 |227970 |
|100004 |227973 |
+------------------+---------+
Same query in MySQL gives the desired result like this :
What is the best way to do ,so that all rows are fetched in my Query.
Please note I already checked other answers related to this, like joining to get all rows etc, but I want to know if there is any other way by with we can get the result like we get in MySQL ?
It looks like it is ordered by alphabetically, in which case 10 comes before 2.
You might want to check that the columns type is a number, not string.
What datatypes do the columns have (printSchema())?
I think website_session_id is of string type. Cast it to an integer type and see what you get:
spark.sql(
"""
| SELECT
| CAST(website_session_id AS int) as website_session_id,
| MIN(website_pageview_id) as min_pv_id
|
| FROM website_pageviews
| GROUP BY website_session_id
| ORDER BY website_session_id
|
|
|""".stripMargin).show(10,truncate = false)
I wrote a nested loop based on columns (A,B & C) and placed it into column D2 and dragged it down to D15 but it seems to stop working when it hits the sports section with remaining values as FALSE
=IF(A2="fruit","fruit",
IF(A2="instrument","instrument",
IF(A2="colours",
IF(B2="red","red",
IF(B2="orange","orange",
IF(A2="sports",
IF(B2="soccer","soccer",
IF(B2="basketball","basketball",
IF(A2="fighting",
IF(B2="taekwando","taekwando",
IF(B2="boxing","boxing",
IF(B2="blood",
IF(C2="mma","mma",
IF(C2="ufc","ufc",
IF(A2="planets",
IF(B2="earth","earth",
IF(B2="dwarf",
IF(C2="pluto","pluto",
IF(A2="cars",
IF(B2="ford","ford",
IF(B2="toyota","toyota")))))))))))))))))))))
Not sure where I am going wrong with the nested loop but it fails when it hits the sports group and produces FALSE for the remaining values in column D
______A_____ _____B_____ ______C_____ _____D______
1|Product |Category |Sub-category|Result |
2|fruit |fruit | |fruit |
3|instrument |instrument | |instrument |
4|colours |red | |red |
5|colours |orange | |orange |
6|sports |soccer | |FALSE | <-- failure starts here
7|sports |basketball | |FALSE |
8|fighting |taekwando | |FALSE |
9|fighting |boxing | |FALSE |
10|fighting |blood |mma |FALSE |
11|fighting |blood |ufc |FALSE |
12|planets |earth | |FALSE |
13|planets |pluto |dwarf |FALSE |
14|cars |ford | |FALSE |
15|cars |toyota | |FALSE |
Would appreciate some help to improve the formula to return the values as indicated in the forumla
If you are just trying to get the last non-empty (sub) category :
=IF(C2 > "", C2, B2)
As i tried to reproduce the issue, I think the brackets are messed.
Please try using IFS where you need multiple nested conditions so that you can reduce the complexity of the code. Try the below snippet.
=IF(A7="fruit","fruit",
IF(A7="instrument","instrument",
IF(A7="colours",
IFS(B7="red","red",B7="orange","orange"),
IF(A7="sports",
IFS(B7="soccer","soccer",B7="basketball","basketball"),
IF(A7="cars",
IFS(B7="ford","ford",B7="toyota","toyota"))))))
Hope this helps.
using formula given by #Ashwin (thank you!) i improved it to include all categories. This seems to be working as expected.
=IF(A2="fruit","fruit",
IF(A2="instrument","instrument",
IF(A2="colours",
IFS(B2="red","red",B2="orange","orange"),
IF(A2="sports",
IFS(B2="soccer","soccer",B2="basketball","basketball"),
IF(A2="fighting",
IFS(B2="taekwando","taekwando",B2="boxing","boxing",B2="blood",IF(C2="mma","mma",IF(C2="ufc","ufc","not met"))),
IF(A2="planets",
IFS(B2="earth","earth",B2="dwarf",IF(C2="pluto","pluto","not met")),
IF(A2="cars",
IFS(B2="ford","ford",B2="toyota","toyota"))))))))
I have a dataframe so assume my data is in Tabular format.
|ID | Serial | Updated
-------------------------------------------------------
|10 |pers1 | |
|20 | | |
|30 |entity_1, entity_2, entity_3|entity_1, entity_3|
Now using withColumn("Serial", explode(split(",")"Serial"))). I have achieved breaking columns into multiple rows as below. this was the 1st part of the requirement.
|ID | Serial | Updated
-------------------------------------------------------
|10 |pers1 | |
|20 | | |
|30 |entity_1 |entity_1, entity_3|
|30 |entity_2 |entity_1, entity_3|
|30 |entity_3 |entity_1, entity_3|
Now for the columns where there are no values it should be 0,
For values which is present in 'Serial' Column should be searched in 'Updated' column. If the value is present in 'Updated' column then it should display '1' else '2'
So for here in this case for entity_1 && entity_3 --> 1 must be displayed & for entity_2 --> 2 should be displayed
How to achieve this ..?
AFAIK, there is no way to check if one column is contained within or is a substring of another column directly without using a udf.
However, if you wanted to avoid using a udf, one way is to explode the "Updated" column. Then you can check for equality between the "Serial" column and the exploded "Updated" column and apply your conditions (1 if match, 2 otherwise)- call this "contains".
Finally, you can then groupBy("ID", "Serial", "Updated") and select the minimum of the "contains" column.
For example, after the two calls to explode() and checking your condition, you will have a DataFrame like this:
df.withColumn("Serial", f.explode(f.split("Serial", ",")))\
.withColumn("updatedExploded", f.explode(f.split("Updated", ",")))\
.withColumn(
"contains",
f.when(
f.isnull("Serial") |
f.isnull("Updated") |
(f.col("Serial") == "") |
(f.col("Updated") == ""),
0
).when(
f.col("Serial") == f.col("updatedExploded"),
1
).otherwise(2)
)\
.show(truncate=False)
#+---+--------+-----------------+---------------+--------+
#|ID |Serial |Updated |updatedExploded|contains|
#+---+--------+-----------------+---------------+--------+
#|10 |pers1 | | |0 |
#|20 | | | |0 |
#|30 |entity_1|entity_1,entity_3|entity_1 |1 |
#|30 |entity_1|entity_1,entity_3|entity_3 |2 |
#|30 |entity_2|entity_1,entity_3|entity_1 |2 |
#|30 |entity_2|entity_1,entity_3|entity_3 |2 |
#|30 |entity_3|entity_1,entity_3|entity_1 |2 |
#|30 |entity_3|entity_1,entity_3|entity_3 |1 |
#+---+--------+-----------------+---------------+--------+
The "trick" of grouping by ("ID", "Serial", "Updated") and taking the minimum of "contains" works because:
If either "Serial" or "Updated" is null (or equal to empty string in this case), the value will be 0.
If at least one of the values in "Updated" matches with "Serial", one of the columns will have a 1.
If there are no matches, you will have only 2's
The final output:
df.withColumn("Serial", f.explode(f.split("Serial", ",")))\
.withColumn("updatedExploded", f.explode(f.split("Updated", ",")))\
.withColumn(
"contains",
f.when(
f.isnull("Serial") |
f.isnull("Updated") |
(f.col("Serial") == "") |
(f.col("Updated") == ""),
0
).when(
f.col("Serial") == f.col("updatedExploded"),
1
).otherwise(2)
)\
.groupBy("ID", "Serial", "Updated")\
.agg(f.min("contains").alias("contains"))\
.sort("ID")\
.show(truncate=False)
#+---+--------+-----------------+--------+
#|ID |Serial |Updated |contains|
#+---+--------+-----------------+--------+
#|10 |pers1 | |0 |
#|20 | | |0 |
#|30 |entity_3|entity_1,entity_3|1 |
#|30 |entity_2|entity_1,entity_3|2 |
#|30 |entity_1|entity_1,entity_3|1 |
#+---+--------+-----------------+--------+
I'm chaining calls to pyspark.sql.functions.when() to check the conditions. The first part checks to see if either column is null or equal to the empty string. I believe that you probably only need to check for null in your actual data, but I put in the check for empty string based on how you displayed your example DataFrame.