Conditional Counting in Spark SQL

Conditional Counting in Spark SQL - apache-spark

I'm trying to count conditionally on one column, here's my code.
spark.sql(
s"""
|SELECT
| date,
| sea,
| contract,
| project,
| COUNT(CASE WHEN type = 'ABC' THEN 1 ELSE 0 END) AS abc,
| COUNT(CASE WHEN type = 'DEF' THEN 1 ELSE 0 END) AS def,
| COUNT(CASE WHEN type = 'ABC' OR type = 'DEF' OR type = 'GHI' THEN 1 ELSE 0 END) AS all
|FROM someTable
|GROUP BY date, seat, contract, project
""".stripMargin).createOrReplaceTempView("something")
This throws up a weird error.
Diagnostic messages truncated, showing last 65536 chars out of 124764:
What am I doing wrong here?
Any help appreciated.

It seems you want to get count of type = 'ABC', type = 'DEF' etc per grouping condition.
If this is the case then COUNT will not give you desired results but would give same result for each case for a group.
It seems you can use SUM instead of COUNT.
SUM will add all 0 and 1 will give you correct count.
Still if you want to resolve the error you are getting, please paste the error and if possible some data you are using to create data frame.

Related

Why does Table.FromList() fail on this trivial conversion from integers?

Consider Microsoft's first example for Table.FromList. Invoking it in the trivial query
let
result = Table.FromList({"a", "b", "c", "d"}, null, {"Letters"})
in
result
results in a table that looks like
| Letters |
+---------+
| a |
| b |
| c |
| d |
Substituting numbers for letters results in the query
let
result = Table.FromList({1,2,3,4},null,{"Integers"})
in
result
which produces the error
Expression.Error: We cannot convert the value 1 to type Text.
Details:
Value=1
Type=Type
I expected the table
| Integers |
+----------+
| 1 |
| 2 |
| 3 |
| 4 |
How do I get the expected table?
What is happening that is causing this problem?

it's written in the function description:
By default, the list is assumed to be a list of text values that is split by commas.
if you convert the list to table using the UI, you can notice that the default splitter is replaced:
let
Source = {1,2,3,4},
#"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error)
in
#"Converted to Table"
so the solution to your problem is
let
result = Table.FromList({1,2,3,4}, Splitter.SplitByNothing(), {"Integers"})
in
result

Because numbers and text are stored differently in computer memory.
You can't perform calculations with text, so 1 as a number has to be treated different from 1 as a symbol, to computers. Remember that a computer is just a machine and not a person, it lacks common sense sometimes. The solution is that you need to put quotation marks around those numbers. When the computer sees quotation marks it will know you are talking about the symbols
Try this:
let
result = Table.FromList({"1","2","3","4"},null,{"Integers"})
in
result

Try
letList={1..4},
result =Table.FromList(List.Transform(List, each Text.From(_)), null, {"Numbers"})
in result

Shifting a PySpark dataframe column by a variable value in another column

I have a PySpark dataframe that looks like this
Date
Value
Shift_Index
2021/02/11
50.12
0
2021/02/12
72.30
4
2021/02/15
81.87
1
2021/02/16
90.12
2
2021/02/17
91.31
1
2021/02/18
81.23
2
2021/02/19
73.45
1
2021/02/22
87.17
0
I want to lead the offset (On the basis of the values in the Shift_Index column here) which I have to pass depends on a particular Column of type Integer.
Can we somehow use an offset value that depends on the column value in lead/lag function in spark SQL ?
I wanted somewhat like this, which works fine in SQL server, but unfortunately throws exception in Spark SQL.
Create table test_table(ID int identity(1,1), Value float, shift_col int, New_Value float)
SELECT Value, shift_col,
ISNULL(LEAD(Value, shift_col) OVER(ORDER BY ID ASC), Value) AS New_Value
FROM test_table
The final result that I need looks something similar to this :
Date
Value
Shift_Index
New_Value
2021/02/11
50.12
0
50.12
2021/02/12
72.30
4
81.23
2021/02/15
81.87
1
90.12
2021/02/16
90.12
2
81.23
2021/02/17
91.31
1
81.23
2021/02/18
81.23
2
87.17
2021/02/19
73.45
1
87.17
2021/02/22
87.17
0
87.17
The following exceptions are encountered
Py4JJavaError: An error occurred while calling o77.sql.
: org.apache.spark.sql.AnalysisException: cannot resolve 'lead(sample_data_temp.shift_col, NULL)' due to data type mismatch: Offset expression 'shift_col#2835' must be a literal
Any help will be really appreciated.
Thanks in advance.

You could do so with a window and lead. If you have very spread out values for Shift_index you could do a select distinct to determine which shifts you need instead of everything up to the max shift.
Ideally you have something to partition your window by otherwise this can be very heavy for large datasets. Spark provides a warning:
WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
Edit: solution without join, still not partitioning means this doesn't parallelize well.
from pyspark.sql import functions as f
from pyspark.sql.window import Window
w = Window().orderBy(f.col('Date'))
max_shift = df.agg(f.max('Shift_index')).collect()[0][0]
for shift in range(1, max_shift+1):
df = df.withColumn('Value' + str(shift), f.lead(f.col('Value'), shift).over(w))
case_shift = 'CASE Shift_index WHEN 0 THEN Value ' + ' '.join([f'WHEN {i} THEN Value{i}' for i in range(1, max_shift + 1)]) + ' ELSE NULL END'
df = df.select(
f.col('Date'),
f.col('Shift_index'),
f.col('Value'),
f.expr(case_shift).alias('New_Value')
)
df.show()
+----------+-----+-----------+---------+
| Date|Value|Shift_index|New_Value|
+----------+-----+-----------+---------+
|2021/02/11|50.12| 0| 50.12|
|2021/02/12| 72.3| 4| 81.23|
|2021/02/15|81.87| 1| 90.12|
|2021/02/16|90.12| 2| 81.23|
|2021/02/17|91.31| 1| 81.23|
|2021/02/18|81.23| 2| 87.17|
|2021/02/19|73.45| 1| 87.17|
|2021/02/22|87.17| 0| 87.17|
+----------+-----+-----------+---------+

How to fix 'empty join will fail query'

I want to show amount of total requests, and the total of the failing requests that are being tracked by ApplicationInsights.
When there are no failing requests in the table, the query will return an empty object (via API, in the portal it will say: ' NO RESULTS FOUND 0 records matched'.)
I've tried setting up a variable which is 0 and give it a new value in the join.
Also I tried to check if the join value is null or empty and gave it a 0 value when so.
But none did help.
requests
| where timestamp > ago(1h)
| summarize totalCount=sum(itemCount) by timestamp
| join (
requests
| where success == false and timestamp > ago(1h)
| summarize totalFailCount =sum(itemCount) by timestamp
) on timestamp
| project timestamp, totalCount, totalFailCount
What I want as a result that if there are no failing requests, totalCount should display 0

It seems that you do not need a join in this case, if you aggregate by timestamp you will get the buckets based on the actual values in this column, most people usually like to count by time "buckets" for example one minute, here is an example for that:
requests
| where timestamp > ago(1h)
| summarize totalCount=count(), totalFailCount = countif(success == false) by bin(timestamp, 1m)

Azure Application Insights Query - How to calculate percentage of total

I'm trying to create a row in an output table that would calculate percentage of total items:
Something like this:
ITEM | COUNT | PERCENTAGE
item 1 | 4 | 80
item 2 | 1 | 20
I can easily get a table with rows of ITEM and COUNT, but I can't figure out how to get total (5 in this case) as a number so I can calculate percentage in column %.
someTable
| where name == "Some Name"
| summarize COUNT = count() by ITEM = tostring( customDimensions.["SomePar"])
| project ITEM, COUNT, PERCENTAGE = (C/?)*100
Any ideas? Thank you.

It's a bit messy to create a query like that.
I've done it bases on the customEvents table in AI. So take a look and see if you can adapt it to your specific situation.
You have to create a table that contains the total count of records, you then have to join this table. Since you can join only on a common column you need a column that has always the same value. I choose appName for that.
So the whole query looks like:
let totalEvents = customEvents
// | where name contains "Opened form"
| summarize count() by appName
| project appName, count_ ;
customEvents
// | where name contains "Opened form"
| join kind=leftouter totalEvents on appName
| summarize count() by name, count_
| project name, totalCount = count_ , itemCount = count_1, percentage = (todouble(count_1) * 100 / todouble(count_))
If you need a filter you have to apply it to both tables.
This outputs:

It is not even necessary to do a join or create a table containing your totals
Just calculate your total and save it in a let like so.
let totalEvents = toscalar(customEvents
| where timestamp > "someDate"
and name == "someEvent"
| summarize count());
then you can simply add a row to your next table, where you need the percentage calcualtion by doing:
| extend total = totalEvents
This will add a new column to your table filled with the total you calculated.
After that you can calculate the percentages as described in the other two answers.
| extend percentages = todouble(count_)*100/todouble(total)
where count_ is the column created by your summarize count() which you presumably do before adding the percentages.
Hope this also helps someone.

I think following is more intuitive. Just extend the set with a dummy property and do a join on that...
requests
| summarize count()
| extend a="b"
| join (
requests
| summarize count() by name
| extend a="b"
) on a
| project name, percentage = (todouble(count_1) * 100 / todouble(count_))

This might work too:
someTable
| summarize count() by item
| as T
| extend percent = 100.0*count_/toscalar(T | summarize sum(count_))
| sort by percent desc
| extend row_cumsum(percent)

Cassandra CQL2: Unable to update a column

Hi i have a columnfamily in cassandra db and when i check the contents of table it is shown differently when used as
./cqlsh -2
select * from table1;
KEY,31281881-1bef-447a-88cf-a227dae821d6 | A,0xaa| Cidr,10.10.12.0/24 | B,0xac | C,0x01 | Ip,10.10.12.1 | D,0x00000000 | E,0xace | F,0x00000000 | G,0x7375626e657431 | H,0x666230363 | I,0x00 | J,0x353839
While output is as this for
./cqlsh -3
select * from table1;
key | Cidr | Ip
--------------------------------------+---------------+------------
31281881-1bef-447a-88cf-a227dae821d6 | 10.10.12.0/24 | 10.10.12.1
This value is inserted using java program running.
I want to suppose update value of coulmn "B" which is only seen when using -2 option manually in database, it gives me error that it is hex value.
I am using this command to update but always getting error
cqlsh:sdnctl_db> update table1 SET B='0x7375626e657431' where key='31281881-1bef-447a-88cf-a227dae821d6';
Bad Request: cannot parse '0x7375626e657431' as hex bytes
cqlsh:sdnctl_db> update table1 SET B=0x7375626e657431 where key='31281881-1bef-447a-88cf-a227dae821d6';
Bad Request: line 1:30 no viable alternative at input 'x7375626e657431'
cqlsh:sdnctl_db> update table1 SET B=7375626e657431 where key='31281881-1bef-447a-88cf-a227dae821d6';
Bad Request: line 1:37 mismatched character '6' expecting '-'
I need to insert the hex value only which will be picked by the application but not able to insert.
Kindly help me in correcting the syntax.

It depends on what data type does column B have. Here is the reference of the CQL data types. The documentation says that blob data is represented as a hexadecimal string, so my assumption is that your column B is also a blob. From this other cassandra question (and answer) you see how to insert strings into blobs.
cqlsh:so> CREATE TABLE test (a blob, b int, PRIMARY KEY (b));
cqlsh:so> INSERT INTO test(a,b) VALUES (textAsBlob('a'), 0);
cqlsh:so> SELECT * FROM test;
b | a
---+------
0 | 0x61
cqlsh:so> UPDATE test SET a = textASBlob('b') WHERE b = 0;
cqlsh:so> SELECT * FROM test;
b | a
---+------
0 | 0x62
In your case, you could convert your hex string to a char string in code, and then use the textAsBlob function of cqlsh.
If the data type of column B is int instead, why don't you convert the hex number to int before executing the insert?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Conditional Counting in Spark SQL - apache-spark

Related

Why does Table.FromList() fail on this trivial conversion from integers?

Shifting a PySpark dataframe column by a variable value in another column

How to fix 'empty join will fail query'

Azure Application Insights Query - How to calculate percentage of total

Cassandra CQL2: Unable to update a column

Categories

Resources