A value in my application logs changed a few weeks ago and now when I query the logs, I receive two different values in my count. I'm using Azure Logs for graphing so this is rather painful.
How I can query one row value as the other (or query the two row values together)?
For example
I'm counting the fruit column values ('PinkLadyApple', 'Orange', 'Banana'),
The PinkLadyApple is now a JazzApple and when I count the results for 90 days I receive
Name | Sum
PinkLadyApple | 100
JazzApple | 20
Orange | 150
Banana | 80
I'm looking for a solution that either renames the value in the response to just Apple or the two values can be combined under JazzApple
I will then render the results in a graph.
My query looks like
fruits
| where timestamp > ago(90d)
| where fruitname in ('PinkLadyApple', 'Orange', 'Banana')
| summarize FruitCount = sum(OtherColumn) by name
You can change the old value to the new one, using the iff() function, like this:
fruits
| where timestamp > ago(90d)
| where fruitname in ('JazzApple', 'PinkLadyApple', 'Orange', 'Banana')
| extend fruitname = iff(fruitname == 'JazzApple', 'PinkLadyApple', fruitname)
| summarize FruitCount = sum(OtherColumn) by name
Note that for best performance, I filter for the old and new value (which uses indexes), and only then changing the old value to the new one, just before the summarize operator.
And if you have multiple renamed values, then you can use the case() function instead, like this:
fruits
| where timestamp > ago(90d)
| where fruitname in ('JazzApple', 'PinkLadyApple', 'Orange', 'Banana', 'OldValue2', 'OldValue3')
| extend fruitname = case(
fruitname == 'JazzApple', 'PinkLadyApple',
fruitname == 'OldValue2', 'NewValue2',
fruitname == 'OldValue3', 'NewValue3',
fruitname)
| summarize FruitCount = sum(OtherColumn) by name
Related
I am new to Spark and I have a specific question about how to use spark to address my problem, which may be simple.
Promblem:
I have a model, which predicts the sales of products. Each product also belongs to a category like shoes, clothes etc. And we also have actual sales data. So the data look like this:
+------------+----------+-----------------+--------------+
| product_id | category | predicted_sales | actual_sales |
+------------+----------+-----------------+--------------+
| pid1 | shoes | 100.0 | 123 |
| pid2 | hat | 232 | 332 |
| pid3 | hat | 202 | 432 |
+------------+----------+-----------------+--------------+
product_id category predicted_sales actual_sales
What I'd like to do is: I want to calculate the number(or percentage) of intersection between top 5% products ranked by actual_sales and top 5% products ranked by predicted_sales for each category.
Doing this for the whole products instead of for each category would be easy, something like below:
def getIntersectionRatio(df:dataframe, per :Int): Double = {
val limit_num = (df.count() * per / 100.0).toInt
var intersection = df.orderBy("actual_sales").limit(limit_num)
.join(df.orderBy("predicted_sales").limit(limit_num), Seq("product_id"), "inner")
intersection.count() * 100.0 / limit_num
}
However, I need to calculate the intersection for each category. The result will be something like this
+-----------+------------------------+
| Category | intersection_percentage|
+-----------+------------------------+
My ideas
User Defined Aggreation Fuction or Aggregators
I think I can achieve my goal if I use groupBy or GroupByKey with UDAF or Aggregators but they are too inefficient because they take 1 row each time and I will have to store each row in the buffer inside UDAF or Aggregator.
df.groupby("category").agg(myUdaf)
class myUdaf extends UserDefinedAggregateFunction {
//Save all the rows to an arraybuffer
//and then transform the buffer back to df
//And then we the the same thing as we did for whole product in getIntersectionRatio defined previously
}
Self implemented partitioning
I can select the distinct categories and the use map to process each category, in which I join the element with df to get the partition
df.select("category").distinct.map(myfun(df))
def myfun(df: dataframe)(row : Row):Row = {
val dfRow = row.toDF //not supported but feasible with other apis
val group = df.join(broadcast(dfRow), seq(category), inner)
getIntersectionRatio(group)
}
Do we have a better solution for this?
Thanks in advance!
I got a data set (Excel) with hundreds of entries. In one string column there is most of the information. The information is divided by '_' and typed in by humans. Therefore, it is not possible to work with index positions.
To create a usable data basis it's mandatory to extract information from this column in another column.
The search pattern = '*v*' is alone not enough. But combined with the condition that the first item has to be a digit it works.
I tried to get it to work with iterrows, iteritems, str.strip, str.extract and many more. But the best solution I received with a for-loop.
pattern = '_*v*_'
test = []
for i in df['col']:
'#Split the string in substrings
i = i.split('_')
for c in i:
if c.find('x') == 1:
if c[0].isdigit():
# print(c)
test.append(c)
else:
'#To be able to fix a few rows manually
test.append(0)
[4]: test =[22v3, 33v55, 4v2]
#Input
+-----------+-----------+
| col | targetcol |
+-----------+-----------+
| as_22v3 | |
| 33v55_bdd | |
| Ave_4v2 | |
+-----------+-----------+
#Output
+-----------+-----------+--+
| col | targetcol | |
+-----------+-----------+--+
| as_22v3 | 22v3 | |
| 33v55_bdd | 33v55 | |
| Ave_4v2 | 4v2 | |
+-----------+-----------+--+
My code does work, but only for the first few rows. It stops after 36 values and I can't figure out why. There is no error message besides of course that it is not possible to assign the list to a DataFrame series since it has not the same size.
pandas.Series.str.extract should help:
>>> df['col'].str.extract(r'(\d+v+\d+)')
0
0 22v3
1 33v55
2 4v2
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2']
})
df['targetcol'] = df['col'].str.extract(r'(\d+v+\d+)')
EDIT
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2', '_22 v3', 'space 2,2v3', '2.v3',
'2.111v999', 'asd.123v77', '1 v7', '123 v 8135']
})
pattern = r'(\d+(\,[0-9]+)?(\s+)?v\d+)'
df['result'] = df['col'].str.extract(pattern)[0]
col result
0 as_22v3 22v3
1 33v55_bdd 33v55
2 Ave_4v2 4v2
3 _22 v3 22 v3
4 space 2,2v3 2,2v3
5 2.v3 NaN
6 2.111v999 111v999
7 asd.123v77 123v77
8 1 v7 1 v7
9 123 v 8135 NaN
You say it stops after 36 values? You say it is Excel file you are processing? One thing you could try is to save data set to .csv file and try to read this file in with pd.read_csv function. There are sometimes some extra characters in Excel file that are not easily visible.
I am not sure how to go about creating a custom field to count instances given a condition.
I have a field, ID, that exists in two formats:
A#####
B#####
I would like to create two columns (one for A and one for B) and count instances by month. Something like COUNTIF ID STARTS WITH A for the first column resulting in something like below. Right now I can only create a table with the total count.
+-------+------+------+
| Month | ID A | ID B |
+-------+------+------+
| Jan | 100 | 10 |
+-------+------+------+
| Feb | 130 | 13 |
+-------+------+------+
| Mar | 90 | 12 |
+-------+------+------+
Define ID A as...
CASE
WHEN ID LIKE 'A%' THEN 1
ELSE 0
END
...and set the Default aggregation property to Total.
Do the same for ID B.
Apologies if I misunderstood the requirement, but you maybe able to spin the list into crosstab using the section off the toolbar, your measure value would be count(ID).
Try this
Query 1 to count A , filtering by substring(ID,1,1) = 'A'
Query 2 to count B , filtering by substring(ID,1,1) = 'B'
Join Query 1 and Query 2 by Year/Month
List by Month with Count A and Count B
I want to calculate the portion of the value, with only two partitions( where type == red and where type != red)
ID | type | value
-----------------------------
1 | red | 10
2 | blue | 20
3 | yellow | 30
result should be :
ID | type | value | portion
-----------------------------
1 | red | 10 | 1
2 | blue | 20 |0.4
3 | yellow | 30 |0.6
The normal window function in spark only supports partitionby a whole column, but I need the "blue" and "yellow", together recognized as the "non-red" type.
Any idea?
First add a column is_red to easier differentiate between the two groups. Then you can groupBy this new column and get the sums for each of the two groups respectively.
To get the fraction (portion), simply divide each row's value by the correct sum, taking into account if the type is red or not. This part can be done using when and otherwise in Spark.
Below is the Scala code to do this. There is a sortBy since when using groupBy the order of results is not guaranteed. With the sort, sum1 below will contain the total sum for all non-red types while sum2 is the sum for red types.
val sum1 :: sum2 :: _ = df.withColumn("is_red", $"type" === lit("red"))
.groupBy($"is_red")
.agg(sum($"value"))
.collect()
.map(row => (row.getAs[Boolean](0), row.getAs[Long](1)))
.toList
.sortBy(_._1)
.map(_._2)
val df2 = df.withColumn("portion", when($"is_red", $"value"/lit(sum2)).otherwise($"value"/lit(sum1)))
The extra is_red column can be removed using drop.
Inspired by Shaido, I used an extra column is_red and the spark window function. But I'm not sure which one is better in performance.
df.withColumn("is_red", when(col("type").equalTo("Red"), "Red")
.otherwise("not Red")
.withColumn("portion", col("value")/sum("value)
.over(Window.partitionBy(col"is_Red")))
.drop(is_Red)
Googling it a bit I found this to be an interesting question. Would like you guys shots.
Having my table
USER | MAP | STARTDAY | ENDDAY
1 | A | 20110101 | 20110105
1 | B | 20110106 | 20110110
2 | A | 20110101 | 20110107
2 | B | 20110105 | 20110110
Whant I want is to fix user's 2 case, where maps A and B overlaps by a couple days (from 20110105 until 20110107).
I wish I was able to query that table in a way that it never return overlapping ranges. My input data is falky already, so I don't have to worry with the conflict treatment, I just want to be able to get a single value for any given BETWEEN these dates.
Possible outputs for the query I'm trying to build would be like
USER | MAP | STARTDAY | ENDDAY
2 | B | 20110108 | 20110110 -- pushed overlapping days ahead..
2 | A | 20110101 | 20110104 -- shrunk overlapping range
It doesn't even matter if the algorithm causes "invalid ranges", e.g. Start = 20110105, End = 20110103, I'll just put null when I get to these cases.
What would you guys say? Any straight forward way to get this done?
Thanks!
f.
Analytic functions could help:
select userid, map
, case when prevend >= startday then prevend+1 else startday end newstart
, endday
from
( select userid, map, startday, endday
, lag(endday) over (partition by userid order by startday) prevend
from mytable
)
order by userid, startday
Gives:
USERID MAP NEWSTART ENDDAY
1 A 01/01/2011 01/05/2011
1 B 01/06/2011 01/10/2011
2 A 01/01/2011 01/07/2011
2 B 01/08/2011 01/10/2011