Query time analytics with Solr faceting and pivoting and log file data - search

I am doing some analytics using Solr and specifically using the faceting and pivot functionality for a large set of log files. I have a large log file that I have indexed in Solr along the lines of.
Keyword Visits log_date_ISO
1 red 1,938 2013-01-01
2 blue 435 2013-02-01
3 green 318 2013-04-01
4 red blue 279 2013-01-01
I then run a query and facet by 'log_date_ISO' to give me keyword counts by date that contain the query term. Two questions:
(1) Is there a way to sum the visits per keyword for each date - because what I really want is to sum visits across keywords that contain the query:
-> e.g. if I ran query 'red' for the above - I would want date 2013-01-01 to have a count of 1938 + 279 = 2217 (i.e. the sum of the visits associated with the keywords that contain the query 'red') rather than '2' (i.e. the count of the keywords containing the query).
(2) Is there a way to normalise by monthly query volume?
-> e.g. if the query volume for '2013-01-01' was 10,000 then the normalised volume for the query 'red' would be 2217/10000 = 0.2217
LAST RESORT: If these are not possible, I will pre-process the log file using pandas/python to group by date, then by keyword then normalise - but was wondering if it was possible in Solr.
Thanks in advance.

Here's one way (similar to Dan Allen's answer here):
In [11]: keywords = df.pop('Keyword').apply(lambda x: pd.Series(x.split())).stack()
In [12]: keywords.index = keywords.index.droplevel(-1)
In [13]: keywords.name = 'Keyword'
In [14]: df1 = df.join(keywords)
In [15]: df1
Out[15]:
Visits log_date_ISO Keyword
1 1938 2013-01-01 red
2 435 2013-02-01 blue
3 318 2013-04-01 green
4 279 2013-01-01 red
4 279 2013-01-01 blue
Then you can do the relevant groupby:
In [16]: df1.groupby(['log_date_ISO', 'Keyword']).sum()
Out[16]:
Visits
log_date_ISO Keyword
2013-01-01 blue 279
red 2217
2013-02-01 blue 435
2013-04-01 green 318
To get the visits as a percentage (to avoid double-counts) I'd do a transform first:
df['VisitsPercentage'] = df.groupby('log_date_ISO')['Visits'].transform(lambda x: x / x.sum())
# follow the same steps as above
In [21]: df2 = df.join(keywords)
In [22]: df2
Out[22]:
Visits log_date_ISO VisitsPercentage Keyword
1 1938 2013-01-01 0.874154 red
2 435 2013-02-01 1.000000 blue
3 318 2013-04-01 1.000000 green
4 279 2013-01-01 0.125846 red
4 279 2013-01-01 0.125846 blue

One can use solr to group by one field in the records and sum another field in the records, by group, using
(1) Facets/pivots (groups data by a specified field)
(2) StatComponent (calculates field statistics for specified field - including the sum)
The call I made is (differently from the names in the question, the 'Keyword' field is called 'q_string', 'Visits' above is called 'q_visits' and 'log_date_ISO' is called 'q_date' in the below):
http://localhost:8983/solr/select?q=neuron&stats=true&stats.field=q_visits&rows=1&indent=true&stats.facet=q_date
This provides basic statistics - including the sum - for the *q_visits* field by date - the specific value I was interested in was the sum:
<double name="min">1.0</double>
<double name="max">435.0</double>
<long name="count">263</long>
<long name="missing">0</long>
<double name="sum">845.0</double>
<double name="sumOfSquares">192917.0</double>
<double name="mean">3.2129277566539924</double>
<double name="stddev">26.94368427501248</double>
The field for which the statics are gathered is declared as type float in schema.xml (if its declared as a string then sum, sd, mean will not be shown).

Related

Pandas - number of occurances of IDs from a column in one dataframe in several columns of a second dataframe

I'm new to python and pandas, and trying to "learn by doing."
I'm currently working with two football/soccer (depending on where you're from!) dataframes:
player_table has several columns, among others 'player_name' and 'player_id'
player_id player_name
0 223 Lionel Messi
1 157 Cristiano Ronaldo
2 962 Neymar
match_table also has several columns, among others 'home_player_1', '..._2', '..._3' and so on, as well as the corresponding 'away_player_1', '...2' , '..._3' and so on. The content of these columns is a player_id, such that you can tell which 22 (2x11) players participated in a given match through their respective unique IDs.
I'll just post a 2 vs. 2 example here, because that works just as well:
match_id home_player_1 home_player_2 away_player_1 away_player_2
0 321 223 852 729 853
1 322 223 858 157 159
2 323 680 742 223 412
What I would like to do now is to add a new column to player_table which gives the number of appearances - player_table['appearances'] by counting the number of times each player_id is mentioned in the part of the dataframe match_table bound horizontally by (home player 1, away player 2) and vertically by (first match, last match)
Desired result:
player_id player_name appearances
0 223 Lionel Messi 3
1 157 Cristiano Ronaldo 1
2 962 Neymar 0
Coming from other programming languages I think my standard solution would be a nested for loop, but I understand that is frowned upon in python...
I have tried several solutions but none really work, this seems to at least give the number of appearances as "home_player_1"
player_table['appearances'] = player_table['player_id'].map(match_table['home_player_1'].value_counts())
Is there a way to expand the map function to include several columns in a dataframe? Or do I have to stack the 22 columns on top of one another in a new dataframe, and then map? Or is map not the appropriate function?
Would really appreciate your support, thanks!
Philipp
Edit: added specific input and desired output as requested
What you could do is use .melt() on the match_table player columns (so it'll turn your wide table in to a tall/long table of a single column). Then do a .value_counts on the that one column. Finally join it to the player_table on the 'player_id' column
import pandas as pd
player_table = pd.DataFrame({'player_id':[223,157,962],
'player_name':['Lionel Messi','Cristiano Ronaldo','Neymar']})
match_table = pd.DataFrame({
'match_id':[321,322,323],
'home_player_1':[223,223,680],
'home_player_2':[852,858,742],
'away_player_1':[729,157,223],
'away_player_2':[853,159,412]})
player_cols = [x for x in match_table.columns if 'player_' in x]
match_table[player_cols].value_counts(sort=True)
df1 = match_table[player_cols].melt(var_name='columns', value_name='appearances')['appearances'].value_counts(sort=True).reset_index(drop=False).rename(columns={'index':'player_id'})
appearances_df = df1.merge(player_table, how='right', on='player_id')[['player_id','player_name','appearances']].fillna(0)
Output:
print(appearances_df)
player_id player_name appearances
0 223 Lionel Messi 3.0
1 157 Cristiano Ronaldo 1.0
2 962 Neymar 0.0

Counting the no.of elements in a column and grouping them

Hope you guys are doing well . I have taken up a small project to do in python so I can learn how to code and do basic data analysis in python along the way . I need some help on counting the number of elements present in a column in a DF and grouping them .
Below is the Dataframe I am using
dates open high low close volume % change
372 2010-01-05 15:28:00 5279.2 5280.25 5279.1 5279.5 131450
373 2010-01-05 15:29:00 5279.75 5279.95 5278.05 5279.0 181200
374 2010-01-05 15:30:00 5277.3 5279.0 5275.0 5276.45 240000
375 2010-01-06 09:16:00 5288.5 5289.5 5288.05 5288.45 32750 0.22837324337386275
376 2010-01-06 09:17:00 5288.15 5288.75 5285.05 5286.5 55004
377 2010-01-06 09:18:00 5286.3 5289.0 5286.3 5288.2 37650
I would like to create another DF where the count of elements/entries in the % change column and group them as , x<= 0.5 or 0.5<x<=1 or 1<x<=1.5 or 1.5<x<=2 or 2<x<=2.5 or X<2.5
Below would be the desired output
Group no.of instances
x<= 0.5 1
0.5<x<=1 0
1<x<=1.5 0
1.5<x<=2 0
2<x<=2.5 0
X<2.5 0
Looking forward to a reply ,
Fudgster
You could get the number of elements in each category by using the bins option of the pandas.value_counts() method.This would return the series with number of records within the specified range.
Here is the code,
df["% change"].value_counts(bins=[0,0.5,1,1.5,2,2.5])

If there a Python function to sum all of the columns of a particular row? If not, what would be the best way to go about this? [duplicate]

I'm going through the Khan Academy course on Statistics as a bit of a refresher from my college days, and as a way to get me up to speed on pandas & other scientific Python.
I've got a table that looks like this from Khan Academy:
| Undergraduate | Graduate | Total
-------------+---------------+----------+------
Straight A's | 240 | 60 | 300
-------------+---------------+----------+------
Not | 3,760 | 440 | 4,200
-------------+---------------+----------+------
Total | 4,000 | 500 | 4,500
I would like to recreate this table using pandas. Of course I could create a DataFrame using something like
"Graduate": {...},
"Undergraduate": {...},
"Total": {...},
But that seems like a naive approach that would both fall over quickly and just not really be extensible.
I've got the non-totals part of the table like this:
df = pd.DataFrame(
{
"Undergraduate": {"Straight A's": 240, "Not": 3_760},
"Graduate": {"Straight A's": 60, "Not": 440},
}
)
df
I've been looking and found a couple of promising things, like:
df['Total'] = df.sum(axis=1)
But I didn't find anything terribly elegant.
I did find the crosstab function that looks like it should do what I want, but it seems like in order to do that I'd have to create a dataframe consisting of 1/0 for all of these values, which seems silly because I've already got an aggregate.
I have found some approaches that seem to manually build a new totals row, but it seems like there should be a better way, something like:
totals(df, rows=True, columns=True)
or something.
Does this exist in pandas, or do I have to just cobble together my own approach?
Or in two steps, using the .sum() function as you suggested (which might be a bit more readable as well):
import pandas as pd
df = pd.DataFrame( {"Undergraduate": {"Straight A's": 240, "Not": 3_760},"Graduate": {"Straight A's": 60, "Not": 440},})
#Total sum per column:
df.loc['Total',:] = df.sum(axis=0)
#Total sum per row:
df.loc[:,'Total'] = df.sum(axis=1)
Output:
Graduate Undergraduate Total
Not 440 3760 4200
Straight A's 60 240 300
Total 500 4000 4500
append and assign
The point of this answer is to provide an in line and not an in place solution.
append
I use append to stack a Series or DataFrame vertically. It also creates a copy so that I can continue to chain.
assign
I use assign to add a column. However, the DataFrame I'm working on is in the in between nether space. So I use a lambda in the assign argument which tells Pandas to apply it to the calling DataFrame.
df.append(df.sum().rename('Total')).assign(Total=lambda d: d.sum(1))
Graduate Undergraduate Total
Not 440 3760 4200
Straight A's 60 240 300
Total 500 4000 4500
Fun alternative
Uses drop with errors='ignore' to get rid of potentially pre-existing Total rows and columns.
Also, still in line.
def tc(d):
return d.assign(Total=d.drop('Total', errors='ignore', axis=1).sum(1))
df.pipe(tc).T.pipe(tc).T
Graduate Undergraduate Total
Not 440 3760 4200
Straight A's 60 240 300
Total 500 4000 4500
From the original data using crosstab, if just base on your input, you just need melt before crosstab
s=df.reset_index().melt('index')
pd.crosstab(index=s['index'],columns=s.variable,values=s.value,aggfunc='sum',margins=True)
Out[33]:
variable Graduate Undergraduate All
index
Not 440 3760 4200
Straight A's 60 240 300
All 500 4000 4500
Toy data
df=pd.DataFrame({'c1':[1,2,2,3,4],'c2':[2,2,3,3,3],'c3':[1,2,3,4,5]})
# before `agg`, I think your input is the result after `groupby`
df
Out[37]:
c1 c2 c3
0 1 2 1
1 2 2 2
2 2 3 3
3 3 3 4
4 4 3 5
pd.crosstab(df.c1,df.c2,df.c3,aggfunc='sum',margins
=True)
Out[38]:
c2 2 3 All
c1
1 1.0 NaN 1
2 2.0 3.0 5
3 NaN 4.0 4
4 NaN 5.0 5
All 3.0 12.0 15
The original data is:
>>> df = pd.DataFrame(dict(Undergraduate=[240, 3760], Graduate=[60, 440]), index=["Straight A's", "Not"])
>>> df
Out:
Graduate Undergraduate
Straight A's 60 240
Not 440 3760
You can only use df.T to achieve recreating this table:
>>> df_new = df.T
>>> df_new
Out:
Straight A's Not
Graduate 60 440
Undergraduate 240 3760
After computing the Total by row and columns:
>>> df_new.loc['Total',:]= df_new.sum(axis=0)
>>> df_new.loc[:,'Total'] = df_new.sum(axis=1)
>>> df_new
Out:
Straight A's Not Total
Graduate 60.0 440.0 500.0
Undergraduate 240.0 3760.0 4000.0
Total 300.0 4200.0 4500.0

1) Issue In Normalize Transformation for Informatica Power Center

I am Trying to Normalize Records of My SOurce table using Normalize Transformation in informatica, But Sequence are not re-generating for different rows.
Below Is SOurce Table :
Store_Name Sales_Quarter1 Sales_Quarter2 Sales_Quarter3 Sales_Quarter4
DELHI 150 240 455 100
MUMBAI 100 500 350 340
Target Table :
Store_name
Sales
Quarter
I am Using Occurrence - 4, on Sales Column for getting GCID Sales.
For Quarter, I am Using GCID Sales column :
O/P :
STORE_NAME SALES_COLUMN QUARTER
Mumbai 100 1
Mumbai 500 2
Mumbai 350 3
Mumbai 340 4
Delhi 150 5
Delhi 240 6
Delhi 455 7
Delhi 100 8
Why Quarter Value is not restarting from 1 for Delhi and is continuing from 5 ?
There is a GK column that keeps sequential numbers for all rows. Definitely, GCID is the right column that keeps numbers per multi-occurrences in a row. So, double check that there is GCID port and not GK that is linked to QUARTER port to target…
It’s good to provide a screenshot for the mapping and for the normalizer transformation (Normalizer tab) to be more informative about your question/issue…
But I suppose you have 'Store_Name' port at level 1 and all 'Sales_Quarter1', 'Sales_Quarter2', 'Sales_Quarter3' and 'Sales_Quarter4' ports grouped at level 2 on Normalizer tab (using >> button at top left area). And at group level (for these four ports) you set the Occurrence to 4.

Sorting and Grouping in Pandas data frame column alphabetically

I want to sort and group by a pandas data frame column alphabetically.
a b c
0 sales 2 NaN
1 purchase 130 230.0
2 purchase 10 20.0
3 sales 122 245.0
4 purchase 103 320.0
I want to sort column "a" such that it is in alphabetical order and is grouped as well i.e., the output is as follows:
a b c
1 purchase 130 230.0
2 10 20.0
4 103 320.0
0 sales 2 NaN
3 122 245.0
How can I do this?
I think you should use the sort_values method of pandas :
result = dataframe.sort_values('a')
It will sort your dataframe by the column a and it will be grouped either because of the sorting. See ya !

Resources