How to embed equation queries in elasticSearch - python-3.x

I have my search engine build using embeddings from documents using ELK...I want to query using mathematical equations and their respective outcomes.
for example: My query is Give me water pH 8+-1
The outcome should be:every document with an pH value from 7 to 9
Exampe 2:Density 1080 would search Density 1080 +- 10% by meaning 1080 +- 108 or, pH 7 would search pH 7 +- 10% by meaning 7 +- 0.7
How can i do this?

Related

Spark web UI showing two stages for a job without any wide transformation

I am trying to understand spark - Jobs and Stages through web ui. I ran a simple code for word count. The file has 44 lines in total.
strings = spark.read.text("word.txt")
filtered = strings.filter(strings.value.contains("The"))
filtered.count()
I see there are no wide transformation, and there is only 1 action. So, the application should have only 1 stage job. However I see there is a shuffle operation after read in the web UI. It shows its a 2 stage job. I am not sure why is it ? Can anyone please help me here ?
Edit : Adding SQL Plan
== Parsed Logical Plan ==
Aggregate [count(1) AS count#5L]
+- AnalysisBarrier
+- Filter Contains(value#0, The)
+- Relation[value#0] text
== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#5L]
+- Filter Contains(value#0, The)
+- Relation[value#0] text
== Optimized Logical Plan ==
Aggregate [count(1) AS count#5L]
+- Project
+- Filter (isnotnull(value#0) && Contains(value#0, The))
+- Relation[value#0] text
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#5L])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#8L])
+- *(1) Project
+- *(1) Filter (isnotnull(value#0) && Contains(value#0, The))
+- *(1) FileScan text [value#0] Batched: false, Format: Text, Location: InMemoryFileIndex[hdfs://dev/user/rk/word.txt], PartitionFilters: [], PushedFilters: [IsNotNull(value), StringContains(value,The)], ReadSchema: struct<value:string>

Physical Plan and Optimizing a Non-Equi Join in Spark SQL

I am using Spark SQL 2.4.0. I have a couple of tables as below:
CUST table:
id | name | age | join_dt
-------------------------
12 | John | 25 | 2019-01-05
34 | Pete | 29 | 2019-06-25
56 | Mike | 35 | 2020-01-31
78 | Alan | 30 | 2020-02-25
REF table:
eff_dt
------
2020-01-31
The requirement is to select all the records from CUST whose join_dt is <= eff_dt in the REF table. So, for this simple requirement, I put together the following query:
version#1:
select
c.id,
c.name,
c.age,
c.join_dt
from cust c
inner join ref r
on c.join_dt <= r.eff_dt;
Now, this creates a BroadcastNestedLoopJoin in the physical plan and hence the query takes a long time to process this.
Question 1:
Is there a better way to implement this same logic without a BNLJ being induced and execute the query faster? Is it possible to alleviate the BNLJ ?
Part 2:
Now,I broke the query into 2 parts as:-
version#2:
select c.id, c.name, c.age, c.join_dt
from cust c
inner join ref r
on c.join_dt = r.eff_dt --equi join
union all
select c.id, c.name, c.age, c.join_dt
from cust c
inner join ref r
on c.join_dt < r.eff_dt; --theta join
Now, for the Query in Version#1, the physical plan shows that the CUST table is scanned only once, whereas the physical plan for the Query in Version#2 indicates that the same input table CUST is scanned twice (Once for each of the 2 queries combined with a union). However, I am surprised to find that Version#2 executes faster than version#1.
Question 2:
How does version#2 execute faster than version#1 although version#2 scans the table twice as opposed to once in case of version#1, and also the fact that both the versions induce a BNLJ ?
Can anyone please clarify. Please let me know if additional information is required.
Thanks.

Generating Samples from Distribution

I am in the process of learning about statistics, and let's say I have an outcome from some experiment:
1 | 0.34
2 | 0.10
3 | 0.05
4 | 0.13
5 | 0.13
6 | 0.25
I am interested generating samples using a uniform random number generator from this distribution. Any suggestions?
This is a very standard problem with a very standard solution. Form an array where each entry contains not the probability of that index, but the sum of all probabilities up to that index. For your example problem, the array is p[1] = 0.34, p[2] = 0.44, p[3] = 0.49, etc. Use your uniform RNG to generate u between 0 and 1. Then find the index i such that p[i-1] < u < p[i]. For a very small array like this, you can use linear search, but for a large array you will want to use binary search. Notice that you can re-use the array to generate multiple deviates, so don't re-form it every time.

Stata tabstat change order/sort?

I am using tabstat in Stata, and using estpost and esttab to get its output to LaTeX. I have
tabstat
to display statistics by group. For example,
tabstat assets, by(industry) missing statistics(count mean sd p25 p50 p75)
The question I have is whether there is a way for tabstat (or other Stata commands) to display the output ordered by the value of the mean, so that those categories that have higher means will be on top. By default, Stata displays by alphabetical order of industry when I use tabstat.
tabstat does not offer such a hook, but there is an approach to problems like this that is general and quite easy to understand.
You don't provide a reproducible example, so we need one:
. sysuse auto, clear
(1978 Automobile Data)
. gen Make = word(make, 1)
. tab Make if foreign
Make | Freq. Percent Cum.
------------+-----------------------------------
Audi | 2 9.09 9.09
BMW | 1 4.55 13.64
Datsun | 4 18.18 31.82
Fiat | 1 4.55 36.36
Honda | 2 9.09 45.45
Mazda | 1 4.55 50.00
Peugeot | 1 4.55 54.55
Renault | 1 4.55 59.09
Subaru | 1 4.55 63.64
Toyota | 3 13.64 77.27
VW | 4 18.18 95.45
Volvo | 1 4.55 100.00
------------+-----------------------------------
Total | 22 100.00
Make here is like your variable industry: it is a string variable, so in tables Stata will tend to show it in alphabetical (alphanumeric) order.
The work-around has several easy steps, some optional.
Calculate a variable on which you want to sort. egen is often useful here.
. egen mean_mpg = mean(mpg), by(Make)
Map those values to a variable with distinct integer values. As two groups could have the same mean (or other summary statistic), make sure you break ties on the original string variable.
. egen group = group(mean_mpg Make)
This variable is created to have value 1 for the group with the lowest mean (or other summary statistic), 2 for the next lowest, and so forth. If the opposite order is desired, as in this question, flip the grouping variable around.
. replace group = -group
(74 real changes made)
There is a problem with this new variable: the values of the original string variable, here Make, are nowhere to be seen. labmask (to be installed from the Stata Journal website after search labmask) is a helper here. We use the values of the original string variable as the value labels of the new variable. (The idea is that the value labels become the "mask" that the integer variable wears.)
. labmask group, values(Make)
Optionally, work at the variable label of the new integer variable.
. label var group "Make"
Now we can tabulate using the categories of the new variable.
. tabstat mpg if foreign, s(mean) by(group) format(%2.1f)
Summary for variables: mpg
by categories of: group (Make)
group | mean
--------+----------
Subaru | 35.0
Mazda | 30.0
VW | 28.5
Honda | 26.5
Renault | 26.0
Datsun | 25.8
BMW | 25.0
Toyota | 22.3
Fiat | 21.0
Audi | 20.0
Volvo | 17.0
Peugeot | 14.0
--------+----------
Total | 24.8
-------------------
Note: other strategies are sometimes better or as good here.
If you collapse your data to a new dataset, you can then sort it as you please.
graph bar and graph dot are good at displaying summary statistics over groups, and the sort order can be tuned directly.
UPDATE 3 and 5 October 2021 A new helper command myaxis from SSC and the Stata Journal (see [paper here) condenses the example here with tabstat:
* set up data example
sysuse auto, clear
gen Make = word(make, 1)
* sort order variable and tabulation
myaxis Make2 = Make, sort(mean mpg) descending
tabstat mpg if foreign, s(mean) by(Make2) format(%2.1f)
I would look at the egenmore package on SSC. You can get that package by typing in Stata ssc install egenmore. In particular, I would look at the entry for axis() in the helpfile of egenmore. That contains an example that does exactly what you want.

gnuplot - error bars and fitting for calculations from raw data

What does gnuplot do if I have a data.txt file like this:
#x y dx dy
1 2 0.2 0.1
3 5 0.1 0.3
Where the dx and dy are the errors directly related to x and y (x +- dx, y +- dy). And I do this:
plot data.txt using (1/$1):($2*5):3:4 with xyerrorbars
Will gnuplot do this for x and y
(1/x) +- dx
5y +- dy
or that
1/(x +- dx)
5(y +- dy)
or that
1/x +- 1/dx
or that, the Gaussian error propagation, which would be the right one, gotten from the sqrt(sum(derivative times the error)^2)
1/x -+ dx/x^2
And how to fit in this case?
Gnuplot will certainly do the first option (1/x) +- dx. As for the fitting, you'll have to be a little more specific ...

Resources