what is map and reduce phase in search - search

I want to use hadoop to implement a simple search engine.
So I created an inverted index using hadoop streaming api and bash.which output a file like this :
ab (744 1) 1
abbrevi (122 1) 1
abil (51 1) (77 1) (738 1) 3
abl (99 1) (132 1) (536 1) (581 1) (695 1) (763 1) (908 1) (914 1) (986 1) (1114 2) 10
ablat (82 2) (274 2) (553 7) (587 1) (1065 3) (1096 2) (1097 7) (1098 3) (10Sorry if 99 4) (1100 4) (1101 3) (1226 3) (1241 3) (1279 1) 14
about (27 1) (32 1) (39 1) (46 1) (49 2) (56 1) (57 1) (69 2) (77 2) (81
2) (83 2) (113 1) (134 1) (139 2) (140 1) (155 1) (156 2) (162 1) (163 1) (165 2) (171 1) (174 1) (177 1) (193 5) (205 1) (206 3) (212 1) (216 3) (218 1)
(225 2) (249 3) (255 1) (257 1) (262 1) (266 3) (272 6) (273 1) (285 1) (292
2) (313 1) (315 2) (346 2) (368 1) (370 1) (371 1) (372 1) (373 1) (381 2) (391 1) (410 3) (420 1) (452 1) (456 4) (469 1) (479 1) (489 1) (498 3) (511 1)
(518 1) (531 1) (536 1) (548 1) (555 1) (556 1) (560 2) (565 1) (567 1) (572
1) (575 1) (577 1) (589 1) (601 1) (603 1) (610 1) (612 1) (614 1) (620 1) (621 4) (625 3) (626 1) (646 1) (649 1) (651 2) (657 2) (662 1) (679 1) (685 2)
(686 1) (704 2) (706 2) (709 1) (717 2) (721 1) (740 2) (757 2) (759 1) (774
1) (786 1) (792 2) (793 1) (794 2) (796 2) (801 2) (805 1) (806 1) (807 2) (808 2) (811 1) (815 1) (816 1) (829 2) (844 1) (869 1) (876 1) (912 1) (917 1)
(921 1) (927 1) (928 2) (958 1) (976 6) (991 1) (992 2) (993 1) (994 1) (996
1) (999 1) (1000 1) (1002 1) (1004 2) (1006 1) (1040 1) (1092 1) (1095 2) (1104 4) (1105 1) (1115 1) (1143 4) (1156 2) (1162 1) (1164 3) (1165 1) (1166 3) (1169 1) (1191 1)
(1194 1) (1202 1) (1209 1) (1212 1) (1218 1) (1223 1) (1224 1) (1229 1) (1230 1) (1231
1) (1239 1) (1241 1) (1244 1) (1246 1) (1248 1) (1255 2) (1262 1) (1275 2) (1282 1) (1303 1) (1304 1) (1307 1) (1310 3) (1316 1) (1335 1) (1341 1) (1344 1) (1345 1) (1353 1)
(1354 3) (1355 1) (1363 1) (1377 1) 178
It means for example word ab repeated only once in document number 744.
Now I want to implement and query searching (which means the document should contains all words in query) using hadoop streaming api.
so what exactly is map and reduce phase in search? and also can you please give me some hints how can I implement it using streaming api? (what should be the input field? ), and I don't have any idea what to do?)
Thanks

Here is my take on your query search problem- I'm just giving you a rough overview of what should be done, rather than giving you the code (my bash skills are a bit rusty anyway).
Job Setup
First you will need to tokenize the query, put the list of tokens as a comma separated list into a config value. You can do this on the mapper/reducer side if you like, but I would recommend to centralize this part in the job setup.
Mapper
Read the config value from the query, make it a "set" or some other structure that has fast key lookups.
The mapper should map each line (a word to n-documents) and if the current word in this line is in your query set, "emit" it to HDFS. This stage should emit the document-id as a key with each word as the value (this creates "n" output records, where "n" is the number of documents per word).
Reducer
The reducer then receives a document-id as key and multiple tokens that matched your query as values, now you read the config value again and just compare if you got all the tokens from your query in this document.
You should emit the document-id as key and usually in search you output some "match-score" as the value. In your case you only search for "full" matches, so this score actually doesn't matter as it will be a constant.
Some improvements
Think of some improvements after doing this, in this case the Mapper emits all the tokens- do you really need them as separate records? Maybe you can use a combiner to save some network bandwidth?
I leave those as an exercise for the reader ;-)

Related

Redis 4.0.10: why do we have 0 keys but 9GB dataset size?

We use ElastiCache from AWS with redis_version:4.0.10.
We can see this stats which do not seem to make sense (0 keys, 9G consumed).
Also indeed Redis is full, does not accept writes.
hostname.amazonaws.com:6379> memory stats
1) "peak.allocated"
2) (integer) 9562806680
3) "total.allocated"
4) (integer) 9168470408
5) "startup.allocated"
6) (integer) 4197000
7) "replication.backlog"
8) (integer) 1048576
9) "clients.slaves"
10) (integer) 33940
11) "clients.normal"
12) (integer) 117622
13) "aof.buffer"
14) (integer) 0
15) "overhead.total"
16) (integer) 5397138
17) "keys.count"
18) (integer) 0
19) "keys.bytes-per-key"
20) (integer) 0
21) "dataset.bytes"
22) (integer) 9163073270
23) "dataset.percentage"
24) "99.986907958984375"
25) "peak.percentage"
26) "95.876358032226562"
27) "fragmentation"
28) "0.65151870250701904"
Actually a client running Monitor was still perceived as connected although no client process was anymore. It was gathering commands to list to this client. It had 6G. I killed client on Redis and memory was instantly freed.

Retrieving messages from redis stream

I have a NodeJS application that is using Redis stream (library 'ioredis') to pass information around. The problem is that when I add a message to a stream and I try to retrieve it, I have to go down a lot of Arrays level:
const message = await redis.xreadgroup('GROUP', orderGroup, orderConsumer, 'COUNT', 1, 'STREAMS', orderStream, '>');
const messageId: string = message[0][1][0][0];
const pMsg: Obj = JSON.parse(JSON.parse(message[0][1][0][1][1]));
This is how I create the stream:
await redis.xgroup('CREATE', orderStream, orderGroup, '0', 'MKSTREAM')
.catch((err) => {
console.error(`Group already exists error: ${err}`);
})
Is this normal? In the Redis doc (https://redis.io/commands/xreadgroup) it shows that the return value is an array with the id of the message at position 0 and the fields at position 1. I feel like I'm missing something...
Here is an example output of XREADGROUP, as you can see the values are at the nested level 5.
127.0.0.1:6379> XREADGROUP Group g1 c1 COUNT 100 STREAMS s1 >
1) 1) "s1"
2) 1) 1) "1608445334963-0"
2) 1) "f1"
2) "v1"
3) "f2"
4) "v2"
2) 1) "1608445335464-0"
2) 1) "f1"
2) "v1"
3) "f2"
4) "v2"
3) 1) "1608445335856-0"
2) 1) "f1"
2) "v1"
3) "f2"
4) "v2"
For more details see https://redis.io/commands/xread
It is normal and expected. XREADGROUP supports reading from multiples stream keys, multiple messages, and messages can have multiple field-value pairs.
Follow the next example:
> XGROUP CREATE mystream1 mygroup 0 MKSTREAM
OK
> XGROUP CREATE mystream2 mygroup 0 MKSTREAM
OK
> XADD mystream1 * field1 value1 field2 value2
"1608444656005-0"
> XADD mystream1 * field1 value3 field2 value4
"1608444660566-0"
> XADD mystream2 * field3 value5 field4 value6
"1608444665238-0"
> XADD mystream2 * field3 value7 field4 value8
"1608444670070-0"
> XREADGROUP GROUP mygroup yo COUNT 2 STREAMS mystream1 mystream2 > >
1) 1) "mystream1"
2) 1) 1) "1608444656005-0"
2) 1) "field1"
2) "value1"
3) "field2"
4) "value2"
2) 1) "1608444660566-0"
2) 1) "field1"
2) "value3"
3) "field2"
4) "value4"
2) 1) "mystream2"
2) 1) 1) "1608444665238-0"
2) 1) "field3"
2) "value5"
3) "field4"
4) "value6"
2) 1) "1608444670070-0"
2) 1) "field3"
2) "value7"
3) "field4"
4) "value8"
The structure you get has multiple nested arrays. Using 0-indexed as in node:
[index of the stream key]
[0: the key name or 1: an array for messages]
[index of the message]
[0: the message ID or 1: an array for fields & values]
[even for field name or odd for value]
Where data[0][1] is the root level array (adjust this entry point for your own use).
Variables
rd: Return Data
el: Element
sel: Sub-element
rel: Relative-element
p: Relative-Object
c: Iterate-Counter for deciding if it is a key or value.
var rd = []
for(var el of data[0][1]){
var sel = el[1]
var p = {}
var c = 0
for(var rel of sel){
if(c % 2 == 0){
// Right here is where the return object keys/values are set.
p[rel] = sel[c + 1]
}
c++
}
rd.push(p)
}
return rd

How to update an old vba code to work on excel?

I have this code from an old book that does not work in my Excel. It is supposed to calculate the slope and intercept of a Deming regression from a range of numbers, x and y
Function Deming(XValues, YValues)
Dim MeanX(), MeanY()
Ncells = XValues.Count
ReDim MeanX(Ncells / 2), MeanY(Ncells / 2)
For x = 2 To Ncells Step 2
MeanX(x / 2) = (XValues(x - 1) + XValues(x)) / 2
MeanY(x / 2) = (YValues(x - 1) + YValues(x)) / 2
SumX = SumX + MeanX(x / 2): SumY = SumY + MeanY(x / 2)
SumX2 = SumX2 + (MeanX(x / 2)) ^ 2
SumY2 = SumY2 + (MeanY(x / 2)) ^ 2
SumXY = SumXY + MeanX(x / 2) * MeanY(x / 2)
SumDeltaX2 = SumDeltaX2 + (XValues(x - 1) - XValues(x)) ^ 2
SumDeltaY2 = SumDeltaY2 + (YValues(x - 1) - YValues(x)) ^ 2
Next
XBar = SumX / N: YBar = SumY / N
Sx2 = (N * SumX2 - SumX ^ 2) / (N * (N - 1))
Sy2 = (N * SumY2 - SumY ^ 2) / (N * (N - 1))
Sdx2 = SumDeltaX2 / (2 * N)
Sdy2 = SumDeltaY2 / (2 * N)
rPearson = (N * SumXY - SumX * SumY) / Sqr((N * SumX2 - SumX ^ 2) * (N *
SumY2 - SumY ^ 2))
lambda = Sdx2 / Sdy2
U = (Sy2 - Sx2 / lambda) / (2 * rPearson * Sqr(Sx2) * Sqr(Sy2))
Slope = U + Sqr(U ^ 2 + 1 / lambda)
Intercept = YBar - Slope * XBar
Deming = Array(Slope, Intercept)
End Function
Does this have a bad syntax or not?
First this is not old code, this is simply bad code.
Anything in VBA, which does not compile, when someone writes Option Explicit on the top of the Module/Worksheet is a bad syntax. This is a rule of a thumb. And in the case of the code, if this one is pasted to the editor the following line is red:
rPearson = (N * SumXY - SumX * SumY) / Sqr((N * SumX2 - SumX ^ 2) * (N *
SumY2 - SumY ^ 2))
This is because it should be in 1 line, and not in 2.
So, concerning the question - how to update it - as a first step, make sure the code compiles with Option Explicit on top (Option Explicit statement). So, write Option Explicit and then go to Debug>Compile on the VBEditor's ribbon. VBeditor will highlight the problem. The first one is that Ncell is not defined:
Then find a way to define it, e.g. write Dim Ncells as Variant or as anything else you may consider useful on the top of the highligted line. It could be that just declaring a variable is not enough, as there is a calculation XBar = SumX / N in the code. There, N should be declared and assigned to a value. If it is only declared, it will be 0, and then a division by 0 will be an error. Thus, probably something like this should be written, depending on the logic: Dim N as Double: N = 1

Absolute Value of the Differences In Two (2) Ranges - PART II

Hopefully this is easier to read than yesterday. Trying to find a way to vary the number of periods "N" that measure "VOLATILITY" The code for the complete function as suggested yesterday is below and fixes "N" at 10. Function works just fine for KAMA with the default value for "N" (N0Addr and N1Addr are not needed in this default version of the KAMA function but are steps to get to a variable "N")
This formula works in Excel:
=SUMPRODUCT((ABS(I26:I36-I25:I35)))
I can also obtain the correct sum of differences in the two ranges but not the absolute value. This VBA Code does that with the named ranges "N0Addr" and "N1Addr":
Rng0 = WorksheetFunction.Sum(Range(N0Addr)) - WorksheetFunction.Sum(Range(N1Addr))
Function nTEST(Price, nPer, mPer, N)
'Variables
Fast = 2 / (nPer + 1)
Slow = 2 / (mPer + 1)
'One(1) Prior Period Calculation
nTEST1 = Application.Caller.Offset(-1)
N0Addr = Application.WorksheetFunction.Concat(Price.Offset(-N, 0).Address & ":" & Price.Address)
N1Addr = Application.WorksheetFunction.Concat(Price.Offset(-(N + 1), 0).Address & ":" & (Price.Offset(-1, 0).Address))
'Change Formula (Y - Yn)
E = Abs(Price - Price.Offset(-N, 0))
'Volatility Formula { =SUM(ABS(Y:Yn)-(Y1:Yn1))) }
'VOLATILITY (N = 10)
'1-10
R = Abs(Price - Price.Offset(-1, 0)) + Abs(Price.Offset(-1, 0) - Price.Offset(-2, 0)) _
+ Abs(Price.Offset(-2, 0) - Price.Offset(-3, 0)) _
+ Abs(Price.Offset(-3, 0) - Price.Offset(-4, 0)) + Abs(Price.Offset(-4, 0) - Price.Offset(-5, 0)) _
+ Abs(Price.Offset(-5, 0) - Price.Offset(-6, 0)) + Abs(Price.Offset(-6, 0) - Price.Offset(-7, 0)) _
+ Abs(Price.Offset(-7, 0) - Price.Offset(-8, 0)) + Abs(Price.Offset(-8, 0) - Price.Offset(-9, 0)) _
+ Abs(Price.Offset(-9, 0) - Price.Offset(-10, 0))
'EFFICIENCY RATIO
ER = E / R
Smooth = (ER * (Fast - Slow) + Slow) ^ 2
'Formula Calculation
nKAMA = Smooth * Price + (1 - Smooth) * nKAMA1
End Function
Looking for an VBA formula or method to input a working formula for "volatility" that can vary over N periods. I can get the sum of differences but not the sum of absolute differences.
Rng0 = WorksheetFunction.Sum(Range(N0Addr)) - WorksheetFunction.Sum(Range(N1Addr))
I can also enter one formula in Excel that accomplishes provides the sum of absolute differences.
=SUMPRODUCT((ABS(I26:I36-I25:I35)))

VBA: Slow solver-loop

I have a code that in a simplified way calculates blast furnace gases for 168 hours (eg a week)
It reads in some input values and chemical values and calculated the molar masses in the system. After that the solver calculates in what chemical form (mainly CO, CO2) the gases exit in.
The problem is that it's extremely slow. If I have this excel workbook open it takes minutes and I haven't even bothered to let it run to the end when there are more workbooks open.
I'm very new to VBA, but I'd expect it to be a bit faster if I could set it to solve the equations in VBA instead of letting the values "iterate" back and forth between the excel worksheet and the VBA solver, by gradually solving the cell values - if I just knew how, IF it is possible or a good idea.
The code in its entity, first the general calculations:
Sub BFgas()
Datamatrix = Range(Cells.Find("Datamatrix").Offset(1, 0).Address, Cells.Find("Datamatrix").Offset(21, 0).Address)
ReDim BFoutput(1 To 168, 1 To 3) As Double
M_pigiron_Matrix = Range(Cells.Find("BF1 Pig iron production").Offset(1, 0).Address, Cells.Find("BF1 Pig iron production").Offset(168 + 1, 0).Address)
Bf_blast_Matrix = Range(Cells.Find("BF 1 - Blast").Offset(1, 0).Address, Cells.Find("BF 1 - Blast").Offset(168 + 1, 0).Address)
Bf_oxygen_Matrix = Range(Cells.Find("BF 1 - Oxygen").Offset(1, 0).Address, Cells.Find("BF 1 - Oxygen").Offset(168 + 1, 0).Address)
M = 1
Do
M_pigiron = M_pigiron_Matrix(M, 1) 'Tons of pig iron
Bf_blast = Bf_blast_Matrix(M, 1) 'Nm3
Bf_oxygen = Bf_oxygen_Matrix(M, 1) 'Nm3
If Bf_blast = 0 Or Bf_oxygen = 0 Then
Do
M_pigiron = M_pigiron_Matrix(M, 1) 'Tons of pig iron
Bf_blast = Bf_blast_Matrix(M, 1) 'Nm3
Bf_oxygen = Bf_oxygen_Matrix(M, 1) 'Nm3
M = M + 1
Loop While Bf_oxygen = 0 Or Bf_blast = 0
End If
n_N2_blast = Bf_blast * Datamatrix(19, 1) / Datamatrix(17, 1) 'kmol
n_O2_blast = Bf_blast * Datamatrix(18, 1) / Datamatrix(17, 1) 'kmol
n_O2_oxygenintake = Bf_oxygen / Datamatrix(17, 1) 'kmol
n_total_O_in = (n_O2_blast + n_O2_oxygenintake) * 2 'kmol
'Calculates the amounts of coke, briquettes and scrap
Cokeratio = Cells.Find("Input data").Offset(1, 1).Value2
Briqratio = Cells.Find("Input data").Offset(2, 1).Value2
Scrapratio = Cells.Find("Input data").Offset(3, 1).Value2
m_oil = Cells.Find("Input data").Offset(4, 1).Value2
m_coke = Cokeratio * M_pigiron * 1000 'kg
m_briq = Briqratio * M_pigiron 'kg
m_scrap = Scrapratio * M_pigiron 'kg
'Fe/Iron calculations
'Calculates the molar masses of iron and coal in pig iron, briqettes and scrap
n_Fe_pigiron = Datamatrix(3, 1) * M_pigiron * 1000 / Datamatrix(15, 1) 'kmol
n_Fe_briq = Datamatrix(12, 1) * m_briq / Datamatrix(15, 1) 'kmol
n_Fe_scrap = Datamatrix(13, 1) * m_scrap / Datamatrix(15, 1) 'kmol
'Calculates how many kmol is needed from pellets
n_Fe_pellets = n_Fe_pigiron - n_Fe_briq - n_Fe_scrap
m_pellets = n_Fe_pellets / Datamatrix(11, 1) * Datamatrix(15, 1) 'Divides by the iron content 0.72, to get the total mass
'O/Oxygen calculations
'Calculates the total incoming oxygen
'(m_pel*x_pellets,O + m_briq*x_O,briq)/M_O + n_blast,O2*2 + n_Oxygen,O2*2
Oxygen_in = m_pellets * Datamatrix(10, 1) / Datamatrix(16, 1) + m_briq * Datamatrix(9, 1) / Datamatrix(16, 1) + n_total_O_in 'kmol
Cells.Find("Solutions").Offset(0, 1).Value = Oxygen_in
'C/Coal calculations
'Calculates the incoming coal minus what comes out with the pig iron, leaving what comes out with the bf-gases
'm_coke,*x_C,coke + m_oil*x_C,oil + m_br*x_br,C = m_rj*x_C,rj + V_tg*(y,co + y,co2)
Coal_for_bf_gas = (m_coke * Datamatrix(4, 1) / Datamatrix(14, 1) + m_oil * Datamatrix(5, 1) / Datamatrix(14, 1) + m_briq * Datamatrix(6, 1) / Datamatrix(14, 1)) - M_pigiron * 1000 * Datamatrix(1, 1) / Datamatrix(14, 1)
Cells.Find("Solutions").Offset(0, 2).Value = Coal_for_bf_gas
'N/Nitrogen
'Nitrogen is mainly what comes in with the blast
N2_for_bf_gas = n_N2_blast
Cells.Find("Solutions").Offset(0, 3).Value = N2_for_bf_gas
'Sets in the hydrogen just in case
'H/hydrogen
'H_for_bf_gas = m_coke * Datamatrix(21, 1) / Datamatrix(20, 1) + m_oil * Datamatrix(7, 1) / Datamatrix(20, 1)
The solver part:
SolverReset 'Code solves the problem for a specific set of lines, in this case meaning hours
SolverOptions Precision:=1, Iterations:=100, AssumeNonNeg:=True
SolverOk setCell:=Cells.Find("Differences").Offset(1, 0).Address, MaxMinVal:=3, ValueOf:="0", ByChange:=Range(Cells.Find("Testing here").Offset(0, 1).Address, Cells.Find("Testing here").Offset(0, 3).Address)
SolverAdd cellRef:=Range(Cells.Find("Testing here").Offset(0, 2).Address, Cells.Find("Testing here").Offset(0, 3).Address), _
relation:=3, _
formulaText:=0.1
SolverAdd cellRef:=Range(Cells.Find("Testing here").Offset(0, 2).Address, Cells.Find("Testing here").Offset(0, 3).Address), _
relation:=1, _
formulaText:=0.4
SolverAdd cellRef:=Cells.Find("Testing here").Offset(0, 1).Address, _
relation:=3, _
formulaText:=(Bf_blast + Bf_oxygen) * 1.2
SolverAdd cellRef:=Cells.Find("Testing here").Offset(0, 1).Address, _
relation:=1, _
formulaText:=(Bf_blast + Bf_oxygen) * 2
SolverSolve userFinish:=True
BFoutput(M, 1) = Cells.Find("Testing here").Offset(0, 1).Value
BFoutput(M, 2) = Cells.Find("Testing here").Offset(0, 2).Value
BFoutput(M, 3) = Cells.Find("Testing here").Offset(0, 3).Value
M = M + 1
Loop While M < 169
Cells.Find("BF1 - Output data").Offset(2, 0).Resize(UBound(BFoutput, 1), 3).Value = BFoutput
I'm not a chemical engineer, so I don't know the equations you're trying to solve.
I'm guessing that they're non-linear, transient, and iterative. 168*3 = 504 degrees of freedom doesn't seem that large to me, but it could be a lot of work if you have lots of small time steps with iterations for each one.
I can't tell if you're doing a transient or steady state problem from the code you posted.
The numerical problems that I'm more familiar with (solid mechanics and heat transfer) are very sensitive to algorithm. The equations can be subject to time step restrictions for stability reasons, depending on the integration scheme chosen.
If you're solving a non-linear steady state problem the same comments would apply, except for iteration step size instead of time step size.
I can't glean much about this from your VB code, but I'll offer these recommendations:
Write out your equations, do a Fourier transform, and see if there are any stability restrictions.
Think about a tool kit like Matlab. They've got out of the box implementations that might be more highly optimized than your custom code.
I'm not aware of any profiling capabilities for VB or Excel, but you can't fix a problem without data. I'd see if I could get some information about where the time is being spent before hypothesizing about a solution.

Resources