Filter Solr query by field - search

I've got a large set of records like this in my index, and what I'm trying to do is to find the objects by SUB property, for example if I want to filter by sub = "5 7 8 10 820" it should result in returning in objects B and C, because they both have 5, 7, 8, 10 and 820 in their SUB property.
To generalize object's sub should contain all of the values (5,7,8,10,820) passed in filter.
Object A has only 5, 7 and 8. Therefore it doesn't satisfy the filter.
Object B has 5,7,8,10,820 in its sub property, therefore it satisfies the filter
as well as Object C does.
How can I fix my query to achieve such behavior?
This is my current query which returns what I think is all of the occurrences of the filter in objects properties:
q=*:*&rows=100&start=0&sort=id+asc&fq=%2Bsub:5+7+8+10+820
Object A: {
"id":"ke131j-nan139-1239Mzf-sazr",
"sub":"0 1 3 4 5 7 8"
etc...
}
Object B: {
"id":"ke131j-1239Mzf-nan139-sacr",
"sub":"5 7 8 9 10 517 820 1121 1124"
etc...
}
Object C: {
"id":"nan139-1239Mzf-sazr-ke131j",
"sub":"5 7 8 10 11 15 783 820 825 921 924"
etc...
}

The answer from the comment was helpful, q=sub:(5 AND 7 AND 8 AND 10 AND 820), worked

Related

Sum rows with same values and write it in new cell

I have the following table:
OrderNumber
Value
123
2
123
3
333
5
333
6
555
8
555
9
My goal is to sum all OrderNumbers with the same values (e.g. for OrderNumber 123 the sum should be 5) and output the result in a new row.
The output should be like this:
OrderNumber
Value
Result
123
2
5
123
3
5
333
5
11
333
6
11
555
8
17
555
9
17
I've seen some formulas beginning with =SUM(A2:A6;A2;B2:B6). Important to me is that the searching criteria must be dynamically because my table has about 1k rows.
Do you have any references or suggestions?
You need SUMIF() function.
=SUMIF($A$2:$A$7,A2,$B$2:$B$7)
If you are a Microsoft 365 user then can try BYROW() for one go.
=BYROW(A2:A7,LAMBDA(x,SUMIF(A2:A7,x,B2:B7)))
This is the exact reason why the "Subtotals" feature has been invented:

panda value_counts show duplicate

Here is the code that I am using
all_data.groupby('BsmtFullBath').BsmtFullBath.count()
and the output is coming up as
BsmtFullBath
0 856
1 588
2 15
3 1
0 849
1 584
2 23
3 1
NA 2
Name: BsmtFullBath, dtype: int64
Expecting it to have a unique value for the each value, but "0" is coming two times.
I believe if you want to get rid of the duplicated values, to use the map function just like in the example below, just ch
df_final['DC'] = df_final['DC'].map({'NO':0, 'WT':1, 'BU':2,'CT':3,'BT':4, 'CD':5})

Python - group by + transform + substring

I'm trying to extract values from a string in a pandas data frame using other two columns as indices.
My data looks like this.
Address second_dot third_dot
0 1.273.1735.0 5 10
1 1.263.48.0 5 8
2 1.273.1341.0 5 10
3 1.273.1527.0 5 10
4 1.273.1379.0 5 10
5 1.273.1094.0 5 10
6 1.273.845.0 5 9
7 1.273.1393.0 5 10
8 1.275.988.0 5 9
9 1.273.973.0 5 9
In columns second_dot and third_dot I've stored the position of within column 'address' of the '.' characters. What I would like to do is to extract from each rows all characters between second and third dot.
The result should be like this:
Result
273
263
273
273
273
273
273
273
275
273
I've already managed to do it by using apply on axis 1 with custom function, but it takes too long (I've got millions or records in my data frame. Considering that the address are repeated over lines, I'm trying to group by the calculation, hoping to speed up.
This is my last attempt, but it does not work.
df.groupby(['Address']).transform(lambda x :
x['Address'].str[x['first_dot']:x['second_dot']])
I get error --> KeyError: ('Address', 'occurred at index MachineIdentifier').
MachineIdentifier is the first column of my df (not the index, a normal column)
Thanks a lot in advance

Assigning integers to dataframe fields ` OverflowError: Python int too large to convert to C unsigned long`

I have a dataframe df that looks like this:
var val
0 clump_thickness 5
1 unif_cell_size 1
2 unif_cell_shape 1
3 marg_adhesion 1
4 single_epith_cell_size 2
5 bare_nuclei 1
6 bland_chrom 3
7 norm_nucleoli 1
8 mitoses 1
9 class 2
11 unif_cell_size 4
12 unif_cell_shape 4
13 marg_adhesion 5
14 single_epith_cell_size 7
15 bare_nuclei 10
17 norm_nucleoli 2
20 clump_thickness 3
25 bare_nuclei 2
30 clump_thickness 6
31 unif_cell_size 8
32 unif_cell_shape 8
34 single_epith_cell_size 3
35 bare_nuclei 4
37 norm_nucleoli 7
40 clump_thickness 4
43 marg_adhesion 3
50 clump_thickness 8
51 unif_cell_size 10
52 unif_cell_shape 10
53 marg_adhesion 8
... ... ...
204 single_epith_cell_size 5
211 unif_cell_size 5
215 bare_nuclei 7
216 bland_chrom 7
217 norm_nucleoli 10
235 bare_nuclei -99999
257 norm_nucleoli 6
324 single_epith_cell_size 8
I want to create a new column that holds the values of the var and val columns, converted to a number. I wrote the following code:
df['id'] = df.apply(lambda row: int.from_bytes('{}{}'.format(row.var, row.val).encode(), 'little'), axis = 1)
When I run this code I get the following error:
df['id'] = df.apply(lambda row: int.from_bytes('{}{}'.format(row.var, row.val).encode(), 'little'), axis = 1)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4384, in _apply_standard
result = Series(results)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py", line 205, in __init__
default=np.nan)
File "pandas/_libs/src/inference.pyx", line 1701, in pandas._libs.lib.fast_multiget (pandas/_libs/lib.c:68371)
File "pandas/_libs/src/inference.pyx", line 1165, in pandas._libs.lib.maybe_convert_objects (pandas/_libs/lib.c:58498)
OverflowError: Python int too large to convert to C unsigned long
I don't understand why. If I run
for column in df['var'].unique():
for value in df['val'].unique():
if int.from_bytes('{}{}'.format(column, value).encode(), 'little') > maximum:
maximum = int.from_bytes('{}{}'.format(column, value).encode(), 'little')
print(int.from_bytes('{}{}'.format(column, value), 'little'))
print()
print(maximum)
I get the following result:
65731626445514392434127804442952952931
67060854441299308307031611503233297507
65731626445514392434127804442952952931
68390082437084224179935418563513642083
69719310432869140052839225623793986659
73706994420223887671550646804635020387
16399285238650560638676108961167827102819
67060854441299308307031611503233297507
72377766424438971798646839744354675811
75036222416008803544454453864915364963
69719310432869140052839225623793986659
16399285238650560638676108961167827102819
76365450411793719417358260925195709539
68390082437084224179935418563513642083
76365450411793719417358260925195709539
73706994420223887671550646804635020387
83632281929131549175300318205721294812263623257187
71048538428654055925743032684074331235
75036222416008803544454453864915364963
72377766424438971798646839744354675811
277249955343544548646026928445812341
256480767909405238131904943128931957
266865361626474893388965935787372149
287634549060614203903087921104252533
64059424565585367137514643836585471605
261673064767940065760435439458152053
282442252202079376274557424775032437
.....
60968996531299
69179002195346541894528099
58769973275747
62068508159075
59869484903523
6026341019714892551838472781928948268513458935618931750446847388019
Based on these results I would say that the conversion to integers works fine. Furthermore, the largest created integer is not so big that it should cause problems when being inserted into the dataframe right?
Question: How can I successfully create a new column with the newly created integers? What am I doing wrong here?
Edit: Although bws's solution
str(int.from_bytes('{}{}'.format(column, value).encode(), 'little'))
solves the error, I now have a new problem: the ids are all unique.. I don't understand why this happens but I suddenly have 3000 unique ids, while there are only 92 unique var/val combinations.
I dont know the why. Maybe lamda use by default int in front of int64?
I have a workaround that maybe is useful for you.
Convert the result to string (object):df['id'] = df.apply(lambda row: str(int.from_bytes('{}{}'.format(row["var"], row["val"]).encode(), 'little')), axis = 1)
This is interesting to know: https://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html
uint64 Unsigned integer (0 to 18446744073709551615)
edit:
After read the last link I asume that when you use a loop, you are using the int python type, not the int that use pandas (come from numpy). So, when you work with a Dataframe you are using the types that numpy provide...
Int type from numpy come from Object so I think that the correct way to work with large integer is use object.
Its my conclusion but maybe I am wrong.
Edit second question:
Simple example works:
d2 = {'val': [2, 1, 1, 2],
'var': ['clump_thickness', 'unif_cell_size', 'unif_cell_size', 'clump_thickness']
}
df2 = pd.DataFrame(data=d2)
df2['id'] = df2.apply(lambda row: str(int.from_bytes('{}{}'.format(row["var"], row["val"]).encode(), 'little')), axis = 1)
Result of df2:
print (df2)
val var id
0 2 clump_thickness 67060854441299308307031611503233297507
1 1 unif_cell_size 256480767909405238131904943128931957
2 1 unif_cell_size 256480767909405238131904943128931957
3 2 clump_thickness 67060854441299308307031611503233297507

different strategies for finding Quartile in excel

let us consider following data
9
5
3
10
14
6
12
7
14
i would like to find `Q1,Q2,Q3,let sort data
3
5
6
7
9
10
12
14
14
in excel we can calculate it very easily
=QUARTILE(A2:A10,1)
=QUARTILE(A2:A10,2)
=QUARTILE(B2:B10,3)
results are
6
9
12
but if we calculate by hand, we will get following results
5.5
9
13
why is result so different?thanks in advance
The definition of Quartile is not unequivocally. So there are multiple methods to calculate the Quartile. See https://en.wikipedia.org/wiki/Quartile
In Excel there are multiple Quartile functions now, see https://support.office.com/en-us/article/QUARTILE-function-93cf8f62-60cd-4fdb-8a92-8451041e1a2a?ui=en-US&rs=en-US&ad=US
QUARTILE and QUARTILE.INC uses Method 2 while QUARTILE.EXC uses Method 1.

Resources