In this example, how could I know how many 0, 1 and 2 there are in each variable?
It looks like you want to count the number of occurrences of each digit in each observation.
You can do this as follows:
clear
input str5 string
"22112"
"21012"
"22012"
"22022"
"21122"
"21112"
"21002"
"...0."
"...0."
"20002"
"..00."
"2..01"
"22212"
"21022"
"12212"
end
generate x0 = length(string) - length(subinstr(string, "0", "", .))
generate x1 = length(string) - length(subinstr(string, "1", "", .))
generate x2 = length(string) - length(subinstr(string, "2", "", .))
The idea here is to calculate the difference in the length of the string after you eliminate every instance of the digit of interest.
The above code snippet will produce the desired output:
list
+-----------------------+
| string x0 x1 x2 |
|-----------------------|
1. | 22112 0 2 3 |
2. | 21012 1 2 2 |
3. | 22012 1 1 3 |
4. | 22022 1 0 4 |
5. | 21122 0 2 3 |
|-----------------------|
6. | 21112 0 3 2 |
7. | 21002 2 1 2 |
8. | ...0. 1 0 0 |
9. | ...0. 1 0 0 |
10. | 20002 3 0 2 |
|-----------------------|
11. | ..00. 2 0 0 |
12. | 2..01 1 1 1 |
13. | 22212 0 1 4 |
14. | 21022 1 1 3 |
15. | 12212 0 2 3 |
+-----------------------+
Related
My DataFrame looks something like this:
+----------------------------------+---------+
| Col1 | Col2 |
+----------------------------------+---------+
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
| value 6 | 3 |
| value 7 | 4 |
| value 8 | 5 |
| Start B | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End B | 6 |
| value 6 | 3 |
| value 7 | 4 |
| value 8 | 5 |
| Start C | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End C | 6 |
+----------------------------------+---------+
What I am trying to acheive is if substring start and end is present I want the rows between them.
Expected Result is:
+----------------------------------+---------+
| Col1 | Col2 |
+----------------------------------+---------+
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
| Start B | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End B | 6 |
| Start C | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End C | 6 |
+----------------------------------+---------+
I tried the code from this How to filter dataframe columns between two rows that contain specific string in column?
m = df['To'].isin(['Start A', 'End A']).cumsum().eq(1)
df[m|m.shift()]
But this only returns the first set of start and end, also it expects the exact string.
output:
+----------------------------------+---------+
| Col1 | Col2 |
+----------------------------------+---------+
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
+----------------------------------+---------+
The answer you linked to was designed to work with a single pair of Start/End.
A more generic variant of it would be to check for the parity of the group (assuming strictly alternating Start/End):
m1 = df['Col1'].str.match(r'Start|End').cumsum().mod(2).eq(1)
# boolean indexing
out = df[m1|m1.shift()]
Alternatively, use each Start as a flag to keep the following rows and each End as a flag to drop them. This wouldn't however consider the A/B/C letter after the Start/End like the nice answer of #Quang does:
# extract Start/End
s = df['Col1'].str.extract(r'^(Start|End)', expand=False)
# set flags and ffill
m1 = s.replace({'Start': True, 'End': False}).ffill()
# boolean slicing
out = df[m1|m1.shift()]
Output:
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6
Let's try:
# extract the label after `Start/End`
groups = df['Col1'].str.extract('[Start|End] (.*)', expand=False)
# keep rows with equal forward fill and backward fill
df[groups.bfill() == groups.ffill()]
Output:
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6
One option is with an interval index:
Get the positions of the starts and ends:
starts = df.Col1.str.startswith("Start").to_numpy().nonzero()[0]
ends = df.Col1.str.startswith("End").to_numpy().nonzero()[0]
Build an interval index, and get matches where the index lies between Start and End:
intervals = pd.IntervalIndex.from_arrays(starts, ends, closed='both')
intervals = intervals.get_indexer(df.index)
Filter the original dataframe with the intervals, where intervals are not less than 0:
df.loc[intervals >= 0]
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6
This is a question about Window Functions in Spark.
Assume I have this DF
DATE_S | ID | STR | VALUE
-------------------------
1 | 1 | A | 0.5
1 | 1 | A | 1.23
1 | 1 | A | -0.4
2 | 1 | A | 2.0
3 | 1 | A | -1.2
3 | 1 | A | 0.523
1 | 2 | A | 1.0
2 | 2 | A | 2.5
3 | 2 | A | 1.32
3 | 2 | A | -3.34
1 | 1 | B | 1.5
1 | 1 | B | 0.23
1 | 1 | B | -0.3
2 | 1 | B | -2.0
3 | 1 | B | 1.32
3 | 1 | B | 523.0
1 | 2 | B | 1.3
2 | 2 | B | -0.5
3 | 2 | B | 4.3243
3 | 2 | B | 3.332
This is just an example! Assume that there are many more DATE_S for each (ID, STR), many more IDs and STRs, and many more entries per (DATE_S, ID, STR). Obviously there are multiple values per Combination (DATE_S, ID, STR)
Now I do this:
val w = Window.partitionBy("ID", "STR").orderBy("DATE_S").rangeBetween(-N, -1)
df.withColumn("RESULT", function("VALUE").over(w))
where N might lead to the inclusion of a large range of rows, from 100 to 100000 and more, depending on ("ID", "STR")
The result will be something like this
DATE_S | ID | STR | VALUE | RESULT
----------------------------------
1 | 1 | A | 0.5 | R1
1 | 1 | A | 1.23 | R1
1 | 1 | A | -0.4 | R1
2 | 1 | A | 2.0 | R2
3 | 1 | A | -1.2 | R3
3 | 1 | A | 0.523 | R3
1 | 2 | A | 1.0 | R4
2 | 2 | A | 2.5 | R5
3 | 2 | A | 1.32 | R6
3 | 2 | A | -3.34 | R7
1 | 1 | B | 1.5 | R8
1 | 1 | B | 0.23 | R8
1 | 1 | B | -0.3 | R9
2 | 1 | B | -2.0 | R10
3 | 1 | B | 1.32 | R11
3 | 1 | B | 523.0 | R11
1 | 2 | B | 1.3 | R12
2 | 2 | B | -0.5 | R13
3 | 2 | B | 4.3243| R14
3 | 2 | B | 3.332 | R14
There are identical "RESULT"s because for every row with identical (DATE_S, ID, ST), the values that go into the calculation of "function" are the same.
My question is this:
Does spark call "function" for each ROW (recalculating the same value multiple times) or calculate it once per range (frame?) of values and just pastes them on all rows that fall in the range?
Thanks for reading :)
From your data the result may not be the same if run twice from what I can see as there is no distinct ordering possibility. But we leave that aside.
Whilst there is codegen optimization, it is nowhere to be found that it checks in the way you state for if the next invocation is the same set of data to process for the next row. I have never read of that type of optimization. There is fusing due to lazy evaluation approach, but that is another matter. So, per row it calculates again.
From a great source: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-windows.html
... At its core, a window function calculates a return value for every
input row of a table based on a group of rows, called the frame. Every
input row can have a unique frame associated with it. ...
... In other words, when executed, a window function computes a value
for each and every row in a window (per window specification). ...
The biggest issue is to have suitable number of partitions for parallel processing, which is expensive, but this is big data. partitionBy("ID", "STR") is the clue here and that is a good thing.
I'm doing a case-control study about ovarian cancer. I want to do stratified analyses for the different histotypes but haven't found a good way of doing it in SPSS. I was thinking about copying the information about the diagnoses from the cases to the controls, but I don't know the proper syntax to do it.
So - what I want to do is to find the diagnosis within the case-control pair, copy it, and paste it into the same variable for all the controls within that pair. Does anyone know a good way to do this?
ID = unique ID for the individual, casecontrol = 1 for case, 0 for control, caseset = stratum, ID for each matched group of individuals.
My dataset looks like this:
ID | casecontrol | caseset | diagnosis
1 | 1 | 1 | 1
2 | 0 | 1 | 0
3 | 0 | 1 | 0
4 | 0 | 1 | 0
5 | 1 | 2 | 3
6 | 0 | 2 | 0
7 | 0 | 2 | 0
8 | 0 | 2 | 0
And I want it to look like this:
ID | casecontrol | caseset | diagnosis
1 | 1 | 1 | 1
2 | 0 | 1 | 1
3 | 0 | 1 | 1
4 | 0 | 1 | 1
5 | 1 | 2 | 3
6 | 0 | 2 | 3
7 | 0 | 2 | 3
8 | 0 | 2 | 3
Thank you very much.
According to your example, in each value of caseset you have one line with diagnosis equals some positive number, and in the rest of the lines diagnosis equals zero (or is missing?).
If this is true, all you need to do is this:
aggregate out=* mode=add overwrite=yes /break=caseset /diagnosis=max(diagnosis).
The above command will overwrite the original data, so make sure you have that data backed up, or use a different name for the aggregated data (eg /FullDiagnosis=max(diagnosis) .
I have the following table of results in a Pandas DataFrame. Each player has been assigned an ID number:
+----------------+----------------+-------------+-------------+
| Home Player ID | Away Player ID | Home Points | Away Points |
+----------------+----------------+-------------+-------------+
| 1 | 2 | 3 | 0 |
| 3 | 4 | 1 | 1 |
| 2 | 3 | 3 | 0 |
| 4 | 1 | 3 | 0 |
| 2 | 4 | 1 | 1 |
| 3 | 1 | 1 | 1 |
| 2 | 1 | 0 | 3 |
| 4 | 3 | 1 | 1 |
| 3 | 2 | 0 | 3 |
| 1 | 4 | 0 | 3 |
| 4 | 2 | 1 | 1 |
| 1 | 3 | 1 | 1 |
+----------------+----------------+-------------+-------------+
The aim is to create a 4x4 numpy matrix (dimensions equal to the number of players) and fill the matrix with the points they earned from games between the respective players.
The matrix should end up like this:
+--------+---+---+---+---+
| Matrix | 1 | 2 | 3 | 4 |
+--------+---+---+---+---+
| 1 | 0 | 3 | 1 | 0 |
| 2 | 0 | 0 | 3 | 1 |
| 3 | 1 | 0 | 0 | 1 |
| 4 | 3 | 1 | 1 | 0 |
+--------+---+---+---+---+
The left hand column is the ID number of the home players, with the column headers the IDs of the away players.
For example, when the Home Player ID = 1 and the Away Player ID = 2, Player 1 earned 3 points, so the entry for the Matrix(1,2) (or 0,1 because of the zero indexing) would equal 3.
I can just about manage to do this with two for loops, but it seems quite inefficient. Is there a better way to achieve this?
Would really appreciate any advice!
Use
In [217]: df.pivot_table(columns='Home Player ID', index='Away Player ID',
values='Away Points', fill_value=0)
Out[217]:
Home Player ID 1 2 3 4
Away Player ID
1 0 3 1 0
2 0 0 3 1
3 1 0 0 1
4 3 1 1 0
Or use
In [221]: df.set_index(['Away Player ID', 'Home Player ID'])['Away Points'].unstack(fill_value=0)
Out[221]:
Home Player ID 1 2 3 4
Away Player ID
1 0 3 1 0
2 0 0 3 1
3 1 0 0 1
4 3 1 1 0
So i'm trying to count who sold something at the end of the day doesn't matter how many item that person sold.
Name Shoes Shirts Hat
A 1 2
B 1
C 1 3
D 1 1
E
So if A sold then should count as 1 person sold something
If E is not selling anything that not count as anything
Example spreadsheet with data from C6 through F10:
| C | D | E | F |
5 |name|shoe|shirt|hat|
6 | a | 1 | 1 | 0 |
7 | b | 0 | 1 | 0 |
8 | c | 1 | 1 | 0 |
9 | d | 1 | 1 | 0 |
10| e | 0 | 0 | 0 |
You could use this formula on the data example above:
=SUM(IF($D$6:$D$10+$E$6:$E$10+$F$6:$F$10>0,1,0))
but you have to hit Ctrl+Shift+Enter, not just Enter (because it is an array formula). You will know you did this correctly because it automatically adds { } around the formula, so it will look like this:
{=SUM(IF($D$6:$D$10+$E$6:$E$10+$F$6:$F$10>0,1,0))}
What this formula does:
IF(D6+E6+F6 > 0, 1,0) for each row until 10+E10+F10, which leaves you with either a 1 or 0 for each row; 1 if the sum of the row > 0 and 0 if it was not > 0. It then adds up the info and gives you a count of any rows that had at least 1 sale of either a shoe, shirt or hat.