How to parse a baseball box score in R - string

I am working on a research project with baseball data from retrosheet.org. I want to create variables for the score of each team in each inning (Vis1, Home1, Vis2, Home2, etc). The problem is that the variable for the box score is coded strangely. Each team has its own variable for the whole game and each inning gets one value. Because leading zeros are cut off a value of "12(10)1X" would mean that a team did not score in the first 4 innings, scored once in the fifth, twice in the sixth, ten times in the seventh, once in the eighth, and they did not have to play the ninth because they had won by that point.
Any advice? I'm at a loss. The () confuse me the most.

There is an example in this talk at useR! 2012 that may provide more information specific to your baseball project.
You can find it here.

I'm irish and live in wales and have no clue about baseball but, I think I remember hearing that there can only be a maximum of 9 innings???? (honestly... no clue!!!)
bbscore = function(x)
{
scores = c()
score = unlist(strsplit(x,split=""))
i= 1
while(i<length(score)+1)
{
if(score[i]=="(")
{
scores = c(scores,paste(score[i+1],score[i+2],sep=""))
i = i+4
}
scores = c(scores,score[i])
i = i+1
}
return(scores)
}
> x
[1] "12(10)1X"
> bbscore(x)
[1] "0" "0" "0" "0" "1" "2" "10" "1" "X"
> scores.df = read.csv("GL1995.TXT",header=F)
> head(scores.df$V20)
[1] 200030300 000000000 000300020 000000010 100100010 001002300
1355 Levels: (11)00033102 00000000 000000000 0000000000 ... 710001001
> scores.df$V20 = as.character(scores.df$V20)
> V20.1995.scores = lapply(scores.df$V21, bbscore)
> V20.1995.scores = lapply(scores.df$V20, bbscore)
> V20.1995.scores[[1]]
[1] "2" "0" "0" "0" "3" "0" "3" "0" "0"
> V20.1995.scores[[2]]
[1] "0" "0" "0" "0" "0" "0" "0" "0" "0"
> V20.1995.scores[[3]]
[1] "0" "0" "0" "3" "0" "0" "0" "2" "0"
Of course you'll have to do some furhter manipulations to get them into numbers and deal with X's and also this will break if there are any other unexpected characters, in addition to being beholden to the assumption of 9 innings.
EDIT:
I removed the stipulation for 9 innings and show how to do this for the entire column (assuming that scores you spoke of are indeed the 20th variable in the csv file). Extra porcessing is required for different number of innings. do.call(rbind,...) won't work. find the longest game and append "X"'s to the end to make them all the same length? Maybe? I'm not sure but I think this question has been answered at least.

Late answer, but...
There is a new R package for fetching data from the MLB server including box score and much more. Might be worth a look!
openWar

Related

Kusto String Difference

I need help with finding difference between 2 strings. For example, difference between the strings outlook and outlooka needs to be "a" or even the number of characters that differ should work fine.
I am okay with converting the strings to array and calculating the set difference as well.
Any help is much appreciated. Thank you.
I am trying to identify homoglyph domains with minor changes.
This query counts each character occurrences in each string and returns the differences.
datatable(id:int, str1:string, str2:string)
[
1 ,"outlook" ,"outlooka"
,2 ,"outlook" ,"outlok"
,3 ,"outlook" ,"outllooook"
,4 ,"outlook" ,"lookout"
]
| mv-apply c = extract_all("(.)", strcat(str1, str2)) to typeof(string)
,s = array_concat(repeat("1", strlen(str1)), repeat("2", strlen(str2))) to typeof(string) on
(
summarize count_diff = countif(s == 2) - countif(s == 1) by c
| summarize char_diff = make_bag_if(bag_pack(c, count_diff), count_diff != 0)
)
id
str1
str2
char_diff
1
outlook
outlooka
{"a":1}
2
outlook
outlok
{"o":-1}
3
outlook
outllooook
{"o":2,"l":1}
4
outlook
lookout
{}
Fiddle

Pattern Matching BASIC programming Language and Universe Database

I need to identify following patterns in string.
- "2N':'2N':'2N"
- "2N'-'2N'-'2N"
- "2N'/'2N'/'2N"
- "2N'/'2N'-'2N"
AND SO ON.....
basically i want this pattern if written in Simple language
2 NUMBERS [: / -] 2 NUMBERS [: / -] 2 NUMBERS
So is there anyway by which i could write one pattern which will cover all the possible scenarios ? or else i have to write total 9 patterns and had to match all 9 patterns to string.... and it is not the scenario in my code , i have to match 4, 2 number digits separated by [: / -] to string for which i have towrite total 27 patterns. So for understanding purpose i have taken 3 ,2 digit scenario...
Please help me...Thank you
Maybe you could try something like (Pick R83 style)
OK = X MATCH "2N1X2N1X2N" AND X[3,1]=X[6,1] AND INDEX(":/-",X[3,1],1) > 0
Where variable X is some input string like: 12-34-56
Should set variable OK to 1 if validation passes, else 0 for any invalid format.
This seems to get all your required validation into a single statement. I have assumed that the non-numeric characters have to be the same. If this is not true, the check could be changed to something like:
OK = X MATCH "2N1X2N1X2N" AND INDEX(":/-",X[3,1],1) > 0 AND INDEX(":/-",X[6,1],1) > 0
Ok, I guess the requirement of surrounding characters was not obvious to me. Still, it does not make it much harder. You just need to 'parse' the string looking for the first (I assume) such pattern (if any) in the input string. This can be done in a couple of lines of code. Here is a (rather untested ) R83 style test program:
PROMPT ":"
LOOP
LOOP
CRT 'Enter test string':
INPUT S
WHILE S # "" AND LEN(S) < 8 DO
CRT "Invalid input! Hit RETURN to exit, or enter a string with >= 8 chars!"
REPEAT
UNTIL S = "" DO
*
* Look for 1st occurrence of pattern in string..
CARDNUM = ""
FOR I = 1 TO LEN(S)-7 WHILE CARDNUM = ""
IF S[I,8] MATCH "2N1X2N1X2N" THEN
IF INDEX(":/-",S[I+2,1],1) > 0 AND INDEX(":/-",S[I+5,1],1) > 0 THEN
CARDNUM = S[I,8] ;* Found it!
END ELSE I = I + 8
END
NEXT I
*
CRT CARDNUM
REPEAT
There is only 7 or 8 lines here that actually look for the card number pattern in the source/test string.
Not quite perfect but how about 2N1X2N1X2N this gets you 2 number followed by 1 of any character followed by 2 numbers etc.
This might help:
BIG.STRING ="HELLO TILDE ~ CARD 12:34:56 IS IN THIS STRING"
TEMP.STRING = BIG.STRING
CONVERT "~:/-" TO "*~~~" IN TEMP.STRING
IF TEMP.STRING MATCHES '0X2N"~"2N"~"2N0X' THEN
FIRST.TILDE.POSN = INDEX(TEMP.STRING,"~",1)
CARD.STRING = BIG.STRING[FIRST.TILDE.POSN-2,8]
PRINT CARD.STRING
END

Im being asked to to initialize a list, l_counts with as many 0 as characters in the English alphabet

L_counts will keep the count for 'a' at position 0, the count for 'b' at position 1, and so on. i must have a way to know what English letter corresponds to each position in L_counts. im not quite understanding the instruction so if i create a empty list or in the list put 0 - 2.
L_count = []
L_counts = [0,1,2,3.. so on]
Any problem with the following?
L_Count = [0]*26

R string comparison

I new to R and try to bring together two datasets (here answc and diagc) based on matching contents. Since the string "1 - Tester1" occurs twice in answc I would expect the result of answc==diagc to return in res at least twice 1 (=true); See example below.
Where did I go wrong?
head(answc)
[1] "1 - Tester1" "2 - Tester2" "3 - Tester3" "1 - Tester1" "2 - Tester2"
[6] "3 - Tester3"
is.character(answc)
[1] TRUE
head(diagc)
[1] "1 - Tester1"
is.character(diagc)
[1] TRUE
res<-ifelse(answc==diagc, 1, 0)
head(res)
[1] 0 0 0 0 0 0
Thank you for the feedback
The hint with the str() got me the confirmation, that the problem may have been in the data types -> I re-did the whole process with data from ANSI-formatted csv-files, read them with "stringsAsFactors=FALSE", and made sure that the relevant answc and diagc really are "chr".
The second repetition got the the desired matches, and although I can't really point out the exact error I would like to close this question.
Thank you
Christian
PS: Form now on I'll always check the encoding and the classes of the elements that are involved in a comparison/match...

Inconsistent behavior between str_split and strsplit

The documentation for str_split in the stringr package states that for the pattern argument:
If "" splits into individual characters.
which suggests it behaves the same as strsplit in this regard. However,
library(stringr)
str_split("abcab","")
[[1]]
[1] "" "a" "b" "c" "a" "b"
with a leading empty string. This compares with,
strsplit("abcab","")
[[1]]
[1] "a" "b" "c" "a" "b"
Leading empty strings seems to be normal behavior when splitting on non-empty strings,
strsplit("abcab","ab")
[[1]]
[1] "" "c"
but even then, str_split generates an 'extra' trailing empty string:
str_split("abcab","ab")
[[1]]
[1] "" "c" ""
Is this discrepancy a bug, feature, an error in the documentation or just a different notion of what's 'expected behavior'?
If you use commas as delimiters, the "expected" (your mileage may vary) result is more obvious:
# expect "" "2" "3" "4" ""
strsplit(",2,3,4,", ",")
# [[1]]
# [1] "" "2" "3" "4"
str_split(",2,3,4,", ",")
# [[1]]
# [1] "" "2" "3" "4" ""
If I have n commas then I expect (n+1) elements to be returned. So I prefer the results from str_split. However, I wouldn't necessarily call this a bug in strsplit, since in performs as advertised:
(from ?strplit) Note that this means that if there is a match at the beginning of
a (non-empty) string, the first element of the output is ‘""’, but
if there is a match at the end of the string, the output is the
same as with the match removed.
"" is trickier, as there is no way to count the number of times "" appears in a string. Therefore treating it as a special case seems justified.
(from ?str_split) If ‘""’ splits into individual characters.
Based on this I suggest you have found a bug and should take hadley's advice and report it!

Resources