I'm training myself to resolve classical statistics exercices with Spark and it's MLib module when needed, in order to face any situation later.
Spark is dedicated to calculation on matrices, but let's say that I have a side calculation to do in my program, at a time, and that I would like to resolve it with Spark, without adding others APIs.
The calculation is simple : learn the expected value and the variance
of having a six on a dice, when you throw it three times.
Currently, I resolve it by the help of the Apache maths API and a bit of Spark :
/** Among imports : */
import org.apache.commons.math3.distribution.*;
/**
* E8.5 : On lance trois dés. Probabilité d'un six obtenu ?
* Suit une loi binomiale B(3, 1/6)
* Par calcul : F(0) = 125/216, F(1) = 125/216 + 75/216, F(2) = (125 + 75 + 15)/216,
* F(3) = (125 + 75 + 15 + 1)/216 = 1
*/
#Test
#DisplayName("On lance trois dés. Probabilité d'un six obtenu ?. Caractéristiques de la loi B(3, 1/6)")
public void troisDésDonnentUnSix() {
int n = 3; // Nombre de lancés de dés.
double v[] = {0, 1, 2, 3}; // Valeurs possibles de la variable aléatoire.
double f[] = new double[n+1]; // Fréquences cumulées.
double p[] = new double[n+1]; // Probabilités.
// Calculer les probabilités et les fréquences.
BinomialDistribution loiBinomale = new BinomialDistribution(n, 1.0/6.0);
for(int i=0; i <= n; i++) {
p[i] = loiBinomale.probability(i);
f[i] = loiBinomale.cumulativeProbability(i);
}
LOGGER.info("P(x) = {}", p);
LOGGER.info("F(x) = {}", f);
Dataset<Row> ds = fromDouble(v, p);
LOGGER.info("E(X) = {}, V(X) = {}", esperance(ds), variance(ds));
}
where fromDouble(v, p) method creates a Dataset from a list of random variable values (column x) and their associated frequencies (column Px) :
/**
* Renvoyer un Dataset depuis une série de valeurs entières à probabilités.
* #param valeurs Valeurs.
* #param probabilites Probabilités.
* #return Dataset avec une colonne x, entière, contenant les valeurs<br>
* et une colonne Px, décimale, contenant les probabilités.
*/
protected Dataset<Row> fromDouble(double[] valeurs, double[] probabilites) {
StructType schema = new StructType()
.add("x", DoubleType, false)
.add("Px", DoubleType, false);
List<Row> rows = new ArrayList<>();
for(int index=0; index < valeurs.length; index ++) {
rows.add(RowFactory.create(valeurs[index], probabilites[index]));
}
return this.session.createDataFrame(rows, schema);
}
And esperance (= expected value) and variance methods called are doing these calculations :
/**
* Calculer l'espérance sur un Dataset avec valeurs et probabilités.
* #param ds Dataset avec colonnes : <br>
* x : valeur<br>
* Px : fréquence<br>
* #return espérance.
*/
protected double esperance(Dataset<Row> ds) {
return ds.agg(sum(col("x").multiply(col("Px")))).first().getDouble(0);
}
/**
* Calculer la variance sur un Dataset avec valeurs et probabilités.
* #param ds Dataset avec colonnes : <br>
* x : valeur<br>
* Px : fréquence<br>
* #return espérance.
*/
protected double variance(Dataset<Row> ds) {
Column variation = col("x").minus(esperance(ds));
Column variationCarre = variation.multiply(variation);
Column termeCalculVariance = col("Px").multiply(variationCarre);
return ds.agg(sum(termeCalculVariance)).first().getDouble(0);
}
LOGGER output :
P(x) = [0.5787037037037037, 0.34722222222222215, 0.06944444444444445, 0.0046296296296296285]
F(x) = [0.5787037037037035, 0.9259259259259259, 0.9953703703703703, 1.0]
E(X) = 0.49999999999999994, V(X) = 0.41666666666666663
It works (? I have caculatd by hand "E(X) = 53/108", and it finds 54/108 = 0.5..., I might be wrong), but it's not perfect.
Is there a more elegant way to solve this problem using Spark (and Spark-MLib, if needed) ?
The problem statement is as follows -
There is a text messaging service.It provides with an API to send SMSes to a user,
but they can be at most 30 characters long.
Also it doesn't guarantee the order in which the messages will be received.
You have to build a function which splits the text in chunks so that it can
be sent in multiple messages. Each chunk has to be :
- upto 30 characters long
- no word should be split in the middle
- each chunk has to have its order suffixed in the form of '(k/n)'
e.g. "this is the first chunk (1/2)", "this is the second chunk (2/2)"
- if the text provided to the function is within 30 characters limit,
no ordering should be suffixed
Input/Output Format
Each test case consists of a single line of words. Words are space
separated. Any other character other than space is considered part of
the word. For each test case, output the minimum number of chunks C
required to fit the entire SMS.
Restrictions
1 <=C<= 99; Input will be such that C remain in this mentioned limit
if your algorithm is optimal. No word will have a length that does
not fit in a single chunk (even after considering the suffix).
Sample Input:
The best lies are always mixed with a little truth
There is no creature on earth half so terrifying as a truly just man!!!!!
You know nothing, Jon Snow
Sample Output
3
3
1
Explanation:
In first case, we will have to split as below
The best lies are always (1/3)
mixed with a little (2/3)
truth (3/3)
First line is fully utilised with 30 characters in it.
Second line has 25 characters but if we try to fit last word in this line,
it becomes 31 characters. 'mixed with a little truth (2/2)
Hence we must split into 3 parts as above.
My approach -> was more around finding the approximate number of chunks first and then expanding on it but that didn't work. I was wondering is it even possible to first calculate how many chunks will be required mathematically or do we actually have to build chunks and see but then how do we build chunks without knowing 'n' of 'k/n'?
You have to know n to be able to know how many words can be put in each chunk because that depends on n.
Even if n is expressed in base 99 so that it only takes one character, you still need to examine the length of every word individually.
I have a suspicion that the optimal distribution of words between chunks is not the simple method of putting words (and spaces) into lines until the next item won't fit: it could be better to make some smaller chunks somewhere earlier to enable better packing later. However, this is not the cutting stock problem because the order must be preserved.
By the simple method, I mean packing the words in assuming there are less than ten chunks, and if not then start again based on there being less than 100 chunks, for example in VB.NET:
Imports System.Text
Imports System.Text.RegularExpressions
Module Module1
Function NumberLength(n As Integer) As Integer
Return If(n < 10, 1, 2)
End Function
Sub Main()
Dim msg = "Escobazos tu dios pisan amor sus el las grupos se y no de los pulso mudas muerte mi inocentes vilo los las bajaba viciosa tierra de amor horizonte la se deja de tierra heridas ni los la es los recodos horizonte diminutas la de es nube talco hombrecillo de piel los se escobazos nadadora de bajo consume las se con ni las en por que donde tierra es sillas el de de que latido viva lo a grupos torre escaleras desnudo dolor me a la la quedo que sepultura signos criaturas de desnudo subía la húmedo desnuda latido nube quedo de la el nadadora el cielo dolor arroyo escobazos quedo donde de amor venas el viva poniendo desangradas torre que resonancia los fría ansioso el de subía el tierra todo se ansioso manteles por amor amor con de el quemadas resonancia con mujer el y que luna los bajaba quedo los yo a alegrísima de ilesa huido el mi que los se bajo la hombrecillo luna en de vilo es de el aire despenada que latido aire para sus horizonte todo muelles heridas viva hule tierra para huido de las a los llenando los que por húmedo tránsito tierra la la aire olvidando recodos de de la ligeros los término por luna bajaba tierra llenando del al que bajo de que de a pupila mueven que grupos se tránsito los ciudades de de nino mármol vuelve lenguas se los pisotean la vengo con faraón tránsito ballenas la se los tierra del escaleras de tierra nunca lenta se musgos que desgarrados la de desgarrados la imperturbable la resonancia y duro subía tierra me mi de talco escaleras el duro los desangradas sus buscando desangradas de pies algodón golondrina por que las no larga con diana que el en imperturbable de los luna al la huevos muertos las los las larga para borrachos de el aire los la bajo tierra fría talco los los comida en llanura en en los todo que en olvidando es de el de tu la de los muerte los las de que húmedo llenando de los pasan los hombrecillo se duro lenta ballenas ninos hule la con a la tierra por gustada es y se tierra amor las recientes manteles tierra de para signos el es un diana es del dios es imperturbable de consume de muelles luna para al nube tierra bajo apariencia encuentro es diminutas"
Dim nPart = 1
Dim nPartsTotal = 1 'assume 9 or less parts
Dim nPartsTotalLength = 1
Dim maxPartLength = 30
If msg.Length <= maxPartLength Then
Console.WriteLine("1")
Console.ReadLine()
Exit Sub
End If
Dim words = Regex.Split(msg, "( )")
Dim suffixLength = 5 ' up to 9 parts
Dim pos = 0
Dim nWord = 0
Dim thisPart As New StringBuilder
Dim partText As New List(Of String)
While nWord < words.Count()
suffixLength = 3 + NumberLength(nPart) + nPartsTotalLength
If pos + suffixLength + words(nWord).Length <= maxPartLength Then
pos += words(nWord).Length
nWord += 1
thisPart.Append(words(nWord - 1))
Else
partText.Add(thisPart.ToString())
pos = 0
nPart += 1
nPartsTotal += 1
thisPart.Clear()
If nPartsTotal > 9 AndAlso nPartsTotalLength = 1 Then
' start again knowing that there are more than 9 parts required
nPartsTotalLength = 2
nPart = 1
nPartsTotal = 1
nWord = 0
partText.Clear()
End If
End If
End While
If thisPart.Length > 0 Then
partText.Add(thisPart.ToString())
End If
Console.WriteLine(nPartsTotal)
Console.WriteLine(New String("|"c, maxPartLength)) ' show max length
For i = 1 To partText.Count()
Console.WriteLine($"{partText(i - 1)}({i}/{nPartsTotal})")
Next
Console.ReadLine()
End Sub
End Module
That happens to generate 99 chunks. The question doesn't ask for the output of the actual chunks - that part of the example code is there to have a look in case it is obvious where a different alogorithm could do better.
It was rather simpler. I started with trying to find the number of chunks first and that is where it went wrong. Andrew's solution is correct.
I am attaching my code in JavaScript(just for reference)
function splitSMSIn10Chunks(string){
let parts = string.split(' ');
let suffix = '(x/y)';
let chunks = 0;
let currentChunk = '';
for(let i=0;i<parts.length;i++){
if((currentChunk.length+1+parts[i].length+suffix.length)<=30){
currentChunk = currentChunk + parts[i] + " ";
}else{
currentChunk = currentChunk + suffix;
currentChunk = parts[i]+" " ;
chunks++;
}
if(i==parts.length-1){
//Last chunk
currentChunk = currentChunk + suffix;
chunks++;
}
if(chunks==10) return -1;
}
return chunks;
};
function splitSMSIn99Chunks(string){
let parts = string.split(' ');
let suffix = '(x/yy)';
let chunks = 0;
let currentChunk = '';
for(let i=0;i<parts.length;i++){
if((currentChunk.length+1+parts[i].length+suffix.length)<=30){
currentChunk = currentChunk + parts[i] + " ";
}else{
currentChunk = currentChunk + suffix;
currentChunk = parts[i]+" " ;
chunks++;
if(chunks==9){
suffix = '(xx/yy)';
}
}
if(i==parts.length-1){
//Last chunk
currentChunk = currentChunk + suffix;
chunks++;
console.log(currentChunk);
}
}
return chunks;
};
let sms = "The best lies are always mixed with a little truth The best lies are always mixed with a little truth The best lies are always mixed with a little truth The best lies are always mixed with a little truth";
let chunksRequired = splitSMSIn10Chunks(sms);
if(chunksRequired==-1){
console.log(splitSMSIn99Chunks(sms));
}else{
console.log(chunksRequired);
}
It can be done in 1 function also but to keep it simpler and easier to read, I have created two separate functions.
Hello:
I have a function which gets a string, and regarding what it gets, it calls some other functions. All but one of them, do not needs arguments. But the one that do needs it expect to receive an argument which type is defined by me. My intention is to require input to pass. But, using getLine, getChar, getInt, store the input keeping the type ([Char],Char,etc), and I need to pass rough input to that function so the inferring system is able to detect that its type is my user-defined type (Fecha).
Extracts from code:
type Fecha = [(NombreJug,PuntosLogrados,MinutosJugados)]
armarListaDeTuplasPuntosFecha::Fecha->[(NombreJug,PuntosLogrados)]
armarListaDeTuplasPuntosFecha [] = []
armarListaDeTuplasPuntosFecha (ej:ejs) = [((\ (nombre,puntos,_)-> (nombre,puntos)) ej)] ++ armarListaDeTuplasPuntosFecha ejs
**jugadorConMayorCantidadDePuntoEnFecha unaFecha** = (\ (nombre,puntos)->nombre) (maximumBy mayorTupla (armarListaDeTuplasPuntosFecha unaFecha))
mejorJugadorPor::String->NombreJug
mejorJugadorPor criterio | criterio == "Mayor Cantidad de puntos en fecha" = do
fecha<-getLine
jugadorConMayorCantidadDePuntoEnFecha (fecha)
| otherwise = "No es un criterio valido, reintente una proxima vez"
I would really appreciate if you can help me with this issue. The available documentation I've found its insufficient for me due to I'm a rookie with Haskell
Thank you very much in advance.
Regards
It looks like he is trying to keep track of players (NombreJug = player name), PuntosLogrados (points gained) and playing times (MinutosJugados) and then find the best player by some criteria.
armarListaDeTuplasPuntosFecha throws away the playing times to return a tuple of player name and points.
mejorJugadorPor ("Best player by") is trying to ask the user for a list of inputs and then select the player with the highest score. I think you are right that he needs a Read instance for his type, or a function to parse the input and turn it into type Fecha defined at the top. It also depends on how NombreJug,PuntosLogrados,MinutosJugados are defined. Are they type synonyms?
mejorJugadorPor also looks like it should be of type String-> IO NombreJug, since it performs IO actions.
This is my attempt to do what you want:
import Data.List
type NombreJug = String
type PuntosLogrados = Int
type MinutosJugados = Int
type Fecha = [(NombreJug,PuntosLogrados,MinutosJugados)]
armarListaDeTuplasPuntosFecha::Fecha->[(NombreJug,PuntosLogrados)]
armarListaDeTuplasPuntosFecha = map desechar
where desechar (x,y,_) = (x,y)
jugadorConMayorCantidadDePuntoEnFecha unaFecha = fst (maximumBy mayorTupla (armarListaDeTuplasPuntosFecha unaFecha))
mayorTupla = undefined
mejorJugadorPor:: String -> IO NombreJug
mejorJugadorPor criterio
| criterio == "Mayor Cantidad de puntos en fecha" = do
fecha <- readLn
return $ jugadorConMayorCantidadDePuntoEnFecha fecha
| otherwise = return "No es un criterio valido, reintente una proxima vez"
I added "mayorTupla = undefined" to get it to compile, because that function isn't defined in the code you posted.
Changes I made:
your function armarListaDeTuplasPuntosFecha is better expressed with map. Map applies a function to every element of a list, which is what you are doing manually.
jugadorConMayorCantidadDePuntoEnFecha can be expressed with fst, which returns the first element of a tuple of two values
mejorJugadorPor needs to be in the IO monad because it performs input/output actions (reading something in that the user types). You do this by change the return type from String to IO String, to say that the return value depends on IO (ie the function isn't pure).
The function readLn does what you want, because it converts the input string to the correct type as long as the type has an instance of Read. The Read type class basically means that you can convert a string into a value of the type somehow.
Because mejorJugadorPor is monadic, you need to make sure that the value it returns is contained in the IO monad. This is what the function return does: it takes a value of type "a" and turns it into a value of type "m a", where m is any monad.
From what I could gather you want to make your Data types instances of the Read class and then use the read function to read string data into your datatypes.
If that was not what you had in mind let me know.
After several hours I got around the mess: with help from UK (Julian Porter: www.jpembedded.co.uk, www.porternet.org). Got the way not to create monads or modifying classes (I'm not at that level yet):
import Data.List
type NombreJug = String
type NombrePart = String
type Club = String
type Posicion = String
type Cotizacion = Integer
type PuntosLogrados = Integer
type MinutosJugados = Integer
type Jugador = (NombreJug,Club,Posicion,Cotizacion)
type Jugadores = [Jugador]
type PartConSusJug = (NombrePart,[NombreJug])
type Participantes = [PartConSusJug]
type Fecha = [(NombreJug,PuntosLogrados,MinutosJugados)]
type Fechas = [Fecha]
participantes = [("Natalia", ["Abbondazieri","Lluy","Battaglia", "Lazzaro"]),
("Romina", ["Islas", "Lluy", "Battaglia", "Lazzaro"]),
("Jessica", ["Islas"])
]
clubes = ["Boca", "Racing", "Tigre"]
jugadores = [("Abbondazieri", "Boca", "Arquero", 6500000),
("Islas", "Tigre", "Arquero", 5500000),
("Lluy", "Racing", "Defensor", 1800000),
("Battaglia", "Boca", "Volante", 8000000),
("Lazzaro", "Tigre", "Delantero", 5200000),
("Monzon","Boca","Defensor",3500000),
("Guzman","Newells","Arquero",1000000),
("Diaz","Velez","Defensor",3600000),
("Palermo","Boca","Delantero",12000000),
("Aguirre","Lanus","Volante",4500000),
("Cura","Huracan","Defensor",1700000),
("Espinoza","Gimnasia","Volante",300000),
("Clemente","Deportivo Piraña","Volante",60000000)
]
miListaTuplasFechas = [("quinta",[("Lluy", 8, 90),("Lazzaro", 6, 90)]),("sexta",[("Lazzaro", 7, 77),("Islas", 6, 90),("Lluy", 7, 90)]),("septima",[("Battaglia", 13, 90), ("Lluy", 6, 90), ("Lazzaro", 8, 77)]),("octava",[("Islas", 4, 84), ("Battaglia", 8, 90)])]
fechas = [quinta, sexta, septima, octava]
quinta = [("Lluy", 8, 90), ("Lazzaro", 6, 90)]
sexta = [("Lazzaro", 7, 77), ("Islas", 6, 90), ("Lluy", 7, 90)]
septima = [("Battaglia", 13, 90), ("Lluy", 6, 90), ("Lazzaro", 8, 77)]
octava = [("Islas", 4, 84), ("Battaglia", 8, 90)]
-- 10) mejorJugadorPor, recibe un criterio y devuelve el mejor jugador de acuerdo a ese criterio.
-- Dar además ejemplos de consultas que resuelvan los siguientes requerimientos:
mayorTupla (n1, c1) (n2, c2)
| c1 > c2 = GT
| c1 <= c2 = LT
-- 1.el jugador que logro mayor cantidad de puntos en todo el torneo. -> "Lazzaro"
armarListaDeTuplasPuntos::Jugadores->[(NombreJug,PuntosLogrados)]
armarListaDeTuplasPuntos [] = []
armarListaDeTuplasPuntos (ej:ejs) = [ (((\ (nombre,_,_,_)-> nombre) ej), (totalPuntosJugador ((\ (nombre,_,_,_)-> nombre) ej))) ] ++ armarListaDeTuplasPuntos ejs
mostrarmeLasTuplasPuntos = armarListaDeTuplasPuntos jugadores
jugadorConMayorCantidadDePuntosEnTorneo = (\ (nombre,puntos)->nombre) (maximumBy mayorTupla mostrarmeLasTuplasPuntos)
-- 2.el jugador que posee la mayor cotización.-> "Battaglia"
armarListaDeTuplasCotizacion::Jugadores->[(NombreJug,Cotizacion)]
armarListaDeTuplasCotizacion [] = []
armarListaDeTuplasCotizacion (ej:ejs) = [((\ (nombre,_,_,cotizacion)-> (nombre,cotizacion)) ej)] ++ armarListaDeTuplasCotizacion ejs
mostrarmeLasTuplasCotizaciones = armarListaDeTuplasCotizacion jugadores
jugadorConLaMayorCotizacion = (\ (nombre,cotizacion)->nombre) (maximumBy mayorTupla mostrarmeLasTuplasCotizaciones)
--Aquí se ve un ejemplo de aplicación de orden superior: la función maximumBy recibe dos funciones como agumentos.
-- 3.el jugador que logro mayor cantidad de puntos en una fecha. (en la 5º) -> "Lluy"
armarListaDeTuplasPuntosFecha::Fecha->[(NombreJug,PuntosLogrados)]
armarListaDeTuplasPuntosFecha [] = []
armarListaDeTuplasPuntosFecha (ej:ejs) = [((\ (nombre,puntos,_)-> (nombre,puntos)) ej)] ++ armarListaDeTuplasPuntosFecha ejs
jugadorConMayorCantidadDePuntoEnFecha [] = "Fecha no definida"
jugadorConMayorCantidadDePuntoEnFecha unaFecha = (\ (nombre,puntos)->nombre) (maximumBy mayorTupla (armarListaDeTuplasPuntosFecha unaFecha))
-- 4.el jugador que logro el mejor promedio en todo el torneo. -> "Battaglia"
armarListaDeTuplasPromedios::Jugadores->[(NombreJug,Float)]
armarListaDeTuplasPromedios [] = []
armarListaDeTuplasPromedios (ej:ejs) = [ (((\ (nombre,_,_,_)-> nombre) ej), (promedioPuntosJugador ((\ (nombre,_,_,_)-> nombre) ej))) ] ++ armarListaDeTuplasPromedios ejs
mostrarmeLasTuplasPromedios = armarListaDeTuplasPromedios jugadores
jugadorConMejorPromedioDelTorneo = (\ (nombre,puntos)->nombre) (maximumBy mayorTupla mostrarmeLasTuplasPromedios)
--Aquí se ve un ejemplo de aplicación de orden superior: la función mostrarmeLasTuplasPromerios es pasada como parámetro a la expresión lambda.
otroCaso = "No es un criterio valido, reintente una proxima vez"
listaDeCriterios criterio | (criterio == "jugadorConMayorCantidadDePuntosEnTorneo") = jugadorConMayorCantidadDePuntosEnTorneo
| (criterio == "jugadorConLaMayorCotizacion") = jugadorConLaMayorCotizacion
| (criterio == "jugadorConMejorPromedioDelTorneo") = jugadorConMejorPromedioDelTorneo
| ((criterio /= "jugadorConMayorCantidadDePuntosEnTorneo")&& (criterio /= "jugadorConLaMayorCotizacion")&&(criterio /= "jugadorConMejorPromedioDelTorneo")) = otroCaso
devolverFecha::String->[(String,Fecha)]->Fecha
devolverFecha laFecha [] = []
devolverFecha laFecha (f:fs) | (((\ fechaIngresada (fechaAComparar,_)-> fechaIngresada == fechaAComparar) laFecha f) == True) = snd f
| otherwise = devolverFecha laFecha fs
criterios1 = do
putStrLn "Ingrese la fecha deseada: "
x<-getLine
let resultado = ((jugadorConMayorCantidadDePuntoEnFecha (devolverFecha x miListaTuplasFechas)))
putStrLn ("\""++resultado++"\"")
criterios2::String->IO ()
criterios2 criterio = do
let resultado = (listaDeCriterios criterio)
putStrLn ("\""++resultado++"\"")
eleccionDeCriterios criterioElegido | (criterioElegido == "jugadorConMayorCantidadDePuntoEnFecha") = criterios1
| otherwise = criterios2 criterioElegido
mejorJugadorPor = do
putStrLn "Por favor, ingrese un criterio: "
criterio<-getLine
eleccionDeCriterios criterio
Console output:
Main> mejorJugadorPor
Por favor, ingrese un criterio:
jugadorConMejorPromedioDelTorneo
"Battaglia"
Main> mejorJugadorPor
Por favor, ingrese un criterio:
pepe
"No es un criterio valido, reintente una proxima vez"
Main>
Main>
Main> mejorJugadorPor
Por favor, ingrese un criterio:
jugadorConMayorCantidadDePuntoEnFecha
Ingrese la fecha deseada:
quinta
"Lluy"
Main> mejorJugadorPor
Por favor, ingrese un criterio:
jugadorConMayorCantidadDePuntoEnFecha
Ingrese la fecha deseada:
decima
"Fecha no definida"
It's in Spanish. If somebody finds it useful, contact me and I'll translate it into English.
Thank you very much for those who commented on this issue, and for their recommendations.