Set the precision and import the necessary packages
%precision %g
%run "~/github/butools/Python/BuToolsInit.py"
First the global butools.verbose flag is set to True to obtain more messages from the functions.
butools.verbose = True
First we load a trace (BC-pAug89), calculate the pdf, the cdf, and the first 5 moments. To obtain the empirical pdf, we need to specify the interval boundaries as well. PdfFromTrace returns the relative number of samples falling into the boundaries divided by the interval sizes.
tr = np.loadtxt("../test/data/bctrace.iat")
intBounds = np.linspace(0, 0.2, 250)
(xt,yt) = PdfFromTrace (tr, intBounds)
(cxtFull,cytFull) = CdfFromTrace (tr)
momt = MarginalMomentsFromTrace (tr, 5)
The first algorithm to test is G-Fit, that obtains a hyper-Erlang distribution by the EM algorithm. It has many parameters, but fortunately it can be used in a very user-friendly way just specifying the number of phases of the result.
alpha, A, logli = PHFromTrace(tr, 5)
print("alpha=",alpha)
print("A=",A)
The density functions of the result and of the trace are compared below. Note that we did not use PdfFromPH, but IntervalPdfFromPH instead, which is more appropriate to compare with an empirical density. IntervalPdfFromPH takes interval boundaries, and returns the probabilities that the PH distributed random variable falls into these intervals divided by the interval lengths.
(xg,yg) = IntervalPdfFromPH (alpha, A, intBounds)
plt.figure(figsize=(12,4))
plt.subplot(121)
plt.title("Pdf comparison, body")
plt.plot(xt,yt,label="Trace")
plt.plot(xg,yg,label="G-Fit")
plt.xlim(0,0.012)
plt.legend(loc='best')
plt.subplot(122)
plt.title("Pdf comparison, tail")
plt.loglog(xt,yt,label="Trace")
plt.loglog(xg,yg,label="G-Fit")
plt.legend(loc='best')
plt.xlim(1e-3,2e-1)
The evolution of the result of the GFIT algorithm can be demonstrated with the following fancy animation.
from IPython.display import clear_output, display
butools.verbose = False
fig = plt.figure(figsize=(12,4))
pl1 = fig.add_subplot(121)
plt.title("Pdf comparison, body")
plt.xlim(0,0.012)
plt.ylim(0,350)
plt.legend(loc='best')
pl2 = fig.add_subplot(122)
plt.title("Pdf comparison, tail")
plt.xlim(1e-3,2e-1)
plt.ylim(1e-6,1e3)
plt.legend(loc='best')
# first we let gfit generate an initial guess and return it in a vector-vector format
pi, lambd, logli = PHFromTrace (tr, [1,2,3], maxIter=0, result="vecvec")
# the following loop executes 5 EM steps in each iteration
for i in range(20):
# execute 5 EM steps
pi, lambd, logli = PHFromTrace (tr, [1,2,3], maxIter=5, initial=(pi, lambd), result="vecvec")
# obtain the PH representation
alpha, A, logli = PHFromTrace (tr, [1,2,3], maxIter=0, initial=(pi, lambd))
(xg,yg) = IntervalPdfFromPH (alpha, A, intBounds)
if i>0:
clear_output()
pl1.lines.pop(0)
pl1.lines.pop(0)
pl2.lines.pop(0)
pl2.lines.pop(0)
pl1.plot(xt,yt,label="Trace")
pl1.plot(xg,yg,label="G-Fit")
pl2.loglog(xt,yt,label="Trace")
pl2.loglog(xg,yg,label="G-Fit")
display(fig)
clear_output()
There are other ways to quantify the success of the fitting method as well, like
Let us start with the comparison of the moments
momg = MomentsFromPH (alpha, A, 5)
print("Moments of the trace: ", momt)
print("Moments of the PH: ", momg)
We also have a function to obtain the squared difference of the empirical pdfs.
print("Squared difference of the pdfs: ", EmpiricalSquaredDifference (yt, yg, intBounds))
Obtaining the squared difference of the cdfs is a bit more involved. The empirical cdf is too large, thus we take only every 1000th item of it. Furthermore, we do not have the interval bounds in this case. The "x" values can serve as interval bounds, but in this case the last "y" has to be dropped out (as the number of interval bounds needs to be larger than the number of values by one)
cxt = cxtFull[::1000]
cyt = cytFull[::1000]
cyg = CdfFromPH (alpha, A, cxt)
print("Squared difference of the pdfs: ", EmpiricalSquaredDifference (cyt[:-1], cyg[:-1], cxt))
The relative entropy of the two pdfs is
print("Relative entropy: ", EmpiricalRelativeEntropy(yt, yg, intBounds))
The likelihood of the trace is
LikelihoodFromTrace(tr,alpha,A)
MAPFromTrace is similar to PHFromTrace, but it returns a MAP instead of a PH. Unfortunately it is much slower, thus we need to obtain a smaller trace for the experiments. The C++ implementation of the procedure (SPEMFIT, https://bitbucket.org/ghorvath78/spemfit) is much faster and is able to operate on millions of measurements with a reasonable speed.
trSmall = tr[0:10000]
D0, D1, logli = MAPFromTrace(trSmall, 5)
The LikleihoodFromTrace works with MAPs as well, not only with PHs.
LikelihoodFromTrace(trSmall, D0, D1)
To evaluate the accuracy of the obtained MAP, the autocorrelation functions are compared.
trAcf = LagCorrelationsFromTrace(tr, 10)
print("trAcf=", trAcf)
mapAcf = LagCorrelationsFromMAP(D0, D1, 10)
print("mapAcf=",mapAcf)
plt.plot(np.array([trAcf,mapAcf]).T);
The squared difference and the relative entropy of the autocorrelation functions:
sqAcf = SquaredDifference(mapAcf, trAcf)
reAcf = RelativeEntropy(mapAcf, trAcf)
print("sqAcf=",sqAcf)
print("reAcf=",reAcf)