1 Introduction
We consider a problem of clustering a sequence of multinomial observations. To be specific, consider a sequence
of independent multinomial random vectors taking values in
for some , where . For each , each is a multinomial random vector such that the number of trials isand the success probability vector is
. To simplify our notation, we write . We allow the value of to depend on the value of and similarly, we allow the value of to depend on the value of . Moreover, to formulate our clustering problem, we assume that there is a finite collection of dimensional probability vectors such that . Since , it follows that for each , there exists such that . For each , we let provided that , and to simplify our notation, we may also write to mean .We assume that the value of , and are unknown, but the value of is observed and forms the basis for statistical inference. Let , and let be the set of all possible values for . Note that can be represented with an element in and can be represented with an element in , i.e., a nonnegative matrix, where each column sums to . Since the value of is assumed to be unknown, from a parameter estimation point of view, one also must consider the set for all as a potential set to which the true parameter belongs. Then, we let
(1) 
Henceforth, we write and for the parameter that generates the data . The estimates of , and are denoted by , and respectively, and we now use the letters , , for a generic value that can take respectively.
In this paper, we propose to take , and to be solutions to the optimization problems specified in (4) and (5). To this end, the rest of this paper is organized as follows.
In Section 2.1, we present the overall description of our approach, specifically introducing (4) and (5). In Section 2.2, we present a preliminary estimation technique, which can be used prior to performing a numerical search for the solution to (4). In Section 2.3, we specify the penalty term for our model selection criteria in (5).
In Section 3.1, we motivate our choice of the penalty term in (9) through Theorem 1 and 2. In Section 3.2, we motivate, in Theorem 3, our usage of the reduced rank projection step within our estimation steps.
In Section 4
, we compare our model selection criterion with the conventional AIC via a Monte Carlo simulation experiment. We also study, through our approach, a twosample test problem for comparing two graphs. This is done using simulated data sets, as well as a real data set involving the chemical and electrical connectivity structure of neurons of a C. elegan. Lastly, we apply our technique to a problem of determining the model dimension associated with the socalled Swimmer dataset which is well known to the nonnegative factorization community (c.f.
[1])2 Background materials
2.1 General framework
To begin, we represent the sequence
as an integervalued random matrix
so that the element in the th column of is . With slight abuse of notation, we denote the th row and the th column of by . Since the sample value of is known, it follows that the sample values of are known. Then, it follows that, denoting by the matrix whose th column is given by , we have(2) 
where is a diagonal matrix such that its th diagonal is and the expectation is taken with respect to the probability measure specified by . Moreover, in general, it can be seen that can be factored as a product of two column stochastic nonnegative matrices, namely, and . Specifically,
(3) 
where for each , the th column of is and for each , the th column of is the basis vector of such that its th entry is if and only if .
In light of (2) and (3), when is observed without noise, i.e., , and given that the value of is known, application of a nonnegative factorization algorithm can recover and from . However, in general, is random. Specifically, we have that for each ,
where for simplicity, we write
Alternatively, we may also write, by grouping according to the value of , that
For simplicity, we may write
Then, for each , let be an maximum estimate of with the restriction that the value of must be an element of , for some value of (c.f. [2]). Specifically, for each and , we denote by , an maximum estimate of given , and we have
(4) 
Note that by taking to in limit, then we see that
and as such, reduces to a maximum likelihood estimate. When the value of is clear from context, we suppress from and write instead.
Then, we let be the smallest values of that minimizes the value of the following expression:
(5) 
where is assumed to be chosen a priori, , and
denoting by
the KullbackLeibler divergence of
from . For reference, we let2.2 Preliminary estimation prior to
For our model selection problem, for each , an estimator of as function of must minimize the size of error while also allowing for nonnegative factorization, i.e., where and are and nonnegative matrices. Directly computing an to achieve this can be done numerically with varying degree of complexity, but in all cases, starting the search for near can be beneficial. To achieve this approximately for initializing our search algorithm for numerical experiments, we propose a multistep procedure in which and are used together. For more details on (and also on USVT, a related approach), we direct the reader to [3] (and [4]), and for , to [5], [6] and [7].
Iteratively searching for a solution to the estimation problem in (4) can be computationally expensive. An approximate solution, which can be used as the initial point of the search, can be obtained in four steps, which are listed in Algorithm 1 collectively for convenience.
First we take a reduced rank projection of at the rank
. Specifically, we first compute the singular value decomposition of
with the diagonal of being sorted in a decreasing order, e.g. . Then, we take(6) 
where is the upper block of , and and are the first columns of and respectively.
Because need not be nonnegative, we then reset the negative entries of to zero. However, the resetting the negative entries of to zero can change the rank of .
To correct this, after computing , we further perform nonnegative factorization, which means we minimize by running over all possible pairs of matrix and matrix (c.f. [5]).
In the last step, we define by letting, for each , if and only if for all , where a tie, if any exists, is resolved by uniformly choosing among the tied indices. Then, estimate .
The aforementioned steps for obtaining and are collectively denoted as in the listing of Algorithm 1.
Upon obtaining the initial value of , one can perform a numerical iterative search for , for example, using a variational method, an EM algorithm, an MCMC method, or a brute force iterative search. For an interested reader, in Section D, we outline an objective function to be maximized for an MCMC approach. In all cases, it is known that a good initialization of the chosen algorithm can improve its rate of convergence as well as allowing the algorithm to avoid a local stationary point. On the other hand, the number of clusters to be estimated still needs to be supplied, for each of these algorithms.
2.3 Model selection criterion
For many application, the following standard model selection criteria are often used:
(7)  
(8) 
where
, ,
and is chosen to be an MLE. Their derivation is based on analysis of an appropriate expected discrepancy (c.f. [8]) for a Gaussian regression model. In this section, we also follow this general approach, catering to our model.
Our modelbased information criterion is obtained by appropriately penalizing the weighted loglikelihood of the multinomial model as specified in (5). Specifically, we consider, for each and ,
(9) 
where is the estimate of assuming that , is the matrix such that , , . In words, is the number of “successes” from the th cluster specified by . Also, counts the number of nonzero entries from the th cluster’s success probability vector , as specified by .
We detail our motivation for (9) by way of Theorem 1 and 2. Intuitively, as increases, the term in (9) is expected to increase for a larger value of especially when and for for many values of . In other words, when some columns of are “overly” similar to each other, the penalty term becomes more prominent (c.f. Table IV).
In Section E, we reduce (9) to and in (7) and (8) respectively, under some simplifying assumptions. However, for clustering a sequence of sparse multinomial data, the penalty terms in (7) and (8) that are appropriate for classical normal regression problems, can overpenalize, especially when the probability vectors contain many zeros (c.f. Figure 3).
3 Theoretical results
The main theoretical results of this paper are twofold. First, we motivate a particular choice of the form of the penalty term in (9
) through an asymptotic analysis of
, under simplifying assumptions that , and . Following [9], we use the superscripted as a mnemonic for “merging”, and use the superscripted as a mnemonic for “splitting”. Second, we motivate the reduced rank projection approach for initializing the numerical search of the maximum likelihood estimate .3.1 Asymptotic derivation of the penalty term
In this section, we motivate a specific choice for the penalty term, , by computing the asymptotic form of the expected weighted discrepancy of while taking the value of to along some sequence of index . Let where is associated with some through (2) and (3), the expectation is taken with respect to , whence , and .
Theorem 1.
Suppose that

for each , we have such that for each , the number of trials is the same for all ,

for each , there exists such that for each
Then,
where and
(10) 
Theorem 1 suggests for the value of the pair in (9). More importantly, we note that the nonzero entries do not contribute to the value of in (10).
For the rest of this section, we further study, through Theorem 2, the question of for what values of and , we can expect to see that chosen according to (5) with the penalty term specified by (9), estimate the true value with high probability.
Specifically, denoting by , in Theorem 2, we study the case in which choosing will lead to a model selection criterion that tends
 (i)

not to underestimate the value of when the alternative model is one that the st and the th clusters are “merged” into one,
 (ii)

not to overestimate the value of , when the alternative model is one that for some and , and , i.e., the th cluster is “split” into two.
Theorem 2.
Let and , where we let

be such that for all that and for all that or ,

be such that for and is a probability vector,

be such that for all that and for all that , with being surjective,

be such that for and then let .
Suppose that for each , and that
(11)  
(12) 
Then,
(13)  
(14) 
where
with
In other words, in Theorem 2, assuming that is such that its takes a form of and its takes a form of , it follows that as , with high probability, , suggesting . Similarly, in Theorem 2, assuming that is such that its takes a form of and its takes a form of , and that , it follows that as , with high probability, , suggesting . Also, (5) with the penalty term specified by (9) with yields the specific form of , and in Theorem 2.
While proven under simplifying assumptions, through Theorem 1 and 2, we propose to choose the value of to be for consistence estimation of .
On the other hand, as discussed in Section E, under some simplifying assumption, can be associated with the conventional AIC, and similarly, can be associated with the conventional BIC. For and , following the proof of Theorem 2, one can show results similar to (13) while the probability in (14) is positive but can be strictly less than . For our numerical experiments in Section 4, to be comparable to the conventional AIC, we take and we give a preference to a smaller value for if a neartie occurs.
3.2 Reduced rank projection as a smoothing routine
The motivation behind using a reduced rank projection step is to remove random variation. As discussed in [3], when there is no missing entries in , algorithm is equivalent to performing reducedrank projection (or equivalently, singular value thresholding at a fixed rank). Specifically, it can be seen from [3, Theorem 4.4], that provided that a random matrix is assumed to be bounded appropriately and that its entries
are independent random variables, using
yields a consistent estimate of under various conditions.To give a more precise statement of our contribution on the topic, we begin by introducing some notation. Given a constant , for each and , let
Then, we let be the result of a single iteration of the singular value threholding of using the (true) value of of the matrix . We will suppress, in our notation, the dependence of , , and on for simplicity.
Truncating each at yields an estimate that is biased due to truncation while its effect may diminish as the value of increases. We present an asymptotic result in which is allowed to grow as a function of and under several simplifying assumptions.
Our first simplifying assumption is that the mean matrix has a “finite” block structure, or equivalently, a “finite” checkerboard type pattern. Specifically, we assume that partition the index set , where the value of is constant and does not depend on , and , Next, we assume also that for each , there exists a pair such that for each , and . We assume that the values of are constant and do not depend on , and .
We suppress in our notation, the dependence of , , and on and for simplicity. Also, means that the pair is indexed by , so that .
Theorem 3.
Suppose that are independent Poisson random variables, and that the rank of is .
If and ,
then
where .
Note that generally, the rank of and for some cases, it is also possible to have the rank of . In Theorem 3, to simplify our analysis, we have assumed that the rank of is . Next, to consider Theorem 3 with respect to [3, Theorem 4.4], we note that the result in [3, Theorem 4.4] applies when the errors are independent while the entries of are correlated. Specifically, as a corollary to Theorem 3, we also have that, under the hypothesis of Theorem 3,
(15) 
where , since given the value of ,
where . In this manner, in addition to giving a motivation to reduced rank projection as a smoothing routine, Theorem 3 can be of interest on its own.
4 Numerical results
4.1 Simulation experiments
penalty  penalty  

1  22.18  0.01  22.18  0.02  22.20 
2  22.14  0.02  22.16  0.08  22.22 
4.1.1 Simple experiment
In this section, through a simple numerical experiment, we reiterate our last observation made in Section 2.3. Consider a sequence of multinomial random vectors taking values in , where their (common) number of multinomial trials is . Specifically, the first success probability vector is proportional to the vector and the second probability vector is proportional to the vector , where for both cases, the number of entries with its value being is and the number of entries with its value being is . In other words, the value of is , whence in this case, our model selection procedure seeks to reach the minimum value of with . As shown in Table I, the value of is minimized at while the value of is minimized at .
More generally, in Table II, we allow the value of to grow, while keeping the first success probability vector to be a scalar multiple of the vector and keeping the second success probability vector to be a scalar multiple of , where for both cases, the number of entries with its value being is fixed at and the number of entries with its entries being and are the same or differ only by . A general pattern Table II is that for all values of , in comparison to , the conventional AIC, i.e. , performs poorly, and we attribute this to the fact that overpenalizes in comparison to .
20  11  0 
25  61  1 
30  86  6 
35  100  16 
40  99  25 
45  100  52 
50  100  56 
55  100  60 
60  100  70 
65  100  70 
70  100  78 
75  100  72 
80  100  83 
85  100  76 
90  100  81 
95  100  80 
100  100  82 
4.1.2 Comparison to rank determination strategies
We now present numerical results for comparing our approach to two conventional rank determination methods. Specifically, we denote our first baseline algorithm with (pamk o dist) and the second with (mclust o pca), where o denotes composition. These competing algorithms are often used in practice for choosing the rank of a (random) matrix. In comparison, we denote our model selection procedure by (aic o nmf). For (pamk o dist), one first computes the distance/dissimilarity matrix using pairwise Euclidean/Frobenius distances between the columns of , and perform partition around medoids for clustering (c.f. [10]) together with “Silhouette” criterion (c.f. [11]) for deciding the number of clusters. For (mclust o pca), one first uses an “elbowfinding” algorithm to estimate the rank of the data matrix (c.f. [12]), say, by , and then, use a clustering algorithm (c.f. [13]) to cluster columns of into clusters.
The result of our experiment is summarized in Figure 1, which illustrates that in all cases, (aic o nmf) either outperforms or nearly on par with the two baseline algorithms.
To explain our result, we now specify the setup for our Monte Carlo experiment. Our experiment is motivated by the real data experiment studied in Section 4.2, where a problem of comparing two graphs representing electrical and chemical neuron pathways of C. elegan is studied.
Specifically, we consider random graphs on vertices such that each has a blockstructured pattern, i.e., a checkerboard like pattern (c.f. Figure 3). For each , we take to be a (weighted) graph on vertices, where each is a Poisson random variable. To parameterize the block structures, we set the number of vertices , where .
The value of equals the number of rows in each block. Then, given a value for the intensity parameter , for each and , we let , where and are such that and . We take
In Figure 1, the horizontal axis specifies the number of nodes after aggregation. For the level of aggregation (or equivalently, vertexcontraction), if the number of nodes after vertexcontraction is (i.e. the far right side of Figure 1), the original graph is reduced to a graph with vertices. Aggregation of edge weights is only done within the same block. Then, we take to be the nonnegative matrix such that its th column is the vectorized version of the aggregation of . Our problem is then to estimate the number of clusters using data , where the correct value for is . In this particular case, is a rank matrix, and as such, our problem can also be thought to be a problem of estimating the rank of after observing , whence (pamk o dist) and (mclust o pca) are applicable.
In summary, there are two parameters that we have varied, specifically, the level of intensity and the level of aggregation. The level of intensity is changed by the value of , which is distinguished in Figure 1 by the shape of points. Note that a bigger value for means more chance for each entry of taking a large integer value.
Then, as the performance index, we use the adjusted Rand index (ARI) values (c.f. [14]). In general, ARI takes a value in . The cases in which the value of ARI is close to is ideal, indicating that clustering is consistent with the truth, and the cases in which the value of ARI is less than are the cases in which its performance is worse than randomly assigned clusters. Then, to compare three algorithms, we compare the values of ARI given each fixed value of .
4.1.3 Comparison to a nonparametric twosample test procedure for random graphs
In this section, we consider a sequence of undirected loopless (unweighted) random graphs such that is an element of a set of distinct matrices whose entries are probabilities. Then, we consider a problem of clustering graphs into finite number of groups from the data . This is an abstraction of a problem in neuroscience, where each can represents a copy of neurontoneuron interaction pattern, where each portraits a different mode of connectivity between neurons. For instance, in Section 4.2, the modes are the chemical and the electrical pathways.
Since each is undirected and loopless, its adjacency matrix can be embedded as a vector in an element in
Comments
There are no comments yet.