[ Information ] [ Publications ] [Signal processing codes] [ Signal & Image Links ]
[ Main blog: A fortunate hive ] [ Blog: Information CLAde ] [ Personal links ]
[ SIVA Conferences ] [ Other conference links ] [ Journal rankings ]
[ Tutorial on 2D wavelets ] [ WITS: Where is the starlet? ]
If you cannot find anything more, look for something else (Bridget Fountain)
Si vous ne trouvez plus rien, cherchez autre chose (Brigitte Fontaine)

 Web www.laurent-duval.eu lcd.siva.free.fr

# Smoothed p-Over-q (SPOQ) lp/lq norm ratio Regularization for Sparse Signal reconstruction

Abstract: Underdetermined or ill-posed inverse problems require additional information to yield sound solutions with tractable optimization algorithms. Sparsity arguments are consequent heuristics to that matter, with numerous applications in signal restoration, image inversion or machine learning. Since the l0 count measure is barely tractable, many statistical or learning approaches have invested in computable proxies, such as the l1 norm. However, the latter does not exhibit the desirable property of scale invariance for sparse data. Generalizing the SOOT Euclidean/Taxicab l1/l2 norm-ratio initially introduced for blind deconvolution, we propose SPOQ, a family of smoothed scale-invariant penalty regularizations. It consists of surrogates for lp-over-lq quasi-norm/norm ratios, p in ]0,2[, q≥2 with Lipschitz regularity. These surrogates are embedded into a novel majorize-minimize trust-region approach, generalizing the variable metric forward-backward algorithm. On naturally sparse mass-spectrometry signals, we show that SPOQ significantly outperforms l0, l1, Cauchy, Welsch and penalties on several performance measures. Guidelines on SPOQ hyperparameters tuning are also provided, suggesting simple data-driven choices.

# Introduction and background[Introduction]

fchen

## On the role and measures of sparsity in data science

The law of parsimony (or Occam's razor1) is an important heuristic principle, at least a guideline, in history, social and empirical sciences . In modern terms, it can be expressed as a preference to simpler models, when they possess - on observed phenomena - a power of explanation comparable to more complex ones. In statistical data processing, it can be used to limit the degrees of freedom for parametric models, to reduce a space of search, to define stopping criteria, to bound filter support, to simplify signals or images believed to encompass meaningful structures, with the help of sparsifying transformations. For processes that inherently generate sparse observations (spiking neurons, chemical sensing), potentially corrupted by smoothing kernels and noise, sparsity may provide a quantitative target on restored data. In the context of partial observations, it can become a means to selecting one solution, among all potential solutions that are consistent with observations.

A natural playground for sparsity, in discrete time series analysis, is c_{00}(\eR)$, the space of almost-zero real sequences, which is closed under finite addition and convolution , . Unless stated otherwise, in the following, we consider finite sequences of$c_{00}(\eR)$,$\ex=(x_n)_{1 \leq n \leq N} \in \eR^N$, denoted by bold lower case. The cornerstone measure of parsimony is the count index2, i.e. the number of non-zero terms in$\ex$, denoted by$\lzx{\ex}$. It is also called cardinality function, numerosity measure, parsimony or sparsity. For p ∈ ]0, + ∞[, we define$\lpx{p}{\ex} = \left( \sum_{n=1}^N|x_n|^p \right)^{1/p}$. It satisfies norm axioms for p ≥ 1. For quasinorms lp, p < 1, a weaker form of the triangle inequality,$\| \ex+\ey \| \le K \left(\| \ex\|+\| \ey\| \right)$with K ∈ [1, + ∞[, is satisfied3. Note that the quasinorm (p < 1) is sometimes called p-norm-like . Piecewise constant function is nonsmooth and nonconvex. It is often considered as unusable for data optimization in large linear systems4, since its leads to NP-hard problems . Under certain drastic conditions, an$\lz$-problem can however be solved exactly using a convex relation by surrogate penalties, like the -norm . In practice, such conditions are rarely met, and the use of the -norm yields approximate solutions and becomes more heuristic . Mixed-integer programming reformulations using branch-and-bound methods are possible, albethey for relatively small-sized problems. Norm- or quasinorm-based penalties have subsequently played an important role in sparse data processing or parsimonious models for high-dimension regression. Squared Euclidean norm l22 possesses efficient implementations but often heavily degrades data sharpness. As a data fidelity term, the l22 cost function alone cannot address, at the same time, residual noise properties and additional data assumptions. It can be supplemented by different forms of variational regularization, à la Tikhonov . Those act on the composition of data with a well-suited sparsifying operator, e.g. identity, gradients, higher-order derivatives, or wavelet frames . The l22 penalty case corresponds to ridge regression . In the basis pursuit or lasso method (least absolute shrinkage and selection operator ) the one-norm or taxicab distance () is preferred, as it promotes a form of sparsity on the solution. Solutions to this “l22 fidelity plus l1” regularization problem are related to total variation regularization in image restoration . A convex combination of ridge and lasso regularizations yields the naive elastic net regularization . Other convex p-norms (p ≥ 1) regularizations have been addressed, as in bridge regression which interpolates between lasso and ridge . Expecting sparser solutions in practice, non-convex least-squares plus lp (p < 1) problems have been addressed . Although a priori appealing for sparse data restoration, such problems retain NP-hard complexities . Another caveat of the above norm/quasinorm penalties, as proxies for reasonable approximations to , is their scale-variance: norms and quasinorms satisfy the absolute homogeneity axiom ($\lpx{p}{\lambda\ex}=|\lambda|\lpx{p}{\ex}$, for$\lambda\in \eR $). Either to copycat the 0-degree homogeneity of , or to cope with scaling ambiguity, scale-invariant contrast functions were suggested . Recently, the nonconvex hyperbolic penalty has reemerged as a triad: a useful stopping criterion, a sparsity index and a sparse deconvolution penalty, as detailed in . The latter work (SOOT: Smoothed One-Over-Two norm ratio) also proposed an efficient optimization algorithm with theoretically guaranteed convergence. We now motivate its extension to a broader family of (quasi-)norm ratio$\lpx{p}{\ex}/\lpx{q}{\ex}$with couples (p, q) ∈ ]0, 2[ × [2, ∞[, based on both their counting properties and probabilistic interpretation. First, we have equivalence relations in finite dimension: $$\lpx{q}{\ex} \leq \lpx{p}{\ex} \leq \lzx{\ex}^{\frac{1}{p}-\frac{1}{q}}\lpx{q}{\ex} \leq N^{\frac{1}{p}-\frac{1}{q}}\lpx{q}{\ex} \label{eq_lp-lq}$$ with p ≤ q, from the standard power-mean inequality implying classical lp-space embeddings and generalized Rogers-Hölder's inequalities5 The LHS in is attained when$\ex$realizes an instance of the most prototypical sparse signals of$c_{00}(\eR)$, with only one non-zero component. The RHS is reached by a maximally non-sparse$\ex$, where all the samples are set to a non-zero constant (the constant N can even be set to the actual support of$\ex$). Thus, quasinorm-ratios provide interesting proxies for a sparseness measure of$\ex$, to quantify how much the “action” or “energy” of a discrete signal is concentrated into only a few of its components. They are invariant under integer (circular) shift or sample shuffling in the sequence, and under non-zero scale change (or 0-degree homogeneity). Those ratios are sometimes termed pq-means. ## Penalties with quasinorm and norm ratios For every p ∈ ]0, 2[ and q ∈ [2, + ∞[, we thus define: $$%\xinrn \quad \ell_p / \ell_q (\ex) = \left(\sum_{n =1}^N \left(\frac{|x_n|^q}{\sum_{n =1}^N |x_{n}|^q} \right)^{p/q} \right)^{1/p} \,. \label{eqlplqexact}$$ It appeared in the context of Jensen's inequalities in . If one interprets the inner term $$p_n = \frac{|x_n|^q}{\sum_{n =1}^N |x_n|^q}$$ as a discrete probability distribution, then lp/lq rewrites as an increasing function (u → u1/p) of a sum of concave functions (u → up/q when p ≤ q) of probabilities. The minimization of such an additive information cost function , a special case of Schur-concave functionals , is used for instance in best basis selection . Resultingly, special cases of lp/lq quasinorm ratios have served as sparsity-inducing penalties in the long history of blind signal deconvolution or image deblurring, as as stopping criteria (for instance in NMF, non-negative matrix factorization ), measures of sparsity satisfying a number of sound parsimony-prone axioms , estimates for time-frequency spread or concentration . The lp quasinorm weighted by l2 is considered as a “possible sparseness criterion” in when scaled with the normalizing factor$N^{\frac{1}{p}-\frac{1}{2}}$. It bears close connection with the kurtosis for centered distributions and central moments, widely used in sparse sensory coding or adaptive filtering . The most frequent one with (p, q) = (1, 2) is used as a surrogate to l0 , see for instance . This ratio was recently used to enhance Lasso recovery on graphs . The early history of this track includes the “minimum entropy deconvolution” proposed in , where the ”varimax norm“, an adaptation of the kurtosis (l4/l2)4, is maximized to yield visually simpler (spikier) signals. It was inspired by simplicity measures proposed in factor analysis , and meant to improve one of the earliest mentioned l0 regularization for robust modeling in seismic. The relationship with the concept of entropy was later explained asymptotically in . It was generalized to the so-called ”variable norm deconvolution" by maximizing (lq/l2)q by . Note that the techniques in are relatively rudimentary. They aim at finding some inverse filter that maximizes a given contrast. They do not explicitly take into account noise statistics. Even more, the deconvolved signal estimate is linearly obtained from observations, see for an overview. More recently, uses l1/l for sparse recovery, and l/l0 for cardinality-penalized clustering. The work in introduces the family of entropy-based sparsity measures$\left(\ell_q/\ell_1 \right)^{\frac{q}{1-q}}$(termed q-ratio sparsity level in ), extending a previous work on squared l1/l2 ratios for compressed sensing. Finally, proposes an extension of to an lq/l2 ratio to discriminate between sharp and blurry images, and use norm ratio for the purpose of impulsive signature enhancement in sparse filtering, still without rigorous convergence proofs. ## Contribution and outline Our main contribution resides in providing a set of smooth-enough surrogates to with sufficient Lipschitz regularity. The resulting penalties, called SPOQ, extend the SOOT approach from . We also propose a novel trust-region which generalizes and improves the variable metric forward-backward algorithm from . This paper is organized as follows. Section [Introduction] recalled the parsimony role and sparsity measures. Section [ProposedFormulation] describes the observation model and the proposed SPOQ quasinorm-norm or norm-ratio sparsity regularization. We derive our trust-region minimization algorithm and analyze its convergence guarantees in Section [MinimizationAlgorithm]. Section [Application] illustrates the performance of proposed SPOQ regularization in the recovery of “naturally sparse” mass spectrometry signals, over , , Cauchy, Welsch and sparsity penalties. 1. Named after William of Ockham, to whom it is often attributed, and stated as “Entities should not be multiplied without necessity”, or “Non sunt multiplicanda entia sine necessitate”. 2. It is neither a norm nor a quasinorm. Its pseudonorm moniker depends on the definition of the subhomogeneity axiom. 3. The best such K is usually called the modulus of concavity of the quasinorm, and saturates to 1 for norms. For 0 < p ≤ 1, one obtains$\ell_p(\ex_+\ey) \le 2^{\frac{1-p}{p}} \big(\ell_p(\ex)+\ell_p(\ey)\big)$. 4. Other denominations are subset selection, minimum weight solution, sparse null-space or minimum set cover. 5. For finite positive sequences$\ex$and$\ey\$, u, v, w > 0 and 1/u + 1/v ≤ 1/w we have , or :

(∑n = 1N(xnyn)w)1/w ≤ (∑n = 1N(xn)u)1/u(∑n = 1N(yn)v)1/v .