In this background page, we explore in more detail the geometric properties of the short-and-sparse deconvolution (SaSD) problem. Typical optimization formulations of SaSD are nonconvex. This nonconvexity can be attributed to the signed shift symmetry of the short-and-sparse model: because of symmetry there are many distinct minimizers. Understanding the geometry of SaSD will enable us to give provable algorithms for certain model problem instances, and, more importantly, to develop better practical algorithms.

Bilinear Lasso and Approximations

Our study centers around the “bilinear Lasso” formulation:

minaSp1,xRnλx1+12axy22, \min_{\mathbf a\in\mathbb S^{p-1},\, \mathbf x\in\mathbb R^{n}} \lambda \lVert \mathbf x\rVert_1 + \tfrac12 \lVert\mathbf a*\mathbf x - \mathbf y\rVert_2^2,

which balances sparsity and fidelity to the observed data.

To study its geometry rigorously, we develop an approximation to this objective function, by approximating the loss as

12axy22=  12ax22ax,y+12y22  12a22+x22ax,y+12y22. \begin{aligned} & \tfrac12 \lVert\mathbf a*\mathbf x - \mathbf y\rVert_2^2 \\ =\;& \tfrac12 \lVert\mathbf a * \mathbf x \rVert_2^2 - \langle \mathbf a * \mathbf x, \mathbf y \rangle +\tfrac12 \lVert \mathbf y\rVert_2^2 \\ \approx\;& \tfrac12 \lVert\mathbf a \rVert_2^2 + \lVert \mathbf x \rVert_2^2 - \langle \mathbf a * \mathbf x, \mathbf y \rangle +\tfrac12 \lVert \mathbf y\rVert_2^2. \end{aligned}

Since both a22\lVert\mathbf a\rVert_2^2 and y22\lVert\mathbf y\rVert_2^2 are constants, the simplified objective becomes

minaSp1,xRnλx1+12x22ax,y.(3) \min_{\mathbf a\in\mathbb S^{p-1},\, \mathbf x\in\mathbb R^{n}} \lambda \lVert \mathbf x\rVert_1 + \tfrac12 \lVert\mathbf x\rVert_2^2 - \langle\mathbf a*\mathbf x,\mathbf y\rangle. \tag{3}

This objective still solves the short-and-sparse deconvolution, but requires more stringent signal conditions and larger sample size (longer signal y\mathbf y); nevertheless, the geometry of this objective represents that of the bilinear-Lasso well and therefore it draws our attention to study it. Specifically, we will investigate the effect of the symmetric solutions of short signal a\mathbf a on the objective landscape φABL\varphi_{\text{ABL}}, the marginal minimization version of (3)(3), over the sphere Sp1\mathbb S^{p-1}:

minaSp1φABL(a):=  minaSp1(minxRnλx1+12x22ax,y). \begin{aligned} &\min_{\mathbf a\in\mathbb S^{p-1} } \varphi_{\text{ABL}}(\mathbf a) \\ :=\;& \min_{\mathbf a\in\mathbb S^{p-1} }\left( \min_{\mathbf x\in\mathbb R^{n}} \lambda \lVert \mathbf x\rVert_1 + \tfrac12 \lVert\mathbf x\rVert_2^2 - \langle\mathbf a*\mathbf x,\mathbf y\rangle\right). \end{aligned}

As it turns out, the geometry of this objective will be indeed dictated by the solutions of problem, that is, the shifts of short ground truth a0\mathbf a_0.

From Shifts to Geometry

Shifts of a0\mathbf a_0

From now on, we will assume that the length of the true short signal a0\mathbf a_0 is p0p_0 as that we optimize over a space of dimension p3p0p \geq 3p_0. Write the shift of a0\mathbf a_0 by ii samples as si[a0]s_i[\mathbf a_0] and put it in the space of dimension pp, we say

si[a0]:=[0,,0p+i,a0,0,,0pi] s_i[\mathbf a_0]\,:=\, [\overbrace{0,\ldots,0}^{p+i},\mathbf a_0,\overbrace{0,\ldots,0}^{p-i}]

where i{p,p}i\in \{-p\ldots,p\}. As we mentioned, these shifts are the solutions to the short-and-sparse deconvolution, which are also the local minimizers of the objective φABL\varphi_{\text{ABL}} and dictate the overall landscape of the objective.

Geometry over subspace of shifts

Specifically, the geometry of φABL\varphi_{\text{ABL}} over the subspace formed by linear combination of a few shifts will exhibit a “symmetric structure”. We will define a subspace spanned by the shifts of a0\mathbf a_0, s1[a0],,st[a0]s_{\ell_1}[\mathbf a_0],\ldots, s_{\ell_t}[\mathbf a_0], as:

S{1,,t}:=span{s1[a0],,st[a0]}. \mathcal S_{\{\ell_1,\ldots,\ell_t\}} \,:=\, \mathrm{span}\{s_{\ell_1}[\mathbf a_0],\ldots, s_{\ell_t}[\mathbf a_0]\}.

As it turns out, these symmetric structure of landscape for φABL\varphi_{\text{ABL}} over S{1,,t}Sp1 \mathcal S_{\{\ell_1,\ldots,\ell_t\}}\cap \mathbb S^{p-1} will enable efficient methods to find the ground truth when several conditions on the ground truth signals (a0,x0)(\mathbf a_0,\mathbf x_0) are satisfied.

We demonstrate three pictorial examples for the φABL\varphi_{\text{ABL}} landscape over sphere near the subspace that is spanned by one, two, and three shifts, as follows:

test

The geometry of φABL\varphi_{\text{ABL}} over the intersection between the shift subspace and unit sphere. The minimizers are close to the shifts, forming the symmetric structure for the geometry; in between the shifts, the negative curvature of φABL\varphi_{\text{ABL}} breaks the symmetric structure; and has positive curvature away from the subspace.

Firstly, the figure on the left shows that the landscape of φABL\varphi_{\text{ABL}} is strongly convex near the shift s1[a0]s_{\ell_1}[\mathbf a_0], and, not-surprisingly, its unique local minimizer is also close to that exact shift. Though this local geometry is certainly ideal for the minimization algorithm to solving the deconvolution problem, this region is very small compared to overall space the sphere Sp1\mathbb S^{p-1}, meaning that it can be difficult (computationally inefficient) to locate these regions without additional understanding of geometric landscape for φABL\varphi_{\text{ABL}}.

Secondly, the figure on the middle shows φABL\varphi_{\text{ABL}} landscape near the subspace S{1,2}\mathcal S_{\{\ell_1,\ell_2\}} spanned by two shifts s1[a0],s2[a0]s_{\ell_1}[\mathbf a_0], s_{\ell_2}[\mathbf a_0] exhibits a symmetric structure. The landscape is constituted by two convex region near either s1[a0]s_{\ell_1}[\mathbf a_0] and s2[a0]s_{\ell_2}[\mathbf a_0]; in between the two shifts, the negative curvature of φABL\varphi_{\text{ABL}} breaks this symmetric structure; and away from this subspace, the function values φABL(a)\varphi_{\text{ABL}}(\mathbf a) are growing. Though this geometry is nonconvex, it is expected that a minimization algorithm that starts within this region solves the deconvolution problem, because of two reasons: (1) the minimizing procedure is very unlikely to be stuck in between the symmetric structure—the saddle point has zero measure and has large negative curvature directions pointing at either solutions, and (2) the minimization algorithm will be trapped within this region near subspace, since the function value away from this subspace always grows.

Lastly, the symmetric structure for φABL\varphi_{\text{ABL}} can be extended to higher dimensional subspaces, as we show in the figure on the right, which demonstrates the landscape over span of three shifts s1[a0],s2[a0],s3[a0]s_{\ell_1}[\mathbf a_0], s_{\ell_2}[\mathbf a_0], s_{\ell_3}[\mathbf a_0]. Noticeably, all local minimizers are near shifts, and in between all three shifts (the peak point right in the middle), all curvatures with directions inside this subspace are negative; meanwhile, we can also see the substructure formed by any pair of two shifts exhibits similar symmetric structure as in the second figure. Therefore, minimization starts within this subspace can also successfully find the shifts; and more importantly, this geometry structure can be extended to even higher dimensional subspace (much higher then three), and to any subspaces formed by different combinations of shifts.

Finally, we want to list these key findings base on the investigation of this geometry over shifts subspace:

  • Minimizer: Within the subspace, all local minimizers are close to one of the shifts, and all these local regions have symmetric structure.

  • Negative Curvature: Within the subspace, and in between the shifts, the negative curvature of φABL\varphi_{\text{ABL}} will breaks the symmetric structure, meaning that the minimizing algorithm will not stuck in between and will eventually find one of the good local minimizer.

  • Positive Curvature: Away from the subspace, the function value of φABL\varphi_{\text{ABL}} will be growing, meaning that the minimizing algorithm will stay trapped near the subspace over iterates.

Geometry over union of subspaces

The benign geometry of φABL\varphi_{\text{ABL}} for the minimization algorithm to solve the short-and-sparse deconvolution holds not just near a single subspace, but near any subspace formed by span of any different combination of shifts:

The landscape of φABL\varphi_{\text{ABL}} is ideal for minimization over the union of subspaces that is spanned by any combinations of shifts. If the minimization algorithm starts near this region, it successfully finds a local minimizer which is one of the shift ai[a0]a_{i}[\mathbf a_0] and solves short-and-sparse deconvolution problem.

We provide a pictorial example once again, as it shows the geometry of φABL\varphi_{\text{ABL}} over the sphere Sp1\mathbb S^{p-1} exhibits similar symmetric structure for every pair of different shifts (s1[a0],s2[a0])(s_{\ell_1}[\mathbf a_0], s_{\ell_2}[\mathbf a_0]), (s2[a0],s3[a0])(s_{\ell_2}[\mathbf a_0], s_{\ell_3}[\mathbf a_0]) and (s3[a0],s1[a0])(s_{\ell_3}[\mathbf a_0], s_{\ell_1}[\mathbf a_0]), in which all the local minimizers of φABL\varphi_{\text{ABL}} over this union of subspace are close to shifts that span the subspace. Similar geometric property holds for union of shifts subspaces of much higher dimension. In theory, depending on p0p_0 the length of a0\mathbf a_0 and θ\theta the sparsity rate of x0\mathbf x_0, this preferable geometry holds for union of subspaces spanned any combination of 4p0θ4p_0\theta shifts.

When is SaSD Solvable?

Short-and-sparse deconvolution is impossible in general, but it can be solved efficiently, whenever the following conditions are satisfied:

  1. Short: a0\mathbf a_0 is enough short compares to the observed signal.

  2. Incoherence: all signed shifts of a0\mathbf a_0 are not close to each other in 2\ell^2 distance.

  3. Sparse: The density of the support for x0\mathbf x_0 is sufficiently sparse, and somewhat irregular.

There is a catch here, if we push these conditions to extreme, say if the pattern is so short and the map is sparse enough that most of the patterns being non overlapping, then short-and-sparse deconvolution problem becomes trivial—-which simply boils down to looking for the isolated short signal as the ground truth a0\mathbf a_0. This extreme case does not apply to most of the practical scenarios, and a well designed algorithm should be able to solve short-and-sparse deconvolution even under other more complicated, non-trivial cases.

Shift coherence of a0\mathbf a_0

Short-and-sparse deconvolution becomes an easier problem if the short a0\mathbf a_0 is more shift-incoherent. The coherence μ(a0)[0,1]\mu(\mathbf a_0)\in[0,1] is simply defined as the largest absolute inner-product between all different shift pairs, namely

μ(a0):=maxij  si[a0],sj[a0]. \mu(\mathbf a_0) \,:=\, \max_{i\neq j}\; \lvert\langle s_i[\mathbf a_0], s_j[\mathbf a_0] \rangle\rvert.

μ(a0)\mu(\mathbf a_0) characterizes the difference between the solutions of short-and-sparse deconvolution: if the coherence is larger μ(a0)1\mu(\mathbf a_0)\nearrow 1, then there are two shifts being closer on the sphere, which will be harder for the algorithm to distinguish whether the shift of a0\mathbf a_0 is fitting to solve the problem. Geometrically speaking, the closer these shifts are, the easier a bad local minimizer may be created in between the shift, and the algorithm could wrongly report this minimizer, the linear combination of shifts, as the solution.

fig

When a0\mathbf a_0 has low shift-coherence such as a single spike, the short-and-sparse deconvolution is easier, while if a0\mathbf a_0 is a smooth function, the shift-coherence is closer to one and the problem is more difficult.

Generally speaking, if the spectrum of the short a0\mathbf a_0 is biased (a0\mathbf a_0 is lowpass, highpass, etc.), then the coherence of a0\mathbf a_0 tends to become larger, and vice versa.

Sparsity of x0\mathbf x_0

We characterize the sparsity of x0\mathbf x_0 by assuming it to be a random vector whose entries are subordinate to Bernoulli-Gaussian distribution with support activation rate θ\theta.

x0i.i.d.BG(θ) \mathbf x_0 \sim_{\text{i.i.d.}}\text{BG}(\theta)

Concretely, the random vector x0\mathbf x_0 can be written as x0=gω\mathbf x_0 = \mathbf g \circ \mathbf \omega where g\mathbf g is Gaussian vector gi.i.d.N(0,1)Rn\mathbf g \sim_{\text{i.i.d.}} \mathcal N(0,1) \in\mathbb R^n and ω\mathbf \omega is Bernoulli vector ωi.i.d.Ber(θ)Rn\mathbf\omega\sim_{\text{i.i.d.}}\text{Ber}(\theta) \in\mathbb R^n .

img

If x0\mathbf x_0 is a random Bernoulli-Gaussian vector with sparsity rate θ\theta, then naturally if θ\theta is higher then short-and-sparse deconvolution problem becomes harder.

As expected, when the sparsity θ\theta is higher, the short-and-sparse deconvolution problem becomes harder, and vice versa. Notice, that if the sparsity rate θ\theta is lesser then 1/p01/p_0, there will have many isolated a0\mathbf a_0 in the observation signal y\mathbf y that can be found via simply algorithm, which makes the short-and-sparse deconvolution a trivial problem. The algorithm we are studying works under more complicating signal condition, namely when θ1/p0\theta \geq 1/p_0.

Sparsity-coherence tradeoff

We can further articulate these conditions for solving short-and-sparse deconvolution through “sparsity-coherence tradeoff”. That is, the deconvolution problem becomes easier to solve when either the short a0\mathbf a_0 has low coherence μ(a0)0\mu(\mathbf a_0) \searrow 0 or the sparsity of x0\mathbf x_0 is lower θ1/p0\theta \searrow 1/p_0. If the short a0\mathbf a_0 has low shift-coherence, then the sparsity of x0\mathbf x_0 is allowable to larger, and if a0\mathbf a_0 turns out to be smoother and low-pass, then the allowable of x0\mathbf x_0 has to be lower.

img

When the short a0\mathbf a_0 consists of a signal spike, the problem can be solved even if the sparsity of x0\mathbf x_0 is high (left). On the other hand, if the a0\mathbf a_0 is a smooth function, whose shift-coherence become much higher, then the allowable sparsity of x0\mathbf x_0 is lower (right).

From Geometry to Algorithm

The understanding of how the geometry of φABL\varphi_{\text{ABL}} is shaped by the shifts of a0\mathbf a_0 suggests it is possible to design an algorithm that guarantees solving short-and-sparse deconvolution. The algorithm is straightforward. It starts near one of the shift subspaces, and stays near the subspace through the course of minimization.

Initialization in union of subspaces

Let us consider a case where the observation of noiseless y\mathbf y consists of isolated, non-overlapping a0\mathbf a_0. In this trivial case, solving the deconvolution problem becomes finding the isolated copy of the short signal a0\mathbf a_0, which can be done via locating the shortest continuous segment of y\mathbf y, simple as it can possibly be. This algorithm, though, is unrealistic in even slightly more general cases, when y\mathbf y contains noise or x0\mathbf x_0 has higher sparsity.

Nevertheless, we can still adopt this idea under realistic scenarios when y\mathbf y contains noise and overlapping a0\mathbf a_0 by seeing that, due to the sparsity of x0\mathbf x_0, a chunk of data [y1,,yp][\mathbf y_1,\ldots,\mathbf y_p] will contain only a few shifts of a0\mathbf a_0. This observation implies that a chunk of y\mathbf y will be closer to the set of the shifts, and will be close to the shift subspace where φABL\varphi_{\text{ABL}} has nice geometry.

img

A chunk of signal y\mathbf y will contain truncated shifts of a0\mathbf a_0. It is easy to create the initializer a(0)\mathbf a^{(0)} by tweeking the zero-padded chunk of y\mathbf y such that become close to some linear combination of signed-shifts, and starts near one the subspace, over which the objective φABL\varphi_{\text{ABL}} has good landscape geometry.

The figure shows an example where the chunk of the data y\mathbf y contains a few truncated shift of a0\mathbf a_0, and with simple modification (taking gradient) we can easily obtain a initial vector that stays in the subspace spanned by these shifts. When the sparsity of x0\mathbf x_0 is θ\theta, then in expectation, the number of participating shift in this initialization scheme is roughly about 3p0θ3p_0\theta. Since, it is known that φABL\varphi_{\text{ABL}} has nice geometry over subspace with dimension up to 4p0θ4p_0\theta, it is expected that minimization starting with the appointed a(0)\mathbf a^{(0)} solves the deconvolution problem.

As it turns out, this initialization a(0)\mathbf a^{(0)} using a chunk of data y\mathbf y, can be also applied to the practical problems using bilinear-lasso.

Minimization within the subspace

Finally, knowing that the function value of φABL\varphi_{\text{ABL}} is growing away from the subspace, implies the small stepping gradient descent method on this objective will not leave the subspace, and will eventually converge to a local minimizer within, which are exactly the solution, shifts of a0\mathbf a_0.

To adopt this phenomenon in practice while solving bilinear-lasso for deconvolution, since the studied objective φABL\varphi_{\text{ABL}} is the marginal minimization over x\mathbf x of the Lasso-like objective, it is advised the initialization x(0)\mathbf x^{(0)} should also be the minimizer of the objective with assigned a(0)\mathbf a^{(0)}, then alternation minimization of both a\mathbf a and x\mathbf x would well approximate the gradient descent procedure of minimizing φABL\varphi_{\text{ABL}} within the subspace, hence the correct solution can be attained.