The regularized optimization problems are quite common in machine learning. They lead to a sparse solution to the modelling problem. Consider the following optimization problem with penalty[1].
Where are the parameters of the model to be optimized. Let us study the reparameterization for all , where represents the element-wise (Hadamard) product of the vectors and . This is called the Hadamard product parameterization (HPP) or simply the Hadamard paramterization[2]. Then the equation (1) can be written as
Therefore, the auxiliary optimization function is
The above function is constructed rather purposefully to satisfy the following properties.
- The over-parameterized function is an upper bound to our optimization function , i.e. . The equality occurs where as can be observed from equation (2).
- The function satisfies the equality . This means that the optima (minima in this case) of is equal to the optima of . In other words, if and are optimal (in terms of ), then is an optimal (in terms of ) value of .
Moreover, since uses the regularization, it is differentiable and biconvex. Thus, the auxiliary function is amenable to gradient-based optimization algorithms like SGD, Adam, etc. Interestingly, Ziyin et. al.[3] also proved a stronger property compared to . They show that at all stationary points of (3), and every local minima of , given by equation (3) is a local minima of , given by equation (1).

Optimization trajectory of l1 and HP regularizations
The above figure shows that, for the same initial conditions and optimization parameters, the regularized objective function (Griewank, in this case) gets stuck in a local minima, while the Hadamard-parameterized function correctly reaches the global minima, which is at . Note that the regularized objective can be used with Pytorch's SGD optimizer, as they use a subgradient of 1 at the non-differentiable point. But this is a convention, as the subgradient of at is the set .
Lastly, similar to interpreting the and regularizers in least-squares problems as Laplacian and Gaussian priors respectively, the equation (3) can also be examined through a probabilistic framework. Here, with the parameters and subject to norm regularization, they can be construed as being governed by a Gaussian prior distribution . This implies a regularization effect on the components and .
[1] | TIL, that the notation is reserved for vectors while the uppercase notation is for operators and functions. So, the appropriate notations are and for vector and the function respectively. |
[2] | Hoff, P. D. (2017). Lasso, fractional norm and structured sparse estimation using a hadamard product parametrization. Comput. Stat. Data An., 115:186–198. ArXiv Link. |
[3] | Ziyin, L. & Wang, Z. (2023). spred: Solving L1 Penalty with SGD. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:43407-43422 ArXiv Link |