2-Cats: 2d Copula Approximating Transforms (2024)

Flavio Figueiredo¹, José G. Fernandes¹, Jackson N. Silva and Renato Assunção^1,2
${}^{1}\,$ Universidade Federal de Minas Gerais
${}^{2}\,$ ESRI Inc.
Reproducibility: https://anonymous.4open.science/r/2cats-E765/
Contact Author: flavio@dcc.ufmg.br

Abstract

Copulas are powerful statistical tools for capturing dependencies across data dimensions. Applying Copulas involves estimating independent marginals, a straightforward task, followed by the much more challenging task of determining a single copulating function, $C$ , that links these marginals. For bivariate data, a copula takes the form of a two-increasing function $C:(u,v)\in\mathbb{I}^{2}\rightarrow\mathbb{I}$ , where $\mathbb{I}=[0,1]$ . This paper proposes 2-Cats, a Neural Network (NN) model that learns two-dimensional Copulas without relying on specific Copula families (e.g., Archimedean). Furthermore, via both theoretical properties of the model and a Lagrangian training approach, we show that 2-Cats meets the desiderata of Copula properties. Moreover, inspired by the literature on Physics-Informed Neural Networks and Sobolev Training, we further extend our training strategy to learn not only the output of a Copula but also its derivatives. Our proposed method exhibits superior performance compared to the state-of-the-art across various datasets while respecting (provably for most and approximately for a single other) properties of C.

1 Introduction

Modeling univariate data is relatively straightforward for several reasons. Firstly, a wide range of probability distributions exist that are suitable for different data types, including Gaussian, Beta, Log-Normal, Gamma, and Poisson. This set is expanded by mixture models that use the elemental distributions as components and can model multimodality and other behavior. Secondly, visual tools such as histograms or Kernel Density Estimators (KDE) provide valuable assistance in selecting the appropriate distribution. Lastly, univariate models typically involve few parameters and may have exact solutions, simplifying the modeling process.

However, the process becomes more challenging when modeling multivariate data and obtaining a joint probability distribution. Firstly, only a few classes of multivariate probability distributions are available, with the primary exceptions being the multivariate Gaussian, Elliptical, and Dirichlet distributions. Secondly, simultaneously identifying the dependencies via conditional distributions based solely on empirical data is highly complex.

In 1959, Abe Sklar formalized the idea of Copulas[61, 60]. This statistical tool allows us to model a multivariate random variable of dimension $d$ by learning its cumulative multivariate distribution function (CDF) using $d$ independent marginal CDFs and a single additional Copulating function $C$ . A bivariate example may separately identify one marginal as a Log-Normal distribution and the other as a Beta distribution. Subsequently, the Copula model links these two marginal distributions.

However, the Copula approach to modeling multivariate data faced a significant limitation until recently. Our model choices were confined to only a handful of closed forms for the $C$ Copulas, such as the Gaussian, Frank, or Clayton Copulas[16, 25]. Unfortunately, this approach using closed-form Copula models proved inadequate for accurately capturing the complex dependencies in real data. The limited parameterization of these Copulas prevented them from fully representing the intricate relationships between variables.

This situation has changed recently.Neural Networks (NNs) are well-known for their universal ability to approximate any function. Consequently, researchers have started exploring NNs as alternatives to closed forms for $C$ [13, 27, 66, 31, 38, 51]. However, a limitation of NNs is that they neglect the importance of maintaining the fundamental mathematical properties of their domain of study.

For Copulas, only three properties define $C$ [50]:P1: $C:(u,v)\in\mathbb{I}^{2}\mapsto\mathbb{I}$ , where $\mathbb{I}=[0,1]$ ;P2: For any $0\leq u_{1}<u_{2}\leq 1$ and $0\leq v_{1}<v_{2}\leq 1$ , we have that the volume of $C$ is: $V_{C}(u_{1},u_{2},v_{1},v_{2})\equiv C(u_{2},v_{2})-C(u_{2},v_{1})-C(u_{1},v_{%2})+C(u_{1},y_{1})\geq 0$ ;and, P3: $C$ is grounded. That is, $C(0,v)=0$ , $C(v,0)=0$ . Moreover, $C(1,v)=v$ and $C(u,1)=u$ .

This paper proposes 2-Cats (2D Copula Approximating Transforms) as a NN approach that meets the above desiderata. How the model meets P1 to P3 is discussed in Section3. P1 and P2 are natural consequences of the model, whereas P3 is guaranteed via Lagragian Optimization.

To understand our approach, let $H_{\theta}(u,v)$ represent our model (or Hypothesis). Let $G:\mathbb{R}^{2}\mapsto\mathbb{I}$ be any bivariate CDF with $\mathbb{R}^{2}$ as support. We connect $H_{\bm{\theta}}$ with $G$ as follows:

	$\displaystyle H_{\bm{\theta}}(u,v)$	$\displaystyle=G\Big{(}z\big{(}t_{v}(u)\big{)},z(t_{u}(v)\big{)}\Big{)};\,\,z(x%)=\log(\frac{x}{1-x});$
	$\displaystyle t_{v}(u)$	$\displaystyle=\frac{\int_{0}^{u}m_{\bm{\theta}}(x,v)\,dx}{\int_{0}^{1}m_{\bm{%\theta}}(x^{\prime},v)\,dx^{\prime}};\,\,t_{u}(v)=\frac{\int_{0}^{v}m_{\bm{%\theta}}(u,y)\,dy}{\int_{0}^{1}m_{\bm{\theta}}(u,y^{\prime})\,dy^{\prime}};$		(1)

Here, $m_{\bm{\theta}}(u,v)$ is any NN that outputs a positive number and $z(x)$ is the Logit function that is both monotonic and maps $[0,1]$ to $[-\infty,\infty]$ , i.e., $z:\mathbb{I}\mapsto\mathbb{R}$ . Here,the function $t_{v}(u)$ monotonically increases concerning $u$ because of the positive NN output; similarly, $t_{u}(v)$ is monotonic on $v$ .

The monotonic properties of all transformations ensure that our output $H_{\bm{\theta}}(u,v)$ is monotonic for both $u$ and $v$ . Let $z_{u}=z(t_{v}(u))$ and $z_{v}=z(t_{u}(v))$ to simplify notation and $(U,V)$ a random vector with distribution function $H_{\bm{\theta}}(u,v)$ . Then, the probability transform below is valid[11, section 4.3]:

	$\displaystyle H_{\bm{\theta}}(u,v)$	$\displaystyle=\Pr\big{[}Z_{u}\leq z_{u},Z_{v}\leq z_{v}\big{]}=\Pr\big{[}t^{-1%}_{v}(U)\leq t^{-1}_{v}(u),t^{-1}_{v}(V)\leq t^{-1}_{v}(v)\big{]}$
		$\displaystyle=\Pr\big{[}U\leq u,V\leq v\big{]}$		(2)

To ensure that our model derivatives also approximate the derivatives of $C$ , a second contribution of our work is that 2-Cats models are trained similarly to Physics Informed Neural Networks (PINNs)[32, 42] by employing Sobolev training[17] approximate Copula derivatives.

There have been several recent NN[13, 38, 51] and non-NN models[49, 5, 47] Copula-like models.We compare our 2-Cats model with these alternatives, and our results show that the 2-Cats performance (in negative log-likelihood) is better or statistically tied to baselines. Our contributions are:

•
We introduce the 2-Cats model, a novel NN Copula approximation. Unlike other NN approaches, we focus on satisfying (either provably or via constraints) the essential requirements of a Copula. We also demonstrate the empirical accuracy of 2-Cats;
•
We are the first to apply Sobolev training[17] in NN-based Copula methodologies. Our approach involves the introduction of a simple empirical estimator for the first derivative of a Copula, which is seamlessly integrated into our training framework.

2 Related Work

After being introduced by Sklar[61, 60], Copulas have emerged as valuable tools across various domains[25, 50], including engineering[56], geosciences[46, 72], and finance[12, 56]. Recently, Copulas have gained attention from the ML community, particularly with the advent of DL methods for Copula estimation and their utilization in generative models[13, 27, 66, 38, 31, 51].

Our proposed approach aligns closely with the emerging literature in this domain but makes significant advancements on two fronts. This paper primarily focuses on developing an NN-based Copula model that adheres to the fundamental properties of a Copula. Two methods sharing similar objectives are discussed in [38, 51] and [13]. However, our approach diverges from these prior works by not confining our method to Archimedean Copulas, as seen in[38] and [51].

As Chilinski and Silva[13], we begin by estimating densities through CDF estimations followed by automatic differentiation to acquire densities. However, our primary difference lies in our emphasis on relating 2-Cats to a Copula, an aspect not prioritized by the authors of that study.

To assess the effectiveness of our approach, we conduct comparative analyses against previous proposals, including the method presented by Janke et al.[31], proposed as a generative model. Furthermore, 2-Cats does not presume a fixed density function for copula estimation; instead, it is guided by the neural network (NN). Hence, our work shares similarities with non-parametric copula estimators used as baselines[49, 47, 5]. Ultimately, all these methods function as baseline approaches against which we evaluate and demonstrate the superiority of our proposed approach.

For training our models, we utilize Sobolev training as a regularizer. Sobolev training is tailored to estimate neural networks (NNs) by incorporating losses that not only address errors toward the primary objective but also consider errors toward their derivatives. This approach ensures a more comprehensive and accurate training of the NNs. A similar concept is explored in Physics Informed Neural Networks (PINNs)[32, 35], where NNs incorporate derivative information from physical models. This approach enables PINNs to model complex physics-related relationships.

In summary, our objective is to develop an NN-based Copula model that respects the mathematical properties of Copulas and benefits from considering derivative information. Consequently, although our focus is on 2D Copulas, the Pair Copula Decomposition (PCC)[1, 15] that requires derivative information and is used to model $d$ -dimensional data is valid for our models[16]. Due to space constraints, we leave the evaluation of such a decomposition for future work.

Our work also aligns with the literature on monotonic networks[58, 18, 71] and Integral Networks[36, 39]. The concept of using integral transforms to build monotonic models was introduced by[71].

3 On the Desiderata of a Copula and 2-Cats Properties

We now revisit the discussion on the 2-Cats model as introduced in Section1.

One of the building blocks of 2-Cats is the NN $m_{\bm{\theta}}(u,v)$ . In particular, we require that this NN outputs only positive numbers so that its integral captures a monotonic function. We achieve this by employing Exponential Linear Units (ELU) plus one activation (ELU+1) in every layer.

Let us consider the computation details of $t_{v}(u)$ , with the understanding that the process for $t_{u}(v)$ is analogous. In our implementation, we adopt Cumulative Trapezoid integration. Initially, we divide the range $[0,1]$ into $200$ equally spaced intervals. When computing $t_{v}(u)$ , the value of $u$ is inserted into this interval while preserving the order. Subsequently, the $m_{\bm{\theta}}(x,v)$ NN is evaluated for the $201$ corresponding values of parameter $x$ . Trapezoidal integration is then applied to calculate the integral value at $u$ (for the numerator) and $1$ (for the denominator). Finally, when $u=0$ , $t_{v}(0)=0$ .

To compute the Jacobian and Hessian of $H_{\bm{\theta}}$ , as is done in PINNs[32, 42], we rely on symbolic computation from modern differentiable computing frameworks such as Jax and Pytorch for this case.In other words, $\frac{\partial H_{\bm{\theta}}(u,v)}{\partial u}$ and $\frac{\partial H_{\bm{\theta}}(u,v)}{\partial v}$ are estimated via symbolic differentiation via the Jacobian¹¹1https://jax.readthedocs.io/en/latest/_autosummary/jax.jacobian.html, while $\frac{\partial^{2}H_{\bm{\theta}}(u,v)}{\partial u\partial v}$ is estimated via symbolic differentiation using the Hessian²²2https://jax.readthedocs.io/en/latest/_autosummary/jax.hessian.html.

Now, let us present our main properties. AppendixB and D complements these properties.

Theorem P1.

$H_{\bm{\theta}}:(u,v)\in\mathbb{I}^{2}\mapsto\mathbb{I}$ , where $\mathbb{I}=[0,1]$ .

Proof.

Remind that $z_{u}=z\big{(}t_{v}(u)\big{)}$ and $z_{v}=z\big{(}t_{u}(v)\big{)}$ . Also, $(t_{v}(u),t_{u}(v))\in\mathbb{I}^{2}$ . These transforms are also able to cover the entire range of $\mathbb{I}^{2}$ (when $u=0$ we have $t_{v}(0)=0$ , when $u=1$ we have $t_{v}(1)=1$ , the same goes for the $t_{u}(v)$ transform). The function $z$ maps the domain $\mathbb{I}^{2}$ into $\mathbb{R}^{2}$ . Given that the bivariate CDFs work in this domain: $H_{\bm{\theta}}:(u,v)\in\mathbb{I}^{2}\mapsto\mathbb{I}$ , where $\mathbb{I}=[0,1]$ .∎

Theorem P2.

The 2-Cats copula $H_{\bm{\theta}}(u,v)$ satisfies the non-negative volume property. That is,for any $0\leq u_{1}<u_{2}\leq 1$ and $0\leq v_{1}<v_{2}\leq 1$ , we have that $V_{H_{\bm{\theta}}}(u_{1},u_{2},v_{1},v_{2})\equiv H_{\bm{\theta}}(u_{2},v_{2}%)-H_{\bm{\theta}}(u_{2},v_{1})-H_{\bm{\theta}}(u_{1},v_{2})+H_{\bm{\theta}}(u_%{1},v_{1})\geq 0$ .

Intuition..

This is a straightforward consequence of $G$ being a bivariate cumulative distribution function and the monotonicity of the $z$ , $t_{u}$ , and $t_{v}$ transforms.The same transform we explored above for the single variate case is valid for multiple variables. That is, fact that $z_{u}=z(t_{v}(u))$ and $z_{v}=z(t_{u}(v))$ are monotonic guarantees that $G(z_{u},z_{v})$ defines a valid transform Eqn(2).

Considering a bivariate CDF, for any given point in its domain (e.g., $t_{v}(u),t_{u}(v)$ ), the CDF represents the accumulated probability up to that point. Let the origin be $(t_{v}(0),t_{u}(0))$ . As either $t_{v}(u)$ , $t_{u}(v)$ , or both values increase, the CDF values can only increase or remain constant. Therefore, the volume under the CDF surface will always be positive. Now, consider the following Corollary of Eq2.∎

Corollary: The second derivative of 2-Cats is a pseudo-likelihood..

Let $F(x_{1},x_{2})=\Pr[X_{1}\leq x_{1},X_{2}\leq x_{2}]$ be the bivariate CDF associated with RVs $X_{1}$ and $X_{2}$ . With $u=F_{X_{1}}(x_{1})$ and $v=F_{X_{2}}(x_{2})$ being the marginal CDFs, we reach that: $F(x_{1},x_{2})=\Pr\big{[}U\leq u,V\leq v\big{]}$ . By Eq(2), $\Pr\big{[}U\leq u,V\leq v\big{]}=H_{\bm{\theta}}(u,v)=H_{\bm{\theta}}(F_{X_{1}%}(x_{1}),F_{X_{2}}(x_{2}))$ . 2-Cats is also capturing the same bivariate CDF of $C$ . The bivariate density, $f(x_{1},x_{2})=\frac{\partial^{2}F(x_{1},x_{2})}{\partial x_{1}\,\partial x_{2}}$ , thus is:

\displaystyle f(x_{1},x_{2})=\frac{\partial^{2}H_{\bm{\theta}}(F_{X_{1}}(x_{1}%),F_{X_{2}}(x_{2}))}{{\partial x_{1}\,\partial x_{2}}}=f_{X_{1}}(x_{1})\,f_{X_%{2}}(x_{2})\frac{\partial^{2}H_{\bm{\theta}}(u,v)}{{\partial u\,\partial v}}

(3)

The above equation is solved via the Chain rule. A consequence of it, is that the likelihood $f(x_{1},x_{2})$ , is proportional to $h_{\bm{\theta}}(u,v)=\frac{\partial^{2}H_{\bm{\theta}}(u,v)}{{\partial u\,%\partial v}}$ . This term is known as a pseudo-likelihood[23, 22].∎

Proof P2..

Given that $f(x_{1},x_{2})\geq 0$ by definition, and also $f_{X_{1}}(x_{1})\geq 0$ and $f_{X_{2}}(x_{2})\geq 0$ : $h_{\bm{\theta}}(u,v)\geq 0$ . The volume being the double integral of $h_{\bm{\theta}}(u,v)$ is always positive or zero.∎

Property P3.

$H_{\bm{\theta}}(u,0)=0$ , $H_{\bm{\theta}}(0,v)=0$ , $H_{\bm{\theta}}(u,1)\approx u$ , and $H_{\bm{\theta}}(1,v)\approx v$ . Notice that this property is a relaxation of the one presented in Section 1.

This last property of a Copula is the one that guarantees that the Copula will have uniform marginals. In particular, the terms: $C(u,1)=u$ and $C(1,v)=v$ are of utmost importance for sampling. To grasp this, consider some valid Copula $C$ . Here, $C(u,1)=\Pr[U\leq u,V\leq 1]=\Pr[U\leq u]=u$ , which is the CDF of the Uniform distribution. A similar argument exists for $v$ .

2-Cats does not provably meet P3 as it does P1 and P2. Nevertheless, we prove that: $H_{\bm{\theta}}(u,0)=0$ and $H_{\bm{\theta}}(0,v)=0$ , while approximating $H_{\bm{\theta}}(u,1)\approx u$ , and $H_{\bm{\theta}}(1,v)\approx v$ .

Lemma P3.1.

$H_{\bm{\theta}}(u,0)=0$ and $H_{\bm{\theta}}(0,v)=0$

Proof..

Consider the valueof $H_{\bm{\theta}}(u,0)$ when $v=0$ . As $t_{u}(0)=0$ , we have $z(t_{u}(0))=z(0)=\lim_{x\rightarrow 0}z(x)=-\infty$ and hence $H_{\bm{\theta}}(u,0)=G(z_{u},-\infty)$ . Given that $G$ is a bivariate CDF, $G(z_{u},-\infty)=\lim_{w\rightarrow-\infty}G(z_{u},w)=0$ for any $z_{u}$ . The same proof exists for $H_{\bm{\theta}}(0,v)$ .∎

Conjecture P3.2.

$H_{\bm{\theta}}(u,1)\approx u$ and $H_{\bm{\theta}}(1,v)\approx v$ .

Meeting P3.2 via Lagrangian..

To meet this relaxed property, we propose the following constrained optimization. Let the inputs of our model be comprised of the set $\mathcal{D}=\{u_{i},v_{i}\}$ , where $i\in[1,n=|\mathcal{D}|]$ . Moreover, let $L_{\bm{\theta}}(D)$ be some loss term used to optimize our model (see the next section). Our optimization for 2-Cats will focus on the following constrained problem:

\displaystyle\arg\,\min_{\bm{\theta}}L_{\bm{\theta}}(D)\quad\textrm{s.t.}\quad%\forall u\in[0,1]:H_{\bm{\theta}}(u,1)-u=0\,\text{and}\,\forall v\in[0,1]:H_{%\bm{\theta}}(1,v)-v=0.

(4)

Although it is trivial to understand why this optimization meets P3.2, performing such an optimization presents several challenges. The first challenge is how to model constraints. A natural first choice is to consider the square of constraints. However, squared constraint optimization is an ill-posed optimization problem[8, Chapter2.1], and [54, Section2.1]. Secondly, our model may correctly estimate that $H_{\bm{\theta}}(u,1)=u$ (or similarly that $H_{\bm{\theta}}(1,v)=v$ ), but this will not necessarily ensure that the partial derivatives of $H_{\bm{\theta}}(u,1)$ and $H_{\bm{\theta}}(1,v)$ are accurate (see AppendixA.3). Such derivatives must equal one for these inputs as the Copula marginals are distributed according to an $Uniform(0,1)$ . Thirdly, evaluating these constraints over the entire domain $u\in[0,1]$ and $v\in[0,1]$ is impossible. The final challenge is how to optimize constraints. Here, a natural approach is to consider Lagrangian multipliers[54, 8, 69]. However, in such cases, the solution to the optimization problem will lie on a saddle point of the loss surface, and gradient methods (used in NNs and 2-Cats) do not converge on saddle points (Section 3.1 of[54] provides a simple intuition on why this is so).

We tackle the first and second challenges by considering the derivative of the square of the above constraints, i.e: $\frac{\partial(H_{\bm{\theta}}(u,1)-u)^{2}}{\partial u}$ and $\frac{\partial(H_{\bm{\theta}}(1,v)-v)^{2}}{\partial v}$ , which via the chain rule will be defined as follows:

\displaystyle r^{1}_{\bm{\theta}}(u)=2\,\big{(}H_{\bm{\theta}}(u,1)-u\big{)}%\big{(}\frac{\partial H_{\bm{\theta}}(u,1)}{\partial u}-1\big{)}\,\text{and}\,%r^{2}_{\bm{\theta}}(v)=2\,\big{(}H_{\bm{\theta}}(1,v)-v\big{)}\big{(}\frac{%\partial H_{\bm{\theta}}(1,v)}{\partial v}-1\big{)}.

(5)

$r^{1}_{\bm{\theta}}(u)$ and $r^{2}_{\bm{\theta}}(v)$ stand for requirements on $u$ and $v$ , respectively. Both requirements are always positive or zero because $f(x)<1\iff\int_{0}^{x}f(x)dx<x$ . Consequently, both requirements will only equal zero when P3.2 is met (our goal). One advantage of this constraint is that it also considers our model’s derivative, which tackles our second challenge above.

We evaluated the constraints on our training set to tackle the third challenge. This leads to the optimization stated below. $r_{\bm{\theta}}(D)$ are the constraints on the training data. From the arguments above, we have that: $r_{\bm{\theta}}(D)\geq 0$ and $r_{\bm{\theta}}(D)=0$ if and only if the constraints are met.

\displaystyle\hat{\bm{\theta}}=\arg\,\min_{\bm{\theta}}L_{\bm{\theta}}(D)+r_{%\bm{\theta}}(D),\,\,\text{where: }\,\,r_{\bm{\theta}}(D)=\sum_{i=1}^{n}r^{1}_{%\bm{\theta}}(u_{i})+\sum_{i=1}^{n}r^{2}_{\bm{\theta}}(v_{i}).

(6)

As in[7], to solve the above problem, we treat $r_{\bm{\theta}}(D)$ as a barrier and optimize as follows:

1.
Let $\lambda\in(0,1]$ and $\alpha\in(0,1)$ be hyper-parameters (e.g., we set $\lambda=1$ and $\alpha=0.95$ ). Also, let $\bm{\theta}^{(0)}$ be the initial model parameters (initialized as in[34] – aka Lecun Normal).
2.
For every training iteration $t\in[1,T]$ :
1. (a)
  $\bm{\theta}^{(t)}=\text{stochatic gradient step}\big{(}\lambda\,L_{\bm{\theta}%^{(t-1)}}(D)+r_{\bm{\theta}^{(t-1)}}(D)\big{)}$ via ADAM[33]
2. (b)
  $\lambda=\alpha\,\lambda$
3.
Set our model as $\hat{\bm{\theta}}=\bm{\theta}^{(T)}$

As $t\rightarrow\infty$ , we have that $\lambda\rightarrow 0$ , leading our optimization to focus on the constraints. Even though we may not start our optimization in a feasible solution to $r_{\bm{\theta}^{(t-1)}}(D)$ [7], the fact that our loss surface is fully differentiable allows the optimizer to reach such a solution if not stuck on some local minima/saddle. Nevertheless, we align with the empirical evidence that in overparametrized models, such as deep NNs, local minima are rare (see Chapter 6.7 – The Optimality of Backpropagation – of[6]) and that such stochastic optimizers, such as ADAM, are suitable for escaping saddle points[6]. Moreover, our initialization, $\bm{\theta}^{(0)}$ , is suitable for satisfying regularizers[34].

For comparison, we shall present results with and without these Lagrangian terms. ∎

4 Sobolev Losses for 2-Cats Models

We have already discussed how our approach ensures the validity of Copula functions. To train our models, we designed our loss function as a weighted sum, i.e.:

L_{\bm{\theta}}(\mathcal{D})=w^{C}L^{C}_{\bm{\theta}}(\mathcal{D})+w^{dC}L^{dC%}_{\bm{\theta}}(\mathcal{D})+w^{c}L^{c}_{\bm{\theta}}(\mathcal{D})

(7)

Here, $w^{C}$ , $w^{dC}$ , and $w^{c}$ are loss weights.

The first component, $L^{C}_{\bm{\theta}}(\mathcal{D})$ , stimulates the copula to closely resemble the empirical cumulative distribution function derived from the observed data. In this way, our model learns to capture the essential characteristics of the data.The second component, $L^{dC}_{\bm{\theta}}(\mathcal{D})$ , imposes penalties on any disparities between the fitted copula’s first-order derivatives and the data-based estimates. A Copula’s first derivative is essential for sampling (see AppendixB) and Vine methods. The third component, $L^{c}_{\bm{\theta}}(\mathcal{D})$ , focuses on the copula’s second-order derivative, which is linked to the probability density function of the data, and evaluates its proximity to the empirical likelihood. Incorporating this aspect in our loss function enhances the model’s ability to capture the data’s distribution.

From the arguments above, all three terms play a role in Copula modeling. While obtaining the first component is relatively straightforward, the other two components necessitate some original work. We explain each of them in turn. Before continuing, we point out that AppendixA provides an example of how NNs behave when estimating derivatives and integrals of functions.

Let $F(x_{1},x_{2})=\Pr\big{[}X_{1}\leq x_{1},X_{2}<x_{2}]$ be the bivariate CDF for some random vector $\mathbf{x}$ and $\widehat{F^{n}}(\mathbf{x})$ be the empirical cumulative distribution function (ECDF) estimated from a sample of size $n$ .The Glivenko-Cantelli theorem states that: $\widehat{F^{n}}(x_{1},x_{2})\ \xrightarrow{\text{a.s.}}\ F(\mathbf{x})$ as $n\xrightarrow{\ }\infty$ [48].

We can explore these results to define the relationship between our model, the ECDF, and the true CDF. Being $\hat{\bm{\theta}}$ the estimated parameters and given that the CDF $F(x_{1},x_{2})$ and the Copula $C(u,v)$ evaluate to the same value (i.e., $u$ and $v$ are the inverse of the marginal CDFs of $x_{1}$ and $x_{2}$ respectively):

\displaystyle C(u,

\displaystyle v)=F(F_{X_{1}}^{-1}(u),F_{X_{2}}^{-1}(v))=F(x_{1},x_{2})\approx%\widehat{F^{n}}(x_{1},x_{2})\approx H_{\hat{\bm{\theta}}}(u,v)=H_{\hat{\bm{%\theta}}}(\widehat{F_{X_{1}}}(x_{1}),\widehat{F_{X_{2}}}(x_{2})).

Considering these definitions, we can define the loss functions based on the ECDF and the output:

\displaystyle L^{C}_{\bm{\theta}}(\mathcal{D})=\frac{1}{n}\sum_{u_{i},v_{i}\in%\mathcal{D}}\big{(}H_{\bm{\theta}}(u_{i},v_{i})-\widehat{F^{n}}(x_{i,1},x_{i,2%})\big{)}^{2}.

(8)

Now for our second term, $L^{dC}_{\bm{\theta}}(\mathcal{D})$ , a natural step for the first derivatives would be to define a mean squared error (or similar) loss towards these derivatives. The issue is that we need to define empirical estimates from data for both derivatives: $\frac{\partial C(u,v)}{\partial u}$ and $\frac{dC(u,v)}{\partial v}$ . To estimate such derivatives, we could have employed conditional density estimators[53, 63, 26]. Nevertheless, methods as[53, 63] are also deep NNs that have the drawback of requiring extra parameters to be estimated. Even classical methods such as[26] suffer from this issue. As a minor contribution, we explore the underlying properties of a Copula to present an empirical approximation for such derivatives.

For a 2d Copula, the first derivative of $C$ has the form[50]: $\frac{\partial C(u,v)}{\partial u}=\Pr\big{[}V\leq v\mid U=u\big{]}.$

The issue with estimating this function from data is that we cannot filter our dataset to condition on $U=u$ (we are working with data where $u\in\mathbb{I}$ , a real number). However, using Bayes rule we can rewrite this equation. First, let us rewrite the above equation in terms of density functions. That is, if $c_{u}(v)=F(V\leq v\mid U=u\big{)}=\Pr\big{[}V\leq v\mid U=u\big{]}$ is a cumulative function, $c(v)$ and $c(u)$ are the marginal copula densities (uniform by definition). Using Bayes rule, we have that:

\displaystyle\frac{\partial C(u,v)}{\partial u}

\displaystyle=c_{u}(v)=\frac{c\big{(}U=u\mid V\leq v\big{)}F\big{(}V\leq v\big%{)}}{c\big{(}u\big{)}}=v\,c\big{(}U=u\mid V\leq v\big{)}.

(9)

Where the last identity above comes from the fact that for Copulas, the marginal distributions $U$ and $V$ are uniform, leading to $c(u)=c(v)=1$ . Also, we shall have that $F\big{(}V\leq v\big{)}=\Pr\big{[}V\leq v\big{]}=v$ and $F\big{(}U\leq u\big{)}\Pr\big{[}U\leq u\big{]}=u$ [50]. Now, estimate $c\big{(}U=u\mid V\leq v\big{)}$ .

To do so, let us define $\frac{\widehat{\partial C^{n}}(u,v)}{\partial u}\approx c\big{(}U=u\mid V\leq v%\big{)}$ as an empirical estimate of such a density using $n$ data points. We employ the widely used Kernel Density Estimation (KDE) to estimate this function. A fast algorithm for our estimation will work as follows: we arrange our dataset as a table of two columns, where each row contains the pairs of $u,v$ (the columns). For efficiency, we create views of this table sorted on column $u$ or column $v$ . When we iterate this table sorted on $v$ , finding the points where $V\leq v$ is trivial, as these are simply the previously iterated rows. If we perform KDE on the column $u$ for these points, we shall have an empirical estimate of the density: $c\big{(}U=u\mid V\leq v\big{)}$ . By simply multiplying this estimate with $v$ , we have our empirical estimation of the partial derivative with regards to $u$ , that is: $\frac{\widehat{\partial C^{n}}(u,v)}{\partial u}$ . Similarly, we can estimate $\frac{\widehat{\partial C^{n}}(u,v)}{\partial v}$ . Now, our loss term is:

\displaystyle L^{dC}_{\bm{\theta}}(\mathcal{D})=\frac{1}{2n}\sum_{u_{i},v_{i}%\in\mathcal{D}}\big{(}

\displaystyle(\frac{\partial H_{\bm{\theta}}(u_{i},v_{i})}{\partial u}-\frac{%\widehat{\partial C^{n}}(u_{i},v_{i})}{\partial u})^{2}+(\frac{\partial H_{\bm%{\theta}}(u_{i},v_{i})}{\partial v}-\frac{\widehat{\partial C^{n}}(u_{i},v_{i}%)}{\partial v})^{2}\big{)}.

(10)

Finally, we focus on the last part of our loss, the a pseudo-likelihood:

\displaystyle L^{c}_{\bm{\theta}}(\mathcal{D})=-\frac{1}{n}\sum_{u_{i},v_{i}%\in\mathcal{D}}\log(\frac{\partial^{2}H_{\bm{\theta}}(u_{i},v_{i})}{\partial u%\partial v}).

(11)

It is essential to understand why the three losses are needed. It might seem that simply minimizing this loss would be sufficient. However, we can show that integrals of NN are also approximations up to an asymptotical constant[39], but we still need to ensure that this constant is acceptable (see AppendixA.2). Moreover, AppendixC presents an ablation study on the impact of all three losses.

5 Experimental Results

We now present our experimental results on different fronts: (1) validating our Empirical estimates for the first derivative, a crucial step when training 2-Cats; (2) evaluating 2-Cats on synthetic and real datasets without the lagrangian term; and, (3) evaluating the impact of the lagrangian term.

Before presenting results, we note that our models were implemented using Jax³³3https://jax.readthedocs.io/en/latest//Flax⁴⁴4https://flax.readthedocs.io. Given that Jax does not implement CDF methods for Multivariate Normal Distributions, we ported a fast approximation of bivariate CDFs[67] and employed it for our models that rely on Gaussian CDFs/Copulas. Our code is available at https://anonymous.4open.science/r/2cats-E765/. Baseline methods were evaluated using the author-supplied code (details are on AppendixG).

5.1 Empirical First Derivative Estimator

First, we validate our empirical estimations for the first derivative of Copulas. We use the closed-form equations for the first derivatives of Gaussian, Clayton, and Frank Copulas described in [57].

These Copulas have a single parameter, which we varied. Gaussian Copulas ( $\rho$ ) has a correlation parameter, and we set it to: 0.1, 0.5, and 0.9. Clayton and Frank copulas’ mixing parameter ( $\theta$ ) was set to 1, 5, and 10. For each of these configurations, we measured the coefficient of determination ( $R2$ ) between empirical estimates and exact formulae. Overall, we found that R2 was above $0.899$ in every setting, validating that our estimates are pretty accurate in practive.

5.2 2-Cats on Datasets (Without Lagrangian Term)

We now turn to our main comparisons. Our primary focus, as is the case on baseline methods, will be on capturing the data likelihood (Eq(3)). In our experiments, the PDFs for $X_{1}$ and $X_{2}$ were estimated via KDE. The bandwidth for these methods is determined using Silverman’s rule[59].

As is commonly done, we evaluate the natural log of this quantity as the data log-likelihood and use its negative to compare 2-Cats with other methods. With this metric of interest in place, we considered two 2-Cats models: (1) 2-Cats-G (Final Layer Gaussian): 2-Cats with a Gaussian bivariate CDF as $G$ ; and, (2) 2-Cats-L (Final Layer Logistic): 2-Cats with a Logistic bivariate CDF as $G$ .

Thus, we considered two forms of $G$ . The first was the CDF of a bivariate Gaussian Distribution[67]. The second one is the CDF of the Flexible bivariate Logistic (see Section 11.8 of[4]):

\displaystyle F(x_{1},x_{2})=\Pr[X_{1}<x_{1},X_{2}<x_{2}]=\big{(}(1+e^{-\alpha%\frac{x_{1}-\mu_{1}}{\sigma_{1}}}+e^{-\alpha\frac{x_{2}-\mu_{2}}{\sigma_{2}}}+%e^{-\alpha(\frac{x_{1}-\mu_{1}}{\sigma_{1}}+\frac{x_{2}-\mu_{2}}{\sigma_{2}})}%)^{\frac{1}{\alpha}}\big{)}^{-1}

$\mu_{1}$ , $\mu_{2}$ , $\sigma_{1}$ , $\sigma_{2}$ and $\alpha$ are free parameters optimized by the model. The same goes for the free parameters of the bivariate Gaussian CDF (the means of each dimension, $\mu_{1}$ and $\mu_{2}$ , and the correlation parameter $\rho$ ). Moreover, for each model, we employed four-layer networks with hidden layers having sizes of 128, 64, 32, and 16. We now discuss loss weights.

Recall that the data density is proportional to the pseudo-likelihood. We primarily emphasized the likelihood loss ( $w^{c}$ ). This accentuates our pivotal component. Notably, the scale of its values differs from that of the MSE components, necessitating a proportional adjustment in weight selection. In light of these factors, we fixed $w^{C}=0.01$ , $w^{dC}=0.5$ , and $w^{c}=0.1$ . We note that these hyper-parameters were sufficient to show how our approach is better than the literature. Our models were trained using early stopping. Here, for each training epoch, we evaluated the pseudo-log-likelihood on the training set and kept the weights of the epoch with the best likelihood on the training dataset.

Gaussian ( $\rho$ )Clayton ( $\theta$ )Frank/Joe ( $\theta$ )0.10.50.915101510Non-Deep LearnPar $2.91\pm 0.09$ $2.69\pm 0.09$ $2.09\pm 0.09$ $2.65\pm 0.09$ $1.89\pm 0.11$ $1.39\pm 0.10$ $2.78\pm 0.08$ $2.57\pm 0.09$ $2.10\pm 0.09$ Bern $2.91\pm 0.09$ $2.69\pm 0.09$ $2.15\pm 0.09$ $2.66\pm 0.08$ $2.06\pm 0.09$ $1.76\pm 0.08$ $2.78\pm 0.08$ $2.58\pm 0.09$ $2.15\pm 0.08$ PBern $2.91\pm 0.09$ $2.69\pm 0.09$ $2.15\pm 0.10$ $2.67\pm 0.09$ $2.12\pm 0.10$ $1.92\pm 0.08$ $2.78\pm 0.08$ $2.57\pm 0.09$ $2.13\pm 0.08$ PSPL1 $2.91\pm 0.09$ $2.69\pm 0.09$ $2.15\pm 0.11$ $2.67\pm 0.09$ $2.14\pm 0.18$ $1.61\pm 0.13$ $2.78\pm 0.08$ $2.57\pm 0.09$ $2.10\pm 0.09$ PSPL2 $2.91\pm 0.09$ $2.69\pm 0.09$ $2.13\pm 0.10$ $2.67\pm 0.09$ $2.02\pm 0.11$ $1.62\pm 0.10$ $2.78\pm 0.08$ $2.57\pm 0.09$ $2.11\pm 0.09$ TTL0 $2.92\pm 0.09$ $2.69\pm 0.09$ $2.09\pm 0.09$ $2.65\pm 0.08$ $2.00\pm 0.21$ $1.46\pm 0.09$ $2.79\pm 0.08$ $2.57\pm 0.09$ $2.12\pm 0.09$ TLL1 $2.92\pm 0.09$ $2.70\pm 0.10$ $2.09\pm 0.09$ $2.65\pm 0.09$ $1.98\pm 0.21$ $1.43\pm 0.10$ $2.79\pm 0.08$ $2.57\pm 0.09$ $2.17\pm 0.20$ TLL2 $2.91\pm 0.09$ $2.75\pm 0.20$ $2.08\pm 0.09$ $2.65\pm 0.09$ $2.03\pm 0.25$ $1.41\pm 0.10$ $2.79\pm 0.09$ $2.57\pm 0.09$ $2.11\pm 0.09$ TLL2nn $2.91\pm 0.09$ $2.69\pm 0.09$ $2.09\pm 0.10$ $2.65\pm 0.09$ $1.93\pm 0.11$ $1.42\pm 0.10$ $2.78\pm 0.08$ $2.57\pm 0.09$ $2.12\pm 0.09$ MR $2.91\pm 0.09$ $2.70\pm 0.09$ $2.16\pm 0.10$ $2.68\pm 0.08$ $2.01\pm 0.11$ $1.54\pm 0.11$ $2.79\pm 0.08$ $2.57\pm 0.09$ $2.11\pm 0.08$ Probit $2.91\pm 0.09$ $2.69\pm 0.09$ $2.11\pm 0.09$ $2.66\pm 0.08$ $2.05\pm 0.20$ $1.50\pm 0.10$ $2.78\pm 0.08$ $2.57\pm 0.09$ $2.12\pm 0.08$ DLNL $1.46\pm 0.08$ $1.32\pm 0.08$ $0.63\pm 0.07$ $1.20\pm 0.06$ $0.47\pm 0.09$ $-0.05\pm 0.10$ $1.39\pm 0.07$ $1.26\pm 0.08$ $0.84\pm 0.09$ IGC $2.92\pm 0.10$ $2.76\pm 0.10$ $2.09\pm 0.10$ $2.64\pm 0.08$ $1.93\pm 0.09$ $1.56\pm 0.11$ $2.87\pm 0.09$ $2.74\pm 0.12$ $2.32\pm 0.13$ Our2-Cats-L $2.95\pm 0.09$ $2.79\pm 0.09$ $1.91\pm 0.10$ $2.67\pm 0.08$ $2.01\pm 0.09$ $0.97\pm 0.10$ $2.90\pm 0.09$ $2.73\pm 0.10$ $2.13\pm 0.10$ 2-Cats-G $3.07\pm 0.14$ $2.90\pm 0.15$ $1.77\pm 0.10$ $3.04\pm 0.21$ $1.55\pm 0.11$ $0.96\pm 0.10$ $3.13\pm 0.15$ $2.60\pm 0.12$ $2.03\pm 0.13$

In both our synthetic and real data experiments, we considered the following approaches as Deep Learning baselines: Deep Archimedian Copulas (ACNET) [38]; Generative Archimedian Copulas (GEN-AC) [51]; Neural Likelihoods (NL) [13]; and Implicit Generative Copulas (IGC) [31]. Here, we use the same hyperparameters employed by the authors. We also considered several Non-Deep Learning baselines. These were the Parametric and Non-Parametric Copula Models from[49]. They are referenced as: The best Parametric approach from several known parametric copulas (Par), non-penalized Bernstein estimator (Bern), penalized Bernstein estimator (PBern), penalized linear B-spline estimator (PSPL1), penalized quadratic B-spline estimator (PSPL2), transformation local likelihood kernel estimator of degree $q=0$ (TTL0), degree $q=1$ (TTL1), and $q=2$ (TTL2), and $q=2$ with nearest neughbors (TTL2nn). We also considered the recent non-parametric approach by Mukhopadhyay and Parzen[47] (MR), as well as the Probit transformation-based estimator from[21].

To assess the performance of our model on synthetic bivariate datasets, we generated Gaussian copulas with correlations $\rho=0.1,0.5,0.9$ , and Clayton and Frank copulas with parameters $\theta=1,5,10$ . The marginal distributions were chosen as uncorrelated Normal distributions with parameters ( $\mu=0,\sigma=1$ ). We generated 1,500 training samples and 500 test samples. Results are for test samples. Before continuing, we point out that some deep methods require datasets to be scaled to [0, 1] intervals[38, 51]. Indeed, the author-supplied source code does not execute with data outside of this range (this choice was first used[38, Section 4.2] and followed by [51, Section 5.1.1]). For fair comparisons, in AppendixE, we present results for models with a min-max scaling of the data.

The results are presented in Table1, which contains the average negative log-likelihood on the test set for each copula. Results were estimated with a fixed seed of 30091985. The table also shows 95% Bootstrap Confidence Intervals[70]. Following suit with several guidelines, we choose to report CIs for better interpretability on the expected range of[20, 65, 3, 28, 19, 24, 30].

Initially, note that statistical ties (same average or overlapping intervals) are present throughout the table. The exception is for the NL[13] method, which severely underperforms in this setting, and the methods that did not execute (due to the required data scaling). When reading this table, it is imperative to consider that the simulated data comes from the models evaluated in the Par (first) row.Thus, parametric methods are naturally expected to overperform. Nevertheless, the statistical ties with such methods show how 2-Cats can capture this behavior even if it is a family-free approach.

		Boston	INTC-MSFT	GOOG-FB
Non-Deep Learn	Par	$-0.16\pm 0.09$	$-0.05\pm 0.07$	$-0.88\pm 0.08$
	Bern	$-0.09\pm 0.06$	$-0.03\pm 0.05$	$-0.70\pm 0.04$
	PBern	$-0.11\pm 0.07$	$-0.02\pm 0.06$	$-0.62\pm 0.04$
	PSPL1	$-0.06\pm 0.14$	$-0.02\pm 0.06$	$-0.72\pm 0.14$
	PSPL2	$-0.05\pm 0.14$	$-0.02\pm 0.06$	$-0.83\pm 0.07$
	TTL0	$-0.16\pm 0.08$	$-0.06\pm 0.06$	$-1.00\pm 0.07$
	TTL1	$-0.18\pm 0.09$	$-0.06\pm 0.06$	$-1.00\pm 0.08$
	TTL2	$-0.18\pm 0.09$	$-0.06\pm 0.06$	$-0.71\pm 0.16$
	TLL2	$-0.15\pm 0.09$	$-0.04\pm 0.05$	$-0.95\pm 0.13$
	MR	$-0.07\pm 0.06$	$-0.01\pm 0.05$	$-0.84\pm 0.07$
	Probit	$-0.10\pm 0.06$	$-0.03\pm 0.07$	$-0.92\pm 0.06$
Deep Learn	ACNet	$-0.28\pm 0.11$	$-0.18\pm 0.08$	$-0.91\pm 0.12$
	GEN-AC	$-0.29\pm 0.11$	$-0.17\pm 0.07$	$-0.75\pm 0.10$
	NL	$-0.28\pm 0.16$	$-0.16\pm 0.08$	$-1.09\pm 0.13$
	IGC	$0.07\pm 0.08$	$0.14\pm 0.04$	$-0.30\pm 0.02$
Our	2-Cats-L	$-0.30\pm 0.09$	$-0.21\pm 0.05$	$-1.75\pm 0.04$
Our	2-Cats-G	$-0.27\pm 0.07$	$-0.07\pm 0.06$	$-1.65\pm 0.08$

We now turn to real datasets. As was done in previous work, we employ the Boston housing dataset, the Intel-Microsoft (INTC-MSFT), and Google-Facebook (GOOG-FB) stock ratios. These pairs are commonly used in Copula research; in particular, they are the same datasets used by[38, 51]. We employed the same train/test and pre-processing as previous work for fair comparisons. Thus, the data is scaled with the same code as the literature, and all deep methods are executed.

The results are presented in Table2. The winning 2-Cats models are highlighted in the table in Green. The best, on average, baseline methods are in Red. We consider a method better or worse than another when their confidence intervals do not overlap; thus, ties with the best model are shown in underscores. Overall, the table highlights the superior performance of the 2-Cats models across all datasets. Only in one setting, 2-Cats-G on INTC-MSOFT, did the model underperform.

5.3 On the Lagrangian Term for P3

So far, 2-Cats presents itself as a promising approach for Copula modeling. Nevertheless, to sample from Copulas, we require that the marginals of the model come from a uniform distribution. This is our primary for the Lagrangian terms used to meet P3 (see Section3).

We trained 2-Cats models on the three real-world datasets with and without the Lagrangian optimization of Eq(6). Due to space constraints, we present results for 2-Cats-L only. Here, models were trained for 1000 iterations. No early stopping was performed as our focus was on meeting P3.

We compare both the average absolute deviations (i.e., $abs_{u}=\frac{1}{n}\sum_{i}|H_{\bm{\theta}}(u_{i},1)-u_{i}|$ and $abs_{v}=\frac{1}{n}\sum_{i}|H_{\bm{\theta}}(1,v_{i})-v_{i}|$ ) as well as the average relative deviations (i.e., $rel_{u}=100\,\frac{1}{n}\sum_{i}|H_{\bm{\theta}}(u_{i},1)-u_{i}|/u_{i}$ and $rel_{v}=100\,\frac{1}{n}\sum_{i}|H_{\bm{\theta}}(1,v_{i})-v_{i}|/v_{i}$ ) for models trained with and without constraints.

For the GOOG-FB dataset, significant gains were achieved with the constraints, i.e.: $abs_{u}$ went from $0.17$ to $0.09$ and $abs_{v}$ improved from $0.16$ to $0.03$ . $rel_{u}$ improved from $57\%$ to $35\%$ and $rel_{v}$ went from $54\%$ to $14\%$ . The negative log-likelihood reduced slightly when Lagrangian was used (from -1.70 to -1.14). Nevertheless, the model is still better than the baselines (see Table2). Results for the negative log-likelihood do not exactly match the previous ones, as no early stopping was done here. Even though relative errors may appear large, note that such scores are severely sensitive to the tail of the distributions. Here, even a small deviation (e.g., 0.001 to 0.0015) incurs a large relative increase.

For INTC-MSOFT, results were similar across $u$ and $v$ dimensions. As in both versions, no significant gains were observed in $abs_{u}$ nor $abs_{v}$ ; these scores are pretty small (below $0.009$ ) in both cases. In relative terms, an increase was observed in $abs_{v}$ from roughly 4% to 2%, with $abs_{u}$ being 4% in both cases. No significant changes were observed in the Boston dataset where errors were already minor $abs_{u}$ and $abs_{v}$ below $0.02$ in both instances, with relative errors ranging from 2% to 4% regardless of the constraints being employed. This shows that 2-Cats may already approximate the Uniform marginals even without constraints.

These results show the efficacy of our Lagrangian penalty. In some settings, such as GOOG-GB, constraints may present significant improvements. Using the Lagrangian term will depend on whether simulation is essential to the end-user (see AppendixB).

6 Conclusions

In this paper, we presented Copula Approximating Transform models (2-Cats).Different from the literature when training our models; we focus not only on capturing the pseudo-likelihood of Copulas but also on meeting or approximating several other properties, such as the partial derivatives of the $C$ function. Moreover, a second major innovation is proposing Sobolev training for Copulas. Overall, our results show the superiority of 2-Cats on real datasets.

A natural follow-up for 2-Cats is on using the Pair Copula Construction (PCC)[1, 15] to port Vines[16, 41, 23, 22] to incorporate our method. PCC is a well-established algorithm to go from a 2D Copula to an ND One using Vines. We also believe that evaluating other non-negative NNs for $m_{\bm{\theta}}$ is a promising direction[45, 55, 68, 62, 40, 36].

References

[1]Kjersti Aas, Claudia Czado, Arnoldo Frigessi, and Henrik Bakken.Pair-copula constructions of multiple dependence.Insurance: Mathematics and economics, 44(2), 2009.
[2]Mohomed Abraj, You-Gan Wang, and MHelen Thompson.A new mixture copula model for spatially correlated multiplevariables with an environmental application.Scientific Reports, 12(1), 2022.
[3]Douglas Altman, David Machin, Trevor Bryant, and Martin Gardner.Statistics with confidence: confidence intervals and statisticalguidelines.John Wiley & Sons, 2013.
[4]BarryC Arnold.Multivariate logistic distributions.Marcel Dekker New York, 1992.
[5]Yves INgounou Bakam and Denys Pommeret.Nonparametric estimation of copulas and copula densities byorthogonal projections.Econometrics and Statistics, 2023.
[6]Pierre Baldi.Deep learning in science.Cambridge University Press, 2021.
[7]Kevin Bello, Bryon Aragam, and Pradeep Ravikumar.Dagma: Learning dags via m-matrices and a log-determinant acyclicitycharacterization.In NeurIPS, 2022.
[8]DimitriP Bertsekas.Constrained optimization and Lagrange multiplier methods.Academic press, 2014.
[9]ChristopherM Bishop.Mixture density networks.Online Report., 1994.
[10]Yongqiang Cai.Achieve the minimum width of neural networks for universalapproximation.In ICLR, 2023.
[11]George Casella and RogerL Berger.Statistical inference.Cengage Learning, 2001.
[12]Umberto Cherubini, Elisa Luciano, and Walter Vecchiato.Copula methods in finance.John Wiley & Sons, 2004.
[13]Pawel Chilinski and Ricardo Silva.Neural likelihoods via cumulative distribution functions.In UAI, 2020.
[14]George Cybenko.Approximation by superpositions of a sigmoidal function.Mathematics of control, signals, and systems, 2(4), 1989.
[15]Claudia Czado.Pair-copula constructions of multivariate copulas.In Copula Theory and Its Applications. Springer, 2010.
[16]Claudia Czado and Thomas Nagler.Vine copula based modeling.Annual Review of Statistics and Its Application, 9, 2022.
[17]WojciechM Czarnecki, Simon Osindero, Max Jaderberg, Grzegorz Swirszcz, andRazvan Pascanu.Sobolev training for neural networks.In NeurIPS, 2017.
[18]Hennie Daniels and Marina Velikova.Monotone and partially monotone neural networks.IEEE Transactions on Neural Networks, 21(6), 2010.
[19]Pierre Dragicevic.Fair statistical communication in hci.Modern statistical methods for HCI, 2016.
[20]Martin Gardner and Douglas Altman.Confidence intervals rather than p values: estimation rather thanhypothesis testing.Br Med J (Clin Res Ed), 292(6522), 1986.
[21]Gery Geenens, Arthur Charpentier, and Davy Paindaveine.Probit transformation for nonparametric kernel estimation of thecopula density.Bernoulli, 2017.
[22]Christian Genest and Anne-Catherine Favre.Everything you always wanted to know about copula modeling but wereafraid to ask.Journal of hydrologic engineering, 12(4), 2007.
[23]Christian Genest, Kilani Ghoudi, and L-P Rivest.A semiparametric estimation procedure of dependence parameters inmultivariate families of distributions.Biometrika, 82(3), 1995.
[24]Sander Greenland, StephenJ Senn, KennethJ Rothman, JohnB Carlin, CharlesPoole, StevenN Goodman, and DouglasG Altman.Statistical tests, p values, confidence intervals, and power: a guideto misinterpretations.European journal of epidemiology, 31, 2016.
[25]Joshua Größer and Ostap Okhrin.Copulae: An overview and recent developments.Wiley Interdisciplinary Reviews: Computational Statistics,14(3), 2022.
[26]Peter Hall, RodneyCL Wolff, and Qiwei Yao.Methods for estimating a conditional distribution function.Journal of the American Statistical association, 94(445), 1999.
[27]Marcel Hirt, Petros Dellaportas, and Alain Durmus.Copula-like variational inference.In NeurIPS, 2019.
[28]Joses Ho, Tayfun Tumkaya, Sameer Aryal, Hyungwon Choi, and Adam Claridge-Chang.Moving beyond p values: data analysis with estimation graphics.Nature methods, 16(7), 2019.
[29]Kurt Hornik, Maxwell Stinchcombe, and Halbert White.Multilayer feedforward networks are universal approximators.Neural networks, 2(5), 1989.
[30]GuidoW Imbens.Statistical significance, p-values, and the reporting of uncertainty.Journal of Economic Perspectives, 35(3), 2021.
[31]Tim Janke, Mohamed Ghanmi, and Florian Steinke.Implicit generative copulas.In NeurIPS, 2021.
[32]GeorgeEm Karniadakis, IoannisG Kevrekidis, LuLu, Paris Perdikaris, SifanWang, and Liu Yang.Physics-informed machine learning.Nature Reviews Physics, 3(6), 2021.
[33]DiederikP Kingma and Jimmy Ba.Adam: A method for stochastic optimization.In ICLR, 2015.
[34]Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter.Self-normalizing neural networks.In NeurIPS, 2017.
[35]Ivan Kobyzev, SimonJD Prince, and MarcusA Brubaker.Normalizing flows: An introduction and review of current methods.IEEE transactions on pattern analysis and machine intelligence,43(11), 2020.
[36]Ryan Kortvelesy.Fixed integral neural networks.arXiv preprint arXiv:2307.14439, 2023.
[37]Georg Lindgren, Holger Rootzén, and Maria Sandsten.Stationary stochastic processes for scientists and engineers.CRC press, 2013.
[38]ChunKai Ling, Fei Fang, and JZico Kolter.Deep archimedean copulas.In NeurIPS, 2020.
[39]Yucong Liu.Neural networks are integrable.arXiv preprint arXiv:2310.14394, 2023.
[40]Lorenzo Loconte, Stefan Mengel, Nicolas Gillis, and Antonio Vergari.Negative mixture models via squaring: Representation and learning.In The 6th Workshop on Tractable Probabilistic Modeling, 2023.
[41]Rand KwongYew Low, Jamie Alco*ck, Robert Faff, and Timothy Brailsford.Canonical vine copulas in the context of modern portfolio management:Are they worth it?Asymmetric Dependence in Finance: Diversification, Correlationand Portfolio Management in Market Downturns, 2018.
[42]LuLu, Raphael Pestourie, Wenjie Yao, Zhicheng Wang, Francesc Verdugo, andStevenG Johnson.Physics-informed neural networks with hard constraints for inversedesign.SIAM Journal on Scientific Computing, 43(6), 2021.
[43]Yulong Lu and Jianfeng Lu.A universal approximation theorem of deep neural networks forexpressing probability distributions.In NeurIPS, 2020.
[44]Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang.The expressive power of neural networks: A view from the width.In NeurIPS, 2017.
[45]Ulysse Marteau-Ferey, Francis Bach, and Alessandro Rudi.Non-parametric models for non-negative functions.In NeurIPS, 2020.
[46]Sadegh Modiri, Santiago Belda, Mostafa Hoseini, Robert Heinkelmann, JoséMFerrándiz, and Harald Schuh.A new hybrid method to improve the ultra-short-term prediction oflod.Journal of geodesy, 94, 2020.
[47]Subhadeep Mukhopadhyay and Emanuel Parzen.Nonparametric universal copula modeling.Applied Stochastic Models in Business and Industry, 36(1),2020.
[48]Michael Naaman.On the tight constant in the multivariatedvoretzky–kiefer–wolfowitz inequality.Statistics & Probability Letters, 173, 2021.
[49]Thomas Nagler, Christian Schellhase, and Claudia Czado.Nonparametric estimation of simplified vine copula models: comparisonof methods.Dependence Modeling, 5(1), 2017.
[50]RogerB Nelsen.An introduction to copulas.Springer, 2006.
[51]Yuting Ng, Ali Hasan, Khalil Elkhalil, and Vahid Tarokh.Generative archimedean copulas.In UAI, 2021.
[52]TTin Nguyen, HienD Nguyen, Faicel Chamroukhi, and GeoffreyJ McLachlan.Approximation by finite mixtures of continuous density functions thatvanish at infinity.Cogent Mathematics & Statistics, 7(1), 2020.
[53]George Papamakarios, Theo Pavlakou, and Iain Murray.Masked autoregressive flow for density estimation.In NeurIPS, 2017.
[54]John Platt and Alan Barr.Constrained differential optimization.In NeurIPS, 1987.
[55]Alessandro Rudi and Carlo Ciliberto.Psd representations for effective probability models.In NeurIPS, 2021.
[56]GSalvadori and Carlo DeMichele.Frequency analysis via copulas: Theoretical aspects and applicationsto hydrological events.Water resources research, 40(12), 2004.
[57]Ulf Schepsmeier and Jakob Stöber.Derivatives and fisher information of bivariate copulas.Statistical papers, 55(2), 2014.
[58]Joseph Sill.Monotonic networks.In NeurIPS, 1997.
[59]BernardW Silverman.Density estimation for statistics and data analysis, volume26.CRC press, 1986.
[60]Abe Sklar.Random variables, distribution functions, and copulas: a personallook backward and forward.Lecture notes-monograph series, 1996.
[61]MSklar.Fonctions de répartition à n dimensions et leurs marges.Annales de l’ISUP, 8(3), 1959.
[62]AleksanteriMikulus Sladek, Martin Trapp, and Arno Solin.Encoding negative dependencies in probabilistic circuits.In The 6th Workshop on Tractable Probabilistic Modeling, 2023.
[63]Kihyuk Sohn, Honglak Lee, and Xinchen Yan.Learning structured output representation using deep conditionalgenerative models.In NeurIPS, 2015.
[64]BSohrabian.Geostatistical prediction through convex combination of archimedeancopulas.Spatial Statistics, 41, 2021.
[65]JonathanAC Sterne and GeorgeDavey Smith.Sifting the evidence—what’s wrong with significance tests?Physical therapy, 81(8), 2001.
[66]Natasa Tagasovska, Firat Ozdemir, and Axel Brando.Retrospective uncertainties for deep models using vine copulas.In AISTATS, 2023.
[67]Wen-Jen Tsay and Peng-Hsuan Ke.A simple approximation for the bivariate normal integral.Communications in Statistics-Simulation and Computation, 52(4),2023.
[68]Russell Tsuchida, ChengSoon Ong, and Dino Sejdinovic.Squared neural families: A new class of tractable density models.arXiv preprint arXiv:2305.13552, 2023.
[69]GRWalsh.Saddle-point property of lagrangian function.Methods of Optimization. New York: John Wiley & Sons, 1975.
[70]Larry Wasserman.All of statistics: a concise course in statistical inference,volume26.Springer, 2004.
[71]Antoine Wehenkel and Gilles Louppe.Unconstrained monotonic neural networks.In NeurIPS, 2019.
[72]Liuyang Yang, Jinghuai Gao, Naihao Liu, Tao Yang, and Xiudi Jiang.A coherence algorithm for 3-d seismic data analysis based on themutual information.IEEE Geoscience and Remote Sensing Letters, 16(6), 2019.
[73]Qingyang Zhang and Xuan Shi.A mixture copula bayesian network model for multimodal genomic data.Cancer Informatics, 16, 2017.

Appendix A On Derivatives and Integrals of NNs

In this appendix, we present an example of the issue of approximating derivatives/integrals of NNs.

A.1 An Example on why approximating derivatives and integrals fail

We begin with a pictorial example of the issue. Before doing so, we take some time to revise the universal approximation properties of NNs[14, 29, 43]. We also shift our focus from Copulas for this argument. It is expected to state that the universal approximation theorem (UAT) guarantees that functions are approximated with NNs. However, given that when learning models from training data, it would be more correct to say that function evaluations at training points are approximated.

2-Cats: 2d Copula Approximating Transforms (1)

Consider a simple 4-Layer Relu NN with layers of 128 neurons. This will be our model. To avoid overfitting, this model will be optimized with a validation set. The problem with relying only on the UAT is shown in Figure1. The figure shows a ground truth data-generating process of $y=\sin(x)+\eta$ , where $\eta$ is some noise term. On the first column, when $\eta=0$ , the model will approximate the data generating function (middle row), its derivative (bottom row), and integral (top row). When $\eta\sim N(0,0.5)$ , the model will approximate the data points, but not necessarily the derivatives.

This example may seem to contradict[17, 29] others that looked into the universal approximation properties of NN regarding derivative. We point out that this is not the case. In AppendixA.3, we show how the variance will increase when estimating the derivative, leading to the figure’s oscillating behavior. From the figure, also note that integrals are approximated up to an asymptotic constant (see AppendixA.2, and [39]); however, we still need to control for this constant (note the growing deviation on the top plot of the middle row).

We now present some results on the Integral (exemplified on the top row of Figure1) and Derivative (bottom row) that tie in with our example. Furthermore, we briefly discuss the last column of the figure. Notice that on the extreme left and extreme right, the derivative of the network begins to diverge from the underlying process. The NN was not trained with points sampled after these boundaries. Under data shifts (e.g., data from a similar process not seen on the training set), the issues we point out throughout this Appendix should increase.

A.2 On the Integral of Neural Networks

To present a view of the theoretical bounds on the error of approximation of integrals, let $\hat{f}_{\bm{\theta}}(x)$ be some NN parametrized by $\bm{\theta}$ . Also, let $f(x)$ represent our training dataset. We also make the reasonable assumption our dataset comes from some real function, $f^{\ast}(x)$ , sampled under a symmetric additive noise, i.e., $f(x)=f^{\ast}(x)+\eta$ , with an expected value equal to zero: $\mathbb{E}[\eta]=0$ . Gaussian noise is such a case.

If a NN is a universal approximator[14, 29, 42, 10, 44] and it is trained on the dataset $f(x)$ , we reach:

\lVert\hat{f}_{\bm{\theta}}(x)-f(x)\rVert_{p}<\epsilon_{1}

Most UAT proofs are valid for any $p$ norm ( $p$ norms bound one another), and assuming the 1 norm, we reach: $\hat{f}_{\bm{\theta}}(x)=f(x)\pm\epsilon$ for some constant $\epsilon$ . As a consequence, we reach:

\hat{f}_{\bm{\theta}}(x)=f^{\ast}(x)+\eta\pm\epsilon_{1}

Given that integration is a linear operator, we can show that:

E_{\eta}[\int\hat{f}_{\bm{\theta}}(x)]=\int f^{\ast}(x)\pm\epsilon_{2}.

To do so, we shall use the law of the unconscious statistician as follows:

	$\displaystyle g(x)$	$\displaystyle=\int\hat{f}{{}_{\theta}}(x)\,dx$
	$\displaystyle E_{\eta}[g(x)]$	$\displaystyle=E_{\eta}[\int\hat{f}_{\theta}(x)dx]$
		$\displaystyle=\int\int\hat{f}_{\theta}(x)p_{\eta}(n)\,dx\,dn$
		$\displaystyle=\int\int p_{\eta}(n)\hat{f}_{\theta}(x)\,dn\,dx$
		$\displaystyle=\int\int p_{\eta}(n)\big{(}f^{\ast}(x)+\eta\pm\epsilon_{1}\big{)%}\,dn\,dx$
		$\displaystyle=\int\int p_{\eta}(n)f^{\ast}(x)+p_{\eta}(n)\eta\pm p_{\eta}(n)%\epsilon_{1}\,dn\,dx$
		$\displaystyle=\int E_{\eta}[f^{\ast}(x)]+E_{\eta}[\eta]\pm E_{\eta}[\epsilon_{%1}]dx$
		$\displaystyle=\int E_{\eta}[f^{\ast}(x)]\pm E_{\eta}[\epsilon_{1}]dx$
		$\displaystyle=\int f^{\ast}(x)\pm\epsilon_{1}dx$
		$\displaystyle=\int f^{\ast}(x)dx\pm\int\epsilon_{1}dx=\int f^{\ast}(x)dx\pm%\epsilon_{2}dx$

where $\epsilon_{2}$ is a constant as $\epsilon_{1}$ is also a constant. However, this constant will grow depending on the integrating interval. This growing error is visible on the plot on the center top of Figure1.Controlling for $\epsilon_{2}$ will depend on the dataset’s quality and the training procedure.

A similar result is discussed in[39].

A.3 On the Derivative of Underlying Process and its impact on Neural Networks

To understand the oscillations on the derivatives of the example, let us rewrite the underlying process that generated the data as a Gaussian process. That is: $Y(x)=sin(x+\phi)+\eta$ . $Y$ is the stochastic process and $\phi\sim Uniform(-2\pi,2\pi)$ is some phase component necessary for stationarity. The variance of this process will be given by the kernel function $K(\tau)$ , which is now controlled by parameter $\tau$ . Given that this process has the same variance over time, $x$ , it is stationary, and a stationary Gaussian process has $K(\tau)=e^{-\frac{\tau}{2}}$ [37]. Before continuing, we note that in our example, we sampled from a Gaussian with a standard deviation of $\sigma=0.5$ and variance of $\sigma^{2}=0.25$ . This leads to a value of $\tau$ that is of $\tau=\approx 2.8$ since $e^{-\frac{2.8}{2}}\approx 0.25$ .

Now, let $Y^{\prime}(x)$ be the derivative of this process. This may be defined as:

\displaystyle\frac{Y(x+h)-Y(x)}{h}

\displaystyle\rightarrow Y^{\prime}(x)

if:

\displaystyle E\big{[}(\frac{Y(x+h)-Y(x)}{h}-Y^{\prime}(x))^{2}\big{]}

\displaystyle\rightarrow 0

when: $h\rightarrow 0$ .

Given that our process is stationary and has the expected value of the noise term as zero, we can estimate the value of the derivative process as zero, i.e., $E[Y^{\prime}(x)]=0$ [37]. We also have that the variance is equal to the second derivative of $K(\tau)$ : $Var[Y^{\prime}(x)]=\frac{\partial K}{\partial\tau\partial\tau}$ [37].

From here, we can derive $K$ twice to show that: $Var[Y^{\prime}(x)]=-e^{-\frac{\tau^{2}}{2}}+\tau^{2}e^{-\frac{\tau^{2}}{2}}$ . Recall that $Var[Y(x)]=\sigma^{2}=0.25$ and $\tau\approx 2.8$ . Now, by solving a simple inequality, we can show that the variance of the derivative process, $Var[Y^{\prime}(x)]$ , is greater than the variance of the original process $Var[Y(x)]$ , i.e., $Var[Y^{\prime}(x)]>Var[Y(x)]$ when $\tau>\sqrt{2}\approx 1.4$ (as is our example where $\tau\approx 2.8$ ).

This result explains the oscillation in the bottom middle plot of Figure1. The derivative process has a higher variation; thus, we expect a higher variance in estimations of the derivative.

Overall, even if an NN is a universal approximation of derivatives[17, 29], the variance of the noise term will likely increase (here we began with a variance of 0.25 and it is already sufficient to show such an increase) even for the derivative of the NN.

Appendix B Sampling from 2-Cats

We now detail how to sample from 2-Cats. Let the first derivatives of 2-Cats be: $h_{v}(u)=\Pr[U\leq u\mid V=v]=\frac{\partial H_{\bm{\theta}}(u,v)}{\partial v}$ , and $h_{u}(v)=\Pr[V\leq v\mid U=u]=\frac{\partial H_{\bm{\theta}}(u,v)}{\partial u}$ . Now, let $h^{\prime}_{v}(u)=\frac{\partial^{2}H_{\bm{\theta}}(u,v)}{\partial v^{2}}$ be the first derivative of $h_{v}(u)$ , and similarly $h^{\prime}_{u}(v)=\frac{\partial^{2}H_{\bm{\theta}}(u,v)}{\partial u^{2}}$ . These derivatives are readily available in the Hessian matrix that we estimate symbolically for 2-Cats (see Section3).

Now, let us determine the inverse of $h_{u}(v)$ , that is: $h^{-1}_{u}(v_{i})$ (the subscript stands for inverse). The same arguments are valid for $h_{v}(u)$ and $h^{-1}_{v}(u_{i})$ , and thus we omit them. Notice that with this inverse, we can sample from the CDF defined by $h_{u}(v)$ using the well-known Inverse transform sampling. Now, notice that by definition, $h_{u}(v)$ is already the derivative of $H_{\bm{\theta}}(u,v)$ with regards to $u$ .

Using a Legendre transform⁵⁵5https://en.wikipedia.org/wiki/Inverse_function_rule, the inverse of a derivative is:

h^{-1}_{u}(v_{i})=\big{(}h^{\prime}_{u}(v_{i})\big{)}^{-1}=\big{(}\frac{%\partial^{2}H_{\bm{\theta}}(u,v_{i})}{\partial u^{2}}\big{)}^{-1}

Where, again, the derivative inside the parenthesis is readily available to us via symbolic computation.

With such results in hand, the algorithm for sampling is:

1.
Generate two independent values $u\sim Uniform(0,1)$ and $v_{i}\sim Uniform(0,1)$ .
2.
Set $v=h^{-1}_{u}(v_{i})$ . This is an Inverse transform sampling for $v$
3.
Now we have the $(u,v)$ pair. We can again use Inverse transform sampling to determine:
1. (a)
  $x_{1}=F^{-1}_{x_{1}}(u)$
2. (b)
  $x_{2}=F^{-1}_{x_{2}}(v)$

Appendix C Ablation Study

We now provide an ablation study to understand the impact of our three losses. In this study, our model was trained without the Lagrangian terms of Property P3.

BOSTON
	$L^{C}$	$L^{dC}$	$L^{c}$
Only $L^{c}$ for training	0.115	0.033	-0.448
$L^{c}$ and $L^{dC}$ for training	0.107	0.035	-0.454
$L^{c}$ and $L^{C}$ for training	0.103	0.023	-0.236
All Three Losses (paper)	0.107	0.020	-0.630

INCT-MSOFT
	$L^{C}$	$L^{dC}$	$L^{c}$
Only $L^{c}$ for training	0.131	0.007	-0.327
$L^{c}$ and $L^{dC}$ for training	0.141	0.004	-0.314
$L^{c}$ and $L^{C}$ for training	0.137	0.007	-0.320
All Three Losses (paper)	0.141	0.018	-0.402

GOOG-FB
	$L^{C}$	$L^{dC}$	$L^{c}$
Only $L^{c}$ for training	0.166	6.88	-3.324
$L^{c}$ and $L^{dC}$ for training	0.133	0.066	-2.163
$L^{c}$ and $L^{C}$ for training	0.195	2.50	-3.201
All Three Losses (paper)	0.136	0.087	-2.881

The three tables (see Table3) of this section below present the values for $L^{C}$ (the squared error of the cumulative function C), $L^{dC}$ (the squared error for the first derivatives of C), and $L^{c}$ (the copula likelihood). We present one table for each real-world dataset.

$L^{c}$ is the score of interest when comparing models and the one we report in our paper. However, as discussed in AppendixA of our manuscript, when an NN approximates one aspect of a function (here being the Copula density $L^{c}$ ), the NN may miss other elements, such as the integral and derivatives of the function. This is the reason why we now present results for all three metrics. The metrics are in the columns of the tables. Each row presents a different 2-Cats approach using: (1) only $L^{c}$ for training (the metric of most interest); (2) $L^{c}$ and $L^{dC}$ ; $L^{c}$ and $L^{C}$ ; and, (3) all three losses (as in the main text).

From the tables, we can see that every loss plays a role when training the model. When using all three losses, gains/ties are achieved in 5 out of 9 cases (bold). When ignoring such terms, large losses (italic) will also arise

Appendix D Conjecture: A Mixture of 2-Cats is a Universal Copula Approximator

With sufficient components, any other CDF (or density function) may be approximated by a mixture of other CDFs[52]. As a consequence, a mixture of 2-Cats of the form:

H_{\bm{m},\bm{\Theta}}(u,v)=\sum_{i=1}^{k}w_{i}H_{i,\bm{\theta_{i}}}(u,v),

where $H_{i,\bm{\theta_{i}}}(u,v)$ is the $i$ -th 2-Cats model. $\bm{\Theta}$ is the parameter vector comprised of concatenating the individual parameters of each mixture component: $\bm{\Theta}=[\bm{\theta_{1}},\bm{\theta_{2}},\dots,\bm{\theta_{k}}]$ , and $\bm{m}=[m_{1},m_{2},\dots m_{k}]$ are real numbers related to the mixture weights.

This model may be trained as follows:

1.
Map mixture parameters to the $k-1$ Simplex: $\mathbf{w}=\text{softmax}(\bm{\theta}_{w})$ . $w_{i}$ if the $i$ -th position of this vector.
2.
When training, backpropagate to learn $\bm{\Theta}$ and $\bm{m}$ .

This is similar to the Mixture Density NN[9]. Over the last few years, mixture of Copulas approaches has been gaining traction in several fields[2, 64, 73]; here, we are proposing an NN variation. By definition, this is a valid Copula.

Appendix E Scaled Synthetic Data

Gaussian ( $\rho$ )Clayton ( $\theta$ )Frank/Joe ( $\theta$ )0.10.50.915101510Non-Deep LearnPar $-0.55\pm 0.12$ $-0.55\pm 0.13$ $-0.55\pm 0.16$ $-1.12\pm 0.09$ $-1.12\pm 0.16$ $-1.12\pm 0.21$ $-0.86\pm 0.09$ $-0.86\pm 0.12$ $-0.86\pm 0.15$ Bern $-0.54\pm 0.12$ $-0.54\pm 0.13$ $-0.54\pm 0.14$ $-1.10\pm 0.08$ $-1.10\pm 0.12$ $-1.10\pm 0.18$ $-0.85\pm 0.09$ $-0.85\pm 0.12$ $-0.85\pm 0.14$ PBern $-0.55\pm 0.11$ $-0.55\pm 0.14$ $-0.55\pm 0.14$ $-1.11\pm 0.08$ $-1.11\pm 0.12$ $-1.11\pm 0.18$ $-0.86\pm 0.09$ $-0.86\pm 0.13$ $-0.86\pm 0.15$ PSPL1 $-0.55\pm 0.11$ $-0.55\pm 0.13$ $-0.55\pm 0.14$ $-1.11\pm 0.07$ $-1.11\pm 0.15$ $-1.11\pm 0.30$ $-0.86\pm 0.09$ $-0.86\pm 0.12$ $-0.86\pm 0.18$ PSPL2 $-0.55\pm 0.11$ $-0.55\pm 0.13$ $-0.55\pm 0.15$ $-1.10\pm 0.08$ $-1.10\pm 0.14$ $-1.10\pm 0.24$ $-0.86\pm 0.09$ $-0.86\pm 0.12$ $-0.86\pm 0.17$ TTL0 $-0.54\pm 0.12$ $-0.54\pm 0.14$ $-0.54\pm 0.15$ $-1.11\pm 0.09$ $-1.11\pm 0.16$ $-1.11\pm 0.27$ $-0.86\pm 0.08$ $-0.86\pm 0.13$ $-0.86\pm 0.16$ TLL1 $-0.53\pm 0.12$ $-0.53\pm 0.14$ $-0.53\pm 0.16$ $-1.12\pm 0.08$ $-1.12\pm 0.17$ $-1.12\pm 0.32$ $-0.85\pm 0.09$ $-0.85\pm 0.12$ $-0.85\pm 0.17$ TLL2 $-0.54\pm 0.12$ $-0.54\pm 0.14$ $-0.54\pm 0.15$ $-1.12\pm 0.09$ $-1.12\pm 0.19$ $-1.12\pm 0.40$ $-0.86\pm 0.09$ $-0.86\pm 0.13$ $-0.86\pm 0.16$ TLL2nn $-0.55\pm 0.12$ $-0.55\pm 0.14$ $-0.55\pm 0.15$ $-1.12\pm 0.08$ $-1.12\pm 0.17$ $-1.12\pm 0.40$ $-0.86\pm 0.09$ $-0.86\pm 0.13$ $-0.86\pm 0.15$ MR $-0.54\pm 0.12$ $-0.54\pm 0.13$ $-0.54\pm 0.14$ $-1.08\pm 0.08$ $-1.08\pm 0.14$ $-1.08\pm 0.34$ $-0.84\pm 0.09$ $-0.84\pm 0.12$ $-0.84\pm 0.15$ Probit $-0.55\pm 0.12$ $-0.55\pm 0.13$ $-0.55\pm 0.15$ $-1.10\pm 0.08$ $-1.10\pm 0.14$ $-1.10\pm 0.29$ $-0.85\pm 0.09$ $-0.85\pm 0.12$ $-0.85\pm 0.16$ Deep LearnACNet $-0.06\pm 0.09$ $-0.29\pm 0.09$ $-1.05\pm 0.08$ $-0.51\pm 0.07$ $-0.91\pm 0.11$ $-1.35\pm 0.04$ $-0.13\pm 0.09$ $-0.49\pm 0.08$ $-0.58\pm 0.10$ GEN-AC $-0.06\pm 0.08$ $-0.29\pm 0.08$ $-1.06\pm 0.07$ $-0.52\pm 0.06$ $-0.93\pm 0.12$ $-1.34\pm 0.05$ $-0.15\pm 0.07$ $-0.49\pm 0.08$ $-0.56\pm 0.10$ NL $-0.34\pm 0.09$ $-0.43\pm 0.07$ $-1.07\pm 0.07$ $-0.64\pm 0.07$ $-1.17\pm 0.13$ $-1.03\pm 0.07$ $-0.42\pm 0.07$ $-0.59\pm 0.08$ $-0.57\pm 0.06$ IGC $-0.58\pm 0.07$ $-0.67\pm 0.06$ $-0.86\pm 0.06$ $-0.66\pm 0.06$ $-0.88\pm 0.06$ $-0.83\pm 0.06$ $-0.66\pm 0.06$ $-0.73\pm 0.05$ $-0.75\pm 0.07$ Our2-Cats-L $-0.48\pm 0.14$ $-0.63\pm 0.13$ $-1.20\pm 0.23$ $-0.64\pm 0.14$ $-1.24\pm 0.20$ $-1.69\pm 0.19$ $-0.52\pm 0.17$ $-1.34\pm 0.17$ $-1.34\pm 0.14$ 2-Cats-G $-0.38\pm 0.15$ $-0.45\pm 0.19$ $-1.04\pm 0.28$ $-0.44\pm 0.19$ $-1.27\pm 0.30$ $-1.06\pm 0.18$ $-0.51\pm 0.15$ $-1.12\pm 0.09$ $-1.06\pm 0.18$

Table4 presents results without scaling the input data as is done in Deep Learning methods. The table presents the average negative log-likelihood and the 95% confidence interval.

The colors and highlights on this table match the main text. The winning 2-Cats models are highlighted in the table in Green. The best, on average, baseline methods are in Red. The table shows that 2-Cats is better than baseline methods when the dependency ( $\rho$ or $\theta$ ) increases. For small dependencies, methods not based on Deep Learning outperform Deep ones (including 2-Cats). Nevertheless, 2-Cats is the winning method in 6 out of 9 datasets and is tied with the best in one case.

Appendix F A Flexible Variation of the Model

A Flexible 2-Cats model works similarly to our 2-Cats. However, the transforms are different:

1.
Let $m_{\bm{\theta}}:\mathbb{I}^{2}\mapsto\mathbb{R}+$ be an MLP outputs positive numbers. We achieve this by employing Elu plus one activation in every layer as in[71].

Define the transforms:

\displaystyle t_{v}(u)=\int_{0}^{u}m_{\bm{\theta}}(x,v)\,dx;\,\,t_{u}(v)=\int_%{0}^{v}m_{\bm{\theta}}(y,u)\,dy.

3.
The 2-Cats-FLEX hypothesis is now defined as the function: $H_{\bm{\theta}}(u,v)=G\big{(}(t_{v}(u),t_{u}(v)\big{)}$ , where $G(x,y)$ is any bivariate CDF on the $\mathbb{R}^{2}$ domain (e.g., the Bivariate Logistic or Bivariate Gaussian).

This model meets P2, but not P1 nor P3. As already stated in the introduction, the fact that $t_{v}(u)$ and $t_{u}(v)$ are monotonic, one-to-one, guarantees that $G(t_{v}(u),t_{u}(v))$ defines a valid probability transform to a new CDF. A major issue with this approach is that we have no guarantee that the model’s derivatives are conditional cumulative functions (first derivatives) or density functions (second derivatives). That’s why we call it Flexible (FLEX).

F.1 Full Results

	Boston	INTC-MSFT	GOOG-FB
Par	$-0.16\pm 0.09$	$-0.05\pm 0.07$	$-0.88\pm 0.08$
Bern	$-0.09\pm 0.06$	$-0.03\pm 0.05$	$-0.70\pm 0.04$
PBern	$-0.11\pm 0.07$	$-0.02\pm 0.06$	$-0.62\pm 0.04$
PSPL1	$-0.06\pm 0.14$	$-0.02\pm 0.06$	$-0.72\pm 0.14$
PSPL2	$-0.05\pm 0.14$	$-0.02\pm 0.06$	$-0.83\pm 0.07$
TTL0	$-0.16\pm 0.08$	$-0.06\pm 0.06$	$-1.00\pm 0.07$
TTL1	$-0.18\pm 0.09$	$-0.06\pm 0.06$	$-1.00\pm 0.08$
TTL2	$-0.18\pm 0.09$	$-0.06\pm 0.06$	$-0.71\pm 0.16$
TLL2	$-0.15\pm 0.09$	$-0.04\pm 0.05$	$-0.95\pm 0.13$
MR	$-0.07\pm 0.06$	$-0.01\pm 0.05$	$-0.84\pm 0.07$
Probit	$-0.10\pm 0.06$	$-0.03\pm 0.07$	$-0.92\pm 0.06$
ACNet	$-0.28\pm 0.11$	$-0.18\pm 0.08$	$-0.91\pm 0.12$
GEN-AC	$-0.29\pm 0.11$	$-0.17\pm 0.07$	$-0.75\pm 0.10$
NL	$-0.28\pm 0.16$	$-0.16\pm 0.08$	$-1.09\pm 0.13$
IGC	$0.07\pm 0.08$	$0.14\pm 0.04$	$-0.30\pm 0.02$
2-Cats-FLEX-L	$0.38\pm 0.71$	$0.19\pm 0.08$	$-1.15\pm 0.07$
2-Cats-FLEX-G	$0.50\pm 0.43$	$0.23\pm 0.09$	$-1.01\pm 0.07$
2-Cats-G	$-0.30\pm 0.09$	$-0.21\pm 0.05$	$-1.75\pm 0.04$
2-Cats-L	$-0.27\pm 0.07$	$-0.07\pm 0.06$	$-1.65\pm 0.08$

In Table5, we show the results for these models on real datasets. The definitions of the models are as follows: (2-Cats-P Gaus.) A parametric mixture of 10 Gaussian Copula densities.; (2-Cats-P Frank) Similar to the above, but is a mixture of 10 Frank Copula densities; (2-Cats-FLEX-G) A Flexible model version where the final activation layer is Bivariate Gaussian CDF; (2-Cats-FLEX-L) A Flexible version of the model where the final activation layer is Bivariate Logistic CDF.

Models were trained with the same hyperparameters and network definition as the ones in our main text.

Appendix G Appendix Baseline Source Code

		Link
Non-Deep Learn	Par	https://cran.r-project.org/web/packages/VineCopula/index.html
	Bern	https://cran.r-project.org/web/packages/kdecopula/index.html
	PBern	https://rdrr.io/cran/penRvine/
	PSPL1	https://rdrr.io/cran/penRvine/
	PSPL2	https://rdrr.io/cran/penRvine/
	TTL0	https://cran.r-project.org/web/packages/kdecopula/index.html
	TTL1	https://cran.r-project.org/web/packages/kdecopula/index.html
	TTL2	https://cran.r-project.org/web/packages/kdecopula/index.html
	TLL2	https://cran.r-project.org/web/packages/kdecopula/index.html
	MR	https://cran.r-project.org/web/packages/kdecopula/index.html
	Probit	https://cran.r-project.org/web/packages/kdecopula/index.html
Deep Learn	ACNet	https://github.com/lingchunkai/ACNet
	GEN-AC	https://github.com/yutingng/gen-AC
	NL	https://github.com/pawelc/NeuralLikelihoods0
	IGC	https://github.com/TimCJanke/igc

The links for baseline source code are in Table6.