2-Cats: 2d Copula Approximating Transforms (2024)

Flavio Figueiredo1,   José G. Fernandes1,   Jackson N. Silva   and Renato Assunção1,2
11{}^{1}\,start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTUniversidade Federal de Minas Gerais
22{}^{2}\,start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTESRI Inc.
Reproducibility: https://anonymous.4open.science/r/2cats-E765/
Contact Author: flavio@dcc.ufmg.br

Abstract

Copulas are powerful statistical tools for capturing dependencies across data dimensions. Applying Copulas involves estimating independent marginals, a straightforward task, followed by the much more challenging task of determining a single copulating function, C𝐶Citalic_C, that links these marginals. For bivariate data, a copula takes the form of a two-increasing function C:(u,v)𝕀2𝕀:𝐶𝑢𝑣superscript𝕀2𝕀C:(u,v)\in\mathbb{I}^{2}\rightarrow\mathbb{I}italic_C : ( italic_u , italic_v ) ∈ blackboard_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → blackboard_I, where 𝕀=[0,1]𝕀01\mathbb{I}=[0,1]blackboard_I = [ 0 , 1 ]. This paper proposes 2-Cats, a Neural Network (NN) model that learns two-dimensional Copulas without relying on specific Copula families (e.g., Archimedean). Furthermore, via both theoretical properties of the model and a Lagrangian training approach, we show that 2-Cats meets the desiderata of Copula properties. Moreover, inspired by the literature on Physics-Informed Neural Networks and Sobolev Training, we further extend our training strategy to learn not only the output of a Copula but also its derivatives. Our proposed method exhibits superior performance compared to the state-of-the-art across various datasets while respecting (provably for most and approximately for a single other) properties of C.

1 Introduction

Modeling univariate data is relatively straightforward for several reasons. Firstly, a wide range of probability distributions exist that are suitable for different data types, including Gaussian, Beta, Log-Normal, Gamma, and Poisson. This set is expanded by mixture models that use the elemental distributions as components and can model multimodality and other behavior. Secondly, visual tools such as histograms or Kernel Density Estimators (KDE) provide valuable assistance in selecting the appropriate distribution. Lastly, univariate models typically involve few parameters and may have exact solutions, simplifying the modeling process.

However, the process becomes more challenging when modeling multivariate data and obtaining a joint probability distribution. Firstly, only a few classes of multivariate probability distributions are available, with the primary exceptions being the multivariate Gaussian, Elliptical, and Dirichlet distributions. Secondly, simultaneously identifying the dependencies via conditional distributions based solely on empirical data is highly complex.

In 1959, Abe Sklar formalized the idea of Copulas[61, 60]. This statistical tool allows us to model a multivariate random variable of dimension d𝑑ditalic_d by learning its cumulative multivariate distribution function (CDF) using d𝑑ditalic_d independent marginal CDFs and a single additional Copulating function C𝐶Citalic_C. A bivariate example may separately identify one marginal as a Log-Normal distribution and the other as a Beta distribution. Subsequently, the Copula model links these two marginal distributions.

However, the Copula approach to modeling multivariate data faced a significant limitation until recently. Our model choices were confined to only a handful of closed forms for the C𝐶Citalic_C Copulas, such as the Gaussian, Frank, or Clayton Copulas[16, 25]. Unfortunately, this approach using closed-form Copula models proved inadequate for accurately capturing the complex dependencies in real data. The limited parameterization of these Copulas prevented them from fully representing the intricate relationships between variables.

This situation has changed recently.Neural Networks (NNs) are well-known for their universal ability to approximate any function. Consequently, researchers have started exploring NNs as alternatives to closed forms for C𝐶Citalic_C[13, 27, 66, 31, 38, 51]. However, a limitation of NNs is that they neglect the importance of maintaining the fundamental mathematical properties of their domain of study.

For Copulas, only three properties define C𝐶Citalic_C[50]:P1: C:(u,v)𝕀2𝕀:𝐶𝑢𝑣superscript𝕀2maps-to𝕀C:(u,v)\in\mathbb{I}^{2}\mapsto\mathbb{I}italic_C : ( italic_u , italic_v ) ∈ blackboard_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ↦ blackboard_I, where 𝕀=[0,1]𝕀01\mathbb{I}=[0,1]blackboard_I = [ 0 , 1 ];P2: For any 0u1<u210subscript𝑢1subscript𝑢210\leq u_{1}<u_{2}\leq 10 ≤ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1 and 0v1<v210subscript𝑣1subscript𝑣210\leq v_{1}<v_{2}\leq 10 ≤ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1, we have that the volume of C𝐶Citalic_C is: VC(u1,u2,v1,v2)C(u2,v2)C(u2,v1)C(u1,v2)+C(u1,y1)0subscript𝑉𝐶subscript𝑢1subscript𝑢2subscript𝑣1subscript𝑣2𝐶subscript𝑢2subscript𝑣2𝐶subscript𝑢2subscript𝑣1𝐶subscript𝑢1subscript𝑣2𝐶subscript𝑢1subscript𝑦10V_{C}(u_{1},u_{2},v_{1},v_{2})\equiv C(u_{2},v_{2})-C(u_{2},v_{1})-C(u_{1},v_{%2})+C(u_{1},y_{1})\geq 0italic_V start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≡ italic_C ( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_C ( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_C ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_C ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≥ 0;and, P3: C𝐶Citalic_C is grounded. That is, C(0,v)=0𝐶0𝑣0C(0,v)=0italic_C ( 0 , italic_v ) = 0, C(v,0)=0𝐶𝑣00C(v,0)=0italic_C ( italic_v , 0 ) = 0. Moreover, C(1,v)=v𝐶1𝑣𝑣C(1,v)=vitalic_C ( 1 , italic_v ) = italic_v and C(u,1)=u𝐶𝑢1𝑢C(u,1)=uitalic_C ( italic_u , 1 ) = italic_u.

This paper proposes 2-Cats (2D Copula Approximating Transforms) as a NN approach that meets the above desiderata. How the model meets P1 to P3 is discussed in Section3. P1 and P2 are natural consequences of the model, whereas P3 is guaranteed via Lagragian Optimization.

To understand our approach, let Hθ(u,v)subscript𝐻𝜃𝑢𝑣H_{\theta}(u,v)italic_H start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) represent our model (or Hypothesis). Let G:2𝕀:𝐺maps-tosuperscript2𝕀G:\mathbb{R}^{2}\mapsto\mathbb{I}italic_G : blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ↦ blackboard_I be any bivariate CDF with 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as support. We connect H𝜽subscript𝐻𝜽H_{\bm{\theta}}italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT with G𝐺Gitalic_G as follows:

H𝜽(u,v)subscript𝐻𝜽𝑢𝑣\displaystyle H_{\bm{\theta}}(u,v)italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v )=G(z(tv(u)),z(tu(v)));z(x)=log(x1x);formulae-sequenceabsent𝐺𝑧subscript𝑡𝑣𝑢𝑧subscript𝑡𝑢𝑣𝑧𝑥𝑥1𝑥\displaystyle=G\Big{(}z\big{(}t_{v}(u)\big{)},z(t_{u}(v)\big{)}\Big{)};\,\,z(x%)=\log(\frac{x}{1-x});= italic_G ( italic_z ( italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ) ) , italic_z ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) ) ) ; italic_z ( italic_x ) = roman_log ( divide start_ARG italic_x end_ARG start_ARG 1 - italic_x end_ARG ) ;
tv(u)subscript𝑡𝑣𝑢\displaystyle t_{v}(u)italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u )=0um𝜽(x,v)𝑑x01m𝜽(x,v)𝑑x;tu(v)=0vm𝜽(u,y)𝑑y01m𝜽(u,y)𝑑y;formulae-sequenceabsentsuperscriptsubscript0𝑢subscript𝑚𝜽𝑥𝑣differential-d𝑥superscriptsubscript01subscript𝑚𝜽superscript𝑥𝑣differential-dsuperscript𝑥subscript𝑡𝑢𝑣superscriptsubscript0𝑣subscript𝑚𝜽𝑢𝑦differential-d𝑦superscriptsubscript01subscript𝑚𝜽𝑢superscript𝑦differential-dsuperscript𝑦\displaystyle=\frac{\int_{0}^{u}m_{\bm{\theta}}(x,v)\,dx}{\int_{0}^{1}m_{\bm{%\theta}}(x^{\prime},v)\,dx^{\prime}};\,\,t_{u}(v)=\frac{\int_{0}^{v}m_{\bm{%\theta}}(u,y)\,dy}{\int_{0}^{1}m_{\bm{\theta}}(u,y^{\prime})\,dy^{\prime}};= divide start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x , italic_v ) italic_d italic_x end_ARG start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v ) italic_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ; italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) = divide start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_y ) italic_d italic_y end_ARG start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_d italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ;(1)

Here, m𝜽(u,v)subscript𝑚𝜽𝑢𝑣m_{\bm{\theta}}(u,v)italic_m start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) is any NN that outputs a positive number and z(x)𝑧𝑥z(x)italic_z ( italic_x ) is the Logit function that is both monotonic and maps [0,1]01[0,1][ 0 , 1 ] to [,][-\infty,\infty][ - ∞ , ∞ ], i.e., z:𝕀:𝑧maps-to𝕀z:\mathbb{I}\mapsto\mathbb{R}italic_z : blackboard_I ↦ blackboard_R. Here,the function tv(u)subscript𝑡𝑣𝑢t_{v}(u)italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ) monotonically increases concerning u𝑢uitalic_u because of the positive NN output; similarly, tu(v)subscript𝑡𝑢𝑣t_{u}(v)italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) is monotonic on v𝑣vitalic_v.

The monotonic properties of all transformations ensure that our output H𝜽(u,v)subscript𝐻𝜽𝑢𝑣H_{\bm{\theta}}(u,v)italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) is monotonic for both u𝑢uitalic_u and v𝑣vitalic_v. Let zu=z(tv(u))subscript𝑧𝑢𝑧subscript𝑡𝑣𝑢z_{u}=z(t_{v}(u))italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_z ( italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ) ) and zv=z(tu(v))subscript𝑧𝑣𝑧subscript𝑡𝑢𝑣z_{v}=z(t_{u}(v))italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_z ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) ) to simplify notation and (U,V)𝑈𝑉(U,V)( italic_U , italic_V ) a random vector with distribution function H𝜽(u,v)subscript𝐻𝜽𝑢𝑣H_{\bm{\theta}}(u,v)italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ). Then, the probability transform below is valid[11, section 4.3]:

H𝜽(u,v)subscript𝐻𝜽𝑢𝑣\displaystyle H_{\bm{\theta}}(u,v)italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v )=Pr[Zuzu,Zvzv]=Pr[tv1(U)tv1(u),tv1(V)tv1(v)]absentPrsubscript𝑍𝑢subscript𝑧𝑢subscript𝑍𝑣subscript𝑧𝑣Prsubscriptsuperscript𝑡1𝑣𝑈subscriptsuperscript𝑡1𝑣𝑢subscriptsuperscript𝑡1𝑣𝑉subscriptsuperscript𝑡1𝑣𝑣\displaystyle=\Pr\big{[}Z_{u}\leq z_{u},Z_{v}\leq z_{v}\big{]}=\Pr\big{[}t^{-1%}_{v}(U)\leq t^{-1}_{v}(u),t^{-1}_{v}(V)\leq t^{-1}_{v}(v)\big{]}= roman_Pr [ italic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ≤ italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ≤ italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ] = roman_Pr [ italic_t start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_U ) ≤ italic_t start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ) , italic_t start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_V ) ≤ italic_t start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_v ) ]
=Pr[Uu,Vv]absentPr𝑈𝑢𝑉𝑣\displaystyle=\Pr\big{[}U\leq u,V\leq v\big{]}= roman_Pr [ italic_U ≤ italic_u , italic_V ≤ italic_v ](2)

To ensure that our model derivatives also approximate the derivatives of C𝐶Citalic_C, a second contribution of our work is that 2-Cats models are trained similarly to Physics Informed Neural Networks (PINNs)[32, 42] by employing Sobolev training[17] approximate Copula derivatives.

There have been several recent NN[13, 38, 51] and non-NN models[49, 5, 47] Copula-like models.We compare our 2-Cats model with these alternatives, and our results show that the 2-Cats performance (in negative log-likelihood) is better or statistically tied to baselines. Our contributions are:

  • We introduce the 2-Cats model, a novel NN Copula approximation. Unlike other NN approaches, we focus on satisfying (either provably or via constraints) the essential requirements of a Copula. We also demonstrate the empirical accuracy of 2-Cats;

  • We are the first to apply Sobolev training[17] in NN-based Copula methodologies. Our approach involves the introduction of a simple empirical estimator for the first derivative of a Copula, which is seamlessly integrated into our training framework.

2 Related Work

After being introduced by Sklar[61, 60], Copulas have emerged as valuable tools across various domains[25, 50], including engineering[56], geosciences[46, 72], and finance[12, 56]. Recently, Copulas have gained attention from the ML community, particularly with the advent of DL methods for Copula estimation and their utilization in generative models[13, 27, 66, 38, 31, 51].

Our proposed approach aligns closely with the emerging literature in this domain but makes significant advancements on two fronts. This paper primarily focuses on developing an NN-based Copula model that adheres to the fundamental properties of a Copula. Two methods sharing similar objectives are discussed in [38, 51] and [13]. However, our approach diverges from these prior works by not confining our method to Archimedean Copulas, as seen in[38] and [51].

As Chilinski and Silva[13], we begin by estimating densities through CDF estimations followed by automatic differentiation to acquire densities. However, our primary difference lies in our emphasis on relating 2-Cats to a Copula, an aspect not prioritized by the authors of that study.

To assess the effectiveness of our approach, we conduct comparative analyses against previous proposals, including the method presented by Janke et al.[31], proposed as a generative model. Furthermore, 2-Cats does not presume a fixed density function for copula estimation; instead, it is guided by the neural network (NN). Hence, our work shares similarities with non-parametric copula estimators used as baselines[49, 47, 5]. Ultimately, all these methods function as baseline approaches against which we evaluate and demonstrate the superiority of our proposed approach.

For training our models, we utilize Sobolev training as a regularizer. Sobolev training is tailored to estimate neural networks (NNs) by incorporating losses that not only address errors toward the primary objective but also consider errors toward their derivatives. This approach ensures a more comprehensive and accurate training of the NNs. A similar concept is explored in Physics Informed Neural Networks (PINNs)[32, 35], where NNs incorporate derivative information from physical models. This approach enables PINNs to model complex physics-related relationships.

In summary, our objective is to develop an NN-based Copula model that respects the mathematical properties of Copulas and benefits from considering derivative information. Consequently, although our focus is on 2D Copulas, the Pair Copula Decomposition (PCC)[1, 15] that requires derivative information and is used to model d𝑑ditalic_d-dimensional data is valid for our models[16]. Due to space constraints, we leave the evaluation of such a decomposition for future work.

Our work also aligns with the literature on monotonic networks[58, 18, 71] and Integral Networks[36, 39]. The concept of using integral transforms to build monotonic models was introduced by[71].

3 On the Desiderata of a Copula and 2-Cats Properties

We now revisit the discussion on the 2-Cats model as introduced in Section1.

One of the building blocks of 2-Cats is the NN m𝜽(u,v)subscript𝑚𝜽𝑢𝑣m_{\bm{\theta}}(u,v)italic_m start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ). In particular, we require that this NN outputs only positive numbers so that its integral captures a monotonic function. We achieve this by employing Exponential Linear Units (ELU) plus one activation (ELU+1) in every layer.

Let us consider the computation details of tv(u)subscript𝑡𝑣𝑢t_{v}(u)italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ), with the understanding that the process for tu(v)subscript𝑡𝑢𝑣t_{u}(v)italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) is analogous. In our implementation, we adopt Cumulative Trapezoid integration. Initially, we divide the range [0,1]01[0,1][ 0 , 1 ] into 200200200200 equally spaced intervals. When computing tv(u)subscript𝑡𝑣𝑢t_{v}(u)italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ), the value of u𝑢uitalic_u is inserted into this interval while preserving the order. Subsequently, the m𝜽(x,v)subscript𝑚𝜽𝑥𝑣m_{\bm{\theta}}(x,v)italic_m start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x , italic_v ) NN is evaluated for the 201201201201 corresponding values of parameter x𝑥xitalic_x. Trapezoidal integration is then applied to calculate the integral value at u𝑢uitalic_u (for the numerator) and 1111 (for the denominator). Finally, when u=0𝑢0u=0italic_u = 0, tv(0)=0subscript𝑡𝑣00t_{v}(0)=0italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( 0 ) = 0.

To compute the Jacobian and Hessian of H𝜽subscript𝐻𝜽H_{\bm{\theta}}italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT, as is done in PINNs[32, 42], we rely on symbolic computation from modern differentiable computing frameworks such as Jax and Pytorch for this case.In other words, H𝜽(u,v)usubscript𝐻𝜽𝑢𝑣𝑢\frac{\partial H_{\bm{\theta}}(u,v)}{\partial u}divide start_ARG ∂ italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) end_ARG start_ARG ∂ italic_u end_ARG and H𝜽(u,v)vsubscript𝐻𝜽𝑢𝑣𝑣\frac{\partial H_{\bm{\theta}}(u,v)}{\partial v}divide start_ARG ∂ italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) end_ARG start_ARG ∂ italic_v end_ARG are estimated via symbolic differentiation via the Jacobian111https://jax.readthedocs.io/en/latest/_autosummary/jax.jacobian.html, while 2H𝜽(u,v)uvsuperscript2subscript𝐻𝜽𝑢𝑣𝑢𝑣\frac{\partial^{2}H_{\bm{\theta}}(u,v)}{\partial u\partial v}divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) end_ARG start_ARG ∂ italic_u ∂ italic_v end_ARG is estimated via symbolic differentiation using the Hessian222https://jax.readthedocs.io/en/latest/_autosummary/jax.hessian.html.

Now, let us present our main properties. AppendixB and D complements these properties.

Theorem P1.

H𝜽:(u,v)𝕀2𝕀:subscript𝐻𝜽𝑢𝑣superscript𝕀2maps-to𝕀H_{\bm{\theta}}:(u,v)\in\mathbb{I}^{2}\mapsto\mathbb{I}italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT : ( italic_u , italic_v ) ∈ blackboard_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ↦ blackboard_I, where 𝕀=[0,1]𝕀01\mathbb{I}=[0,1]blackboard_I = [ 0 , 1 ].

Proof.

Remind that zu=z(tv(u))subscript𝑧𝑢𝑧subscript𝑡𝑣𝑢z_{u}=z\big{(}t_{v}(u)\big{)}italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_z ( italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ) ) and zv=z(tu(v))subscript𝑧𝑣𝑧subscript𝑡𝑢𝑣z_{v}=z\big{(}t_{u}(v)\big{)}italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_z ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) ). Also, (tv(u),tu(v))𝕀2subscript𝑡𝑣𝑢subscript𝑡𝑢𝑣superscript𝕀2(t_{v}(u),t_{u}(v))\in\mathbb{I}^{2}( italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ) , italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) ) ∈ blackboard_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. These transforms are also able to cover the entire range of 𝕀2superscript𝕀2\mathbb{I}^{2}blackboard_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (when u=0𝑢0u=0italic_u = 0 we have tv(0)=0subscript𝑡𝑣00t_{v}(0)=0italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( 0 ) = 0, when u=1𝑢1u=1italic_u = 1 we have tv(1)=1subscript𝑡𝑣11t_{v}(1)=1italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( 1 ) = 1, the same goes for the tu(v)subscript𝑡𝑢𝑣t_{u}(v)italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) transform). The function z𝑧zitalic_z maps the domain 𝕀2superscript𝕀2\mathbb{I}^{2}blackboard_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT into 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Given that the bivariate CDFs work in this domain: H𝜽:(u,v)𝕀2𝕀:subscript𝐻𝜽𝑢𝑣superscript𝕀2maps-to𝕀H_{\bm{\theta}}:(u,v)\in\mathbb{I}^{2}\mapsto\mathbb{I}italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT : ( italic_u , italic_v ) ∈ blackboard_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ↦ blackboard_I, where 𝕀=[0,1]𝕀01\mathbb{I}=[0,1]blackboard_I = [ 0 , 1 ].∎

Theorem P2.

The 2-Cats copula H𝛉(u,v)subscript𝐻𝛉𝑢𝑣H_{\bm{\theta}}(u,v)italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) satisfies the non-negative volume property. That is,for any 0u1<u210subscript𝑢1subscript𝑢210\leq u_{1}<u_{2}\leq 10 ≤ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1 and 0v1<v210subscript𝑣1subscript𝑣210\leq v_{1}<v_{2}\leq 10 ≤ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1, we have that VH𝛉(u1,u2,v1,v2)H𝛉(u2,v2)H𝛉(u2,v1)H𝛉(u1,v2)+H𝛉(u1,v1)0subscript𝑉subscript𝐻𝛉subscript𝑢1subscript𝑢2subscript𝑣1subscript𝑣2subscript𝐻𝛉subscript𝑢2subscript𝑣2subscript𝐻𝛉subscript𝑢2subscript𝑣1subscript𝐻𝛉subscript𝑢1subscript𝑣2subscript𝐻𝛉subscript𝑢1subscript𝑣10V_{H_{\bm{\theta}}}(u_{1},u_{2},v_{1},v_{2})\equiv H_{\bm{\theta}}(u_{2},v_{2}%)-H_{\bm{\theta}}(u_{2},v_{1})-H_{\bm{\theta}}(u_{1},v_{2})+H_{\bm{\theta}}(u_%{1},v_{1})\geq 0italic_V start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≡ italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≥ 0.

Intuition..

This is a straightforward consequence of G𝐺Gitalic_G being a bivariate cumulative distribution function and the monotonicity of the z𝑧zitalic_z, tusubscript𝑡𝑢t_{u}italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, and tvsubscript𝑡𝑣t_{v}italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT transforms.The same transform we explored above for the single variate case is valid for multiple variables. That is, fact that zu=z(tv(u))subscript𝑧𝑢𝑧subscript𝑡𝑣𝑢z_{u}=z(t_{v}(u))italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_z ( italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ) ) and zv=z(tu(v))subscript𝑧𝑣𝑧subscript𝑡𝑢𝑣z_{v}=z(t_{u}(v))italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_z ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) ) are monotonic guarantees that G(zu,zv)𝐺subscript𝑧𝑢subscript𝑧𝑣G(z_{u},z_{v})italic_G ( italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) defines a valid transform Eqn(2).

Considering a bivariate CDF, for any given point in its domain (e.g., tv(u),tu(v)subscript𝑡𝑣𝑢subscript𝑡𝑢𝑣t_{v}(u),t_{u}(v)italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ) , italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v )), the CDF represents the accumulated probability up to that point. Let the origin be (tv(0),tu(0))subscript𝑡𝑣0subscript𝑡𝑢0(t_{v}(0),t_{u}(0))( italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( 0 ) , italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( 0 ) ). As either tv(u)subscript𝑡𝑣𝑢t_{v}(u)italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ), tu(v)subscript𝑡𝑢𝑣t_{u}(v)italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ), or both values increase, the CDF values can only increase or remain constant. Therefore, the volume under the CDF surface will always be positive. Now, consider the following Corollary of Eq2.∎

Corollary: The second derivative of 2-Cats is a pseudo-likelihood..

Let F(x1,x2)=Pr[X1x1,X2x2]𝐹subscript𝑥1subscript𝑥2Prsubscript𝑋1subscript𝑥1subscript𝑋2subscript𝑥2F(x_{1},x_{2})=\Pr[X_{1}\leq x_{1},X_{2}\leq x_{2}]italic_F ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = roman_Pr [ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] be the bivariate CDF associated with RVs X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. With u=FX1(x1)𝑢subscript𝐹subscript𝑋1subscript𝑥1u=F_{X_{1}}(x_{1})italic_u = italic_F start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and v=FX2(x2)𝑣subscript𝐹subscript𝑋2subscript𝑥2v=F_{X_{2}}(x_{2})italic_v = italic_F start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) being the marginal CDFs, we reach that: F(x1,x2)=Pr[Uu,Vv]𝐹subscript𝑥1subscript𝑥2Pr𝑈𝑢𝑉𝑣F(x_{1},x_{2})=\Pr\big{[}U\leq u,V\leq v\big{]}italic_F ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = roman_Pr [ italic_U ≤ italic_u , italic_V ≤ italic_v ]. By Eq(2), Pr[Uu,Vv]=H𝜽(u,v)=H𝜽(FX1(x1),FX2(x2))Pr𝑈𝑢𝑉𝑣subscript𝐻𝜽𝑢𝑣subscript𝐻𝜽subscript𝐹subscript𝑋1subscript𝑥1subscript𝐹subscript𝑋2subscript𝑥2\Pr\big{[}U\leq u,V\leq v\big{]}=H_{\bm{\theta}}(u,v)=H_{\bm{\theta}}(F_{X_{1}%}(x_{1}),F_{X_{2}}(x_{2}))roman_Pr [ italic_U ≤ italic_u , italic_V ≤ italic_v ] = italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) = italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_F start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ). 2-Cats is also capturing the same bivariate CDF of C𝐶Citalic_C. The bivariate density, f(x1,x2)=2F(x1,x2)x1x2𝑓subscript𝑥1subscript𝑥2superscript2𝐹subscript𝑥1subscript𝑥2subscript𝑥1subscript𝑥2f(x_{1},x_{2})=\frac{\partial^{2}F(x_{1},x_{2})}{\partial x_{1}\,\partial x_{2}}italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∂ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG, thus is:

f(x1,x2)=2H𝜽(FX1(x1),FX2(x2))x1x2=fX1(x1)fX2(x2)2H𝜽(u,v)uv𝑓subscript𝑥1subscript𝑥2superscript2subscript𝐻𝜽subscript𝐹subscript𝑋1subscript𝑥1subscript𝐹subscript𝑋2subscript𝑥2subscript𝑥1subscript𝑥2subscript𝑓subscript𝑋1subscript𝑥1subscript𝑓subscript𝑋2subscript𝑥2superscript2subscript𝐻𝜽𝑢𝑣𝑢𝑣\displaystyle f(x_{1},x_{2})=\frac{\partial^{2}H_{\bm{\theta}}(F_{X_{1}}(x_{1}%),F_{X_{2}}(x_{2}))}{{\partial x_{1}\,\partial x_{2}}}=f_{X_{1}}(x_{1})\,f_{X_%{2}}(x_{2})\frac{\partial^{2}H_{\bm{\theta}}(u,v)}{{\partial u\,\partial v}}italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_F start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∂ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG = italic_f start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_f start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) end_ARG start_ARG ∂ italic_u ∂ italic_v end_ARG(3)

The above equation is solved via the Chain rule. A consequence of it, is that the likelihood f(x1,x2)𝑓subscript𝑥1subscript𝑥2f(x_{1},x_{2})italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), is proportional to h𝜽(u,v)=2H𝜽(u,v)uvsubscript𝜽𝑢𝑣superscript2subscript𝐻𝜽𝑢𝑣𝑢𝑣h_{\bm{\theta}}(u,v)=\frac{\partial^{2}H_{\bm{\theta}}(u,v)}{{\partial u\,%\partial v}}italic_h start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) = divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) end_ARG start_ARG ∂ italic_u ∂ italic_v end_ARG. This term is known as a pseudo-likelihood[23, 22].∎

Proof P2..

Given that f(x1,x2)0𝑓subscript𝑥1subscript𝑥20f(x_{1},x_{2})\geq 0italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≥ 0 by definition, and also fX1(x1)0subscript𝑓subscript𝑋1subscript𝑥10f_{X_{1}}(x_{1})\geq 0italic_f start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≥ 0 and fX2(x2)0subscript𝑓subscript𝑋2subscript𝑥20f_{X_{2}}(x_{2})\geq 0italic_f start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≥ 0: h𝜽(u,v)0subscript𝜽𝑢𝑣0h_{\bm{\theta}}(u,v)\geq 0italic_h start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) ≥ 0. The volume being the double integral of h𝜽(u,v)subscript𝜽𝑢𝑣h_{\bm{\theta}}(u,v)italic_h start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) is always positive or zero.∎

Property P3.

H𝜽(u,0)=0subscript𝐻𝜽𝑢00H_{\bm{\theta}}(u,0)=0italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , 0 ) = 0, H𝛉(0,v)=0subscript𝐻𝛉0𝑣0H_{\bm{\theta}}(0,v)=0italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( 0 , italic_v ) = 0, H𝛉(u,1)usubscript𝐻𝛉𝑢1𝑢H_{\bm{\theta}}(u,1)\approx uitalic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , 1 ) ≈ italic_u, and H𝛉(1,v)vsubscript𝐻𝛉1𝑣𝑣H_{\bm{\theta}}(1,v)\approx vitalic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( 1 , italic_v ) ≈ italic_v. Notice that this property is a relaxation of the one presented in Section 1.

This last property of a Copula is the one that guarantees that the Copula will have uniform marginals. In particular, the terms: C(u,1)=u𝐶𝑢1𝑢C(u,1)=uitalic_C ( italic_u , 1 ) = italic_u and C(1,v)=v𝐶1𝑣𝑣C(1,v)=vitalic_C ( 1 , italic_v ) = italic_v are of utmost importance for sampling. To grasp this, consider some valid Copula C𝐶Citalic_C. Here, C(u,1)=Pr[Uu,V1]=Pr[Uu]=u𝐶𝑢1Pr𝑈𝑢𝑉1Pr𝑈𝑢𝑢C(u,1)=\Pr[U\leq u,V\leq 1]=\Pr[U\leq u]=uitalic_C ( italic_u , 1 ) = roman_Pr [ italic_U ≤ italic_u , italic_V ≤ 1 ] = roman_Pr [ italic_U ≤ italic_u ] = italic_u, which is the CDF of the Uniform distribution. A similar argument exists for v𝑣vitalic_v.

2-Cats does not provably meet P3 as it does P1 and P2. Nevertheless, we prove that: H𝜽(u,0)=0subscript𝐻𝜽𝑢00H_{\bm{\theta}}(u,0)=0italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , 0 ) = 0 and H𝜽(0,v)=0subscript𝐻𝜽0𝑣0H_{\bm{\theta}}(0,v)=0italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( 0 , italic_v ) = 0, while approximating H𝜽(u,1)usubscript𝐻𝜽𝑢1𝑢H_{\bm{\theta}}(u,1)\approx uitalic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , 1 ) ≈ italic_u, and H𝜽(1,v)vsubscript𝐻𝜽1𝑣𝑣H_{\bm{\theta}}(1,v)\approx vitalic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( 1 , italic_v ) ≈ italic_v.

Lemma P3.1.

H𝜽(u,0)=0subscript𝐻𝜽𝑢00H_{\bm{\theta}}(u,0)=0italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , 0 ) = 0 and H𝛉(0,v)=0subscript𝐻𝛉0𝑣0H_{\bm{\theta}}(0,v)=0italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( 0 , italic_v ) = 0

Proof..

Consider the valueof H𝜽(u,0)subscript𝐻𝜽𝑢0H_{\bm{\theta}}(u,0)italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , 0 ) when v=0𝑣0v=0italic_v = 0. As tu(0)=0subscript𝑡𝑢00t_{u}(0)=0italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( 0 ) = 0, we have z(tu(0))=z(0)=limx0z(x)=𝑧subscript𝑡𝑢0𝑧0subscript𝑥0𝑧𝑥z(t_{u}(0))=z(0)=\lim_{x\rightarrow 0}z(x)=-\inftyitalic_z ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( 0 ) ) = italic_z ( 0 ) = roman_lim start_POSTSUBSCRIPT italic_x → 0 end_POSTSUBSCRIPT italic_z ( italic_x ) = - ∞ and henceH𝜽(u,0)=G(zu,)subscript𝐻𝜽𝑢0𝐺subscript𝑧𝑢H_{\bm{\theta}}(u,0)=G(z_{u},-\infty)italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , 0 ) = italic_G ( italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , - ∞ ). Given that G𝐺Gitalic_G is a bivariate CDF,G(zu,)=limwG(zu,w)=0𝐺subscript𝑧𝑢subscript𝑤𝐺subscript𝑧𝑢𝑤0G(z_{u},-\infty)=\lim_{w\rightarrow-\infty}G(z_{u},w)=0italic_G ( italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , - ∞ ) = roman_lim start_POSTSUBSCRIPT italic_w → - ∞ end_POSTSUBSCRIPT italic_G ( italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_w ) = 0 for any zusubscript𝑧𝑢z_{u}italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. The same proof exists for H𝜽(0,v)subscript𝐻𝜽0𝑣H_{\bm{\theta}}(0,v)italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( 0 , italic_v ).∎

Conjecture P3.2.

H𝜽(u,1)usubscript𝐻𝜽𝑢1𝑢H_{\bm{\theta}}(u,1)\approx uitalic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , 1 ) ≈ italic_u and H𝛉(1,v)vsubscript𝐻𝛉1𝑣𝑣H_{\bm{\theta}}(1,v)\approx vitalic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( 1 , italic_v ) ≈ italic_v.

Meeting P3.2 via Lagrangian..

To meet this relaxed property, we propose the following constrained optimization. Let the inputs of our model be comprised of the set 𝒟={ui,vi}𝒟subscript𝑢𝑖subscript𝑣𝑖\mathcal{D}=\{u_{i},v_{i}\}caligraphic_D = { italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, where i[1,n=|𝒟|]𝑖delimited-[]1𝑛𝒟i\in[1,n=|\mathcal{D}|]italic_i ∈ [ 1 , italic_n = | caligraphic_D | ]. Moreover, let L𝜽(D)subscript𝐿𝜽𝐷L_{\bm{\theta}}(D)italic_L start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_D ) be some loss term used to optimize our model (see the next section). Our optimization for 2-Cats will focus on the following constrained problem:

argmin𝜽L𝜽(D)s.t.u[0,1]:H𝜽(u,1)u=0andv[0,1]:H𝜽(1,v)v=0.:subscript𝜽subscript𝐿𝜽𝐷s.t.for-all𝑢01subscript𝐻𝜽𝑢1𝑢0andfor-all𝑣01:subscript𝐻𝜽1𝑣𝑣0\displaystyle\arg\,\min_{\bm{\theta}}L_{\bm{\theta}}(D)\quad\textrm{s.t.}\quad%\forall u\in[0,1]:H_{\bm{\theta}}(u,1)-u=0\,\text{and}\,\forall v\in[0,1]:H_{%\bm{\theta}}(1,v)-v=0.roman_arg roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_D ) s.t. ∀ italic_u ∈ [ 0 , 1 ] : italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , 1 ) - italic_u = 0 and ∀ italic_v ∈ [ 0 , 1 ] : italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( 1 , italic_v ) - italic_v = 0 .(4)

Although it is trivial to understand why this optimization meets P3.2, performing such an optimization presents several challenges. The first challenge is how to model constraints. A natural first choice is to consider the square of constraints. However, squared constraint optimization is an ill-posed optimization problem[8, Chapter2.1], and [54, Section2.1]. Secondly, our model may correctly estimate that H𝜽(u,1)=usubscript𝐻𝜽𝑢1𝑢H_{\bm{\theta}}(u,1)=uitalic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , 1 ) = italic_u (or similarly that H𝜽(1,v)=vsubscript𝐻𝜽1𝑣𝑣H_{\bm{\theta}}(1,v)=vitalic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( 1 , italic_v ) = italic_v), but this will not necessarily ensure that the partial derivatives of H𝜽(u,1)subscript𝐻𝜽𝑢1H_{\bm{\theta}}(u,1)italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , 1 ) and H𝜽(1,v)subscript𝐻𝜽1𝑣H_{\bm{\theta}}(1,v)italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( 1 , italic_v ) are accurate (see AppendixA.3). Such derivatives must equal one for these inputs as the Copula marginals are distributed according to an Uniform(0,1)𝑈𝑛𝑖𝑓𝑜𝑟𝑚01Uniform(0,1)italic_U italic_n italic_i italic_f italic_o italic_r italic_m ( 0 , 1 ). Thirdly, evaluating these constraints over the entire domain u[0,1]𝑢01u\in[0,1]italic_u ∈ [ 0 , 1 ] and v[0,1]𝑣01v\in[0,1]italic_v ∈ [ 0 , 1 ] is impossible. The final challenge is how to optimize constraints. Here, a natural approach is to consider Lagrangian multipliers[54, 8, 69]. However, in such cases, the solution to the optimization problem will lie on a saddle point of the loss surface, and gradient methods (used in NNs and 2-Cats) do not converge on saddle points (Section 3.1 of[54] provides a simple intuition on why this is so).

We tackle the first and second challenges by considering the derivative of the square of the above constraints, i.e: (H𝜽(u,1)u)2usuperscriptsubscript𝐻𝜽𝑢1𝑢2𝑢\frac{\partial(H_{\bm{\theta}}(u,1)-u)^{2}}{\partial u}divide start_ARG ∂ ( italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , 1 ) - italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_u end_ARG and (H𝜽(1,v)v)2vsuperscriptsubscript𝐻𝜽1𝑣𝑣2𝑣\frac{\partial(H_{\bm{\theta}}(1,v)-v)^{2}}{\partial v}divide start_ARG ∂ ( italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( 1 , italic_v ) - italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_v end_ARG, which via the chain rule will be defined as follows:

r𝜽1(u)=2(H𝜽(u,1)u)(H𝜽(u,1)u1)andr𝜽2(v)=2(H𝜽(1,v)v)(H𝜽(1,v)v1).subscriptsuperscript𝑟1𝜽𝑢2subscript𝐻𝜽𝑢1𝑢subscript𝐻𝜽𝑢1𝑢1andsubscriptsuperscript𝑟2𝜽𝑣2subscript𝐻𝜽1𝑣𝑣subscript𝐻𝜽1𝑣𝑣1\displaystyle r^{1}_{\bm{\theta}}(u)=2\,\big{(}H_{\bm{\theta}}(u,1)-u\big{)}%\big{(}\frac{\partial H_{\bm{\theta}}(u,1)}{\partial u}-1\big{)}\,\text{and}\,%r^{2}_{\bm{\theta}}(v)=2\,\big{(}H_{\bm{\theta}}(1,v)-v\big{)}\big{(}\frac{%\partial H_{\bm{\theta}}(1,v)}{\partial v}-1\big{)}.italic_r start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u ) = 2 ( italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , 1 ) - italic_u ) ( divide start_ARG ∂ italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , 1 ) end_ARG start_ARG ∂ italic_u end_ARG - 1 ) and italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_v ) = 2 ( italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( 1 , italic_v ) - italic_v ) ( divide start_ARG ∂ italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( 1 , italic_v ) end_ARG start_ARG ∂ italic_v end_ARG - 1 ) .(5)

r𝜽1(u)subscriptsuperscript𝑟1𝜽𝑢r^{1}_{\bm{\theta}}(u)italic_r start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u ) and r𝜽2(v)subscriptsuperscript𝑟2𝜽𝑣r^{2}_{\bm{\theta}}(v)italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_v ) stand for requirements on u𝑢uitalic_u and v𝑣vitalic_v, respectively. Both requirements are always positive or zero because f(x)<10xf(x)𝑑x<xiff𝑓𝑥1superscriptsubscript0𝑥𝑓𝑥differential-d𝑥𝑥f(x)<1\iff\int_{0}^{x}f(x)dx<xitalic_f ( italic_x ) < 1 ⇔ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT italic_f ( italic_x ) italic_d italic_x < italic_x. Consequently, both requirements will only equal zero when P3.2 is met (our goal). One advantage of this constraint is that it also considers our model’s derivative, which tackles our second challenge above.

We evaluated the constraints on our training set to tackle the third challenge. This leads to the optimization stated below. r𝜽(D)subscript𝑟𝜽𝐷r_{\bm{\theta}}(D)italic_r start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_D ) are the constraints on the training data. From the arguments above, we have that: r𝜽(D)0subscript𝑟𝜽𝐷0r_{\bm{\theta}}(D)\geq 0italic_r start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_D ) ≥ 0 and r𝜽(D)=0subscript𝑟𝜽𝐷0r_{\bm{\theta}}(D)=0italic_r start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_D ) = 0 if and only if the constraints are met.

𝜽^=argmin𝜽L𝜽(D)+r𝜽(D),where:r𝜽(D)=i=1nr𝜽1(ui)+i=1nr𝜽2(vi).formulae-sequence^𝜽subscript𝜽subscript𝐿𝜽𝐷subscript𝑟𝜽𝐷where:subscript𝑟𝜽𝐷superscriptsubscript𝑖1𝑛subscriptsuperscript𝑟1𝜽subscript𝑢𝑖superscriptsubscript𝑖1𝑛subscriptsuperscript𝑟2𝜽subscript𝑣𝑖\displaystyle\hat{\bm{\theta}}=\arg\,\min_{\bm{\theta}}L_{\bm{\theta}}(D)+r_{%\bm{\theta}}(D),\,\,\text{where: }\,\,r_{\bm{\theta}}(D)=\sum_{i=1}^{n}r^{1}_{%\bm{\theta}}(u_{i})+\sum_{i=1}^{n}r^{2}_{\bm{\theta}}(v_{i}).over^ start_ARG bold_italic_θ end_ARG = roman_arg roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_D ) + italic_r start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_D ) , where: italic_r start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_D ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .(6)

As in[7], to solve the above problem, we treat r𝜽(D)subscript𝑟𝜽𝐷r_{\bm{\theta}}(D)italic_r start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_D ) as a barrier and optimize as follows:

  1. 1.

    Let λ(0,1]𝜆01\lambda\in(0,1]italic_λ ∈ ( 0 , 1 ] and α(0,1)𝛼01\alpha\in(0,1)italic_α ∈ ( 0 , 1 ) be hyper-parameters (e.g., we set λ=1𝜆1\lambda=1italic_λ = 1 and α=0.95𝛼0.95\alpha=0.95italic_α = 0.95). Also, let 𝜽(0)superscript𝜽0\bm{\theta}^{(0)}bold_italic_θ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT be the initial model parameters (initialized as in[34] – aka Lecun Normal).

  2. 2.

    For every training iteration t[1,T]𝑡1𝑇t\in[1,T]italic_t ∈ [ 1 , italic_T ]:

    1. (a)

      𝜽(t)=stochatic gradient step(λL𝜽(t1)(D)+r𝜽(t1)(D))superscript𝜽𝑡stochatic gradient step𝜆subscript𝐿superscript𝜽𝑡1𝐷subscript𝑟superscript𝜽𝑡1𝐷\bm{\theta}^{(t)}=\text{stochatic gradient step}\big{(}\lambda\,L_{\bm{\theta}%^{(t-1)}}(D)+r_{\bm{\theta}^{(t-1)}}(D)\big{)}bold_italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = stochatic gradient step ( italic_λ italic_L start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_D ) + italic_r start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_D ) ) via ADAM[33]

    2. (b)

      λ=αλ𝜆𝛼𝜆\lambda=\alpha\,\lambdaitalic_λ = italic_α italic_λ

  3. 3.

    Set our model as 𝜽^=𝜽(T)^𝜽superscript𝜽𝑇\hat{\bm{\theta}}=\bm{\theta}^{(T)}over^ start_ARG bold_italic_θ end_ARG = bold_italic_θ start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT

As t𝑡t\rightarrow\inftyitalic_t → ∞, we have that λ0𝜆0\lambda\rightarrow 0italic_λ → 0, leading our optimization to focus on the constraints. Even though we may not start our optimization in a feasible solution to r𝜽(t1)(D)subscript𝑟superscript𝜽𝑡1𝐷r_{\bm{\theta}^{(t-1)}}(D)italic_r start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_D )[7], the fact that our loss surface is fully differentiable allows the optimizer to reach such a solution if not stuck on some local minima/saddle. Nevertheless, we align with the empirical evidence that in overparametrized models, such as deep NNs, local minima are rare (see Chapter 6.7 – The Optimality of Backpropagation – of[6]) and that such stochastic optimizers, such as ADAM, are suitable for escaping saddle points[6]. Moreover, our initialization, 𝜽(0)superscript𝜽0\bm{\theta}^{(0)}bold_italic_θ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, is suitable for satisfying regularizers[34].

For comparison, we shall present results with and without these Lagrangian terms. ∎

4 Sobolev Losses for 2-Cats Models

We have already discussed how our approach ensures the validity of Copula functions. To train our models, we designed our loss function as a weighted sum, i.e.:

L𝜽(𝒟)=wCL𝜽C(𝒟)+wdCL𝜽dC(𝒟)+wcL𝜽c(𝒟)subscript𝐿𝜽𝒟superscript𝑤𝐶subscriptsuperscript𝐿𝐶𝜽𝒟superscript𝑤𝑑𝐶subscriptsuperscript𝐿𝑑𝐶𝜽𝒟superscript𝑤𝑐subscriptsuperscript𝐿𝑐𝜽𝒟L_{\bm{\theta}}(\mathcal{D})=w^{C}L^{C}_{\bm{\theta}}(\mathcal{D})+w^{dC}L^{dC%}_{\bm{\theta}}(\mathcal{D})+w^{c}L^{c}_{\bm{\theta}}(\mathcal{D})italic_L start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_D ) = italic_w start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_D ) + italic_w start_POSTSUPERSCRIPT italic_d italic_C end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT italic_d italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_D ) + italic_w start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_D )(7)

Here, wCsuperscript𝑤𝐶w^{C}italic_w start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, wdCsuperscript𝑤𝑑𝐶w^{dC}italic_w start_POSTSUPERSCRIPT italic_d italic_C end_POSTSUPERSCRIPT, and wcsuperscript𝑤𝑐w^{c}italic_w start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT are loss weights.

The first component, L𝜽C(𝒟)subscriptsuperscript𝐿𝐶𝜽𝒟L^{C}_{\bm{\theta}}(\mathcal{D})italic_L start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_D ), stimulates the copula to closely resemble the empirical cumulative distribution function derived from the observed data. In this way, our model learns to capture the essential characteristics of the data.The second component, L𝜽dC(𝒟)subscriptsuperscript𝐿𝑑𝐶𝜽𝒟L^{dC}_{\bm{\theta}}(\mathcal{D})italic_L start_POSTSUPERSCRIPT italic_d italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_D ), imposes penalties on any disparities between the fitted copula’s first-order derivatives and the data-based estimates. A Copula’s first derivative is essential for sampling (see AppendixB) and Vine methods. The third component, L𝜽c(𝒟)subscriptsuperscript𝐿𝑐𝜽𝒟L^{c}_{\bm{\theta}}(\mathcal{D})italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_D ), focuses on the copula’s second-order derivative, which is linked to the probability density function of the data, and evaluates its proximity to the empirical likelihood. Incorporating this aspect in our loss function enhances the model’s ability to capture the data’s distribution.

From the arguments above, all three terms play a role in Copula modeling. While obtaining the first component is relatively straightforward, the other two components necessitate some original work. We explain each of them in turn. Before continuing, we point out that AppendixA provides an example of how NNs behave when estimating derivatives and integrals of functions.

Let F(x1,x2)=Pr[X1x1,X2<x2]𝐹subscript𝑥1subscript𝑥2Prsubscript𝑋1subscript𝑥1subscript𝑋2subscript𝑥2F(x_{1},x_{2})=\Pr\big{[}X_{1}\leq x_{1},X_{2}<x_{2}]italic_F ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = roman_Pr [ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] be the bivariate CDF for some random vector 𝐱𝐱\mathbf{x}bold_x and Fn^(𝐱)^superscript𝐹𝑛𝐱\widehat{F^{n}}(\mathbf{x})over^ start_ARG italic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ( bold_x ) be the empirical cumulative distribution function (ECDF) estimated from a sample of size n𝑛nitalic_n.The Glivenko-Cantelli theorem states that: Fn^(x1,x2)a.s.F(𝐱)a.s.^superscript𝐹𝑛subscript𝑥1subscript𝑥2𝐹𝐱\widehat{F^{n}}(x_{1},x_{2})\ \xrightarrow{\text{a.s.}}\ F(\mathbf{x})over^ start_ARG italic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_ARROW overa.s. → end_ARROW italic_F ( bold_x ) as n𝑛n\xrightarrow{\ }\inftyitalic_n start_ARROW → end_ARROW ∞[48].

We can explore these results to define the relationship between our model, the ECDF, and the true CDF. Being 𝜽^^𝜽\hat{\bm{\theta}}over^ start_ARG bold_italic_θ end_ARG the estimated parameters and given that the CDF F(x1,x2)𝐹subscript𝑥1subscript𝑥2F(x_{1},x_{2})italic_F ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and the Copula C(u,v)𝐶𝑢𝑣C(u,v)italic_C ( italic_u , italic_v ) evaluate to the same value (i.e., u𝑢uitalic_u and v𝑣vitalic_v are the inverse of the marginal CDFs of x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively):

C(u,\displaystyle C(u,italic_C ( italic_u ,v)=F(FX11(u),FX21(v))=F(x1,x2)Fn^(x1,x2)H𝜽^(u,v)=H𝜽^(FX1^(x1),FX2^(x2)).\displaystyle v)=F(F_{X_{1}}^{-1}(u),F_{X_{2}}^{-1}(v))=F(x_{1},x_{2})\approx%\widehat{F^{n}}(x_{1},x_{2})\approx H_{\hat{\bm{\theta}}}(u,v)=H_{\hat{\bm{%\theta}}}(\widehat{F_{X_{1}}}(x_{1}),\widehat{F_{X_{2}}}(x_{2})).italic_v ) = italic_F ( italic_F start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_u ) , italic_F start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_v ) ) = italic_F ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≈ over^ start_ARG italic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≈ italic_H start_POSTSUBSCRIPT over^ start_ARG bold_italic_θ end_ARG end_POSTSUBSCRIPT ( italic_u , italic_v ) = italic_H start_POSTSUBSCRIPT over^ start_ARG bold_italic_θ end_ARG end_POSTSUBSCRIPT ( over^ start_ARG italic_F start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , over^ start_ARG italic_F start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) .

Considering these definitions, we can define the loss functions based on the ECDF and the output:

L𝜽C(𝒟)=1nui,vi𝒟(H𝜽(ui,vi)Fn^(xi,1,xi,2))2.subscriptsuperscript𝐿𝐶𝜽𝒟1𝑛subscriptsubscript𝑢𝑖subscript𝑣𝑖𝒟superscriptsubscript𝐻𝜽subscript𝑢𝑖subscript𝑣𝑖^superscript𝐹𝑛subscript𝑥𝑖1subscript𝑥𝑖22\displaystyle L^{C}_{\bm{\theta}}(\mathcal{D})=\frac{1}{n}\sum_{u_{i},v_{i}\in%\mathcal{D}}\big{(}H_{\bm{\theta}}(u_{i},v_{i})-\widehat{F^{n}}(x_{i,1},x_{i,2%})\big{)}^{2}.italic_L start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_D ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D end_POSTSUBSCRIPT ( italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over^ start_ARG italic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ( italic_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(8)

Now for our second term, L𝜽dC(𝒟)subscriptsuperscript𝐿𝑑𝐶𝜽𝒟L^{dC}_{\bm{\theta}}(\mathcal{D})italic_L start_POSTSUPERSCRIPT italic_d italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_D ), a natural step for the first derivatives would be to define a mean squared error (or similar) loss towards these derivatives. The issue is that we need to define empirical estimates from data for both derivatives: C(u,v)u𝐶𝑢𝑣𝑢\frac{\partial C(u,v)}{\partial u}divide start_ARG ∂ italic_C ( italic_u , italic_v ) end_ARG start_ARG ∂ italic_u end_ARG and dC(u,v)v𝑑𝐶𝑢𝑣𝑣\frac{dC(u,v)}{\partial v}divide start_ARG italic_d italic_C ( italic_u , italic_v ) end_ARG start_ARG ∂ italic_v end_ARG. To estimate such derivatives, we could have employed conditional density estimators[53, 63, 26]. Nevertheless, methods as[53, 63] are also deep NNs that have the drawback of requiring extra parameters to be estimated. Even classical methods such as[26] suffer from this issue. As a minor contribution, we explore the underlying properties of a Copula to present an empirical approximation for such derivatives.

For a 2d Copula, the first derivative of C𝐶Citalic_C has the form[50]:C(u,v)u=Pr[VvU=u].𝐶𝑢𝑣𝑢Pr𝑉conditional𝑣𝑈𝑢\frac{\partial C(u,v)}{\partial u}=\Pr\big{[}V\leq v\mid U=u\big{]}.divide start_ARG ∂ italic_C ( italic_u , italic_v ) end_ARG start_ARG ∂ italic_u end_ARG = roman_Pr [ italic_V ≤ italic_v ∣ italic_U = italic_u ] .

The issue with estimating this function from data is that we cannot filter our dataset to condition on U=u𝑈𝑢U=uitalic_U = italic_u (we are working with data where u𝕀𝑢𝕀u\in\mathbb{I}italic_u ∈ blackboard_I, a real number). However, using Bayes rule we can rewrite this equation. First, let us rewrite the above equation in terms of density functions. That is, if cu(v)=F(VvU=u)=Pr[VvU=u]subscript𝑐𝑢𝑣𝐹𝑉conditional𝑣𝑈𝑢Pr𝑉conditional𝑣𝑈𝑢c_{u}(v)=F(V\leq v\mid U=u\big{)}=\Pr\big{[}V\leq v\mid U=u\big{]}italic_c start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) = italic_F ( italic_V ≤ italic_v ∣ italic_U = italic_u ) = roman_Pr [ italic_V ≤ italic_v ∣ italic_U = italic_u ] is a cumulative function, c(v)𝑐𝑣c(v)italic_c ( italic_v ) and c(u)𝑐𝑢c(u)italic_c ( italic_u ) are the marginal copula densities (uniform by definition). Using Bayes rule, we have that:

C(u,v)u𝐶𝑢𝑣𝑢\displaystyle\frac{\partial C(u,v)}{\partial u}divide start_ARG ∂ italic_C ( italic_u , italic_v ) end_ARG start_ARG ∂ italic_u end_ARG=cu(v)=c(U=uVv)F(Vv)c(u)=vc(U=uVv).absentsubscript𝑐𝑢𝑣𝑐𝑈conditional𝑢𝑉𝑣𝐹𝑉𝑣𝑐𝑢𝑣𝑐𝑈conditional𝑢𝑉𝑣\displaystyle=c_{u}(v)=\frac{c\big{(}U=u\mid V\leq v\big{)}F\big{(}V\leq v\big%{)}}{c\big{(}u\big{)}}=v\,c\big{(}U=u\mid V\leq v\big{)}.= italic_c start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) = divide start_ARG italic_c ( italic_U = italic_u ∣ italic_V ≤ italic_v ) italic_F ( italic_V ≤ italic_v ) end_ARG start_ARG italic_c ( italic_u ) end_ARG = italic_v italic_c ( italic_U = italic_u ∣ italic_V ≤ italic_v ) .(9)

Where the last identity above comes from the fact that for Copulas, the marginal distributions U𝑈Uitalic_U and V𝑉Vitalic_V are uniform, leading to c(u)=c(v)=1𝑐𝑢𝑐𝑣1c(u)=c(v)=1italic_c ( italic_u ) = italic_c ( italic_v ) = 1. Also, we shall have that F(Vv)=Pr[Vv]=v𝐹𝑉𝑣Pr𝑉𝑣𝑣F\big{(}V\leq v\big{)}=\Pr\big{[}V\leq v\big{]}=vitalic_F ( italic_V ≤ italic_v ) = roman_Pr [ italic_V ≤ italic_v ] = italic_v and F(Uu)Pr[Uu]=u𝐹𝑈𝑢Pr𝑈𝑢𝑢F\big{(}U\leq u\big{)}\Pr\big{[}U\leq u\big{]}=uitalic_F ( italic_U ≤ italic_u ) roman_Pr [ italic_U ≤ italic_u ] = italic_u[50]. Now, estimate c(U=uVv)𝑐𝑈conditional𝑢𝑉𝑣c\big{(}U=u\mid V\leq v\big{)}italic_c ( italic_U = italic_u ∣ italic_V ≤ italic_v ).

To do so, let us define Cn^(u,v)uc(U=uVv)^superscript𝐶𝑛𝑢𝑣𝑢𝑐𝑈conditional𝑢𝑉𝑣\frac{\widehat{\partial C^{n}}(u,v)}{\partial u}\approx c\big{(}U=u\mid V\leq v%\big{)}divide start_ARG over^ start_ARG ∂ italic_C start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ( italic_u , italic_v ) end_ARG start_ARG ∂ italic_u end_ARG ≈ italic_c ( italic_U = italic_u ∣ italic_V ≤ italic_v ) as an empirical estimate of such a density using n𝑛nitalic_n data points. We employ the widely used Kernel Density Estimation (KDE) to estimate this function. A fast algorithm for our estimation will work as follows: we arrange our dataset as a table of two columns, where each row contains the pairs of u,v𝑢𝑣u,vitalic_u , italic_v (the columns). For efficiency, we create views of this table sorted on column u𝑢uitalic_u or column v𝑣vitalic_v. When we iterate this table sorted on v𝑣vitalic_v, finding the points where Vv𝑉𝑣V\leq vitalic_V ≤ italic_v is trivial, as these are simply the previously iterated rows. If we perform KDE on the column u𝑢uitalic_u for these points, we shall have an empirical estimate of the density: c(U=uVv)𝑐𝑈conditional𝑢𝑉𝑣c\big{(}U=u\mid V\leq v\big{)}italic_c ( italic_U = italic_u ∣ italic_V ≤ italic_v ). By simply multiplying this estimate with v𝑣vitalic_v, we have our empirical estimation of the partial derivative with regards to u𝑢uitalic_u, that is: Cn^(u,v)u^superscript𝐶𝑛𝑢𝑣𝑢\frac{\widehat{\partial C^{n}}(u,v)}{\partial u}divide start_ARG over^ start_ARG ∂ italic_C start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ( italic_u , italic_v ) end_ARG start_ARG ∂ italic_u end_ARG. Similarly, we can estimate Cn^(u,v)v^superscript𝐶𝑛𝑢𝑣𝑣\frac{\widehat{\partial C^{n}}(u,v)}{\partial v}divide start_ARG over^ start_ARG ∂ italic_C start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ( italic_u , italic_v ) end_ARG start_ARG ∂ italic_v end_ARG. Now, our loss term is:

L𝜽dC(𝒟)=12nui,vi𝒟(\displaystyle L^{dC}_{\bm{\theta}}(\mathcal{D})=\frac{1}{2n}\sum_{u_{i},v_{i}%\in\mathcal{D}}\big{(}italic_L start_POSTSUPERSCRIPT italic_d italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_D ) = divide start_ARG 1 end_ARG start_ARG 2 italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D end_POSTSUBSCRIPT ((H𝜽(ui,vi)uCn^(ui,vi)u)2+(H𝜽(ui,vi)vCn^(ui,vi)v)2).\displaystyle(\frac{\partial H_{\bm{\theta}}(u_{i},v_{i})}{\partial u}-\frac{%\widehat{\partial C^{n}}(u_{i},v_{i})}{\partial u})^{2}+(\frac{\partial H_{\bm%{\theta}}(u_{i},v_{i})}{\partial v}-\frac{\widehat{\partial C^{n}}(u_{i},v_{i}%)}{\partial v})^{2}\big{)}.( divide start_ARG ∂ italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_u end_ARG - divide start_ARG over^ start_ARG ∂ italic_C start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_u end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( divide start_ARG ∂ italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_v end_ARG - divide start_ARG over^ start_ARG ∂ italic_C start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_v end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .(10)

Finally, we focus on the last part of our loss, the a pseudo-likelihood:

L𝜽c(𝒟)=1nui,vi𝒟log(2H𝜽(ui,vi)uv).subscriptsuperscript𝐿𝑐𝜽𝒟1𝑛subscriptsubscript𝑢𝑖subscript𝑣𝑖𝒟superscript2subscript𝐻𝜽subscript𝑢𝑖subscript𝑣𝑖𝑢𝑣\displaystyle L^{c}_{\bm{\theta}}(\mathcal{D})=-\frac{1}{n}\sum_{u_{i},v_{i}%\in\mathcal{D}}\log(\frac{\partial^{2}H_{\bm{\theta}}(u_{i},v_{i})}{\partial u%\partial v}).italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_D ) = - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D end_POSTSUBSCRIPT roman_log ( divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_u ∂ italic_v end_ARG ) .(11)

It is essential to understand why the three losses are needed. It might seem that simply minimizing this loss would be sufficient. However, we can show that integrals of NN are also approximations up to an asymptotical constant[39], but we still need to ensure that this constant is acceptable (see AppendixA.2). Moreover, AppendixC presents an ablation study on the impact of all three losses.

5 Experimental Results

We now present our experimental results on different fronts: (1) validating our Empirical estimates for the first derivative, a crucial step when training 2-Cats; (2) evaluating 2-Cats on synthetic and real datasets without the lagrangian term; and, (3) evaluating the impact of the lagrangian term.

Before presenting results, we note that our models were implemented using Jax333https://jax.readthedocs.io/en/latest//Flax444https://flax.readthedocs.io. Given that Jax does not implement CDF methods for Multivariate Normal Distributions, we ported a fast approximation of bivariate CDFs[67] and employed it for our models that rely on Gaussian CDFs/Copulas. Our code is available at https://anonymous.4open.science/r/2cats-E765/. Baseline methods were evaluated using the author-supplied code (details are on AppendixG).

5.1 Empirical First Derivative Estimator

First, we validate our empirical estimations for the first derivative of Copulas. We use the closed-form equations for the first derivatives of Gaussian, Clayton, and Frank Copulas described in [57].

These Copulas have a single parameter, which we varied. Gaussian Copulas (ρ𝜌\rhoitalic_ρ) has a correlation parameter, and we set it to: 0.1, 0.5, and 0.9. Clayton and Frank copulas’ mixing parameter (θ𝜃\thetaitalic_θ) was set to 1, 5, and 10. For each of these configurations, we measured the coefficient of determination (R2𝑅2R2italic_R 2) between empirical estimates and exact formulae. Overall, we found that R2 was above 0.8990.8990.8990.899 in every setting, validating that our estimates are pretty accurate in practive.

5.2 2-Cats on Datasets (Without Lagrangian Term)

We now turn to our main comparisons. Our primary focus, as is the case on baseline methods, will be on capturing the data likelihood (Eq(3)). In our experiments, the PDFs for X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT were estimated via KDE. The bandwidth for these methods is determined using Silverman’s rule[59].

As is commonly done, we evaluate the natural log of this quantity as the data log-likelihood and use its negative to compare 2-Cats with other methods. With this metric of interest in place, we considered two 2-Cats models: (1) 2-Cats-G (Final Layer Gaussian): 2-Cats with a Gaussian bivariate CDF as G𝐺Gitalic_G; and, (2) 2-Cats-L (Final Layer Logistic): 2-Cats with a Logistic bivariate CDF as G𝐺Gitalic_G.

Thus, we considered two forms of G𝐺Gitalic_G. The first was the CDF of a bivariate Gaussian Distribution[67]. The second one is the CDF of the Flexible bivariate Logistic (see Section 11.8 of[4]):

F(x1,x2)=Pr[X1<x1,X2<x2]=((1+eαx1μ1σ1+eαx2μ2σ2+eα(x1μ1σ1+x2μ2σ2))1α)1𝐹subscript𝑥1subscript𝑥2Prsubscript𝑋1subscript𝑥1subscript𝑋2subscript𝑥2superscriptsuperscript1superscript𝑒𝛼subscript𝑥1subscript𝜇1subscript𝜎1superscript𝑒𝛼subscript𝑥2subscript𝜇2subscript𝜎2superscript𝑒𝛼subscript𝑥1subscript𝜇1subscript𝜎1subscript𝑥2subscript𝜇2subscript𝜎21𝛼1\displaystyle F(x_{1},x_{2})=\Pr[X_{1}<x_{1},X_{2}<x_{2}]=\big{(}(1+e^{-\alpha%\frac{x_{1}-\mu_{1}}{\sigma_{1}}}+e^{-\alpha\frac{x_{2}-\mu_{2}}{\sigma_{2}}}+%e^{-\alpha(\frac{x_{1}-\mu_{1}}{\sigma_{1}}+\frac{x_{2}-\mu_{2}}{\sigma_{2}})}%)^{\frac{1}{\alpha}}\big{)}^{-1}italic_F ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = roman_Pr [ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = ( ( 1 + italic_e start_POSTSUPERSCRIPT - italic_α divide start_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT - italic_α divide start_ARG italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT - italic_α ( divide start_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT

μ1subscript𝜇1\mu_{1}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, μ2subscript𝜇2\mu_{2}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, σ1subscript𝜎1\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, σ2subscript𝜎2\sigma_{2}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and α𝛼\alphaitalic_α are free parameters optimized by the model. The same goes for the free parameters of the bivariate Gaussian CDF (the means of each dimension, μ1subscript𝜇1\mu_{1}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and μ2subscript𝜇2\mu_{2}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and the correlation parameter ρ𝜌\rhoitalic_ρ). Moreover, for each model, we employed four-layer networks with hidden layers having sizes of 128, 64, 32, and 16. We now discuss loss weights.

Recall that the data density is proportional to the pseudo-likelihood. We primarily emphasized the likelihood loss (wcsuperscript𝑤𝑐w^{c}italic_w start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT). This accentuates our pivotal component. Notably, the scale of its values differs from that of the MSE components, necessitating a proportional adjustment in weight selection. In light of these factors, we fixed wC=0.01superscript𝑤𝐶0.01w^{C}=0.01italic_w start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT = 0.01, wdC=0.5superscript𝑤𝑑𝐶0.5w^{dC}=0.5italic_w start_POSTSUPERSCRIPT italic_d italic_C end_POSTSUPERSCRIPT = 0.5, and wc=0.1superscript𝑤𝑐0.1w^{c}=0.1italic_w start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = 0.1. We note that these hyper-parameters were sufficient to show how our approach is better than the literature. Our models were trained using early stopping. Here, for each training epoch, we evaluated the pseudo-log-likelihood on the training set and kept the weights of the epoch with the best likelihood on the training dataset.

Gaussian (ρ𝜌\rhoitalic_ρ)Clayton (θ𝜃\thetaitalic_θ)Frank/Joe (θ𝜃\thetaitalic_θ)0.10.50.915101510Non-Deep LearnPar2.91±0.09plus-or-minus2.910.092.91\pm 0.092.91 ± 0.092.69±0.09plus-or-minus2.690.092.69\pm 0.092.69 ± 0.092.09±0.09plus-or-minus2.090.092.09\pm 0.092.09 ± 0.092.65±0.09plus-or-minus2.650.092.65\pm 0.092.65 ± 0.091.89±0.11plus-or-minus1.890.111.89\pm 0.111.89 ± 0.111.39±0.10plus-or-minus1.390.101.39\pm 0.101.39 ± 0.102.78±0.08plus-or-minus2.780.082.78\pm 0.082.78 ± 0.082.57±0.09plus-or-minus2.570.092.57\pm 0.092.57 ± 0.092.10±0.09plus-or-minus2.100.092.10\pm 0.092.10 ± 0.09Bern2.91±0.09plus-or-minus2.910.092.91\pm 0.092.91 ± 0.092.69±0.09plus-or-minus2.690.092.69\pm 0.092.69 ± 0.092.15±0.09plus-or-minus2.150.092.15\pm 0.092.15 ± 0.092.66±0.08plus-or-minus2.660.082.66\pm 0.082.66 ± 0.082.06±0.09plus-or-minus2.060.092.06\pm 0.092.06 ± 0.091.76±0.08plus-or-minus1.760.081.76\pm 0.081.76 ± 0.082.78±0.08plus-or-minus2.780.082.78\pm 0.082.78 ± 0.082.58±0.09plus-or-minus2.580.092.58\pm 0.092.58 ± 0.092.15±0.08plus-or-minus2.150.082.15\pm 0.082.15 ± 0.08PBern2.91±0.09plus-or-minus2.910.092.91\pm 0.092.91 ± 0.092.69±0.09plus-or-minus2.690.092.69\pm 0.092.69 ± 0.092.15±0.10plus-or-minus2.150.102.15\pm 0.102.15 ± 0.102.67±0.09plus-or-minus2.670.092.67\pm 0.092.67 ± 0.092.12±0.10plus-or-minus2.120.102.12\pm 0.102.12 ± 0.101.92±0.08plus-or-minus1.920.081.92\pm 0.081.92 ± 0.082.78±0.08plus-or-minus2.780.082.78\pm 0.082.78 ± 0.082.57±0.09plus-or-minus2.570.092.57\pm 0.092.57 ± 0.092.13±0.08plus-or-minus2.130.082.13\pm 0.082.13 ± 0.08PSPL12.91±0.09plus-or-minus2.910.092.91\pm 0.092.91 ± 0.092.69±0.09plus-or-minus2.690.092.69\pm 0.092.69 ± 0.092.15±0.11plus-or-minus2.150.112.15\pm 0.112.15 ± 0.112.67±0.09plus-or-minus2.670.092.67\pm 0.092.67 ± 0.092.14±0.18plus-or-minus2.140.182.14\pm 0.182.14 ± 0.181.61±0.13plus-or-minus1.610.131.61\pm 0.131.61 ± 0.132.78±0.08plus-or-minus2.780.082.78\pm 0.082.78 ± 0.082.57±0.09plus-or-minus2.570.092.57\pm 0.092.57 ± 0.092.10±0.09plus-or-minus2.100.092.10\pm 0.092.10 ± 0.09PSPL22.91±0.09plus-or-minus2.910.092.91\pm 0.092.91 ± 0.092.69±0.09plus-or-minus2.690.092.69\pm 0.092.69 ± 0.092.13±0.10plus-or-minus2.130.102.13\pm 0.102.13 ± 0.102.67±0.09plus-or-minus2.670.092.67\pm 0.092.67 ± 0.092.02±0.11plus-or-minus2.020.112.02\pm 0.112.02 ± 0.111.62±0.10plus-or-minus1.620.101.62\pm 0.101.62 ± 0.102.78±0.08plus-or-minus2.780.082.78\pm 0.082.78 ± 0.082.57±0.09plus-or-minus2.570.092.57\pm 0.092.57 ± 0.092.11±0.09plus-or-minus2.110.092.11\pm 0.092.11 ± 0.09TTL02.92±0.09plus-or-minus2.920.092.92\pm 0.092.92 ± 0.092.69±0.09plus-or-minus2.690.092.69\pm 0.092.69 ± 0.092.09±0.09plus-or-minus2.090.092.09\pm 0.092.09 ± 0.092.65±0.08plus-or-minus2.650.082.65\pm 0.082.65 ± 0.082.00±0.21plus-or-minus2.000.212.00\pm 0.212.00 ± 0.211.46±0.09plus-or-minus1.460.091.46\pm 0.091.46 ± 0.092.79±0.08plus-or-minus2.790.082.79\pm 0.082.79 ± 0.082.57±0.09plus-or-minus2.570.092.57\pm 0.092.57 ± 0.092.12±0.09plus-or-minus2.120.092.12\pm 0.092.12 ± 0.09TLL12.92±0.09plus-or-minus2.920.092.92\pm 0.092.92 ± 0.092.70±0.10plus-or-minus2.700.102.70\pm 0.102.70 ± 0.102.09±0.09plus-or-minus2.090.092.09\pm 0.092.09 ± 0.092.65±0.09plus-or-minus2.650.092.65\pm 0.092.65 ± 0.091.98±0.21plus-or-minus1.980.211.98\pm 0.211.98 ± 0.211.43±0.10plus-or-minus1.430.101.43\pm 0.101.43 ± 0.102.79±0.08plus-or-minus2.790.082.79\pm 0.082.79 ± 0.082.57±0.09plus-or-minus2.570.092.57\pm 0.092.57 ± 0.092.17±0.20plus-or-minus2.170.202.17\pm 0.202.17 ± 0.20TLL22.91±0.09plus-or-minus2.910.092.91\pm 0.092.91 ± 0.092.75±0.20plus-or-minus2.750.202.75\pm 0.202.75 ± 0.202.08±0.09plus-or-minus2.080.092.08\pm 0.092.08 ± 0.092.65±0.09plus-or-minus2.650.092.65\pm 0.092.65 ± 0.092.03±0.25plus-or-minus2.030.252.03\pm 0.252.03 ± 0.251.41±0.10plus-or-minus1.410.101.41\pm 0.101.41 ± 0.102.79±0.09plus-or-minus2.790.092.79\pm 0.092.79 ± 0.092.57±0.09plus-or-minus2.570.092.57\pm 0.092.57 ± 0.092.11±0.09plus-or-minus2.110.092.11\pm 0.092.11 ± 0.09TLL2nn2.91±0.09plus-or-minus2.910.092.91\pm 0.092.91 ± 0.092.69±0.09plus-or-minus2.690.092.69\pm 0.092.69 ± 0.092.09±0.10plus-or-minus2.090.102.09\pm 0.102.09 ± 0.102.65±0.09plus-or-minus2.650.092.65\pm 0.092.65 ± 0.091.93±0.11plus-or-minus1.930.111.93\pm 0.111.93 ± 0.111.42±0.10plus-or-minus1.420.101.42\pm 0.101.42 ± 0.102.78±0.08plus-or-minus2.780.082.78\pm 0.082.78 ± 0.082.57±0.09plus-or-minus2.570.092.57\pm 0.092.57 ± 0.092.12±0.09plus-or-minus2.120.092.12\pm 0.092.12 ± 0.09MR2.91±0.09plus-or-minus2.910.092.91\pm 0.092.91 ± 0.092.70±0.09plus-or-minus2.700.092.70\pm 0.092.70 ± 0.092.16±0.10plus-or-minus2.160.102.16\pm 0.102.16 ± 0.102.68±0.08plus-or-minus2.680.082.68\pm 0.082.68 ± 0.082.01±0.11plus-or-minus2.010.112.01\pm 0.112.01 ± 0.111.54±0.11plus-or-minus1.540.111.54\pm 0.111.54 ± 0.112.79±0.08plus-or-minus2.790.082.79\pm 0.082.79 ± 0.082.57±0.09plus-or-minus2.570.092.57\pm 0.092.57 ± 0.092.11±0.08plus-or-minus2.110.082.11\pm 0.082.11 ± 0.08Probit2.91±0.09plus-or-minus2.910.092.91\pm 0.092.91 ± 0.092.69±0.09plus-or-minus2.690.092.69\pm 0.092.69 ± 0.092.11±0.09plus-or-minus2.110.092.11\pm 0.092.11 ± 0.092.66±0.08plus-or-minus2.660.082.66\pm 0.082.66 ± 0.082.05±0.20plus-or-minus2.050.202.05\pm 0.202.05 ± 0.201.50±0.10plus-or-minus1.500.101.50\pm 0.101.50 ± 0.102.78±0.08plus-or-minus2.780.082.78\pm 0.082.78 ± 0.082.57±0.09plus-or-minus2.570.092.57\pm 0.092.57 ± 0.092.12±0.08plus-or-minus2.120.082.12\pm 0.082.12 ± 0.08DLNL1.46±0.08plus-or-minus1.460.081.46\pm 0.081.46 ± 0.081.32±0.08plus-or-minus1.320.081.32\pm 0.081.32 ± 0.080.63±0.07plus-or-minus0.630.070.63\pm 0.070.63 ± 0.071.20±0.06plus-or-minus1.200.061.20\pm 0.061.20 ± 0.060.47±0.09plus-or-minus0.470.090.47\pm 0.090.47 ± 0.090.05±0.10plus-or-minus0.050.10-0.05\pm 0.10- 0.05 ± 0.101.39±0.07plus-or-minus1.390.071.39\pm 0.071.39 ± 0.071.26±0.08plus-or-minus1.260.081.26\pm 0.081.26 ± 0.080.84±0.09plus-or-minus0.840.090.84\pm 0.090.84 ± 0.09IGC2.92±0.10plus-or-minus2.920.102.92\pm 0.102.92 ± 0.102.76±0.10plus-or-minus2.760.102.76\pm 0.102.76 ± 0.102.09±0.10plus-or-minus2.090.102.09\pm 0.102.09 ± 0.102.64±0.08plus-or-minus2.640.082.64\pm 0.082.64 ± 0.081.93±0.09plus-or-minus1.930.091.93\pm 0.091.93 ± 0.091.56±0.11plus-or-minus1.560.111.56\pm 0.111.56 ± 0.112.87±0.09plus-or-minus2.870.092.87\pm 0.092.87 ± 0.092.74±0.12plus-or-minus2.740.122.74\pm 0.122.74 ± 0.122.32±0.13plus-or-minus2.320.132.32\pm 0.132.32 ± 0.13Our2-Cats-L2.95±0.09plus-or-minus2.950.092.95\pm 0.092.95 ± 0.092.79±0.09plus-or-minus2.790.092.79\pm 0.092.79 ± 0.091.91±0.10plus-or-minus1.910.101.91\pm 0.101.91 ± 0.102.67±0.08plus-or-minus2.670.082.67\pm 0.082.67 ± 0.082.01±0.09plus-or-minus2.010.092.01\pm 0.092.01 ± 0.090.97±0.10plus-or-minus0.970.100.97\pm 0.100.97 ± 0.102.90±0.09plus-or-minus2.900.092.90\pm 0.092.90 ± 0.092.73±0.10plus-or-minus2.730.102.73\pm 0.102.73 ± 0.102.13±0.10plus-or-minus2.130.102.13\pm 0.102.13 ± 0.102-Cats-G3.07±0.14plus-or-minus3.070.143.07\pm 0.143.07 ± 0.142.90±0.15plus-or-minus2.900.152.90\pm 0.152.90 ± 0.151.77±0.10plus-or-minus1.770.101.77\pm 0.101.77 ± 0.103.04±0.21plus-or-minus3.040.213.04\pm 0.213.04 ± 0.211.55±0.11plus-or-minus1.550.111.55\pm 0.111.55 ± 0.110.96±0.10plus-or-minus0.960.100.96\pm 0.100.96 ± 0.103.13±0.15plus-or-minus3.130.153.13\pm 0.153.13 ± 0.152.60±0.12plus-or-minus2.600.122.60\pm 0.122.60 ± 0.122.03±0.13plus-or-minus2.030.132.03\pm 0.132.03 ± 0.13

In both our synthetic and real data experiments, we considered the following approaches as Deep Learning baselines: Deep Archimedian Copulas (ACNET) [38]; Generative Archimedian Copulas (GEN-AC) [51]; Neural Likelihoods (NL) [13]; and Implicit Generative Copulas (IGC) [31]. Here, we use the same hyperparameters employed by the authors. We also considered several Non-Deep Learning baselines. These were the Parametric and Non-Parametric Copula Models from[49]. They are referenced as: The best Parametric approach from several known parametric copulas (Par), non-penalized Bernstein estimator (Bern), penalized Bernstein estimator (PBern), penalized linear B-spline estimator (PSPL1), penalized quadratic B-spline estimator (PSPL2), transformation local likelihood kernel estimator of degree q=0𝑞0q=0italic_q = 0 (TTL0), degree q=1𝑞1q=1italic_q = 1 (TTL1), and q=2𝑞2q=2italic_q = 2 (TTL2), and q=2𝑞2q=2italic_q = 2 with nearest neughbors (TTL2nn). We also considered the recent non-parametric approach by Mukhopadhyay and Parzen[47] (MR), as well as the Probit transformation-based estimator from[21].

To assess the performance of our model on synthetic bivariate datasets, we generated Gaussian copulas with correlations ρ=0.1,0.5,0.9𝜌0.10.50.9\rho=0.1,0.5,0.9italic_ρ = 0.1 , 0.5 , 0.9, and Clayton and Frank copulas with parameters θ=1,5,10𝜃1510\theta=1,5,10italic_θ = 1 , 5 , 10. The marginal distributions were chosen as uncorrelated Normal distributions with parameters (μ=0,σ=1formulae-sequence𝜇0𝜎1\mu=0,\sigma=1italic_μ = 0 , italic_σ = 1). We generated 1,500 training samples and 500 test samples. Results are for test samples. Before continuing, we point out that some deep methods require datasets to be scaled to [0, 1] intervals[38, 51]. Indeed, the author-supplied source code does not execute with data outside of this range (this choice was first used[38, Section 4.2] and followed by [51, Section 5.1.1]). For fair comparisons, in AppendixE, we present results for models with a min-max scaling of the data.

The results are presented in Table1, which contains the average negative log-likelihood on the test set for each copula. Results were estimated with a fixed seed of 30091985. The table also shows 95% Bootstrap Confidence Intervals[70]. Following suit with several guidelines, we choose to report CIs for better interpretability on the expected range of[20, 65, 3, 28, 19, 24, 30].

Initially, note that statistical ties (same average or overlapping intervals) are present throughout the table. The exception is for the NL[13] method, which severely underperforms in this setting, and the methods that did not execute (due to the required data scaling). When reading this table, it is imperative to consider that the simulated data comes from the models evaluated in the Par (first) row.Thus, parametric methods are naturally expected to overperform. Nevertheless, the statistical ties with such methods show how 2-Cats can capture this behavior even if it is a family-free approach.

BostonINTC-MSFTGOOG-FB
Non-Deep LearnPar0.16±0.09plus-or-minus0.160.09-0.16\pm 0.09- 0.16 ± 0.090.05±0.07plus-or-minus0.050.07-0.05\pm 0.07- 0.05 ± 0.070.88±0.08plus-or-minus0.880.08-0.88\pm 0.08- 0.88 ± 0.08
Bern0.09±0.06plus-or-minus0.090.06-0.09\pm 0.06- 0.09 ± 0.060.03±0.05plus-or-minus0.030.05-0.03\pm 0.05- 0.03 ± 0.050.70±0.04plus-or-minus0.700.04-0.70\pm 0.04- 0.70 ± 0.04
PBern0.11±0.07plus-or-minus0.110.07-0.11\pm 0.07- 0.11 ± 0.070.02±0.06plus-or-minus0.020.06-0.02\pm 0.06- 0.02 ± 0.060.62±0.04plus-or-minus0.620.04-0.62\pm 0.04- 0.62 ± 0.04
PSPL10.06±0.14plus-or-minus0.060.14-0.06\pm 0.14- 0.06 ± 0.140.02±0.06plus-or-minus0.020.06-0.02\pm 0.06- 0.02 ± 0.060.72±0.14plus-or-minus0.720.14-0.72\pm 0.14- 0.72 ± 0.14
PSPL20.05±0.14plus-or-minus0.050.14-0.05\pm 0.14- 0.05 ± 0.140.02±0.06plus-or-minus0.020.06-0.02\pm 0.06- 0.02 ± 0.060.83±0.07plus-or-minus0.830.07-0.83\pm 0.07- 0.83 ± 0.07
TTL00.16±0.08plus-or-minus0.160.08-0.16\pm 0.08- 0.16 ± 0.080.06±0.06plus-or-minus0.060.06-0.06\pm 0.06- 0.06 ± 0.061.00±0.07plus-or-minus1.000.07-1.00\pm 0.07- 1.00 ± 0.07
TTL10.18±0.09plus-or-minus0.180.09-0.18\pm 0.09- 0.18 ± 0.090.06±0.06plus-or-minus0.060.06-0.06\pm 0.06- 0.06 ± 0.061.00±0.08plus-or-minus1.000.08-1.00\pm 0.08- 1.00 ± 0.08
TTL20.18±0.09plus-or-minus0.180.09-0.18\pm 0.09- 0.18 ± 0.090.06±0.06plus-or-minus0.060.06-0.06\pm 0.06- 0.06 ± 0.060.71±0.16plus-or-minus0.710.16-0.71\pm 0.16- 0.71 ± 0.16
TLL20.15±0.09plus-or-minus0.150.09-0.15\pm 0.09- 0.15 ± 0.090.04±0.05plus-or-minus0.040.05-0.04\pm 0.05- 0.04 ± 0.050.95±0.13plus-or-minus0.950.13-0.95\pm 0.13- 0.95 ± 0.13
MR0.07±0.06plus-or-minus0.070.06-0.07\pm 0.06- 0.07 ± 0.060.01±0.05plus-or-minus0.010.05-0.01\pm 0.05- 0.01 ± 0.050.84±0.07plus-or-minus0.840.07-0.84\pm 0.07- 0.84 ± 0.07
Probit0.10±0.06plus-or-minus0.100.06-0.10\pm 0.06- 0.10 ± 0.060.03±0.07plus-or-minus0.030.07-0.03\pm 0.07- 0.03 ± 0.070.92±0.06plus-or-minus0.920.06-0.92\pm 0.06- 0.92 ± 0.06
Deep LearnACNet0.28±0.11plus-or-minus0.280.11-0.28\pm 0.11- 0.28 ± 0.110.18±0.08plus-or-minus0.180.08-0.18\pm 0.08- 0.18 ± 0.080.91±0.12plus-or-minus0.910.12-0.91\pm 0.12- 0.91 ± 0.12
GEN-AC0.29±0.11plus-or-minus0.290.11-0.29\pm 0.11- 0.29 ± 0.110.17±0.07plus-or-minus0.170.07-0.17\pm 0.07- 0.17 ± 0.070.75±0.10plus-or-minus0.750.10-0.75\pm 0.10- 0.75 ± 0.10
NL0.28±0.16plus-or-minus0.280.16-0.28\pm 0.16- 0.28 ± 0.160.16±0.08plus-or-minus0.160.08-0.16\pm 0.08- 0.16 ± 0.081.09±0.13plus-or-minus1.090.13-1.09\pm 0.13- 1.09 ± 0.13
IGC0.07±0.08plus-or-minus0.070.080.07\pm 0.080.07 ± 0.080.14±0.04plus-or-minus0.140.040.14\pm 0.040.14 ± 0.040.30±0.02plus-or-minus0.300.02-0.30\pm 0.02- 0.30 ± 0.02
Our2-Cats-L0.30±0.09plus-or-minus0.300.09-0.30\pm 0.09- 0.30 ± 0.090.21±0.05plus-or-minus0.210.05-0.21\pm 0.05- 0.21 ± 0.051.75±0.04plus-or-minus1.750.04-1.75\pm 0.04- 1.75 ± 0.04
2-Cats-G0.27±0.07plus-or-minus0.270.07-0.27\pm 0.07- 0.27 ± 0.070.07±0.06plus-or-minus0.070.06-0.07\pm 0.06- 0.07 ± 0.061.65±0.08plus-or-minus1.650.08-1.65\pm 0.08- 1.65 ± 0.08

We now turn to real datasets. As was done in previous work, we employ the Boston housing dataset, the Intel-Microsoft (INTC-MSFT), and Google-Facebook (GOOG-FB) stock ratios. These pairs are commonly used in Copula research; in particular, they are the same datasets used by[38, 51]. We employed the same train/test and pre-processing as previous work for fair comparisons. Thus, the data is scaled with the same code as the literature, and all deep methods are executed.

The results are presented in Table2. The winning 2-Cats models are highlighted in the table in Green. The best, on average, baseline methods are in Red. We consider a method better or worse than another when their confidence intervals do not overlap; thus, ties with the best model are shown in underscores. Overall, the table highlights the superior performance of the 2-Cats models across all datasets. Only in one setting, 2-Cats-G on INTC-MSOFT, did the model underperform.

5.3 On the Lagrangian Term for P3

So far, 2-Cats presents itself as a promising approach for Copula modeling. Nevertheless, to sample from Copulas, we require that the marginals of the model come from a uniform distribution. This is our primary for the Lagrangian terms used to meet P3 (see Section3).

We trained 2-Cats models on the three real-world datasets with and without the Lagrangian optimization of Eq(6). Due to space constraints, we present results for 2-Cats-L only. Here, models were trained for 1000 iterations. No early stopping was performed as our focus was on meeting P3.

We compare both the average absolute deviations (i.e., absu=1ni|H𝜽(ui,1)ui|𝑎𝑏subscript𝑠𝑢1𝑛subscript𝑖subscript𝐻𝜽subscript𝑢𝑖1subscript𝑢𝑖abs_{u}=\frac{1}{n}\sum_{i}|H_{\bm{\theta}}(u_{i},1)-u_{i}|italic_a italic_b italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 ) - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | and absv=1ni|H𝜽(1,vi)vi|𝑎𝑏subscript𝑠𝑣1𝑛subscript𝑖subscript𝐻𝜽1subscript𝑣𝑖subscript𝑣𝑖abs_{v}=\frac{1}{n}\sum_{i}|H_{\bm{\theta}}(1,v_{i})-v_{i}|italic_a italic_b italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( 1 , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |) as well as the average relative deviations (i.e., relu=1001ni|H𝜽(ui,1)ui|/ui𝑟𝑒subscript𝑙𝑢1001𝑛subscript𝑖subscript𝐻𝜽subscript𝑢𝑖1subscript𝑢𝑖subscript𝑢𝑖rel_{u}=100\,\frac{1}{n}\sum_{i}|H_{\bm{\theta}}(u_{i},1)-u_{i}|/u_{i}italic_r italic_e italic_l start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = 100 divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 ) - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | / italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and relv=1001ni|H𝜽(1,vi)vi|/vi𝑟𝑒subscript𝑙𝑣1001𝑛subscript𝑖subscript𝐻𝜽1subscript𝑣𝑖subscript𝑣𝑖subscript𝑣𝑖rel_{v}=100\,\frac{1}{n}\sum_{i}|H_{\bm{\theta}}(1,v_{i})-v_{i}|/v_{i}italic_r italic_e italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 100 divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( 1 , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | / italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) for models trained with and without constraints.

For the GOOG-FB dataset, significant gains were achieved with the constraints, i.e.: absu𝑎𝑏subscript𝑠𝑢abs_{u}italic_a italic_b italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT went from 0.170.170.170.17 to 0.090.090.090.09 and absv𝑎𝑏subscript𝑠𝑣abs_{v}italic_a italic_b italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT improved from 0.160.160.160.16 to 0.030.030.030.03. relu𝑟𝑒subscript𝑙𝑢rel_{u}italic_r italic_e italic_l start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT improved from 57%percent5757\%57 % to 35%percent3535\%35 % and relv𝑟𝑒subscript𝑙𝑣rel_{v}italic_r italic_e italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT went from 54%percent5454\%54 % to 14%percent1414\%14 %. The negative log-likelihood reduced slightly when Lagrangian was used (from -1.70 to -1.14). Nevertheless, the model is still better than the baselines (see Table2). Results for the negative log-likelihood do not exactly match the previous ones, as no early stopping was done here. Even though relative errors may appear large, note that such scores are severely sensitive to the tail of the distributions. Here, even a small deviation (e.g., 0.001 to 0.0015) incurs a large relative increase.

For INTC-MSOFT, results were similar across u𝑢uitalic_u and v𝑣vitalic_v dimensions. As in both versions, no significant gains were observed in absu𝑎𝑏subscript𝑠𝑢abs_{u}italic_a italic_b italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT nor absv𝑎𝑏subscript𝑠𝑣abs_{v}italic_a italic_b italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT; these scores are pretty small (below 0.0090.0090.0090.009) in both cases. In relative terms, an increase was observed in absv𝑎𝑏subscript𝑠𝑣abs_{v}italic_a italic_b italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT from roughly 4% to 2%, with absu𝑎𝑏subscript𝑠𝑢abs_{u}italic_a italic_b italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT being 4% in both cases. No significant changes were observed in the Boston dataset where errors were already minor absu𝑎𝑏subscript𝑠𝑢abs_{u}italic_a italic_b italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and absv𝑎𝑏subscript𝑠𝑣abs_{v}italic_a italic_b italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT below 0.020.020.020.02 in both instances, with relative errors ranging from 2% to 4% regardless of the constraints being employed. This shows that 2-Cats may already approximate the Uniform marginals even without constraints.

These results show the efficacy of our Lagrangian penalty. In some settings, such as GOOG-GB, constraints may present significant improvements. Using the Lagrangian term will depend on whether simulation is essential to the end-user (see AppendixB).

6 Conclusions

In this paper, we presented Copula Approximating Transform models (2-Cats).Different from the literature when training our models; we focus not only on capturing the pseudo-likelihood of Copulas but also on meeting or approximating several other properties, such as the partial derivatives of the C𝐶Citalic_C function. Moreover, a second major innovation is proposing Sobolev training for Copulas. Overall, our results show the superiority of 2-Cats on real datasets.

A natural follow-up for 2-Cats is on using the Pair Copula Construction (PCC)[1, 15] to port Vines[16, 41, 23, 22] to incorporate our method. PCC is a well-established algorithm to go from a 2D Copula to an ND One using Vines. We also believe that evaluating other non-negative NNs for m𝜽subscript𝑚𝜽m_{\bm{\theta}}italic_m start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT is a promising direction[45, 55, 68, 62, 40, 36].

References

  • [1]Kjersti Aas, Claudia Czado, Arnoldo Frigessi, and Henrik Bakken.Pair-copula constructions of multiple dependence.Insurance: Mathematics and economics, 44(2), 2009.
  • [2]Mohomed Abraj, You-Gan Wang, and MHelen Thompson.A new mixture copula model for spatially correlated multiplevariables with an environmental application.Scientific Reports, 12(1), 2022.
  • [3]Douglas Altman, David Machin, Trevor Bryant, and Martin Gardner.Statistics with confidence: confidence intervals and statisticalguidelines.John Wiley & Sons, 2013.
  • [4]BarryC Arnold.Multivariate logistic distributions.Marcel Dekker New York, 1992.
  • [5]Yves INgounou Bakam and Denys Pommeret.Nonparametric estimation of copulas and copula densities byorthogonal projections.Econometrics and Statistics, 2023.
  • [6]Pierre Baldi.Deep learning in science.Cambridge University Press, 2021.
  • [7]Kevin Bello, Bryon Aragam, and Pradeep Ravikumar.Dagma: Learning dags via m-matrices and a log-determinant acyclicitycharacterization.In NeurIPS, 2022.
  • [8]DimitriP Bertsekas.Constrained optimization and Lagrange multiplier methods.Academic press, 2014.
  • [9]ChristopherM Bishop.Mixture density networks.Online Report., 1994.
  • [10]Yongqiang Cai.Achieve the minimum width of neural networks for universalapproximation.In ICLR, 2023.
  • [11]George Casella and RogerL Berger.Statistical inference.Cengage Learning, 2001.
  • [12]Umberto Cherubini, Elisa Luciano, and Walter Vecchiato.Copula methods in finance.John Wiley & Sons, 2004.
  • [13]Pawel Chilinski and Ricardo Silva.Neural likelihoods via cumulative distribution functions.In UAI, 2020.
  • [14]George Cybenko.Approximation by superpositions of a sigmoidal function.Mathematics of control, signals, and systems, 2(4), 1989.
  • [15]Claudia Czado.Pair-copula constructions of multivariate copulas.In Copula Theory and Its Applications. Springer, 2010.
  • [16]Claudia Czado and Thomas Nagler.Vine copula based modeling.Annual Review of Statistics and Its Application, 9, 2022.
  • [17]WojciechM Czarnecki, Simon Osindero, Max Jaderberg, Grzegorz Swirszcz, andRazvan Pascanu.Sobolev training for neural networks.In NeurIPS, 2017.
  • [18]Hennie Daniels and Marina Velikova.Monotone and partially monotone neural networks.IEEE Transactions on Neural Networks, 21(6), 2010.
  • [19]Pierre Dragicevic.Fair statistical communication in hci.Modern statistical methods for HCI, 2016.
  • [20]Martin Gardner and Douglas Altman.Confidence intervals rather than p values: estimation rather thanhypothesis testing.Br Med J (Clin Res Ed), 292(6522), 1986.
  • [21]Gery Geenens, Arthur Charpentier, and Davy Paindaveine.Probit transformation for nonparametric kernel estimation of thecopula density.Bernoulli, 2017.
  • [22]Christian Genest and Anne-Catherine Favre.Everything you always wanted to know about copula modeling but wereafraid to ask.Journal of hydrologic engineering, 12(4), 2007.
  • [23]Christian Genest, Kilani Ghoudi, and L-P Rivest.A semiparametric estimation procedure of dependence parameters inmultivariate families of distributions.Biometrika, 82(3), 1995.
  • [24]Sander Greenland, StephenJ Senn, KennethJ Rothman, JohnB Carlin, CharlesPoole, StevenN Goodman, and DouglasG Altman.Statistical tests, p values, confidence intervals, and power: a guideto misinterpretations.European journal of epidemiology, 31, 2016.
  • [25]Joshua Größer and Ostap Okhrin.Copulae: An overview and recent developments.Wiley Interdisciplinary Reviews: Computational Statistics,14(3), 2022.
  • [26]Peter Hall, RodneyCL Wolff, and Qiwei Yao.Methods for estimating a conditional distribution function.Journal of the American Statistical association, 94(445), 1999.
  • [27]Marcel Hirt, Petros Dellaportas, and Alain Durmus.Copula-like variational inference.In NeurIPS, 2019.
  • [28]Joses Ho, Tayfun Tumkaya, Sameer Aryal, Hyungwon Choi, and Adam Claridge-Chang.Moving beyond p values: data analysis with estimation graphics.Nature methods, 16(7), 2019.
  • [29]Kurt Hornik, Maxwell Stinchcombe, and Halbert White.Multilayer feedforward networks are universal approximators.Neural networks, 2(5), 1989.
  • [30]GuidoW Imbens.Statistical significance, p-values, and the reporting of uncertainty.Journal of Economic Perspectives, 35(3), 2021.
  • [31]Tim Janke, Mohamed Ghanmi, and Florian Steinke.Implicit generative copulas.In NeurIPS, 2021.
  • [32]GeorgeEm Karniadakis, IoannisG Kevrekidis, LuLu, Paris Perdikaris, SifanWang, and Liu Yang.Physics-informed machine learning.Nature Reviews Physics, 3(6), 2021.
  • [33]DiederikP Kingma and Jimmy Ba.Adam: A method for stochastic optimization.In ICLR, 2015.
  • [34]Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter.Self-normalizing neural networks.In NeurIPS, 2017.
  • [35]Ivan Kobyzev, SimonJD Prince, and MarcusA Brubaker.Normalizing flows: An introduction and review of current methods.IEEE transactions on pattern analysis and machine intelligence,43(11), 2020.
  • [36]Ryan Kortvelesy.Fixed integral neural networks.arXiv preprint arXiv:2307.14439, 2023.
  • [37]Georg Lindgren, Holger Rootzén, and Maria Sandsten.Stationary stochastic processes for scientists and engineers.CRC press, 2013.
  • [38]ChunKai Ling, Fei Fang, and JZico Kolter.Deep archimedean copulas.In NeurIPS, 2020.
  • [39]Yucong Liu.Neural networks are integrable.arXiv preprint arXiv:2310.14394, 2023.
  • [40]Lorenzo Loconte, Stefan Mengel, Nicolas Gillis, and Antonio Vergari.Negative mixture models via squaring: Representation and learning.In The 6th Workshop on Tractable Probabilistic Modeling, 2023.
  • [41]Rand KwongYew Low, Jamie Alco*ck, Robert Faff, and Timothy Brailsford.Canonical vine copulas in the context of modern portfolio management:Are they worth it?Asymmetric Dependence in Finance: Diversification, Correlationand Portfolio Management in Market Downturns, 2018.
  • [42]LuLu, Raphael Pestourie, Wenjie Yao, Zhicheng Wang, Francesc Verdugo, andStevenG Johnson.Physics-informed neural networks with hard constraints for inversedesign.SIAM Journal on Scientific Computing, 43(6), 2021.
  • [43]Yulong Lu and Jianfeng Lu.A universal approximation theorem of deep neural networks forexpressing probability distributions.In NeurIPS, 2020.
  • [44]Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang.The expressive power of neural networks: A view from the width.In NeurIPS, 2017.
  • [45]Ulysse Marteau-Ferey, Francis Bach, and Alessandro Rudi.Non-parametric models for non-negative functions.In NeurIPS, 2020.
  • [46]Sadegh Modiri, Santiago Belda, Mostafa Hoseini, Robert Heinkelmann, JoséMFerrándiz, and Harald Schuh.A new hybrid method to improve the ultra-short-term prediction oflod.Journal of geodesy, 94, 2020.
  • [47]Subhadeep Mukhopadhyay and Emanuel Parzen.Nonparametric universal copula modeling.Applied Stochastic Models in Business and Industry, 36(1),2020.
  • [48]Michael Naaman.On the tight constant in the multivariatedvoretzky–kiefer–wolfowitz inequality.Statistics & Probability Letters, 173, 2021.
  • [49]Thomas Nagler, Christian Schellhase, and Claudia Czado.Nonparametric estimation of simplified vine copula models: comparisonof methods.Dependence Modeling, 5(1), 2017.
  • [50]RogerB Nelsen.An introduction to copulas.Springer, 2006.
  • [51]Yuting Ng, Ali Hasan, Khalil Elkhalil, and Vahid Tarokh.Generative archimedean copulas.In UAI, 2021.
  • [52]TTin Nguyen, HienD Nguyen, Faicel Chamroukhi, and GeoffreyJ McLachlan.Approximation by finite mixtures of continuous density functions thatvanish at infinity.Cogent Mathematics & Statistics, 7(1), 2020.
  • [53]George Papamakarios, Theo Pavlakou, and Iain Murray.Masked autoregressive flow for density estimation.In NeurIPS, 2017.
  • [54]John Platt and Alan Barr.Constrained differential optimization.In NeurIPS, 1987.
  • [55]Alessandro Rudi and Carlo Ciliberto.Psd representations for effective probability models.In NeurIPS, 2021.
  • [56]GSalvadori and Carlo DeMichele.Frequency analysis via copulas: Theoretical aspects and applicationsto hydrological events.Water resources research, 40(12), 2004.
  • [57]Ulf Schepsmeier and Jakob Stöber.Derivatives and fisher information of bivariate copulas.Statistical papers, 55(2), 2014.
  • [58]Joseph Sill.Monotonic networks.In NeurIPS, 1997.
  • [59]BernardW Silverman.Density estimation for statistics and data analysis, volume26.CRC press, 1986.
  • [60]Abe Sklar.Random variables, distribution functions, and copulas: a personallook backward and forward.Lecture notes-monograph series, 1996.
  • [61]MSklar.Fonctions de répartition à n dimensions et leurs marges.Annales de l’ISUP, 8(3), 1959.
  • [62]AleksanteriMikulus Sladek, Martin Trapp, and Arno Solin.Encoding negative dependencies in probabilistic circuits.In The 6th Workshop on Tractable Probabilistic Modeling, 2023.
  • [63]Kihyuk Sohn, Honglak Lee, and Xinchen Yan.Learning structured output representation using deep conditionalgenerative models.In NeurIPS, 2015.
  • [64]BSohrabian.Geostatistical prediction through convex combination of archimedeancopulas.Spatial Statistics, 41, 2021.
  • [65]JonathanAC Sterne and GeorgeDavey Smith.Sifting the evidence—what’s wrong with significance tests?Physical therapy, 81(8), 2001.
  • [66]Natasa Tagasovska, Firat Ozdemir, and Axel Brando.Retrospective uncertainties for deep models using vine copulas.In AISTATS, 2023.
  • [67]Wen-Jen Tsay and Peng-Hsuan Ke.A simple approximation for the bivariate normal integral.Communications in Statistics-Simulation and Computation, 52(4),2023.
  • [68]Russell Tsuchida, ChengSoon Ong, and Dino Sejdinovic.Squared neural families: A new class of tractable density models.arXiv preprint arXiv:2305.13552, 2023.
  • [69]GRWalsh.Saddle-point property of lagrangian function.Methods of Optimization. New York: John Wiley & Sons, 1975.
  • [70]Larry Wasserman.All of statistics: a concise course in statistical inference,volume26.Springer, 2004.
  • [71]Antoine Wehenkel and Gilles Louppe.Unconstrained monotonic neural networks.In NeurIPS, 2019.
  • [72]Liuyang Yang, Jinghuai Gao, Naihao Liu, Tao Yang, and Xiudi Jiang.A coherence algorithm for 3-d seismic data analysis based on themutual information.IEEE Geoscience and Remote Sensing Letters, 16(6), 2019.
  • [73]Qingyang Zhang and Xuan Shi.A mixture copula bayesian network model for multimodal genomic data.Cancer Informatics, 16, 2017.

Appendix A On Derivatives and Integrals of NNs

In this appendix, we present an example of the issue of approximating derivatives/integrals of NNs.

A.1 An Example on why approximating derivatives and integrals fail

We begin with a pictorial example of the issue. Before doing so, we take some time to revise the universal approximation properties of NNs[14, 29, 43]. We also shift our focus from Copulas for this argument. It is expected to state that the universal approximation theorem (UAT) guarantees that functions are approximated with NNs. However, given that when learning models from training data, it would be more correct to say that function evaluations at training points are approximated.

2-Cats: 2d Copula Approximating Transforms (1)

Consider a simple 4-Layer Relu NN with layers of 128 neurons. This will be our model. To avoid overfitting, this model will be optimized with a validation set. The problem with relying only on the UAT is shown in Figure1. The figure shows a ground truth data-generating process of y=sin(x)+η𝑦𝑥𝜂y=\sin(x)+\etaitalic_y = roman_sin ( italic_x ) + italic_η, where η𝜂\etaitalic_η is some noise term. On the first column, when η=0𝜂0\eta=0italic_η = 0, the model will approximate the data generating function (middle row), its derivative (bottom row), and integral (top row). When ηN(0,0.5)similar-to𝜂𝑁00.5\eta\sim N(0,0.5)italic_η ∼ italic_N ( 0 , 0.5 ), the model will approximate the data points, but not necessarily the derivatives.

This example may seem to contradict[17, 29] others that looked into the universal approximation properties of NN regarding derivative. We point out that this is not the case. In AppendixA.3, we show how the variance will increase when estimating the derivative, leading to the figure’s oscillating behavior. From the figure, also note that integrals are approximated up to an asymptotic constant (see AppendixA.2, and [39]); however, we still need to control for this constant (note the growing deviation on the top plot of the middle row).

We now present some results on the Integral (exemplified on the top row of Figure1) and Derivative (bottom row) that tie in with our example. Furthermore, we briefly discuss the last column of the figure. Notice that on the extreme left and extreme right, the derivative of the network begins to diverge from the underlying process. The NN was not trained with points sampled after these boundaries. Under data shifts (e.g., data from a similar process not seen on the training set), the issues we point out throughout this Appendix should increase.

A.2 On the Integral of Neural Networks

To present a view of the theoretical bounds on the error of approximation of integrals, let f^𝜽(x)subscript^𝑓𝜽𝑥\hat{f}_{\bm{\theta}}(x)over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x ) be some NN parametrized by 𝜽𝜽\bm{\theta}bold_italic_θ. Also, let f(x)𝑓𝑥f(x)italic_f ( italic_x ) represent our training dataset. We also make the reasonable assumption our dataset comes from some real function, f(x)superscript𝑓𝑥f^{\ast}(x)italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ), sampled under a symmetric additive noise, i.e., f(x)=f(x)+η𝑓𝑥superscript𝑓𝑥𝜂f(x)=f^{\ast}(x)+\etaitalic_f ( italic_x ) = italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) + italic_η, with an expected value equal to zero: 𝔼[η]=0𝔼delimited-[]𝜂0\mathbb{E}[\eta]=0blackboard_E [ italic_η ] = 0. Gaussian noise is such a case.

If a NN is a universal approximator[14, 29, 42, 10, 44] and it is trained on the dataset f(x)𝑓𝑥f(x)italic_f ( italic_x ), we reach:

f^𝜽(x)f(x)p<ϵ1subscriptdelimited-∥∥subscript^𝑓𝜽𝑥𝑓𝑥𝑝subscriptitalic-ϵ1\lVert\hat{f}_{\bm{\theta}}(x)-f(x)\rVert_{p}<\epsilon_{1}∥ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x ) - italic_f ( italic_x ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT < italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

Most UAT proofs are valid for any p𝑝pitalic_p norm (p𝑝pitalic_p norms bound one another), and assuming the 1 norm, we reach: f^𝜽(x)=f(x)±ϵsubscript^𝑓𝜽𝑥plus-or-minus𝑓𝑥italic-ϵ\hat{f}_{\bm{\theta}}(x)=f(x)\pm\epsilonover^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x ) = italic_f ( italic_x ) ± italic_ϵ for some constant ϵitalic-ϵ\epsilonitalic_ϵ. As a consequence, we reach:

f^𝜽(x)=f(x)+η±ϵ1subscript^𝑓𝜽𝑥plus-or-minussuperscript𝑓𝑥𝜂subscriptitalic-ϵ1\hat{f}_{\bm{\theta}}(x)=f^{\ast}(x)+\eta\pm\epsilon_{1}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x ) = italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) + italic_η ± italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

Given that integration is a linear operator, we can show that:

Eη[f^𝜽(x)]=f(x)±ϵ2.subscript𝐸𝜂delimited-[]subscript^𝑓𝜽𝑥plus-or-minussuperscript𝑓𝑥subscriptitalic-ϵ2E_{\eta}[\int\hat{f}_{\bm{\theta}}(x)]=\int f^{\ast}(x)\pm\epsilon_{2}.italic_E start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT [ ∫ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x ) ] = ∫ italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) ± italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

To do so, we shall use the law of the unconscious statistician as follows:

g(x)𝑔𝑥\displaystyle g(x)italic_g ( italic_x )=f^(x)θdx\displaystyle=\int\hat{f}{{}_{\theta}}(x)\,dx= ∫ over^ start_ARG italic_f end_ARG start_FLOATSUBSCRIPT italic_θ end_FLOATSUBSCRIPT ( italic_x ) italic_d italic_x
Eη[g(x)]subscript𝐸𝜂delimited-[]𝑔𝑥\displaystyle E_{\eta}[g(x)]italic_E start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT [ italic_g ( italic_x ) ]=Eη[f^θ(x)𝑑x]absentsubscript𝐸𝜂delimited-[]subscript^𝑓𝜃𝑥differential-d𝑥\displaystyle=E_{\eta}[\int\hat{f}_{\theta}(x)dx]= italic_E start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT [ ∫ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) italic_d italic_x ]
=f^θ(x)pη(n)𝑑x𝑑nabsentsubscript^𝑓𝜃𝑥subscript𝑝𝜂𝑛differential-d𝑥differential-d𝑛\displaystyle=\int\int\hat{f}_{\theta}(x)p_{\eta}(n)\,dx\,dn= ∫ ∫ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) italic_p start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_n ) italic_d italic_x italic_d italic_n
=pη(n)f^θ(x)𝑑n𝑑xabsentsubscript𝑝𝜂𝑛subscript^𝑓𝜃𝑥differential-d𝑛differential-d𝑥\displaystyle=\int\int p_{\eta}(n)\hat{f}_{\theta}(x)\,dn\,dx= ∫ ∫ italic_p start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_n ) over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) italic_d italic_n italic_d italic_x
=pη(n)(f(x)+η±ϵ1)𝑑n𝑑xabsentsubscript𝑝𝜂𝑛plus-or-minussuperscript𝑓𝑥𝜂subscriptitalic-ϵ1differential-d𝑛differential-d𝑥\displaystyle=\int\int p_{\eta}(n)\big{(}f^{\ast}(x)+\eta\pm\epsilon_{1}\big{)%}\,dn\,dx= ∫ ∫ italic_p start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_n ) ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) + italic_η ± italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_d italic_n italic_d italic_x
=pη(n)f(x)+pη(n)η±pη(n)ϵ1dndxabsentplus-or-minussubscript𝑝𝜂𝑛superscript𝑓𝑥subscript𝑝𝜂𝑛𝜂subscript𝑝𝜂𝑛subscriptitalic-ϵ1𝑑𝑛𝑑𝑥\displaystyle=\int\int p_{\eta}(n)f^{\ast}(x)+p_{\eta}(n)\eta\pm p_{\eta}(n)%\epsilon_{1}\,dn\,dx= ∫ ∫ italic_p start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_n ) italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) + italic_p start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_n ) italic_η ± italic_p start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_n ) italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_n italic_d italic_x
=Eη[f(x)]+Eη[η]±Eη[ϵ1]dxabsentplus-or-minussubscript𝐸𝜂delimited-[]superscript𝑓𝑥subscript𝐸𝜂delimited-[]𝜂subscript𝐸𝜂delimited-[]subscriptitalic-ϵ1𝑑𝑥\displaystyle=\int E_{\eta}[f^{\ast}(x)]+E_{\eta}[\eta]\pm E_{\eta}[\epsilon_{%1}]dx= ∫ italic_E start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) ] + italic_E start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT [ italic_η ] ± italic_E start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT [ italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] italic_d italic_x
=Eη[f(x)]±Eη[ϵ1]dxabsentplus-or-minussubscript𝐸𝜂delimited-[]superscript𝑓𝑥subscript𝐸𝜂delimited-[]subscriptitalic-ϵ1𝑑𝑥\displaystyle=\int E_{\eta}[f^{\ast}(x)]\pm E_{\eta}[\epsilon_{1}]dx= ∫ italic_E start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) ] ± italic_E start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT [ italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] italic_d italic_x
=f(x)±ϵ1dxabsentplus-or-minussuperscript𝑓𝑥subscriptitalic-ϵ1𝑑𝑥\displaystyle=\int f^{\ast}(x)\pm\epsilon_{1}dx= ∫ italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) ± italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_x
=f(x)𝑑x±ϵ1𝑑x=f(x)𝑑x±ϵ2dxabsentplus-or-minussuperscript𝑓𝑥differential-d𝑥subscriptitalic-ϵ1differential-d𝑥plus-or-minussuperscript𝑓𝑥differential-d𝑥subscriptitalic-ϵ2𝑑𝑥\displaystyle=\int f^{\ast}(x)dx\pm\int\epsilon_{1}dx=\int f^{\ast}(x)dx\pm%\epsilon_{2}dx= ∫ italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) italic_d italic_x ± ∫ italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_x = ∫ italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) italic_d italic_x ± italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d italic_x

where ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a constant as ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is also a constant. However, this constant will grow depending on the integrating interval. This growing error is visible on the plot on the center top of Figure1.Controlling for ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT will depend on the dataset’s quality and the training procedure.

A similar result is discussed in[39].

A.3 On the Derivative of Underlying Process and its impact on Neural Networks

To understand the oscillations on the derivatives of the example, let us rewrite the underlying process that generated the data as a Gaussian process. That is: Y(x)=sin(x+ϕ)+η𝑌𝑥𝑠𝑖𝑛𝑥italic-ϕ𝜂Y(x)=sin(x+\phi)+\etaitalic_Y ( italic_x ) = italic_s italic_i italic_n ( italic_x + italic_ϕ ) + italic_η. Y𝑌Yitalic_Y is the stochastic process and ϕUniform(2π,2π)similar-toitalic-ϕ𝑈𝑛𝑖𝑓𝑜𝑟𝑚2𝜋2𝜋\phi\sim Uniform(-2\pi,2\pi)italic_ϕ ∼ italic_U italic_n italic_i italic_f italic_o italic_r italic_m ( - 2 italic_π , 2 italic_π ) is some phase component necessary for stationarity. The variance of this process will be given by the kernel function K(τ)𝐾𝜏K(\tau)italic_K ( italic_τ ), which is now controlled by parameter τ𝜏\tauitalic_τ. Given that this process has the same variance over time, x𝑥xitalic_x, it is stationary, and a stationary Gaussian process has K(τ)=eτ2𝐾𝜏superscript𝑒𝜏2K(\tau)=e^{-\frac{\tau}{2}}italic_K ( italic_τ ) = italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT[37]. Before continuing, we note that in our example, we sampled from a Gaussian with a standard deviation of σ=0.5𝜎0.5\sigma=0.5italic_σ = 0.5 and variance of σ2=0.25superscript𝜎20.25\sigma^{2}=0.25italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.25. This leads to a value of τ𝜏\tauitalic_τ that is of τ=2.8\tau=\approx 2.8italic_τ = ≈ 2.8 since e2.820.25superscript𝑒2.820.25e^{-\frac{2.8}{2}}\approx 0.25italic_e start_POSTSUPERSCRIPT - divide start_ARG 2.8 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ≈ 0.25.

Now, let Y(x)superscript𝑌𝑥Y^{\prime}(x)italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) be the derivative of this process. This may be defined as:

Y(x+h)Y(x)h𝑌𝑥𝑌𝑥\displaystyle\frac{Y(x+h)-Y(x)}{h}divide start_ARG italic_Y ( italic_x + italic_h ) - italic_Y ( italic_x ) end_ARG start_ARG italic_h end_ARGY(x)absentsuperscript𝑌𝑥\displaystyle\rightarrow Y^{\prime}(x)→ italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x )

if:

E[(Y(x+h)Y(x)hY(x))2]𝐸delimited-[]superscript𝑌𝑥𝑌𝑥superscript𝑌𝑥2\displaystyle E\big{[}(\frac{Y(x+h)-Y(x)}{h}-Y^{\prime}(x))^{2}\big{]}italic_E [ ( divide start_ARG italic_Y ( italic_x + italic_h ) - italic_Y ( italic_x ) end_ARG start_ARG italic_h end_ARG - italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]0absent0\displaystyle\rightarrow 0→ 0

when: h00h\rightarrow 0italic_h → 0.

Given that our process is stationary and has the expected value of the noise term as zero, we can estimate the value of the derivative process as zero, i.e., E[Y(x)]=0𝐸delimited-[]superscript𝑌𝑥0E[Y^{\prime}(x)]=0italic_E [ italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ] = 0[37]. We also have that the variance is equal to the second derivative of K(τ)𝐾𝜏K(\tau)italic_K ( italic_τ ): Var[Y(x)]=Kττ𝑉𝑎𝑟delimited-[]superscript𝑌𝑥𝐾𝜏𝜏Var[Y^{\prime}(x)]=\frac{\partial K}{\partial\tau\partial\tau}italic_V italic_a italic_r [ italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ] = divide start_ARG ∂ italic_K end_ARG start_ARG ∂ italic_τ ∂ italic_τ end_ARG[37].

From here, we can derive K𝐾Kitalic_K twice to show that: Var[Y(x)]=eτ22+τ2eτ22𝑉𝑎𝑟delimited-[]superscript𝑌𝑥superscript𝑒superscript𝜏22superscript𝜏2superscript𝑒superscript𝜏22Var[Y^{\prime}(x)]=-e^{-\frac{\tau^{2}}{2}}+\tau^{2}e^{-\frac{\tau^{2}}{2}}italic_V italic_a italic_r [ italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ] = - italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT + italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT. Recall that Var[Y(x)]=σ2=0.25𝑉𝑎𝑟delimited-[]𝑌𝑥superscript𝜎20.25Var[Y(x)]=\sigma^{2}=0.25italic_V italic_a italic_r [ italic_Y ( italic_x ) ] = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.25 and τ2.8𝜏2.8\tau\approx 2.8italic_τ ≈ 2.8. Now, by solving a simple inequality, we can show that the variance of the derivative process, Var[Y(x)]𝑉𝑎𝑟delimited-[]superscript𝑌𝑥Var[Y^{\prime}(x)]italic_V italic_a italic_r [ italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ], is greater than the variance of the original process Var[Y(x)]𝑉𝑎𝑟delimited-[]𝑌𝑥Var[Y(x)]italic_V italic_a italic_r [ italic_Y ( italic_x ) ], i.e., Var[Y(x)]>Var[Y(x)]𝑉𝑎𝑟delimited-[]superscript𝑌𝑥𝑉𝑎𝑟delimited-[]𝑌𝑥Var[Y^{\prime}(x)]>Var[Y(x)]italic_V italic_a italic_r [ italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ] > italic_V italic_a italic_r [ italic_Y ( italic_x ) ] when τ>21.4𝜏21.4\tau>\sqrt{2}\approx 1.4italic_τ > square-root start_ARG 2 end_ARG ≈ 1.4 (as is our example where τ2.8𝜏2.8\tau\approx 2.8italic_τ ≈ 2.8).

This result explains the oscillation in the bottom middle plot of Figure1. The derivative process has a higher variation; thus, we expect a higher variance in estimations of the derivative.

Overall, even if an NN is a universal approximation of derivatives[17, 29], the variance of the noise term will likely increase (here we began with a variance of 0.25 and it is already sufficient to show such an increase) even for the derivative of the NN.

Appendix B Sampling from 2-Cats

We now detail how to sample from 2-Cats. Let the first derivatives of 2-Cats be: hv(u)=Pr[UuV=v]=H𝜽(u,v)vsubscript𝑣𝑢Pr𝑈conditional𝑢𝑉𝑣subscript𝐻𝜽𝑢𝑣𝑣h_{v}(u)=\Pr[U\leq u\mid V=v]=\frac{\partial H_{\bm{\theta}}(u,v)}{\partial v}italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ) = roman_Pr [ italic_U ≤ italic_u ∣ italic_V = italic_v ] = divide start_ARG ∂ italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) end_ARG start_ARG ∂ italic_v end_ARG, and hu(v)=Pr[VvU=u]=H𝜽(u,v)usubscript𝑢𝑣Pr𝑉conditional𝑣𝑈𝑢subscript𝐻𝜽𝑢𝑣𝑢h_{u}(v)=\Pr[V\leq v\mid U=u]=\frac{\partial H_{\bm{\theta}}(u,v)}{\partial u}italic_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) = roman_Pr [ italic_V ≤ italic_v ∣ italic_U = italic_u ] = divide start_ARG ∂ italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) end_ARG start_ARG ∂ italic_u end_ARG. Now, let hv(u)=2H𝜽(u,v)v2subscriptsuperscript𝑣𝑢superscript2subscript𝐻𝜽𝑢𝑣superscript𝑣2h^{\prime}_{v}(u)=\frac{\partial^{2}H_{\bm{\theta}}(u,v)}{\partial v^{2}}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ) = divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) end_ARG start_ARG ∂ italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG be the first derivative of hv(u)subscript𝑣𝑢h_{v}(u)italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ), and similarly hu(v)=2H𝜽(u,v)u2subscriptsuperscript𝑢𝑣superscript2subscript𝐻𝜽𝑢𝑣superscript𝑢2h^{\prime}_{u}(v)=\frac{\partial^{2}H_{\bm{\theta}}(u,v)}{\partial u^{2}}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) = divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) end_ARG start_ARG ∂ italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. These derivatives are readily available in the Hessian matrix that we estimate symbolically for 2-Cats (see Section3).

Now, let us determine the inverse of hu(v)subscript𝑢𝑣h_{u}(v)italic_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ), that is: hu1(vi)subscriptsuperscript1𝑢subscript𝑣𝑖h^{-1}_{u}(v_{i})italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (the subscript stands for inverse). The same arguments are valid for hv(u)subscript𝑣𝑢h_{v}(u)italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ) and hv1(ui)subscriptsuperscript1𝑣subscript𝑢𝑖h^{-1}_{v}(u_{i})italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and thus we omit them. Notice that with this inverse, we can sample from the CDF defined by hu(v)subscript𝑢𝑣h_{u}(v)italic_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) using the well-known Inverse transform sampling. Now, notice that by definition, hu(v)subscript𝑢𝑣h_{u}(v)italic_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) is already the derivative of H𝜽(u,v)subscript𝐻𝜽𝑢𝑣H_{\bm{\theta}}(u,v)italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) with regards to u𝑢uitalic_u.

Using a Legendre transform555https://en.wikipedia.org/wiki/Inverse_function_rule, the inverse of a derivative is:

hu1(vi)=(hu(vi))1=(2H𝜽(u,vi)u2)1subscriptsuperscript1𝑢subscript𝑣𝑖superscriptsubscriptsuperscript𝑢subscript𝑣𝑖1superscriptsuperscript2subscript𝐻𝜽𝑢subscript𝑣𝑖superscript𝑢21h^{-1}_{u}(v_{i})=\big{(}h^{\prime}_{u}(v_{i})\big{)}^{-1}=\big{(}\frac{%\partial^{2}H_{\bm{\theta}}(u,v_{i})}{\partial u^{2}}\big{)}^{-1}italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = ( divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT

Where, again, the derivative inside the parenthesis is readily available to us via symbolic computation.

With such results in hand, the algorithm for sampling is:

  1. 1.

    Generate two independent values uUniform(0,1)similar-to𝑢𝑈𝑛𝑖𝑓𝑜𝑟𝑚01u\sim Uniform(0,1)italic_u ∼ italic_U italic_n italic_i italic_f italic_o italic_r italic_m ( 0 , 1 ) and viUniform(0,1)similar-tosubscript𝑣𝑖𝑈𝑛𝑖𝑓𝑜𝑟𝑚01v_{i}\sim Uniform(0,1)italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_U italic_n italic_i italic_f italic_o italic_r italic_m ( 0 , 1 ).

  2. 2.

    Set v=hu1(vi)𝑣subscriptsuperscript1𝑢subscript𝑣𝑖v=h^{-1}_{u}(v_{i})italic_v = italic_h start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). This is an Inverse transform sampling for v𝑣vitalic_v

  3. 3.

    Now we have the (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) pair. We can again use Inverse transform sampling to determine:

    1. (a)

      x1=Fx11(u)subscript𝑥1subscriptsuperscript𝐹1subscript𝑥1𝑢x_{1}=F^{-1}_{x_{1}}(u)italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u )

    2. (b)

      x2=Fx21(v)subscript𝑥2subscriptsuperscript𝐹1subscript𝑥2𝑣x_{2}=F^{-1}_{x_{2}}(v)italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_v )

Appendix C Ablation Study

We now provide an ablation study to understand the impact of our three losses. In this study, our model was trained without the Lagrangian terms of Property P3.

BOSTON
LCsuperscript𝐿𝐶L^{C}italic_L start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPTLdCsuperscript𝐿𝑑𝐶L^{dC}italic_L start_POSTSUPERSCRIPT italic_d italic_C end_POSTSUPERSCRIPTLcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT
Only Lcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT for training0.1150.033-0.448
Lcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and LdCsuperscript𝐿𝑑𝐶L^{dC}italic_L start_POSTSUPERSCRIPT italic_d italic_C end_POSTSUPERSCRIPT for training0.1070.035-0.454
Lcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and LCsuperscript𝐿𝐶L^{C}italic_L start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT for training0.1030.023-0.236
All Three Losses (paper)0.1070.020-0.630
INCT-MSOFT
LCsuperscript𝐿𝐶L^{C}italic_L start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPTLdCsuperscript𝐿𝑑𝐶L^{dC}italic_L start_POSTSUPERSCRIPT italic_d italic_C end_POSTSUPERSCRIPTLcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT
Only Lcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT for training0.1310.007-0.327
Lcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and LdCsuperscript𝐿𝑑𝐶L^{dC}italic_L start_POSTSUPERSCRIPT italic_d italic_C end_POSTSUPERSCRIPT for training0.1410.004-0.314
Lcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and LCsuperscript𝐿𝐶L^{C}italic_L start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT for training0.1370.007-0.320
All Three Losses (paper)0.1410.018-0.402
GOOG-FB
LCsuperscript𝐿𝐶L^{C}italic_L start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPTLdCsuperscript𝐿𝑑𝐶L^{dC}italic_L start_POSTSUPERSCRIPT italic_d italic_C end_POSTSUPERSCRIPTLcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT
Only Lcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT for training0.1666.88-3.324
Lcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and LdCsuperscript𝐿𝑑𝐶L^{dC}italic_L start_POSTSUPERSCRIPT italic_d italic_C end_POSTSUPERSCRIPT for training0.1330.066-2.163
Lcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and LCsuperscript𝐿𝐶L^{C}italic_L start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT for training0.1952.50-3.201
All Three Losses (paper)0.1360.087-2.881

The three tables (see Table3) of this section below present the values for LCsuperscript𝐿𝐶L^{C}italic_L start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT (the squared error of the cumulative function C), LdCsuperscript𝐿𝑑𝐶L^{dC}italic_L start_POSTSUPERSCRIPT italic_d italic_C end_POSTSUPERSCRIPT (the squared error for the first derivatives of C), and Lcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT (the copula likelihood). We present one table for each real-world dataset.

Lcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is the score of interest when comparing models and the one we report in our paper. However, as discussed in AppendixA of our manuscript, when an NN approximates one aspect of a function (here being the Copula density Lcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT), the NN may miss other elements, such as the integral and derivatives of the function. This is the reason why we now present results for all three metrics. The metrics are in the columns of the tables. Each row presents a different 2-Cats approach using: (1) only Lcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT for training (the metric of most interest); (2) Lcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and LdCsuperscript𝐿𝑑𝐶L^{dC}italic_L start_POSTSUPERSCRIPT italic_d italic_C end_POSTSUPERSCRIPT; Lcsuperscript𝐿𝑐L^{c}italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and LCsuperscript𝐿𝐶L^{C}italic_L start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT; and, (3) all three losses (as in the main text).

From the tables, we can see that every loss plays a role when training the model. When using all three losses, gains/ties are achieved in 5 out of 9 cases (bold). When ignoring such terms, large losses (italic) will also arise

Appendix D Conjecture: A Mixture of 2-Cats is a Universal Copula Approximator

With sufficient components, any other CDF (or density function) may be approximated by a mixture of other CDFs[52]. As a consequence, a mixture of 2-Cats of the form:

H𝒎,𝚯(u,v)=i=1kwiHi,𝜽𝒊(u,v),subscript𝐻𝒎𝚯𝑢𝑣superscriptsubscript𝑖1𝑘subscript𝑤𝑖subscript𝐻𝑖subscript𝜽𝒊𝑢𝑣H_{\bm{m},\bm{\Theta}}(u,v)=\sum_{i=1}^{k}w_{i}H_{i,\bm{\theta_{i}}}(u,v),italic_H start_POSTSUBSCRIPT bold_italic_m , bold_Θ end_POSTSUBSCRIPT ( italic_u , italic_v ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_i , bold_italic_θ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u , italic_v ) ,

where Hi,𝜽𝒊(u,v)subscript𝐻𝑖subscript𝜽𝒊𝑢𝑣H_{i,\bm{\theta_{i}}}(u,v)italic_H start_POSTSUBSCRIPT italic_i , bold_italic_θ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u , italic_v ) is the i𝑖iitalic_i-th 2-Cats model. 𝚯𝚯\bm{\Theta}bold_Θ is the parameter vector comprised of concatenating the individual parameters of each mixture component: 𝚯=[𝜽𝟏,𝜽𝟐,,𝜽𝒌]𝚯subscript𝜽1subscript𝜽2subscript𝜽𝒌\bm{\Theta}=[\bm{\theta_{1}},\bm{\theta_{2}},\dots,\bm{\theta_{k}}]bold_Θ = [ bold_italic_θ start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , … , bold_italic_θ start_POSTSUBSCRIPT bold_italic_k end_POSTSUBSCRIPT ], and 𝒎=[m1,m2,mk]𝒎subscript𝑚1subscript𝑚2subscript𝑚𝑘\bm{m}=[m_{1},m_{2},\dots m_{k}]bold_italic_m = [ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] are real numbers related to the mixture weights.

This model may be trained as follows:

  1. 1.

    Map mixture parameters to the k1𝑘1k-1italic_k - 1 Simplex: 𝐰=softmax(𝜽w)𝐰softmaxsubscript𝜽𝑤\mathbf{w}=\text{softmax}(\bm{\theta}_{w})bold_w = softmax ( bold_italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ). wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT if the i𝑖iitalic_i-th position of this vector.

  2. 2.

    When training, backpropagate to learn 𝚯𝚯\bm{\Theta}bold_Θ and 𝒎𝒎\bm{m}bold_italic_m.

This is similar to the Mixture Density NN[9]. Over the last few years, mixture of Copulas approaches has been gaining traction in several fields[2, 64, 73]; here, we are proposing an NN variation. By definition, this is a valid Copula.

Appendix E Scaled Synthetic Data

Gaussian (ρ𝜌\rhoitalic_ρ)Clayton (θ𝜃\thetaitalic_θ)Frank/Joe (θ𝜃\thetaitalic_θ)0.10.50.915101510Non-Deep LearnPar0.55±0.12plus-or-minus0.550.12-0.55\pm 0.12- 0.55 ± 0.120.55±0.13plus-or-minus0.550.13-0.55\pm 0.13- 0.55 ± 0.130.55±0.16plus-or-minus0.550.16-0.55\pm 0.16- 0.55 ± 0.161.12±0.09plus-or-minus1.120.09-1.12\pm 0.09- 1.12 ± 0.091.12±0.16plus-or-minus1.120.16-1.12\pm 0.16- 1.12 ± 0.161.12±0.21plus-or-minus1.120.21-1.12\pm 0.21- 1.12 ± 0.210.86±0.09plus-or-minus0.860.09-0.86\pm 0.09- 0.86 ± 0.090.86±0.12plus-or-minus0.860.12-0.86\pm 0.12- 0.86 ± 0.120.86±0.15plus-or-minus0.860.15-0.86\pm 0.15- 0.86 ± 0.15Bern0.54±0.12plus-or-minus0.540.12-0.54\pm 0.12- 0.54 ± 0.120.54±0.13plus-or-minus0.540.13-0.54\pm 0.13- 0.54 ± 0.130.54±0.14plus-or-minus0.540.14-0.54\pm 0.14- 0.54 ± 0.141.10±0.08plus-or-minus1.100.08-1.10\pm 0.08- 1.10 ± 0.081.10±0.12plus-or-minus1.100.12-1.10\pm 0.12- 1.10 ± 0.121.10±0.18plus-or-minus1.100.18-1.10\pm 0.18- 1.10 ± 0.180.85±0.09plus-or-minus0.850.09-0.85\pm 0.09- 0.85 ± 0.090.85±0.12plus-or-minus0.850.12-0.85\pm 0.12- 0.85 ± 0.120.85±0.14plus-or-minus0.850.14-0.85\pm 0.14- 0.85 ± 0.14PBern0.55±0.11plus-or-minus0.550.11-0.55\pm 0.11- 0.55 ± 0.110.55±0.14plus-or-minus0.550.14-0.55\pm 0.14- 0.55 ± 0.140.55±0.14plus-or-minus0.550.14-0.55\pm 0.14- 0.55 ± 0.141.11±0.08plus-or-minus1.110.08-1.11\pm 0.08- 1.11 ± 0.081.11±0.12plus-or-minus1.110.12-1.11\pm 0.12- 1.11 ± 0.121.11±0.18plus-or-minus1.110.18-1.11\pm 0.18- 1.11 ± 0.180.86±0.09plus-or-minus0.860.09-0.86\pm 0.09- 0.86 ± 0.090.86±0.13plus-or-minus0.860.13-0.86\pm 0.13- 0.86 ± 0.130.86±0.15plus-or-minus0.860.15-0.86\pm 0.15- 0.86 ± 0.15PSPL10.55±0.11plus-or-minus0.550.11-0.55\pm 0.11- 0.55 ± 0.110.55±0.13plus-or-minus0.550.13-0.55\pm 0.13- 0.55 ± 0.130.55±0.14plus-or-minus0.550.14-0.55\pm 0.14- 0.55 ± 0.141.11±0.07plus-or-minus1.110.07-1.11\pm 0.07- 1.11 ± 0.071.11±0.15plus-or-minus1.110.15-1.11\pm 0.15- 1.11 ± 0.151.11±0.30plus-or-minus1.110.30-1.11\pm 0.30- 1.11 ± 0.300.86±0.09plus-or-minus0.860.09-0.86\pm 0.09- 0.86 ± 0.090.86±0.12plus-or-minus0.860.12-0.86\pm 0.12- 0.86 ± 0.120.86±0.18plus-or-minus0.860.18-0.86\pm 0.18- 0.86 ± 0.18PSPL20.55±0.11plus-or-minus0.550.11-0.55\pm 0.11- 0.55 ± 0.110.55±0.13plus-or-minus0.550.13-0.55\pm 0.13- 0.55 ± 0.130.55±0.15plus-or-minus0.550.15-0.55\pm 0.15- 0.55 ± 0.151.10±0.08plus-or-minus1.100.08-1.10\pm 0.08- 1.10 ± 0.081.10±0.14plus-or-minus1.100.14-1.10\pm 0.14- 1.10 ± 0.141.10±0.24plus-or-minus1.100.24-1.10\pm 0.24- 1.10 ± 0.240.86±0.09plus-or-minus0.860.09-0.86\pm 0.09- 0.86 ± 0.090.86±0.12plus-or-minus0.860.12-0.86\pm 0.12- 0.86 ± 0.120.86±0.17plus-or-minus0.860.17-0.86\pm 0.17- 0.86 ± 0.17TTL00.54±0.12plus-or-minus0.540.12-0.54\pm 0.12- 0.54 ± 0.120.54±0.14plus-or-minus0.540.14-0.54\pm 0.14- 0.54 ± 0.140.54±0.15plus-or-minus0.540.15-0.54\pm 0.15- 0.54 ± 0.151.11±0.09plus-or-minus1.110.09-1.11\pm 0.09- 1.11 ± 0.091.11±0.16plus-or-minus1.110.16-1.11\pm 0.16- 1.11 ± 0.161.11±0.27plus-or-minus1.110.27-1.11\pm 0.27- 1.11 ± 0.270.86±0.08plus-or-minus0.860.08-0.86\pm 0.08- 0.86 ± 0.080.86±0.13plus-or-minus0.860.13-0.86\pm 0.13- 0.86 ± 0.130.86±0.16plus-or-minus0.860.16-0.86\pm 0.16- 0.86 ± 0.16TLL10.53±0.12plus-or-minus0.530.12-0.53\pm 0.12- 0.53 ± 0.120.53±0.14plus-or-minus0.530.14-0.53\pm 0.14- 0.53 ± 0.140.53±0.16plus-or-minus0.530.16-0.53\pm 0.16- 0.53 ± 0.161.12±0.08plus-or-minus1.120.08-1.12\pm 0.08- 1.12 ± 0.081.12±0.17plus-or-minus1.120.17-1.12\pm 0.17- 1.12 ± 0.171.12±0.32plus-or-minus1.120.32-1.12\pm 0.32- 1.12 ± 0.320.85±0.09plus-or-minus0.850.09-0.85\pm 0.09- 0.85 ± 0.090.85±0.12plus-or-minus0.850.12-0.85\pm 0.12- 0.85 ± 0.120.85±0.17plus-or-minus0.850.17-0.85\pm 0.17- 0.85 ± 0.17TLL20.54±0.12plus-or-minus0.540.12-0.54\pm 0.12- 0.54 ± 0.120.54±0.14plus-or-minus0.540.14-0.54\pm 0.14- 0.54 ± 0.140.54±0.15plus-or-minus0.540.15-0.54\pm 0.15- 0.54 ± 0.151.12±0.09plus-or-minus1.120.09-1.12\pm 0.09- 1.12 ± 0.091.12±0.19plus-or-minus1.120.19-1.12\pm 0.19- 1.12 ± 0.191.12±0.40plus-or-minus1.120.40-1.12\pm 0.40- 1.12 ± 0.400.86±0.09plus-or-minus0.860.09-0.86\pm 0.09- 0.86 ± 0.090.86±0.13plus-or-minus0.860.13-0.86\pm 0.13- 0.86 ± 0.130.86±0.16plus-or-minus0.860.16-0.86\pm 0.16- 0.86 ± 0.16TLL2nn0.55±0.12plus-or-minus0.550.12-0.55\pm 0.12- 0.55 ± 0.120.55±0.14plus-or-minus0.550.14-0.55\pm 0.14- 0.55 ± 0.140.55±0.15plus-or-minus0.550.15-0.55\pm 0.15- 0.55 ± 0.151.12±0.08plus-or-minus1.120.08-1.12\pm 0.08- 1.12 ± 0.081.12±0.17plus-or-minus1.120.17-1.12\pm 0.17- 1.12 ± 0.171.12±0.40plus-or-minus1.120.40-1.12\pm 0.40- 1.12 ± 0.400.86±0.09plus-or-minus0.860.09-0.86\pm 0.09- 0.86 ± 0.090.86±0.13plus-or-minus0.860.13-0.86\pm 0.13- 0.86 ± 0.130.86±0.15plus-or-minus0.860.15-0.86\pm 0.15- 0.86 ± 0.15MR0.54±0.12plus-or-minus0.540.12-0.54\pm 0.12- 0.54 ± 0.120.54±0.13plus-or-minus0.540.13-0.54\pm 0.13- 0.54 ± 0.130.54±0.14plus-or-minus0.540.14-0.54\pm 0.14- 0.54 ± 0.141.08±0.08plus-or-minus1.080.08-1.08\pm 0.08- 1.08 ± 0.081.08±0.14plus-or-minus1.080.14-1.08\pm 0.14- 1.08 ± 0.141.08±0.34plus-or-minus1.080.34-1.08\pm 0.34- 1.08 ± 0.340.84±0.09plus-or-minus0.840.09-0.84\pm 0.09- 0.84 ± 0.090.84±0.12plus-or-minus0.840.12-0.84\pm 0.12- 0.84 ± 0.120.84±0.15plus-or-minus0.840.15-0.84\pm 0.15- 0.84 ± 0.15Probit0.55±0.12plus-or-minus0.550.12-0.55\pm 0.12- 0.55 ± 0.120.55±0.13plus-or-minus0.550.13-0.55\pm 0.13- 0.55 ± 0.130.55±0.15plus-or-minus0.550.15-0.55\pm 0.15- 0.55 ± 0.151.10±0.08plus-or-minus1.100.08-1.10\pm 0.08- 1.10 ± 0.081.10±0.14plus-or-minus1.100.14-1.10\pm 0.14- 1.10 ± 0.141.10±0.29plus-or-minus1.100.29-1.10\pm 0.29- 1.10 ± 0.290.85±0.09plus-or-minus0.850.09-0.85\pm 0.09- 0.85 ± 0.090.85±0.12plus-or-minus0.850.12-0.85\pm 0.12- 0.85 ± 0.120.85±0.16plus-or-minus0.850.16-0.85\pm 0.16- 0.85 ± 0.16Deep LearnACNet0.06±0.09plus-or-minus0.060.09-0.06\pm 0.09- 0.06 ± 0.090.29±0.09plus-or-minus0.290.09-0.29\pm 0.09- 0.29 ± 0.091.05±0.08plus-or-minus1.050.08-1.05\pm 0.08- 1.05 ± 0.080.51±0.07plus-or-minus0.510.07-0.51\pm 0.07- 0.51 ± 0.070.91±0.11plus-or-minus0.910.11-0.91\pm 0.11- 0.91 ± 0.111.35±0.04plus-or-minus1.350.04-1.35\pm 0.04- 1.35 ± 0.040.13±0.09plus-or-minus0.130.09-0.13\pm 0.09- 0.13 ± 0.090.49±0.08plus-or-minus0.490.08-0.49\pm 0.08- 0.49 ± 0.080.58±0.10plus-or-minus0.580.10-0.58\pm 0.10- 0.58 ± 0.10GEN-AC0.06±0.08plus-or-minus0.060.08-0.06\pm 0.08- 0.06 ± 0.080.29±0.08plus-or-minus0.290.08-0.29\pm 0.08- 0.29 ± 0.081.06±0.07plus-or-minus1.060.07-1.06\pm 0.07- 1.06 ± 0.070.52±0.06plus-or-minus0.520.06-0.52\pm 0.06- 0.52 ± 0.060.93±0.12plus-or-minus0.930.12-0.93\pm 0.12- 0.93 ± 0.121.34±0.05plus-or-minus1.340.05-1.34\pm 0.05- 1.34 ± 0.050.15±0.07plus-or-minus0.150.07-0.15\pm 0.07- 0.15 ± 0.070.49±0.08plus-or-minus0.490.08-0.49\pm 0.08- 0.49 ± 0.080.56±0.10plus-or-minus0.560.10-0.56\pm 0.10- 0.56 ± 0.10NL0.34±0.09plus-or-minus0.340.09-0.34\pm 0.09- 0.34 ± 0.090.43±0.07plus-or-minus0.430.07-0.43\pm 0.07- 0.43 ± 0.071.07±0.07plus-or-minus1.070.07-1.07\pm 0.07- 1.07 ± 0.070.64±0.07plus-or-minus0.640.07-0.64\pm 0.07- 0.64 ± 0.071.17±0.13plus-or-minus1.170.13-1.17\pm 0.13- 1.17 ± 0.131.03±0.07plus-or-minus1.030.07-1.03\pm 0.07- 1.03 ± 0.070.42±0.07plus-or-minus0.420.07-0.42\pm 0.07- 0.42 ± 0.070.59±0.08plus-or-minus0.590.08-0.59\pm 0.08- 0.59 ± 0.080.57±0.06plus-or-minus0.570.06-0.57\pm 0.06- 0.57 ± 0.06IGC0.58±0.07plus-or-minus0.580.07-0.58\pm 0.07- 0.58 ± 0.070.67±0.06plus-or-minus0.670.06-0.67\pm 0.06- 0.67 ± 0.060.86±0.06plus-or-minus0.860.06-0.86\pm 0.06- 0.86 ± 0.060.66±0.06plus-or-minus0.660.06-0.66\pm 0.06- 0.66 ± 0.060.88±0.06plus-or-minus0.880.06-0.88\pm 0.06- 0.88 ± 0.060.83±0.06plus-or-minus0.830.06-0.83\pm 0.06- 0.83 ± 0.060.66±0.06plus-or-minus0.660.06-0.66\pm 0.06- 0.66 ± 0.060.73±0.05plus-or-minus0.730.05-0.73\pm 0.05- 0.73 ± 0.050.75±0.07plus-or-minus0.750.07-0.75\pm 0.07- 0.75 ± 0.07Our2-Cats-L0.48±0.14plus-or-minus0.480.14-0.48\pm 0.14- 0.48 ± 0.140.63±0.13plus-or-minus0.630.13-0.63\pm 0.13- 0.63 ± 0.131.20±0.23plus-or-minus1.200.23-1.20\pm 0.23- 1.20 ± 0.230.64±0.14plus-or-minus0.640.14-0.64\pm 0.14- 0.64 ± 0.141.24±0.20plus-or-minus1.240.20-1.24\pm 0.20- 1.24 ± 0.201.69±0.19plus-or-minus1.690.19-1.69\pm 0.19- 1.69 ± 0.190.52±0.17plus-or-minus0.520.17-0.52\pm 0.17- 0.52 ± 0.171.34±0.17plus-or-minus1.340.17-1.34\pm 0.17- 1.34 ± 0.171.34±0.14plus-or-minus1.340.14-1.34\pm 0.14- 1.34 ± 0.142-Cats-G0.38±0.15plus-or-minus0.380.15-0.38\pm 0.15- 0.38 ± 0.150.45±0.19plus-or-minus0.450.19-0.45\pm 0.19- 0.45 ± 0.191.04±0.28plus-or-minus1.040.28-1.04\pm 0.28- 1.04 ± 0.280.44±0.19plus-or-minus0.440.19-0.44\pm 0.19- 0.44 ± 0.191.27±0.30plus-or-minus1.270.30-1.27\pm 0.30- 1.27 ± 0.301.06±0.18plus-or-minus1.060.18-1.06\pm 0.18- 1.06 ± 0.180.51±0.15plus-or-minus0.510.15-0.51\pm 0.15- 0.51 ± 0.151.12±0.09plus-or-minus1.120.09-1.12\pm 0.09- 1.12 ± 0.091.06±0.18plus-or-minus1.060.18-1.06\pm 0.18- 1.06 ± 0.18

Table4 presents results without scaling the input data as is done in Deep Learning methods. The table presents the average negative log-likelihood and the 95% confidence interval.

The colors and highlights on this table match the main text. The winning 2-Cats models are highlighted in the table in Green. The best, on average, baseline methods are in Red. The table shows that 2-Cats is better than baseline methods when the dependency (ρ𝜌\rhoitalic_ρ or θ𝜃\thetaitalic_θ) increases. For small dependencies, methods not based on Deep Learning outperform Deep ones (including 2-Cats). Nevertheless, 2-Cats is the winning method in 6 out of 9 datasets and is tied with the best in one case.

Appendix F A Flexible Variation of the Model

A Flexible 2-Cats model works similarly to our 2-Cats. However, the transforms are different:

  1. 1.

    Let m𝜽:𝕀2+:subscript𝑚𝜽maps-tosuperscript𝕀2limit-fromm_{\bm{\theta}}:\mathbb{I}^{2}\mapsto\mathbb{R}+italic_m start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT : blackboard_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ↦ blackboard_R + be an MLP outputs positive numbers. We achieve this by employing Elu plus one activation in every layer as in[71].

  2. 2.

    Define the transforms:

    tv(u)=0um𝜽(x,v)𝑑x;tu(v)=0vm𝜽(y,u)𝑑y.formulae-sequencesubscript𝑡𝑣𝑢superscriptsubscript0𝑢subscript𝑚𝜽𝑥𝑣differential-d𝑥subscript𝑡𝑢𝑣superscriptsubscript0𝑣subscript𝑚𝜽𝑦𝑢differential-d𝑦\displaystyle t_{v}(u)=\int_{0}^{u}m_{\bm{\theta}}(x,v)\,dx;\,\,t_{u}(v)=\int_%{0}^{v}m_{\bm{\theta}}(y,u)\,dy.italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x , italic_v ) italic_d italic_x ; italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y , italic_u ) italic_d italic_y .
  3. 3.

    The 2-Cats-FLEX hypothesis is now defined as the function: H𝜽(u,v)=G((tv(u),tu(v))H_{\bm{\theta}}(u,v)=G\big{(}(t_{v}(u),t_{u}(v)\big{)}italic_H start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_u , italic_v ) = italic_G ( ( italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ) , italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) ), where G(x,y)𝐺𝑥𝑦G(x,y)italic_G ( italic_x , italic_y ) is any bivariate CDF on the 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT domain (e.g., the Bivariate Logistic or Bivariate Gaussian).

This model meets P2, but not P1 nor P3. As already stated in the introduction, the fact that tv(u)subscript𝑡𝑣𝑢t_{v}(u)italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ) and tu(v)subscript𝑡𝑢𝑣t_{u}(v)italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) are monotonic, one-to-one, guarantees that G(tv(u),tu(v))𝐺subscript𝑡𝑣𝑢subscript𝑡𝑢𝑣G(t_{v}(u),t_{u}(v))italic_G ( italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_u ) , italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_v ) ) defines a valid probability transform to a new CDF. A major issue with this approach is that we have no guarantee that the model’s derivatives are conditional cumulative functions (first derivatives) or density functions (second derivatives). That’s why we call it Flexible (FLEX).

F.1 Full Results

BostonINTC-MSFTGOOG-FB
Par0.16±0.09plus-or-minus0.160.09-0.16\pm 0.09- 0.16 ± 0.090.05±0.07plus-or-minus0.050.07-0.05\pm 0.07- 0.05 ± 0.070.88±0.08plus-or-minus0.880.08-0.88\pm 0.08- 0.88 ± 0.08
Bern0.09±0.06plus-or-minus0.090.06-0.09\pm 0.06- 0.09 ± 0.060.03±0.05plus-or-minus0.030.05-0.03\pm 0.05- 0.03 ± 0.050.70±0.04plus-or-minus0.700.04-0.70\pm 0.04- 0.70 ± 0.04
PBern0.11±0.07plus-or-minus0.110.07-0.11\pm 0.07- 0.11 ± 0.070.02±0.06plus-or-minus0.020.06-0.02\pm 0.06- 0.02 ± 0.060.62±0.04plus-or-minus0.620.04-0.62\pm 0.04- 0.62 ± 0.04
PSPL10.06±0.14plus-or-minus0.060.14-0.06\pm 0.14- 0.06 ± 0.140.02±0.06plus-or-minus0.020.06-0.02\pm 0.06- 0.02 ± 0.060.72±0.14plus-or-minus0.720.14-0.72\pm 0.14- 0.72 ± 0.14
PSPL20.05±0.14plus-or-minus0.050.14-0.05\pm 0.14- 0.05 ± 0.140.02±0.06plus-or-minus0.020.06-0.02\pm 0.06- 0.02 ± 0.060.83±0.07plus-or-minus0.830.07-0.83\pm 0.07- 0.83 ± 0.07
TTL00.16±0.08plus-or-minus0.160.08-0.16\pm 0.08- 0.16 ± 0.080.06±0.06plus-or-minus0.060.06-0.06\pm 0.06- 0.06 ± 0.061.00±0.07plus-or-minus1.000.07-1.00\pm 0.07- 1.00 ± 0.07
TTL10.18±0.09plus-or-minus0.180.09-0.18\pm 0.09- 0.18 ± 0.090.06±0.06plus-or-minus0.060.06-0.06\pm 0.06- 0.06 ± 0.061.00±0.08plus-or-minus1.000.08-1.00\pm 0.08- 1.00 ± 0.08
TTL20.18±0.09plus-or-minus0.180.09-0.18\pm 0.09- 0.18 ± 0.090.06±0.06plus-or-minus0.060.06-0.06\pm 0.06- 0.06 ± 0.060.71±0.16plus-or-minus0.710.16-0.71\pm 0.16- 0.71 ± 0.16
TLL20.15±0.09plus-or-minus0.150.09-0.15\pm 0.09- 0.15 ± 0.090.04±0.05plus-or-minus0.040.05-0.04\pm 0.05- 0.04 ± 0.050.95±0.13plus-or-minus0.950.13-0.95\pm 0.13- 0.95 ± 0.13
MR0.07±0.06plus-or-minus0.070.06-0.07\pm 0.06- 0.07 ± 0.060.01±0.05plus-or-minus0.010.05-0.01\pm 0.05- 0.01 ± 0.050.84±0.07plus-or-minus0.840.07-0.84\pm 0.07- 0.84 ± 0.07
Probit0.10±0.06plus-or-minus0.100.06-0.10\pm 0.06- 0.10 ± 0.060.03±0.07plus-or-minus0.030.07-0.03\pm 0.07- 0.03 ± 0.070.92±0.06plus-or-minus0.920.06-0.92\pm 0.06- 0.92 ± 0.06
ACNet0.28±0.11plus-or-minus0.280.11-0.28\pm 0.11- 0.28 ± 0.110.18±0.08plus-or-minus0.180.08-0.18\pm 0.08- 0.18 ± 0.080.91±0.12plus-or-minus0.910.12-0.91\pm 0.12- 0.91 ± 0.12
GEN-AC0.29±0.11plus-or-minus0.290.11-0.29\pm 0.11- 0.29 ± 0.110.17±0.07plus-or-minus0.170.07-0.17\pm 0.07- 0.17 ± 0.070.75±0.10plus-or-minus0.750.10-0.75\pm 0.10- 0.75 ± 0.10
NL0.28±0.16plus-or-minus0.280.16-0.28\pm 0.16- 0.28 ± 0.160.16±0.08plus-or-minus0.160.08-0.16\pm 0.08- 0.16 ± 0.081.09±0.13plus-or-minus1.090.13-1.09\pm 0.13- 1.09 ± 0.13
IGC0.07±0.08plus-or-minus0.070.080.07\pm 0.080.07 ± 0.080.14±0.04plus-or-minus0.140.040.14\pm 0.040.14 ± 0.040.30±0.02plus-or-minus0.300.02-0.30\pm 0.02- 0.30 ± 0.02
2-Cats-FLEX-L0.38±0.71plus-or-minus0.380.710.38\pm 0.710.38 ± 0.710.19±0.08plus-or-minus0.190.080.19\pm 0.080.19 ± 0.081.15±0.07plus-or-minus1.150.07-1.15\pm 0.07- 1.15 ± 0.07
2-Cats-FLEX-G0.50±0.43plus-or-minus0.500.430.50\pm 0.430.50 ± 0.430.23±0.09plus-or-minus0.230.090.23\pm 0.090.23 ± 0.091.01±0.07plus-or-minus1.010.07-1.01\pm 0.07- 1.01 ± 0.07
2-Cats-G0.30±0.09plus-or-minus0.300.09-0.30\pm 0.09- 0.30 ± 0.090.21±0.05plus-or-minus0.210.05-0.21\pm 0.05- 0.21 ± 0.051.75±0.04plus-or-minus1.750.04-1.75\pm 0.04- 1.75 ± 0.04
2-Cats-L0.27±0.07plus-or-minus0.270.07-0.27\pm 0.07- 0.27 ± 0.070.07±0.06plus-or-minus0.070.06-0.07\pm 0.06- 0.07 ± 0.061.65±0.08plus-or-minus1.650.08-1.65\pm 0.08- 1.65 ± 0.08

In Table5, we show the results for these models on real datasets. The definitions of the models are as follows: (2-Cats-P Gaus.) A parametric mixture of 10 Gaussian Copula densities.; (2-Cats-P Frank) Similar to the above, but is a mixture of 10 Frank Copula densities; (2-Cats-FLEX-G) A Flexible model version where the final activation layer is Bivariate Gaussian CDF; (2-Cats-FLEX-L) A Flexible version of the model where the final activation layer is Bivariate Logistic CDF.

Models were trained with the same hyperparameters and network definition as the ones in our main text.

Appendix G Appendix Baseline Source Code

Link
Non-Deep LearnParhttps://cran.r-project.org/web/packages/VineCopula/index.html
Bernhttps://cran.r-project.org/web/packages/kdecopula/index.html
PBernhttps://rdrr.io/cran/penRvine/
PSPL1https://rdrr.io/cran/penRvine/
PSPL2https://rdrr.io/cran/penRvine/
TTL0https://cran.r-project.org/web/packages/kdecopula/index.html
TTL1https://cran.r-project.org/web/packages/kdecopula/index.html
TTL2https://cran.r-project.org/web/packages/kdecopula/index.html
TLL2https://cran.r-project.org/web/packages/kdecopula/index.html
MRhttps://cran.r-project.org/web/packages/kdecopula/index.html
Probithttps://cran.r-project.org/web/packages/kdecopula/index.html
Deep LearnACNethttps://github.com/lingchunkai/ACNet
GEN-AChttps://github.com/yutingng/gen-AC
NLhttps://github.com/pawelc/NeuralLikelihoods0
IGChttps://github.com/TimCJanke/igc

The links for baseline source code are in Table6.

2-Cats: 2d Copula Approximating Transforms (2024)

References

Top Articles
Latest Posts
Article information

Author: Geoffrey Lueilwitz

Last Updated:

Views: 6505

Rating: 5 / 5 (80 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Geoffrey Lueilwitz

Birthday: 1997-03-23

Address: 74183 Thomas Course, Port Micheal, OK 55446-1529

Phone: +13408645881558

Job: Global Representative

Hobby: Sailing, Vehicle restoration, Rowing, Ghost hunting, Scrapbooking, Rugby, Board sports

Introduction: My name is Geoffrey Lueilwitz, I am a zealous, encouraging, sparkling, enchanting, graceful, faithful, nice person who loves writing and wants to share my knowledge and understanding with you.