Array Code in C++

xtensor (http://xtensor.readthedocs.io/) is an array library in C++. It defines the multi-dimensional array data structure suitable for compile-time optimization. The array library helps us organize code and achieve fast runtime.

  1. Python is slow but easy to write
  2. Speed up by using numpy (still in Python)
  3. Xtensor: write iterative code in C++ speed using arrays
  4. Effect of house-keeping code

More on array-based system design:

  1. Design interface with arrays
  2. Conversion between dynamic and static semantics
  3. Insert profiling code

Python Is Slow

Python is usually slow when it comes to number-crunching, but so convenient to code.

Here we consider a boundary value problem of the Laplace equation for temperature distribution in a \(1\times1\) square area.

\[\begin{split}& \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2} = 0 \quad (0<x<1; 0<y<1) \\ &\left\{\begin{array}{lll} u(0,y) = 0, & u(1,y) = \sin(\pi y) & \rule{4ex}{0pt} (0 \le y \le 1) \\ u(x,0) = 0, & u(x,1) = 0 & \rule{4ex}{0pt} (0 \le x \le 1) \end{array}\right.\end{split}\]

To solve it numerically, we choose the finite-difference method. The finite-difference method needs a grid to discretize the spatial domain. The simplest spatial discretization is the homogeneous Cartesian grid. Let’s make a \(51\times51\) Cartesian grid.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np
from matplotlib import pyplot as plt

def make_grid():
    nx = 51
    x = np.linspace(0, 1, nx)
    gx, gy = np.meshgrid(x, x)
    u = np.zeros_like(gx)
    u[0,:] = 0
    u[-1,:] = 1 * np.sin(np.linspace(0,np.pi,nx))
    u[:,0] = 0
    u[:,-1] = 0
    return nx, x, u

def show_grid(size):
    fig, ax = plt.subplots(figsize=(size,size))
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.set_xticks(x, minor=True)
    ax.set_yticks(x, minor=True)
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')
    ax.grid(True, which='minor')

nx, x, uoriginal = make_grid()
show_grid(10)
../../_images/01_grid.png

After the grid is defined, we may derive the finite-differencing formula. Use the Taylor series expansion to obtain the difference equation:

\[\begin{split}& \frac{u(x_{i+1}, y_j) - 2u(x_i, y_j) + u(x_{i-1}, y_j)}{(\Delta x)^2} \\ &\quad + \frac{u(x_i, y_{j+1}) - 2u(x_i, y_j) + u(x_i, y_{j+1})}{(\Delta y)^2} = 0\end{split}\]

Note \(\Delta x = \Delta y\). The difference equation is rewritten as

\[u(x_i, y_j) = \frac{u(x_{i+1}, y_j) + u(x_{i-1}, y_j) + u(x_i, y_{j+1}) + u(x_i, y_{j-1})}{4}\]

Apply the point-Jacobi method to write a formula to iteratively solve the difference equaion:

\[u^{n+1}(x_i, y_i) = \frac{u^n(x_{i+1}, y_j) + u^n(x_{i-1}, y_j) + u^n(x_i, y_{j+1}) + u^n(x_i, y_{j-1})}{4}\]

where \(u^n\) is the solution at the \(n\)-th iteration.

Now we can use Python to quickly implement the solver:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import time
import numpy as np
from matplotlib import pyplot as plt

class Timer:
    def __enter__(self):
        self._t0 = time.time()
    def __exit__(self, *args, **kw):
        self._t1 = time.time()
        print("Wall time: {:g} s".format(self._t1 - self._t0))

def solve_python_loop():
    u = uoriginal.copy()
    un = u.copy()
    converged = False
    step = 0
    # Outer loop.
    while not converged:
        step += 1
        # Inner loops. One for x and the other for y.
        for it in range(1, nx-1):
            for jt in range(1, nx-1):
                un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4
        norm = np.abs(un-u).max()
        u[...] = un[...]
        converged = True if norm < 1.e-5 else False
    return u, step, norm

def show_result(u, step, norm, size=7):
    print("step", step, "norm", norm)
    fig, ax = plt.subplots(figsize=(size,size))
    cs = ax.contour(x, x, u.T)
    ax.clabel(cs, inline=1, fontsize=10)

    ax.set_xticks(np.linspace(0,1,6))
    ax.set_yticks(np.linspace(0,1,6))
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')

    ax.grid(True, which='minor')
>>> with Timer():
>>>    u, step, norm = solve_python_loop()
u, step, norm = solve_python_loop()
Wall time: 4.79688 s
../../_images/01_solve_python_loop.png

It takes quite a while (around 5 seconds) to converge with 2097 iterations.

Is the calculation correct? For any numerical application, correctness is the first condition.

We may compare the numerical solution with the analytical solution. Recall the PDE and its boundary conditions:

\[\begin{split}& \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2} = 0 \quad (0<x<1; 0<y<1) \\ &\left\{\begin{array}{lll} u(0,y) = 0, & u(1,y) = \sin(\pi y) & \rule{4ex}{0pt} (0 \le y \le 1) \\ u(x,0) = 0, & u(x,1) = 0 & \rule{4ex}{0pt} (0 \le x \le 1) \end{array}\right.\end{split}\]

Use separation of variable. Assume the solution \(u(x,y) = \phi(x)\psi(y)\).

\[\begin{split}& \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2} = 0 \quad \Rightarrow \quad \phi''\psi + \phi\psi'' = 0 \\ & \Rightarrow \quad \frac{\phi''}{\phi} = -\frac{\psi''}{\psi} = \lambda\end{split}\]
\[\begin{split}\left\{\begin{array}{ll} \phi'' - \lambda\phi = 0, & \phi(0) = 0 \\ \psi'' + \lambda\psi = 0, & \psi(0) = 0, \, \psi(1) = 0 \end{array}\right.\end{split}\]

\(\lambda\) is the eigenvalue. The general solution of the ODE of \(\psi\) can be obtained as

\[\psi(y) = c\cos(\kappa y) + d\sin(\kappa y), \quad \kappa^2 = \lambda\]

Substitute the boundary conditions to the general solution

\[\begin{split}& \psi(0) = c = 0 \, \Rightarrow \, c = 0 \\ & \psi(1) = d\sin(\kappa) = 0 \, \Rightarrow \, \kappa = n\pi, \, n = 1, 2, 3, \ldots \\ & \psi(y) = \sum_{n=1}^{\infty} \psi_n(y), \; \mbox{where} \; \psi_n(y) = d_n\sin(n\pi y) \\ & \Rightarrow \psi(y) = \sum_{n=1}^{\infty} d_n \sin(n\pi y)\end{split}\]

Substitute the eigenvalue $lambda$ into the ODE of \(\phi\)

\[\phi'' - (n\pi)^2\phi = 0\]

The general solution is

\[\phi_n(x) = a_n\cosh(\kappa x) + b_n\sinh(\kappa x)\]

Apply the boundary condition \(\phi_n(0) = a_n = 0\) and obtain \(\phi_n(x) = b_n\sinh(\kappa x)\).

The solution \(u(x, y)\) can now be written as

\[u(x,y) = \sum_{n=1}^{\infty}\phi_n(x)\psi_n(y) = \sum_{n=1}^{\infty} \alpha_n \sinh(n\pi x)\sin(n\pi y)\]

where $alpha_n = b_nd_n$. Apply the last boundary condition

\[u(1,y) = \sin(\pi y) = \sum_{n=1}^{\infty}\alpha_n\sinh(n\pi)\sin(n\pi)\]

It is obtained that \(\alpha_1 = \sinh^{-1}(\pi)\) and \(\alpha_k = 0 \forall k = 2, 3, \ldots\). The solution of \(u\) is obtained:

\[u(x, y) = \frac{\sinh(\pi x)}{\sinh(\pi)} \sin(\pi y)\]
1
2
3
4
5
6
7
import numpy as np

def solve_analytical():
    u = np.empty((len(x), len(x)), dtype='float64')
    for ix, cx in enumerate(x):
        u[ix, :] = np.sinh(np.pi*cx) / np.sinh(np.pi) * np.sin(np.pi*x)
    return u
../../_images/01_solve_analytical.png
>>> uanalytical = solve_analytical()
>>> # Calculate the L inf norm.
>>> print("Linf of difference is %f" % np.abs(u - uanalytical).max())
Linf of difference is 0.004962

Say \(u_a\) is the analytical solution. \(|u-u_a|_{\infty}\) from the above result is good enough.

Array-Based Code with Numpy

We usually can use numpy to speed up the slow Python loops. Numpy implements fast calculations in C. By using numpy, we essentially delegate the calculation to C.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import numpy as np

def solve_array():
    u = uoriginal.copy()
    un = u.copy()
    converged = False
    step = 0
    while not converged:
        step += 1
        un[1:nx-1,1:nx-1] = (u[2:nx,1:nx-1] + u[0:nx-2,1:nx-1] +
                             u[1:nx-1,2:nx] + u[1:nx-1,0:nx-2]) / 4
        norm = np.abs(un-u).max()
        u[...] = un[...]
        converged = True if norm < 1.e-5 else False
    return u, step, norm
../../_images/02_solve_array.png
>>> # Run the Python solver
>>> with Timer():
>>>     u, step, norm = solve_array()
Wall time: 0.0552309 s

The speed is much better: less than 0.1 second. Compared to the naive Python loop, the speed up is more than 50x (to be exact, 87x).

Nested Loop in C++

Oftentimes numpy is still not fast enough. Besides, it’s not really easy to read. Nested loop reads more straight-forward for our point-Jacobi method. Now xtensor comes to help.

Except the parentheses, the C++ version looks almost the same as the Python version.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include <pybind11/pybind11.h>
#define FORCE_IMPORT_ARRAY
#include <xtensor-python/pyarray.hpp>

#include <vector>
#include <algorithm>
#include <tuple>
#include <iostream>

#include <xtensor/xarray.hpp>
#include <xtensor/xadapt.hpp>
#include <xtensor/xview.hpp>

std::tuple<xt::xarray<double>, size_t, double>
solve1(xt::xarray<double> u)
{
    const size_t nx = u.shape(0);
    xt::xarray<double> un = u;
    bool converged = false;
    size_t step = 0;
    double norm;
    while (!converged)
    {
        ++step;
        for (size_t it=1; it<nx-1; ++it)
        {
            for (size_t jt=1; jt<nx-1; ++jt)
            {
                un(it,jt) = (u(it+1,jt) + u(it-1,jt) + u(it,jt+1) + u(it,jt-1)) / 4;
            }
        }
        norm = xt::amax(xt::abs(un-u))();
        if (norm < 1.e-5) { converged = true; }
        u = un;
    }
    return std::make_tuple(u, step, norm);
}

PYBIND11_MODULE(solve_cpp, m)
{
    xt::import_numpy();
    m.def
    (
        "solve_cpp", [](xt::pyarray<double> & uin) { return solve1(xt::xarray<double>(uin)); }
    );
}
$ g++ solve_cpp.cpp -o solve_cpp.so -O3 -fPIC -shared -std=c++17 -lpython3.9
../../_images/03_solve_cpp.png
>>> with Timer():
>>>     u, step, norm = solve_cpp.solve_cpp(uoriginal)
Wall time: 0.0251369 s
Runtime with Python loop, Numpy array, and C++ loop
Code Time (s) Speedup  
Python nested loop 4.797 1 (baseline) n/a
Numpy array 0.055 87.22 1 (baseline)
C++ nested loop 0.025 191.88 2.2

Major Source of Overhead: Data Preparation

Numerical calculation takes time. Intuitively, developers spend time on optimizing the number-crunching code. However, for a useful application, the house-keeping code for preparing the calculation data and post-processing the results is equally important.

In our previous example of solving the Laplace equation, all the conditions are hard-coded. It’s OK for the teaching purpose, but not useful to those who don’t know so much about the math and numerical. This time, I will use an example of curve fitting to show how the house-keeping code affects performance, and xtensor comes to help.

We will do polynomial curve fitting for data in groups of variable length.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
#include <pybind11/pybind11.h>
#define FORCE_IMPORT_ARRAY
#include <xtensor-python/pyarray.hpp>

#include <vector>
#include <algorithm>

#include <xtensor/xarray.hpp>
#include <xtensor/xadapt.hpp>
#include <xtensor/xview.hpp>
#include <xtensor-blas/xlinalg.hpp>

using array_type = xt::xarray<double>;

template <class AT>
xt::xarray<double> fit_poly(AT & xarr, AT & yarr, size_t order)
{
    if (xarr.size() != yarr.size()) { throw std::runtime_error("xarr and yarr size mismatch"); }

    xt::xarray<double> matrix(std::vector<size_t>{order+1, order+1});

    for (size_t it=0; it<order+1; ++it)
    {
        for (size_t jt=0; jt<order+1; ++jt)
        {
            double & val = matrix(it, jt);
            val = 0;
            for (size_t kt=0; kt<xarr.size(); ++kt) { val += pow(xarr[kt], it+jt); }
        }
    }

    xt::xarray<double> rhs(std::vector<size_t>{order+1});
    for (size_t jt=0; jt<order+1; ++jt)
    {
        rhs[jt] = 0;
        for (size_t kt=0; kt<yarr.size(); ++kt) { rhs[jt] += pow(xarr[kt], jt) * yarr[kt]; }
    }

    xt::xarray<double> lhs = xt::linalg::solve(matrix, rhs);
    std::reverse(lhs.begin(), lhs.end()); // to make numpy.poly1d happy.

    return lhs;
}

template <class AT>
xt::xarray<double> fit_polys(xt::xarray<double> & xarr, xt::xarray<double> & yarr, size_t order)
{
    size_t xmin = std::floor(*std::min_element(xarr.begin(), xarr.end()));
    size_t xmax = std::ceil(*std::max_element(xarr.begin(), xarr.end()));
    size_t ninterval = xmax - xmin;

    xt::xarray<double> lhs(std::vector<size_t>{ninterval, order+1});
    lhs.fill(0); // sentinel.
    size_t start=0;
    for (size_t it=0; it<xmax; ++it)
    {
        size_t stop;
        for (stop=start; stop<xarr.size(); ++stop) { if (xarr[stop]>=it+1) { break; } }

        AT const sub_x = xt::view(xarr, xt::range(start, stop));
        AT const sub_y = xt::view(yarr, xt::range(start, stop));

        xt::xarray<double> sub_lhs = fit_poly(sub_x, sub_y, order);
        xt::view(lhs, it, xt::all()) = sub_lhs;

        start = stop;
    }

    return lhs;
}

PYBIND11_MODULE(data_prep, m)
{
    xt::import_numpy();
    m.def
    (
        "fit_poly"
      , [](xt::pyarray<double> & xarr_in, xt::pyarray<double> & yarr_in, size_t order)
        {
            std::vector<size_t> xarr_shape(xarr_in.shape().begin(), xarr_in.shape().end());
            xt::xarray<double> xarr = xt::adapt(xarr_in.data(), xarr_shape);

            std::vector<size_t> yarr_shape(yarr_in.shape().begin(), yarr_in.shape().end());
            xt::xarray<double> yarr = xt::adapt(yarr_in.data(), yarr_shape);

            return fit_poly(xarr, yarr, order);
        }
    );
    m.def
    (
        "fit_polys"
      , [](xt::pyarray<double> & xarr_in, xt::pyarray<double> & yarr_in, size_t order)
        {
            std::vector<size_t> xarr_shape(xarr_in.shape().begin(), xarr_in.shape().end());
            xt::xarray<double> xarr = xt::adapt(xarr_in.data(), xarr_shape);
            std::vector<size_t> yarr_shape(yarr_in.shape().begin(), yarr_in.shape().end());
            xt::xarray<double> yarr = xt::adapt(yarr_in.data(), yarr_shape);
            return fit_polys<array_type>(xarr, yarr, order);
        }
    );
}
>>> with Timer():
>>>     # np.unique returns a sorted array.
>>>     xdata = np.unique(np.random.sample(1000000) * 1000)
>>>     ydata = np.random.sample(len(xdata)) * 1000
Wall time: 0.114635 s
import data_prep

plt.rc('figure', figsize=(12, 8))

def plot_poly_fitted(i):
    slct = (xdata>=i)&(xdata<(i+1))
    sub_x = xdata[slct]
    sub_y = ydata[slct]
    poly = data_prep.fit_poly(sub_x, sub_y, 3)
    print(poly)
    poly = np.poly1d(poly)
    xp = np.linspace(sub_x.min(), sub_x.max(), 100)
    plt.plot(sub_x, sub_y, '.', xp, poly(xp), '-')
>>> plot_poly_fitted(10)
../../_images/04_fit_poly.png

Now, let’s see the impact to runtime from the house-keeping code outside the calculating helper.

>>> with Timer():
>>>     # Do the calculation for the 1000 groups of points.
>>>     polygroup = np.empty((1000, 3), dtype='float64')
>>>     for i in range(1000):
>>>         # Use numpy to build the point group.
>>>         slct = (xdata>=i)&(xdata<(i+1))
>>>         sub_x = xdata[slct]
>>>         sub_y = ydata[slct]
>>>         polygroup[i,:] = data_prep.fit_poly(sub_x, sub_y, 2)
Wall time: 1.49671 s
>>> with Timer():
>>>     # Using numpy to build the point groups takes a lot of time.
>>>     data_groups = []
>>>     for i in range(1000):
>>>         slct = (xdata>=i)&(xdata<(i+1))
>>>         data_groups.append((xdata[slct], ydata[slct]))
Wall time: 1.24653 s
>>> with Timer():
>>>     # Fitting helper runtime is much less than building the point groups.
>>>     polygroup = np.empty((1000, 3), dtype='float64')
>>>     for it, (sub_x, sub_y) in enumerate(data_groups):
>>>         polygroup[it,:] = data_prep.fit_poly(sub_x, sub_y, 2)
Wall time: 0.215859 s

It’s very productive to write house-keeping code in Python. As we see, the price to pay is the runtime, and oftentimes memory as well. But to spend 5x the runtime in house-keeping code is intolerable. We need to write C++ to speed up.

Now see the fit_polys() C++ helper. It detects the point group right before fitting.

>>> with Timer():
>>>     rbatch = data_prep.fit_polys(xdata, ydata, 2)
Wall time: 0.21058 s

Struct of Array and Array of Struct

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
struct StructOfArray
{
    std::vector<double> x;
    std::vector<double> y;
};

struct PointProxy
{
    StructOfArray * soa;
    size_t idx;
    double   x() const { return soa.x[idx]; }
    double & x()       { return soa.x[idx]; }
    double   y() const { return soa.y[idx]; }
    double & y()       { return soa.y[idx]; }
};

/*
 * Array of struct:
 */
struct Point
{
    double x, y;
};

using ArrayOfStruct = std::vector<Point>;

Conversion between Dynamic and Static

It’s not a bad idea to do it manually. Spelling out the static to dynamic conversion makes it clear what do we want to do. When we work in the inner-most loop, no PyObject or virtual function table should be there.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
template <size_t ND>
class SpaceBase
{
public:
    static constexpr const size_t NDIM = ND;
    using serial_type = uint32_t;
    using real_type = double;
}; /* end class SpaceBase */

class StaticGrid1d
  : public StaticGridBase<1>
{
}; /* end class StaticGrid1d */

class StaticGrid2d
  : public StaticGridBase<2>
{
}; /* end class StaticGrid2d */

class StaticGrid3d
  : public StaticGridBase<3>
{
}; /* end class StaticGrid3d */

/*
 * WrapStaticGridBase has the pybind11 wrapping code.
 */

class WrapStaticGrid1d
  : public WrapStaticGridBase< WrapStaticGrid1d, StaticGrid1d >
{
}; /* end class WrapStaticGrid1d */

class WrapStaticGrid2d
  : public WrapStaticGridBase< WrapStaticGrid2d, StaticGrid2d >
{
}; /* end class WrapStaticGrid2d */

class WrapStaticGrid3d
  : public WrapStaticGridBase< WrapStaticGrid3d, StaticGrid3d >
{
}; /* end class WrapStaticGrid3d */

PYBIND11_MODULE(_modmesh, mod)
{
    WrapStaticGrid1d::commit(mod);
    WrapStaticGrid2d::commit(mod);
    WrapStaticGrid3d::commit(mod);
}

Example: pybind11::cppfunction

pybind11::cppfunction

https://github.com/pybind/pybind11/blob/v2.4.3/include/pybind11/pybind11.h#L56

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
/// Wraps an arbitrary C++ function/method/lambda function/.. into a callable Python object
class cpp_function : public function {
public:
    cpp_function() { }
    cpp_function(std::nullptr_t) { }

    /// Construct a cpp_function from a vanilla function pointer
    template <typename Return, typename... Args, typename... Extra>
    cpp_function(Return (*f)(Args...), const Extra&... extra) {
        initialize(f, f, extra...);
    }

    /// Construct a cpp_function from a lambda function (possibly with internal state)
    template <typename Func, typename... Extra,
              typename = detail::enable_if_t<detail::is_lambda<Func>::value>>
    cpp_function(Func &&f, const Extra&... extra) {
        initialize(std::forward<Func>(f),
                   (detail::function_signature_t<Func> *) nullptr, extra...);
    }

    /// Construct a cpp_function from a class method (non-const)
    template <typename Return, typename Class, typename... Arg, typename... Extra>
    cpp_function(Return (Class::*f)(Arg...), const Extra&... extra) {
        initialize([f](Class *c, Arg... args) -> Return { return (c->*f)(args...); },
                   (Return (*) (Class *, Arg...)) nullptr, extra...);
    }

    /// Construct a cpp_function from a class method (const)
    template <typename Return, typename Class, typename... Arg, typename... Extra>
    cpp_function(Return (Class::*f)(Arg...) const, const Extra&... extra) {
        initialize([f](const Class *c, Arg... args) -> Return { return (c->*f)(args...); },
                   (Return (*)(const Class *, Arg ...)) nullptr, extra...);
    }

// ...

pybind11::cppfunction::initialize

https://github.com/pybind/pybind11/blob/v2.4.3/include/pybind11/pybind11.h#L98

/// Special internal constructor for functors, lambda functions, etc.
template <typename Func, typename Return, typename... Args, typename... Extra>
void initialize(Func &&f, Return (*)(Args...), const Extra&... extra) {
  // ...
}

pybind11::cppfunction::dispatch

https://github.com/pybind/pybind11/blob/v2.4.3/include/pybind11/pybind11.h#L423

static PyObject *dispatcher(PyObject *self, PyObject *args_in, PyObject *kwargs_in) {
  // ...
}

Insert Profiling Code

In addition to using OS-provided profiling tools, e.g., Linux’s perf and Macos’s Instruments, we should also add a custom profiling layer in the code. You may need to port your code to a platform that doesn’t have very good system profiler. Your custom profiler will become the safety net.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
/*
 * MODMESH_PROFILE defined: Enable profiling API.
 */
#ifdef MODMESH_PROFILE

#define MODMESH_TIME(NAME) \
    ScopedTimer local_scoped_timer_ ## __LINE__(NAME);

/*
 * No MODMESH_PROFILE defined: Disable profiling API.
 */
#else // MODMESH_PROFILE

#define MODMESH_TIME(NAME)

#endif // MODMESH_PROFILE
/*
 * End MODMESH_PROFILE.
 */

struct ScopedTimer
{

    ScopedTimer() = delete;

    ScopedTimer(const char * name) : m_name(name) {}

    ~ScopedTimer()
    {
        TimeRegistry::me().add(m_name, m_sw.lap());
    }

    StopWatch m_sw;
    char const * m_name;

}; /* end struct ScopedTimer */

// Manually
void StaticGrid1d::fill(StaticGrid1d::real_type val)
{
    MODMESH_TIME("StaticGrid1d::fill");
    std::fill(m_coord.get(), m_coord.get()+m_nx, val);
}
>>> import modmesh as mm
>>>
>>> gd = mm.StaticGrid1d(1000000)
>>> for it in range(100):
>>>     gd.fill(0)
>>> print(mm.time_registry.report())
StaticGrid1d.__init__ : count = 1 , time = 6.36e-06 (second)
StaticGrid1d.fill : count = 100 , time = 0.0346712 (second)
StaticGrid1d::fill : count = 100 , time = 0.0344818 (second)

Exercises

  1. xtensor allows writing array-based code in C++, just like what numpy does for Python. Use xtensor to write array-based code in C++ by modifying the C++ version of the point-Jacobi solver. The array-based C++ version should not have the inner loops.
  2. By allowing changing the signature of the function fit_poly(), how can we ensure the shapes of xarr and yarr to be the same, without the explicit check with "xarr and yarr size mismatch"? Write code to show.

References

[1]xtensor; multi-dimensional arrays with broadcasting and lazy computing: https://xtensor.readthedocs.io
[2]xtensor-python; Python bindings for the xtensor C++ multi-dimensional array library: https://xtensor-python.readthedocs.io
[3]pybind11 — Seamless operability between C++11 and Python: https://pybind11.readthedocs.io/en/stable/
[4]IPython / Jupyter integration for pybind11: https://github.com/aldanor/ipybind