# Array Code in C++¶

xtensor (http://xtensor.readthedocs.io/) is an array library in C++. It defines the multi-dimensional array data structure suitable for compile-time optimization. The array library helps us organize code and achieve fast runtime.

1. Python is slow but easy to write
2. Speed up by using numpy (still in Python)
3. Xtensor: write iterative code in C++ speed using arrays
4. Effect of house-keeping code

More on array-based system design:

2. Conversion between dynamic and static semantics
3. Insert profiling code

## Python Is Slow¶

Python is usually slow when it comes to number-crunching, but so convenient to code.

Here we consider a boundary value problem of the Laplace equation for temperature distribution in a $$1\times1$$ square area.

$\begin{split}& \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2} = 0 \quad (0<x<1; 0<y<1) \\ &\left\{\begin{array}{lll} u(0,y) = 0, & u(1,y) = \sin(\pi y) & \rule{4ex}{0pt} (0 \le y \le 1) \\ u(x,0) = 0, & u(x,1) = 0 & \rule{4ex}{0pt} (0 \le x \le 1) \end{array}\right.\end{split}$

To solve it numerically, we choose the finite-difference method. The finite-difference method needs a grid to discretize the spatial domain. The simplest spatial discretization is the homogeneous Cartesian grid. Let’s make a $$51\times51$$ Cartesian grid.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 import numpy as np from matplotlib import pyplot as plt def make_grid(): nx = 51 x = np.linspace(0, 1, nx) gx, gy = np.meshgrid(x, x) u = np.zeros_like(gx) u[0,:] = 0 u[-1,:] = 1 * np.sin(np.linspace(0,np.pi,nx)) u[:,0] = 0 u[:,-1] = 0 return nx, x, u def show_grid(size): fig, ax = plt.subplots(figsize=(size,size)) ax.set_xlim(0, 1) ax.set_ylim(0, 1) ax.set_xticks(x, minor=True) ax.set_yticks(x, minor=True) ax.set_xlabel('$x$') ax.set_ylabel('$y$') ax.grid(True, which='minor') nx, x, uoriginal = make_grid() show_grid(10) 

After the grid is defined, we may derive the finite-differencing formula. Use the Taylor series expansion to obtain the difference equation:

$\begin{split}& \frac{u(x_{i+1}, y_j) - 2u(x_i, y_j) + u(x_{i-1}, y_j)}{(\Delta x)^2} \\ &\quad + \frac{u(x_i, y_{j+1}) - 2u(x_i, y_j) + u(x_i, y_{j+1})}{(\Delta y)^2} = 0\end{split}$

Note $$\Delta x = \Delta y$$. The difference equation is rewritten as

$u(x_i, y_j) = \frac{u(x_{i+1}, y_j) + u(x_{i-1}, y_j) + u(x_i, y_{j+1}) + u(x_i, y_{j-1})}{4}$

Apply the point-Jacobi method to write a formula to iteratively solve the difference equaion:

$u^{n+1}(x_i, y_i) = \frac{u^n(x_{i+1}, y_j) + u^n(x_{i-1}, y_j) + u^n(x_i, y_{j+1}) + u^n(x_i, y_{j-1})}{4}$

where $$u^n$$ is the solution at the $$n$$-th iteration.

Now we can use Python to quickly implement the solver:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 import time import numpy as np from matplotlib import pyplot as plt class Timer: def __enter__(self): self._t0 = time.time() def __exit__(self, *args, **kw): self._t1 = time.time() print("Wall time: {:g} s".format(self._t1 - self._t0)) def solve_python_loop(): u = uoriginal.copy() un = u.copy() converged = False step = 0 # Outer loop. while not converged: step += 1 # Inner loops. One for x and the other for y. for it in range(1, nx-1): for jt in range(1, nx-1): un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4 norm = np.abs(un-u).max() u[...] = un[...] converged = True if norm < 1.e-5 else False return u, step, norm def show_result(u, step, norm, size=7): print("step", step, "norm", norm) fig, ax = plt.subplots(figsize=(size,size)) cs = ax.contour(x, x, u.T) ax.clabel(cs, inline=1, fontsize=10) ax.set_xticks(np.linspace(0,1,6)) ax.set_yticks(np.linspace(0,1,6)) ax.set_xlabel('$x$') ax.set_ylabel('$y$') ax.grid(True, which='minor') 
>>> with Timer():
>>>    u, step, norm = solve_python_loop()
u, step, norm = solve_python_loop()
Wall time: 4.79688 s


It takes quite a while (around 5 seconds) to converge with 2097 iterations.

Is the calculation correct? For any numerical application, correctness is the first condition.

We may compare the numerical solution with the analytical solution. Recall the PDE and its boundary conditions:

$\begin{split}& \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2} = 0 \quad (0<x<1; 0<y<1) \\ &\left\{\begin{array}{lll} u(0,y) = 0, & u(1,y) = \sin(\pi y) & \rule{4ex}{0pt} (0 \le y \le 1) \\ u(x,0) = 0, & u(x,1) = 0 & \rule{4ex}{0pt} (0 \le x \le 1) \end{array}\right.\end{split}$

Use separation of variable. Assume the solution $$u(x,y) = \phi(x)\psi(y)$$.

$\begin{split}& \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2} = 0 \quad \Rightarrow \quad \phi''\psi + \phi\psi'' = 0 \\ & \Rightarrow \quad \frac{\phi''}{\phi} = -\frac{\psi''}{\psi} = \lambda\end{split}$
$\begin{split}\left\{\begin{array}{ll} \phi'' - \lambda\phi = 0, & \phi(0) = 0 \\ \psi'' + \lambda\psi = 0, & \psi(0) = 0, \, \psi(1) = 0 \end{array}\right.\end{split}$

$$\lambda$$ is the eigenvalue. The general solution of the ODE of $$\psi$$ can be obtained as

$\psi(y) = c\cos(\kappa y) + d\sin(\kappa y), \quad \kappa^2 = \lambda$

Substitute the boundary conditions to the general solution

$\begin{split}& \psi(0) = c = 0 \, \Rightarrow \, c = 0 \\ & \psi(1) = d\sin(\kappa) = 0 \, \Rightarrow \, \kappa = n\pi, \, n = 1, 2, 3, \ldots \\ & \psi(y) = \sum_{n=1}^{\infty} \psi_n(y), \; \mbox{where} \; \psi_n(y) = d_n\sin(n\pi y) \\ & \Rightarrow \psi(y) = \sum_{n=1}^{\infty} d_n \sin(n\pi y)\end{split}$

Substitute the eigenvalue $lambda$ into the ODE of $$\phi$$

$\phi'' - (n\pi)^2\phi = 0$

The general solution is

$\phi_n(x) = a_n\cosh(\kappa x) + b_n\sinh(\kappa x)$

Apply the boundary condition $$\phi_n(0) = a_n = 0$$ and obtain $$\phi_n(x) = b_n\sinh(\kappa x)$$.

The solution $$u(x, y)$$ can now be written as

$u(x,y) = \sum_{n=1}^{\infty}\phi_n(x)\psi_n(y) = \sum_{n=1}^{\infty} \alpha_n \sinh(n\pi x)\sin(n\pi y)$

where $alpha_n = b_nd_n$. Apply the last boundary condition

$u(1,y) = \sin(\pi y) = \sum_{n=1}^{\infty}\alpha_n\sinh(n\pi)\sin(n\pi)$

It is obtained that $$\alpha_1 = \sinh^{-1}(\pi)$$ and $$\alpha_k = 0 \forall k = 2, 3, \ldots$$. The solution of $$u$$ is obtained:

$u(x, y) = \frac{\sinh(\pi x)}{\sinh(\pi)} \sin(\pi y)$
 1 2 3 4 5 6 7 import numpy as np def solve_analytical(): u = np.empty((len(x), len(x)), dtype='float64') for ix, cx in enumerate(x): u[ix, :] = np.sinh(np.pi*cx) / np.sinh(np.pi) * np.sin(np.pi*x) return u 
>>> uanalytical = solve_analytical()
>>> # Calculate the L inf norm.
>>> print("Linf of difference is %f" % np.abs(u - uanalytical).max())
Linf of difference is 0.004962


Say $$u_a$$ is the analytical solution. $$|u-u_a|_{\infty}$$ from the above result is good enough.

## Array-Based Code with Numpy¶

We usually can use numpy to speed up the slow Python loops. Numpy implements fast calculations in C. By using numpy, we essentially delegate the calculation to C.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import numpy as np def solve_array(): u = uoriginal.copy() un = u.copy() converged = False step = 0 while not converged: step += 1 un[1:nx-1,1:nx-1] = (u[2:nx,1:nx-1] + u[0:nx-2,1:nx-1] + u[1:nx-1,2:nx] + u[1:nx-1,0:nx-2]) / 4 norm = np.abs(un-u).max() u[...] = un[...] converged = True if norm < 1.e-5 else False return u, step, norm 
>>> # Run the Python solver
>>> with Timer():
>>>     u, step, norm = solve_array()
Wall time: 0.0552309 s


The speed is much better: less than 0.1 second. Compared to the naive Python loop, the speed up is more than 50x (to be exact, 87x).

## Nested Loop in C++¶

Oftentimes numpy is still not fast enough. Besides, it’s not really easy to read. Nested loop reads more straight-forward for our point-Jacobi method. Now xtensor comes to help.

Except the parentheses, the C++ version looks almost the same as the Python version.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 #include #define FORCE_IMPORT_ARRAY #include #include #include #include #include #include #include #include std::tuple, size_t, double> solve1(xt::xarray u) { const size_t nx = u.shape(0); xt::xarray un = u; bool converged = false; size_t step = 0; double norm; while (!converged) { ++step; for (size_t it=1; it & uin) { return solve1(xt::xarray(uin)); } ); } 
\$ g++ solve_cpp.cpp -o solve_cpp.so -O3 -fPIC -shared -std=c++17 -lpython3.9

>>> with Timer():
>>>     u, step, norm = solve_cpp.solve_cpp(uoriginal)
Wall time: 0.0251369 s

 Code Time (s) Speedup Python nested loop 4.797 1 (baseline) n/a Numpy array 0.055 87.22 1 (baseline) C++ nested loop 0.025 191.88 2.2

## Major Source of Overhead: Data Preparation¶

Numerical calculation takes time. Intuitively, developers spend time on optimizing the number-crunching code. However, for a useful application, the house-keeping code for preparing the calculation data and post-processing the results is equally important.

In our previous example of solving the Laplace equation, all the conditions are hard-coded. It’s OK for the teaching purpose, but not useful to those who don’t know so much about the math and numerical. This time, I will use an example of curve fitting to show how the house-keeping code affects performance, and xtensor comes to help.

We will do polynomial curve fitting for data in groups of variable length.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 #include #define FORCE_IMPORT_ARRAY #include #include #include #include #include #include #include using array_type = xt::xarray; template xt::xarray fit_poly(AT & xarr, AT & yarr, size_t order) { if (xarr.size() != yarr.size()) { throw std::runtime_error("xarr and yarr size mismatch"); } xt::xarray matrix(std::vector{order+1, order+1}); for (size_t it=0; it rhs(std::vector{order+1}); for (size_t jt=0; jt lhs = xt::linalg::solve(matrix, rhs); std::reverse(lhs.begin(), lhs.end()); // to make numpy.poly1d happy. return lhs; } template xt::xarray fit_polys(xt::xarray & xarr, xt::xarray & yarr, size_t order) { size_t xmin = std::floor(*std::min_element(xarr.begin(), xarr.end())); size_t xmax = std::ceil(*std::max_element(xarr.begin(), xarr.end())); size_t ninterval = xmax - xmin; xt::xarray lhs(std::vector{ninterval, order+1}); lhs.fill(0); // sentinel. size_t start=0; for (size_t it=0; it=it+1) { break; } } AT const sub_x = xt::view(xarr, xt::range(start, stop)); AT const sub_y = xt::view(yarr, xt::range(start, stop)); xt::xarray sub_lhs = fit_poly(sub_x, sub_y, order); xt::view(lhs, it, xt::all()) = sub_lhs; start = stop; } return lhs; } PYBIND11_MODULE(data_prep, m) { xt::import_numpy(); m.def ( "fit_poly" , [](xt::pyarray & xarr_in, xt::pyarray & yarr_in, size_t order) { std::vector xarr_shape(xarr_in.shape().begin(), xarr_in.shape().end()); xt::xarray xarr = xt::adapt(xarr_in.data(), xarr_shape); std::vector yarr_shape(yarr_in.shape().begin(), yarr_in.shape().end()); xt::xarray yarr = xt::adapt(yarr_in.data(), yarr_shape); return fit_poly(xarr, yarr, order); } ); m.def ( "fit_polys" , [](xt::pyarray & xarr_in, xt::pyarray & yarr_in, size_t order) { std::vector xarr_shape(xarr_in.shape().begin(), xarr_in.shape().end()); xt::xarray xarr = xt::adapt(xarr_in.data(), xarr_shape); std::vector yarr_shape(yarr_in.shape().begin(), yarr_in.shape().end()); xt::xarray yarr = xt::adapt(yarr_in.data(), yarr_shape); return fit_polys(xarr, yarr, order); } ); } 
>>> with Timer():
>>>     # np.unique returns a sorted array.
>>>     xdata = np.unique(np.random.sample(1000000) * 1000)
>>>     ydata = np.random.sample(len(xdata)) * 1000
Wall time: 0.114635 s

import data_prep

plt.rc('figure', figsize=(12, 8))

def plot_poly_fitted(i):
slct = (xdata>=i)&(xdata<(i+1))
sub_x = xdata[slct]
sub_y = ydata[slct]
poly = data_prep.fit_poly(sub_x, sub_y, 3)
print(poly)
poly = np.poly1d(poly)
xp = np.linspace(sub_x.min(), sub_x.max(), 100)
plt.plot(sub_x, sub_y, '.', xp, poly(xp), '-')

>>> plot_poly_fitted(10)


Now, let’s see the impact to runtime from the house-keeping code outside the calculating helper.

>>> with Timer():
>>>     # Do the calculation for the 1000 groups of points.
>>>     polygroup = np.empty((1000, 3), dtype='float64')
>>>     for i in range(1000):
>>>         # Use numpy to build the point group.
>>>         slct = (xdata>=i)&(xdata<(i+1))
>>>         sub_x = xdata[slct]
>>>         sub_y = ydata[slct]
>>>         polygroup[i,:] = data_prep.fit_poly(sub_x, sub_y, 2)
Wall time: 1.49671 s

>>> with Timer():
>>>     # Using numpy to build the point groups takes a lot of time.
>>>     data_groups = []
>>>     for i in range(1000):
>>>         slct = (xdata>=i)&(xdata<(i+1))
>>>         data_groups.append((xdata[slct], ydata[slct]))
Wall time: 1.24653 s

>>> with Timer():
>>>     # Fitting helper runtime is much less than building the point groups.
>>>     polygroup = np.empty((1000, 3), dtype='float64')
>>>     for it, (sub_x, sub_y) in enumerate(data_groups):
>>>         polygroup[it,:] = data_prep.fit_poly(sub_x, sub_y, 2)
Wall time: 0.215859 s


It’s very productive to write house-keeping code in Python. As we see, the price to pay is the runtime, and oftentimes memory as well. But to spend 5x the runtime in house-keeping code is intolerable. We need to write C++ to speed up.

Now see the fit_polys() C++ helper. It detects the point group right before fitting.

>>> with Timer():
>>>     rbatch = data_prep.fit_polys(xdata, ydata, 2)
Wall time: 0.21058 s


## Struct of Array and Array of Struct¶

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 struct StructOfArray { std::vector x; std::vector y; }; struct PointProxy { StructOfArray * soa; size_t idx; double x() const { return soa.x[idx]; } double & x() { return soa.x[idx]; } double y() const { return soa.y[idx]; } double & y() { return soa.y[idx]; } }; /* * Array of struct: */ struct Point { double x, y; }; using ArrayOfStruct = std::vector; 

## Conversion between Dynamic and Static¶

It’s not a bad idea to do it manually. Spelling out the static to dynamic conversion makes it clear what do we want to do. When we work in the inner-most loop, no PyObject or virtual function table should be there.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 template class SpaceBase { public: static constexpr const size_t NDIM = ND; using serial_type = uint32_t; using real_type = double; }; /* end class SpaceBase */ class StaticGrid1d : public StaticGridBase<1> { }; /* end class StaticGrid1d */ class StaticGrid2d : public StaticGridBase<2> { }; /* end class StaticGrid2d */ class StaticGrid3d : public StaticGridBase<3> { }; /* end class StaticGrid3d */ /* * WrapStaticGridBase has the pybind11 wrapping code. */ class WrapStaticGrid1d : public WrapStaticGridBase< WrapStaticGrid1d, StaticGrid1d > { }; /* end class WrapStaticGrid1d */ class WrapStaticGrid2d : public WrapStaticGridBase< WrapStaticGrid2d, StaticGrid2d > { }; /* end class WrapStaticGrid2d */ class WrapStaticGrid3d : public WrapStaticGridBase< WrapStaticGrid3d, StaticGrid3d > { }; /* end class WrapStaticGrid3d */ PYBIND11_MODULE(_modmesh, mod) { WrapStaticGrid1d::commit(mod); WrapStaticGrid2d::commit(mod); WrapStaticGrid3d::commit(mod); } 

### Example: pybind11::cppfunction¶

#### pybind11::cppfunction¶

https://github.com/pybind/pybind11/blob/v2.4.3/include/pybind11/pybind11.h#L56

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 /// Wraps an arbitrary C++ function/method/lambda function/.. into a callable Python object class cpp_function : public function { public: cpp_function() { } cpp_function(std::nullptr_t) { } /// Construct a cpp_function from a vanilla function pointer template cpp_function(Return (*f)(Args...), const Extra&... extra) { initialize(f, f, extra...); } /// Construct a cpp_function from a lambda function (possibly with internal state) template ::value>> cpp_function(Func &&f, const Extra&... extra) { initialize(std::forward(f), (detail::function_signature_t *) nullptr, extra...); } /// Construct a cpp_function from a class method (non-const) template cpp_function(Return (Class::*f)(Arg...), const Extra&... extra) { initialize([f](Class *c, Arg... args) -> Return { return (c->*f)(args...); }, (Return (*) (Class *, Arg...)) nullptr, extra...); } /// Construct a cpp_function from a class method (const) template cpp_function(Return (Class::*f)(Arg...) const, const Extra&... extra) { initialize([f](const Class *c, Arg... args) -> Return { return (c->*f)(args...); }, (Return (*)(const Class *, Arg ...)) nullptr, extra...); } // ... 

#### pybind11::cppfunction::initialize¶

https://github.com/pybind/pybind11/blob/v2.4.3/include/pybind11/pybind11.h#L98

/// Special internal constructor for functors, lambda functions, etc.
template <typename Func, typename Return, typename... Args, typename... Extra>
void initialize(Func &&f, Return (*)(Args...), const Extra&... extra) {
// ...
}


#### pybind11::cppfunction::dispatch¶

https://github.com/pybind/pybind11/blob/v2.4.3/include/pybind11/pybind11.h#L423

static PyObject *dispatcher(PyObject *self, PyObject *args_in, PyObject *kwargs_in) {
// ...
}


## Insert Profiling Code¶

In addition to using OS-provided profiling tools, e.g., Linux’s perf and Macos’s Instruments, we should also add a custom profiling layer in the code. You may need to port your code to a platform that doesn’t have very good system profiler. Your custom profiler will become the safety net.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 /* * MODMESH_PROFILE defined: Enable profiling API. */ #ifdef MODMESH_PROFILE #define MODMESH_TIME(NAME) \ ScopedTimer local_scoped_timer_ ## __LINE__(NAME); /* * No MODMESH_PROFILE defined: Disable profiling API. */ #else // MODMESH_PROFILE #define MODMESH_TIME(NAME) #endif // MODMESH_PROFILE /* * End MODMESH_PROFILE. */ struct ScopedTimer { ScopedTimer() = delete; ScopedTimer(const char * name) : m_name(name) {} ~ScopedTimer() { TimeRegistry::me().add(m_name, m_sw.lap()); } StopWatch m_sw; char const * m_name; }; /* end struct ScopedTimer */ // Manually void StaticGrid1d::fill(StaticGrid1d::real_type val) { MODMESH_TIME("StaticGrid1d::fill"); std::fill(m_coord.get(), m_coord.get()+m_nx, val); } 
>>> import modmesh as mm
>>>
>>> gd = mm.StaticGrid1d(1000000)
>>> for it in range(100):
>>>     gd.fill(0)
>>> print(mm.time_registry.report())
StaticGrid1d.__init__ : count = 1 , time = 6.36e-06 (second)
StaticGrid1d.fill : count = 100 , time = 0.0346712 (second)
StaticGrid1d::fill : count = 100 , time = 0.0344818 (second)


## Exercises¶

1. xtensor allows writing array-based code in C++, just like what numpy does for Python. Use xtensor to write array-based code in C++ by modifying the C++ version of the point-Jacobi solver. The array-based C++ version should not have the inner loops.
2. By allowing changing the signature of the function fit_poly(), how can we ensure the shapes of xarr and yarr to be the same, without the explicit check with "xarr and yarr size mismatch"? Write code to show.