Books     Neural Networks
Neuron Models

TLU Implementation

turns out that implementing the TLU in a sequential manner is trivial. However, more considerations need to be taken into account when one attempts to implement the neuron in a parallel manner and also when implementing the model as part of a larger library.


#include <vector>
#include <algorithm>

class TLU {
public:

  explicit TLU(int theta) : m_theta{ theta } {}

  bool apply(const std::vector<bool> & s) {
    int sum = 0;
    std::for_each(s.begin(), s.end(),
                  [&sum] (const bool v) { sum += static_cast <int>(v); } );
    return sum >= m_theta;
  }

private:
  int m_theta;

}; // class TLU

This implementation is sequential, however, in order to parallelise this code for-each loop needs to be parallelised. This could be done using threads, OpenMP or any other means depending on the programming language used. In essence, we need to distribute chucks of the input signals to different workers. In C++17 (Parallel TS) it can be parallelised by simply using the parallel for-each algorithm variant.

To distribute the input signals to different worker threads, we split the signals vector into chunks and calculate partial sums over each chunk independently, then reduce the partial sums to a terminal sum. For this specific implementation, it is simpler to add a partial sum to the terminal sum whenever a partial sum becomes available. However, we will follow the more general approach as it is more likely to be used in a distributed architecture.

Note, however, that the activation function, i.e. the apply function, is a very simple function in this case. In case of a complicated activation function as well as in order to achieve high performance, GPUs are used to process the input signals in a massively parallel manner.


#include <vector>
#include <algorithm>
#include <thread>

class TLU {
public:

  explicit TLU(int theta) : m_theta{ theta } {}

  bool apply(const std::vector<bool> & s) {

    size_t chunks = s.size() / k_signals_per_thread;
    size_t remaining = s.size() % k_signals_per_thread;

    int sum = 0;
    auto next = s.begin();
    std::mutex m;
    std::vector<std::thread> threads;
    std::vector<int> partials;

    for (size_t i = 0; i < chunks; ++i) {
      threads.emplace_back(
        std::thread([&]() {
          int partial = partial_sum(next, next + k_signals_per_thread);
          std::lock_guard(m);
          partials.push_back(partial);
        });
      );
      next += k_signals_per_thread;
    }

    threads.emplace_back(
      std::thread([&]() {
        int partial = partial_sum(next, next + remaining);
        std::lock_guard(m);
        partials.push_back(partial);
      });
    );

    for (auto & th : threads) th.join();

    return sum >= m_theta;
  }

private:

  template < Iterator >
  int partial_sum(Iterator & begin, Iterator & end) {
    int partial;
    std::for_each(
      begin, last,
      [&](const auto v) { partial += static_cast<int>(v); });
    return partial;
  }

  int m_theta;

  constexpr size_t k_signals_per_thread{ 10 };

}; // class TLU

Note that instead of threads, separate nodes in a cluster, a single GPU, or a cluster of GPUs could be used. However, this would be too much for our purpose. Nonetheless, the pattern holds. Also, instead of assigning a fixed number of signals to process on each worker, a dynamic allocation based on the input size could be used. Of course, using a stream of signals is much more flexible in terms of input source and in some cases memory footprint as well. At this point, parallelisation becomes more involving.

Another interesting point is the lock guard we used. It is necessary to guard against concurrent modification of the partial sums container as it is not thread-safe. The same would apply if we updated the summation variable directly or a race condition would occur unless a different summation strategy is devised. For example, a concurrent or lock-free container data structure is used to store partial sums and the summation of the partial sum is delayed until all threads have finished (as shown above). An alternative strategy would be to write each partial sum in its own container or variable then reduce these to the terminal sum after all workers have finished.