The performance-energy gap between fully customized ASIC and general-purpose CPUs can be extremely large, especially for compute-intensive applications and domains that drive today’s progress in computing technology. The application domains in this category include image and video processing, cryptography, machine learning, bioinformatics, etc. Heterogeneous architectures with general-purpose CPUs and domain-specialized accelerators provide the solution to closing the power-performance gap. However, a challenge that emerged from heterogeneity is increased complexity in design-space exploration and partitioning of applications to available computing fabric (general-purpose CPU, GPU, FPGA). Applications generally have a set of Pareto-optimal implementations that trade-off numerous metrics such as performance, throughput, power, energy, cost, quality of service, timing constraints, fault tolerance, and many more. A promising way out of the flexibility vs. efficiency dualism outlined above is coming from the fact we are witnessing the emergence of a few ultra-fast-growth application domains that are increasingly becoming the key drivers for computing systems market adoption.
Our applications group puts focus on the following disruptive application domains that are driving the evolution of computing systems:
1. Video Processing (VP): As a second key demonstrative application domain, we will address video coding and transcoding of ever-increasing high-resolution video data which today already constitutes over 80% of global internet data traffic. Recent advances in this field have resulted in the adoption of the HEVC video encoding standard, which poses high computational demands and still needs to be effectively solved for large-scale video delivery systems with real-time on-demand transcoding. Most demanding computational kernels in VP (such as motion estimation, discrete cosine transform, interpolation, and other) exhibit abundant fine-grained data-level parallelism with throughput as the most critical metric. The most common acceleration strategies in existing heterogeneous systems exploit SIMD/vector extensions on general CPUs or GPGPU on graphics units. However, these approaches do not consider energy equally as performance. Besides, innovations in HEVC such as tiles and wavefront parallel processing cannot be fully exploited in GPUs. They are meant to be exploited by tiles of CPU cores which can concurrently process video frame partitions and complement the fine-grained pixel-based parallelism exploited in GPUs. The addition of coding complexity in HEVC requires some parts to be specifically customized through the use of specific instructions in CPU (e.g. SIMD) or the addition of specific accelerators.
Bolt65 is a codename for an ongoing project on the Faculty of Electrical Engineering and Computing in Zagreb. The aim of the project is to develop a “clean room” software/hardware suite consisting of an encoder, decoder, and transcoder. Find more about Bolt65 here.
2. Machine learning (ML): is penetrating various innovative fields such as autonomous driving, natural language recognition, smart robots, image and video recognition, expert systems, virtual personal assistants, etc. We are witnessing the emergence of deep learning techniques and popular libraries and APIs such as TensorFlow. Deep neural networks (DNN) have a compute demanding learning phase which is now generally performed on GPUs with floating-point arithmetic, while the inference phase is performed both in high-performance and embedded devices and in many applications such as image recognition, resorting to reduced-precision numerical representation. It is important to notice that the most computationally intensive kernels in ML are linear algebra operators (matrix multiplication, matrix-vector product, multi-dimensional convolutional stencils) which are also very common in traditional HPC applications. Focusing on these kernels and implementing accelerators for them can obtain both: specialization and flexibility in multiple application domains.
3. Crypto Processing (CP): In general, efficient implementations of cryptographic primitives are essential for developing digital services that are indispensable for a digital world. While security and privacy are associated with cryptography in the first place, these functions are also core components of systems to identify and differentiate individual users (authentication), ensure that digital data is not modified (integrity), and digital transactions can be ratified (non-repudiation). All practical digital services require cryptographic primitives (hash functions, block ciphers, etc.). Especially for the IoT domain, where devices have limited resources, efficient implementations of these primitives are even more important. The situation is exacerbated by side-channel attacks that can make use of physical properties that result from practical implementations of these primitives. Thus, additional resources (area, power) have to be invested in the implementation of these primitives to reduce the attack surface of side-channel attacks. A third complication is an emergence of blockchain-based systems which has proven itself as enabling technology in areas wherever the chain of trust is critical (i.e. financial transactions processing, cryptocurrencies, etc.). Implementations of blockchain-based decentralized storage and communication in the IoT domain puts higher demands on both cryptographic processing and communication on the IoT node. This will need energy- and computation-efficient accelerators implemented to be able to realize the required functionality at the same time taking into consideration resilience to malicious attacks.
4. Computational Biology (CB): With the advances in computing technology the domain of bioinformatics as the discipline that connects biology and computer science has also expanded and attracted the interest. The vast amount of biological data available in digital format has inspired the development of new algorithms for storage, analysis, and visualization. Genome Assembly has been a central task in computational biology with numerous challenges, but the most important one is still an automated full genome reconstruction. Current assembly methods usually use a graph-based approach which starts with building a graph by joining overlapping reads, followed by using heuristics to find a path that visits each read once. However, this is often unfeasible because of tangles in the graph which occur due to incorrect read overlaps and repetitive regions. This is particularly critical for both long genomes with many chromosomes and metagenomic samples with anything from ten to several hundred present genomes. The sequencing technology has matured to the 3rd generation which can produce much longer contiguous assemblies compared to the previous two generations. The drawback of new sequencer technology is higher error-rate, which must be addressed with new genome assembly methods which pipeline sequencing and error-correction steps into a better-suited tool for such long and error-prone reads. Despite the error rate, improved algorithms for mapping and alignment have enabled de novo assembly (sequencing of the previously unknown genome without reference sequence available for alignment).
These compute-intensive domains are characterized by the limited number of highly specific computational patterns that make a major fraction of the total workload and therefore offer a major opportunity for exploiting heterogeneity and customization of hardware architectures even more aggressively than in the current-generation computing systems. We envision the emergence of domain-specific programming models and hardware which will enable the programmer to express computation and communication in the domain (neural processing, streaming, cryptographic primitives, multidimensional processing primitives for 2D, 3D, or 4D video data coding and transcoding) and map it to specialized execution units (tensor units, video engines, crypto-processors).