ProbLP: A framework for low-precision probabilistic inference

Bayesian reasoning is a powerful mechanism for probabilistic inference in smart edge-devices. During such inferences, a low-precision arithmetic representation can enable improved energy efficiency. However, its impact on inference accuracy is not yet understood. Furthermore, general-purpose hardware does not natively support low-precision representation. To address this, we propose ProbLP, a framework that automates the analysis and design of low-precision probabilistic inference hardware. It automatically chooses an appropriate energy-efficient representation based on worst-case error-bounds and hardware energy-models. It generates custom hardware for the resulting inference network exploiting parallelism, pipelining and low-precision operation. The framework is validated on several embedded-sensing benchmarks.


INTRODUCTION
The use of probabilistic inference is popular for robust classification, diagnosis and decision-making problems, because of its ability to assign a confidence-level to every result, in terms of probability. Probabilistic Graphical Model (PGM) [16], an established tool for probabilistic inference, is widely used for such problems. PGMs have several interesting properties that make them suitable for embedded applications. Specifically, PGMs: 1) are capable of dealing with missing data; 2) allow to incorporate information from different domains, as well as expert knowledge; 3) can be trained with less data; and 4) can explicitly model uncertainty and causal relationships in the system. In addition, PGMs' performance is competitive with respect to other state-of-the-art Machine Learning implementations on embedded sensing applications [9,19,10,13].
Inference in PGMs is prominently performed using a versatile representation known as an Arithmetic Circuit (AC) (or sumproduct network) [6]. An AC refers to a model of computation and is often represented as a graph of additions and multiplications. ACs allow for an integration of both statistical and symbolic methods in artificial intelligence, a promising combination that is pursued in state-of-the-art machine learning methods [17,14,13]. They are also central to performing inference in the field of probabilistic (logic) programming [8,14]. Furthermore, recent approaches learn ACs directly from data, with state-of-the-art performance in certain applications [13]. In this work, we focus on ACs representing Bayesian networks (BN), a type of PGM.
Inference in ACs is generally restricted to obtaining exact solutions on general purpose computing devices. A significant improvement in energy efficiency would be possible by tolerating some error through approximating the probability computed by ACs. Take for instance a smartphone-based activity identification for elderly, wherein a probability is evaluated for different activities (e.g., a user walking up the stairs). The application chooses to identify an activity only if its probability is higher than a certain threshold, say 0.60. Here, allowing an output error of 0.01 would only affect the decisions within the probability range of 0.59 and 0.61, while enabling improved energy-efficiency.
A promising hardware optimization that can exploit the available error-tolerance is to realize the additions and multiplications in reduced-precision representation. Yet, the state-of-the-art is lacking analysis of the impact of such precision reduction on the output probability error. Previous works [5,4,18] have studied the impact of low-precision in leaf nodes of an AC, but do not account for noisy or low precision computations in its internal nodes. However, the error in the imprecise internal nodes can get accumulated and be the dominant source of imprecision in the inference output.
In this paper, we propose ProbLP 1 , a holistic framework to automate the design of low-precision energy-efficient hardware for probabilistic inference in Arithmetic Circuits. Our contributions are as follows: • We derive bounds on the error in probabilistic queries due to low-precision representation, taking into account the error introduced in all the nodes in an AC. • We develop energy models to help choose the most energyefficient representation • We develop a tool to automatically generate low-precision inference hardware, and validates its performance on several embedded sensing benchmarks.
This paper is organized as follows. Section 2 gives an introduction to Arithmetic circuits compiled from Bayesian Networks and an overview of related works. In Section 3, we derive analytical error bounds for ACs. We introduce the ProbLP framework and elaborate on how it selects the optimal precision and selects between fixed-or floating-point representation. Section 4 demonstrates the validity of the framework on a suite of embedded sensing benchmarks and Section 5 concludes this work. 1

BACKGROUND AND PREVIOUS WORK
In this paper, we denote a random variable with uppercase letter and its instantiation with a lower case letter . A set of multiple random variables are denoted with bold upper case letters X and its joint assignment with bold lower case letters x.
Bayesian networks (BN) are directed acyclic graphs that compactly encode a joint probability distribution over a set of random variables { 1 , ..., } [16]: where denotes the parents of and ( | ) are the conditional dependencies between variables and their parents, which can be represented as Conditional Probability Tables (CPTs). In the graphical component of BNs, the variables are represented as nodes and their probabilistic or causal relationships are indicated by the direction of the edges among them, as depicted in Figure 1a. The joint probability distribution in (1) allows answering a number of probabilistic queries such as the marginal probability, the conditional probability or the Most Probable Explanation (MPE) [16].
Probabilistic inference on a BN can be made efficient by compiling it to an Arithmetic circuit, which consists only of multiplications and addition. Figure 1b shows an example of an AC generated by compiling the BN in Figure 1a. The inputs to this AC are the BN's parameters, represented by | , where and are the instantiation of random variable and its parents. The second type of inputs to the AC are indicators , which are binary variables that indicate the evidence of the observed nodes. The probability of an evidence (e.g., e = { = 1 , = 3 }) can be computed with an upward pass on the AC by setting the indicators that contradict the evidence to 0 ( 2 = 1 = 2 = 0), and others to 1 ( 1 = 1 = 2 = 3 = 1).
Previous works have studied the impact of finite-precision CPT parameters on marginal and conditional probability [5,4,18]. This helps to reduce memory footprint due to a smaller inference model. However, they did not study the effect of low-precision arithmetic operations. The authors of [20,12] studied the effect of fixed-point arithmetic on marginal probability for a few BNs, but not of conditional probability, and did not provide error bounds. Moreover, the impact under floating-point arithmetic operation is also unclear.
This work provides analytical bounds on the absolute and the relative error in marginal and conditional probabilities for fixedand floating-pt operations in the entire AC. A holistic framework ProbLP is introduced, which also takes energy-consumption into account to choose the optimal representation among fixed-point and floating-point. Subsequently, it automatically generates custom hardware for the AC evaluation.

METHODOLOGY
Different components of the ProbLP framework are shown in figure  2. ProbLP takes in an AC together with some user requirements, based on which, it calculates the least number of fixed and floating point bits needed to meet these requirements. To do so, ProbLP evaluates error-bounds for the AC, based on their error models. To choose between these two representations, it subsequently estimates the energy of the complete AC based on energy models. Finally, it generates a fully-parallel pipelined hardware in the selected low-precision representation. The three inputs of ProbLP are as follows: Arithmetic circuit: The Arithmetic circuit to be implemented using low-precision hardware. In this paper, we use ACs compiled from Bayesian networks, but they can as well be compiled from probabilistic (logic) programs or can be trained directly from data.
Type of query: The type of probabilistic query to be performed using the AC, to be chosen from marginal probability, conditional probability or the probability of most probable explanation (MPE).
Error tolerance: The amount of error on the output that can be tolerated in the probabilistic queries by the application, for all possible combination of inputs, in terms of absolute or relative error. An absolute error is given as =˜− and a relative error is given as =˜− , where is the output probability of interest.

Error analysis
The aim of error analysis is to estimate the minimum number of bits required to achieve the user-specified error tolerance. For this, it has to take into account the impact of reducing the number of bits on the error in the AC output probability. There are two sources of error in an AC: an error in the leaf nodes when CPT values are quantized to finite precision, and an error injected in the intermediate nodes of AC due to the finite precision arithmetic operations. Unlike previous research works, we formally treat the error in the intermediate nodes as well to derive the error-bounds. We consider two representations: fixed-point and floating-point. Operators of both types round bits during computation. For example, a multiplication of 2 inputs of bits produces an exact result with 2 bits, which is subsequently rounded to fit an bit output.The error introduced can be modeled as an additive noise source.
The error models used for the leaf nodes and the intermediate nodes are described next. Some of the models are inspired by [11], but authors of that work did not perform the error analysis for ACs, and some of the models would render unbounded errors if  3.1.1 Fixed-pt error estimation. Let and be the number of integer and fraction bits. All the numbers are assumed to be in the range of the fixed-pt format, implying an absence of overflow during computation, which can be ensured by using an appropriate number of integer bits , discussed in detail in section 3.1.4. Fixed-pt leaf node: Let be the real value of a leaf node in an AC, and˜be its fixed-pt representation. The error in fixed-pt conversion can be bounded as, Fixed-pt adder node: If˜and˜are the fixed-pt representation of adder inputs and , the error in the output˜is given as Note that the fixed-pt adder does not add any error of its own, as it does not round bits, and hence simply accumulates the error of the inputs. Note again that the adder output cannot overflow, as all the numbers are ensured to be in range.
Fixed-pt multiplier node: With˜and˜as the fixed-pt representation of multiplier inputs and , the error in fixed-pt multiplier output˜can be bounded as, In (4), the error term 2 −( +1) models the error introduced when the LSB bits of the intermediate multiplication result are rounded to fit back into fractional bits. Equation (5) produces an unbounded error unless max and max can be bounded.
The max and max can be efficiently bounded by taking into account the AC-specific properties. An AC consists of adders and multipliers and only operates on non-negative numbers. As a result, each internal node in the AC is a monotonously increasing function of its inputs. Hence, all the nodes are at the maximum value when all the inputs are at their maximum. As such, since CPT parameters stay constant across AC evaluations, this is achieved when all the indicator variables are set to 1. This allows to assess the max and max of every operator in the AC with just a single AC evaluation. Thereby, allowing ProbLP to bound the error of fixed-pt multipliers.
3.1.2 Floating-pt error estimation. Let and be the exponent and mantissa bits. We only consider normalized floating-pt here. All the numbers are assumed to be within the range of the given format, ensured by a method explained in detail in section 3.1.4.
Float-pt leaf node: Let be the real value of a leaf in AC andb e its floating-pt representation. The absolute error introduced due to the floating-pt conversion can be bounded as described in [11], which can be expressed alternatively as, where: Float-pt adder node: Let˜and˜be the float-pt versions of adder inputs and , be the ideal output, and˜be the output of a floating-pt adder.˜and˜can be represented as, Here, and depends on the amount of error accumulated in and , respectively. The bound on˜can be given as follows, = ( + ) The error term in (9) is due to the rounding of LSB bits of the mantissa of the smaller input before addition.
Float-pt multiplier node: Just as in case of the adder, the inputs and˜can be bounded as in (8). With that, the output of a floatingpt multiplier can be given as, = (1 ± ) + +1 The error term in (11) is due to the rounding of the LSB bits of the mantissa to fit the result in mantissa bits.

3.1.3
Error-bound at the AC output. Equations (2), (3), (5), and (6), (10) and (12) corresponds to the Error models shown in figure 2. As these models generate the output in the same format as the inputs, they provide a way to recursively propagate the error from the leaves of an AC all the way up to its output node, by accumulating the error introduced in every adder and multiplier. Figure 3 shows an example of error propagation using the fixed-pt error models. This is performed as a part of the fixed-pt error analysis and float-pt error analysis blocks of ProbLP shown in figure 2.
The error propagation in fixed-pt arithmetic produces a bound of the form Δ ≤ , where Δ is the absolute error in the output node, and is a constant that depends on the size and structure of the AC, its parameters, and the number of fixed-pt bits. The constant can be estimated recursively with our error models for any given AC. Similarly, the error propagation in floating-pt arithmetic produces a bound of the form˜≤ (1 ± ) , where˜is the output of an AC with floating-pt operators, is the ideal output, is a constant related to the number of floating-pt bits, and is a constant related to the size and structure of the AC. Again, the constant can be estimated recursively using the models we proposed, for any AC.
Alternatively, the floating-pt bound can be expressed as˜− ≤ for some constant , i.e., a bounded relative error at the output.

Number of integer or exponent bits.
For the error-models proposed in section 3.1.1 and 3.1.2 to be valid, the numbers encountered during the computation must stay within range of the representation. This can be ensured by using an appropriate number of integer bits and exponent bits for fixed-and floating-pt respectively. Otherwise, error in some of the probability evaluations would exceed the predicted bounds. It is hence important to automatically derive the required range of numbers for any given AC.
Max-value analysis: The largest number to be encountered in an AC can be derived by setting all the indicator variables to 1, as explained in section 3.1.1. Analyzing the internal AC data values of this query, allows deriving the required , resp. to avoid overflow.
Min-value analysis: The floating-pt models are invalid in case of underflow as well. Hence, it is necessary to also estimate the smallest positive non-zero value for an AC. It can be proven that all the nodes in an AC are at the respective minimum non-zero values when all the indicator variables are set to 1 and the adders are replaced with minimum operators ( , ). The resulting efficient AC evaluation allows ProbLP to analyze a lower bound on AC values, and find the appropriate required to prevent underflow. The fixed-pt models remain valid even if the number of fraction bits is not enough to represent small values in the AC, so no special precautions are needed here regarding underflow.
In this way, ProbLP performs the Max-value and Min-value analysis to selects , resp. , that satisfies both the requirements.

Bounds for probabilistic queries
As shown in figure 2, ProbLP aims to estimate the optimal fixedpt and float-pt bit width for a given type of probabilistic query and error tolerance. However, the bounds derived so far, apply only to a single AC evaluation. Some types of probabilistic queries require a combination of multiple AC evaluations. In this section, we derive bounds for two type of probabilisitic queries: 1) Marginal probability and MPE, and 2) Conditional probability.

Marginal probability and MPE.
Marginal probabilities (q, e) and most probable explanation (MPE) need only one AC evaluation. Hence, the bounds derived in section 3.1.3 apply for these queries.

Conditional probability.
Conditional probability (q | e) is evaluated by performing two AC evaluations, one for (q, e) and one for (e), followed by taking the ratio of the two results 2 .
Fixed-pt bounds: In the case of fixed-pt arithmetic, the absolute error in each of the AC queries remains bounded. The impact on the conditional probability can hence be given as, Here, maximum error is achieved when Δ 2 = 0 and Δ 1 = Δ 1max . In such a case, following equations show the impact on absolute and relative error in the conditional probability, Equations (14) and (15) show the absolute and relative error in the conditional query (q | e). The error in the numerator is scaled by (e) and (e) (q | e), and these probabilities can become very small. Hence, large number of fixed-pt bits are generally required to achieve a reasonable error-bound, especially for the relative-error bound of (15). The absolute-error bound in (14) can be quantified by estimating the minimum possible value for (e) as described in section 3.1.4, wherein adders are replaced with min operators.
As the denominator of (15) can become very small, it is not a good idea to use fixed-pt when requiring a relative error bound in conditional probabilities. Moreover, quantifying a bound for (15) is also not straightforward. Hence, ProbLP will always choose float-pt for relative error in conditional probability.
Float-pt bounds: The impact of using float-pt arithmetic on conditional probability can be given as follows.
In (16), and are upper bounded by a constant, say , but not lower bounded. In the worst case, one of them can become 0, while the other is . Even in this worst case, the floating-pt version of the conditional probability still remains bounded as follows.
This ensures a bound on the relative error Δ fl (q | e) (q | e) . 2 (q | e) can also be estimated by an upward and a downward pass in an AC followed with a division. We do not consider it explicitly, but similar error bounds are expected.

Selecting optimal representation
Section 3.1 and 3.2 establishes a method to evaluate error bounds for a given AC in terms of number of bits. Next, ProbLP finds the least number of fixed-pt and float-pt bits needed for given requirements. For this, it evaluates the bounds starting with 2 fraction bits and 2 mantissa bits, and increments them until the error-requirement is satisfied. Then, it estimates the least number of integer and exponent bits required by the min and max analysis explained in section 3.1.4. In this way, ProbLP comes up with the optimal fixed-pt and float-pt representation shown in figure 2.
Subsequently, the framework has to select among fixed-pt and float-pt. ProbLP selects the one with the lowest energy-consumption, estimated using operator-level energy models. Energy models for the adders and multipliers are developed by synthesizing them with varying fraction/mantissa bits and integer/exponent bits in TSMC 65nm technology and extracting post-synthesis energy consumption. The models were fitted using least-squares method to the simulation results, and are summarized in Table 1.

Automatic hardware generation
ProbLP suggests the most-appropriate low-precision representation for the AC, but this may not translate to energy savings unless hardware has custom arithmetic operators. To address this, ProbLP has an integrated hardware generator that generates custom parallel hardware that is fully-pipelined and consists of arithmetic operators of the exact precision that is required to meet the user requirements. There are two major stages in the hardware generation process. In the first stage, all AC operators with more than two inputs are decomposed into a tree of 2-input operators. An example of such decomposition is shown in figure 4, wherein the F operator is decomposed into a tree of F1, F2 and F3. In the second stage, the generator inserts pipeline registers after every operator. In some cases, it may have to insert multiple registers due to a mismatch in path timings, as shown in the path between A and G in figure 4. The final output of ProbLP is a verilog code of the custom hardware.

EXPERIMENTAL RESULTS
We validate the functionality of ProbLP for the Arithmetic circuits targeting embedded sensing applications, by performing two types of experiments on four datasets. Three of these datasets (HAR [1], UNIMIB [15], UIWADS [3] in table 2) correspond to activity and user identification applications in smartphones and therefore rely on the accurate estimation of a conditional probability of the form ( | ) to make threshold-based decisions. The fourth dataset (Alarm in table 2 [2]) is of a patient monitoring application and is often used as a standard Bayesian network benchmark. The ACs used in this section are compiled using the ACE tool [7], with -cd06 and -forceC2d option enabled. For the experiments on HAR, UNIMIB, and UIWADS, we trained Naive Bayes classifier on 60% of the data and used the rest for testing. The testing dataset for Alarm is generated by sampling 1000 instances from the trained network. In all the experiments, the leaf nodes of the BN were used as evidence nodes e and one of the root nodes in the BN (the class node in the case of the classifiers) as a query node q.

Validation of bounds
This experiment confirms the validity of the derived error bounds for the AC compiled from the Alarm network. The experimental setting is as follows: Fixed-pt: The number of integer bits is set to 1 based on the max-analysis , and fraction bits is varied from 8 to 40.
Float-pt: The number of exponent bits is set to 8 based on the max-min analysis, and mantissa bits is varied from 8 to 40. Figure 5 shows the max and mean error on the test-set, which confirm the validity of the bounds.

Overall performance
In this experiment, the complete ProbLP framework is deployed to choose an appropriate arithmetic representation and generate hardware for different ACs and for given user requirements. The results of the experiment are summarized in Table 2. Experiments are performed for all combinations of queries and types of error tolerances for the HAR AC, and two combinations for the rest of the ACs. The table shows the optimal fixed-pt and float-pt representation that meet the target error-tolerance. Among these, ProbLP selects the one with less predicted energy, highlighted in bold. The resulting maximum error observed on the test-sets remain within the required error-tolerance. The post-synthesis energy consumption matches well to the energy predicted by the framework. The energy consumption of the hardware with a 32b float (E=8, M=23, 1 sign bit) is also shown for comparison. Note here that the choice of 0.01 error tolerance is arbitrary and higher energy-efficiency can be achieved for relaxed error tolerances.

CONCLUSION
Probabilistic inference with Arithmetic circuits can be made energyefficient by tolerating a small amount of error in output probabilities and by designing custom hardware to exploit this error tolerance.  This paper, therefore, proposes ProbLP, a holistic framework to automate the design of low-precision custom hardware for ACs. The framework estimates worst-case error bounds for ACs, taking into account the error incurred in reduced precision fixed-and floatingpoint operators. It estimates the impact of these errors on different types of probabilistic queries and finds the least number of fixed-pt and float-pt bits required to meet the error-tolerance. Subsequently, it chooses among the fixed-pt and float-pt representation based on the energy models developed for this purpose. Next, ProbLP automatically converts an AC to pipelined logic with custom arithmetic operators. The analytically derived error bounds are validated for varying fixed-and float-pt bits. Finally, the ProbLP framework is used for several embedded sensing benchmarks, confirming that the error-requirements are met and the energy consumption of automatically generated hardware matches the prediction.