rapp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Rapp-dev] Vector Abstraction Layer / SIMD API updates for integral imag


From: Hans-Peter Nilsson
Subject: [Rapp-dev] Vector Abstraction Layer / SIMD API updates for integral images
Date: Mon, 9 Jan 2012 08:20:10 +0100

The RAPP integral operations, rc_integral_sum_bin_u8,
rc_integral_sum_bin_u16, rc_integral_sum_bin_u32,
rc_integral_sum_u8_u16 and rc_integral_sum_u8_u32 only have
generic implementations at the moment; no SIMD implementations.

The kernel operation is "val_ = left_ + cur_ + up_ - upleft_"
where left_ and upleft_ are the shifted (the element to the
"left") values of the current and "above" lines correspondingly,
after previous operations have been carried out at increasing
indexes (left-to-right), and "cur_" is the 8-bit or binary
element value (as 0 or 1) at the specific position before the
operation.  The main caveat seems to be that the "left_" value
is accumulative, not trivially vectorizable.

Someone mentioned a paper presenting a vectorized version, but I
don't remember seeing any specific references.  A web search
seems to find me algorithms _using_ integral images as a base
part, but not the vectorized basic operation to compute an
integral image.  Pointers?  Better search results?

The left_ and upleft_ values seems like they can each be
adjusted with a RC_VEC_ALIGNC (right, besides the accumulation).
We also need to convert the binary elements to 8-, 16- and 32-bit
integral values; an operation that would also be applicable to a
vectorization of rc_type_bin_to_u8.  Then, just add and subtract
the intermediate 16- or 32-bit size elements.

While it's likely that the parallelization of the accumulation
requires additional operations or basic operations, I have some
suggestions for SIMD API additions to cover the rest.  I don't
think we need additional vector types for the 16- and 32-bit
size vectors; current SIMD back-ends don't require more than
casting within the API operation.  That might of course change
after inspecting the generated code.  Adding 16- and 32-bit
addition and subtraction operations seems otherwise trivial.

For the binary conversion part, I briefly checked ALTIVEC, NEON
and SSE2.  I found no specific instructions for the task, the
conversion mapping in the suggested RC_VEC_UNPACKB<n> operation
seems to be as close as it gets.  Note, the suggested operation
yields 0 and -1 (255, 65535, 4294967295) as needed by
rc_type_bin_to_u8 rather than 0 and 1 as needed for the integral
sum.  I think rather than letting an API operation yield (0, 1),
we'll just open-code that part for the integral sum use as, say,
 rc_vec_t one; RC_VEC_SPLAT(one, 255); RC_VEC_AND(result, one, indata);
It might be that a negation schedules better better for some
SIMD back-ends.  Then again, a comparison (most SIMD yield 0 and
-1 for each element for that part) followed by an AND to yield
another value should be common enough.

For each of ALTIVEC, SSE2 and NEON it seems they'll all have
implementations for RC_VEC_UNPACKB<n> like
 - Form a vector with power-of-two n-size elements 1<<idx:
   (a constant vector (1, 2, 4, 8, ...), the compiler will move
   it out of loops)
 - Repeat across all vector elements, the part of the source
   vector with the binary input to convert.
 - AND the power-of-two vector with the repeated source.
 - Compare each n-size element in the AND result with zero, yielding (0, -1)
   for each element in the result.  (Sort-of RC_VEC_CMPGT, but
   yielding 0 and -1, as most SIMD do, rather than just the high bit.)

I think I can put to rest any speculation that the binary input
to be converted should be read by other means than the basic
vector load.  There'll be an additional loop, iterating over
each binary input vector load, shifting the value after each
conversion, but no indications that the input is generally
better formed from, say, byte loads.

My current plan is to implement the framework (mostly the
testing bits) for these operations for a few SIMD back-ends and
put that on a public branch as a base for future work.

Comments?

diff --git a/compute/backend/rc_vec_api.h b/compute/backend/rc_vec_api.h
index 4815035..6c126f0 100644
--- a/compute/backend/rc_vec_api.h
+++ b/compute/backend/rc_vec_api.h
@@ -530,6 +530,54 @@ typedef arch_vector_t rc_vec_t;
 #define RC_VEC_ADDS(dstv, srcv1, srcv2)
 
 /**
+ *  Non-saturating addition, uint16_t elements.
+ *  Computes dstv = srcv1 + srcv2 for each 16-bit field in two's
+ *  complement truncating arithmetic (wrapping around at zero without
+ *  exceptions).
+ *
+ *  @param dstv   The output vector.
+ *  @param srcv1  The first input vector.
+ *  @param srcv2  The second input vector.
+ */
+#define RC_VEC_ADD16(dstv, srcv1, srcv2)
+
+/**
+ *  Non-saturating subtraction, uint16_t elements.
+ *  Computes dstv = srcv1 - srcv2 for each 16-bit field in two's
+ *  complement truncating arithmetic (wrapping around at zero without
+ *  exceptions).
+ *
+ *  @param dstv   The output vector.
+ *  @param srcv1  The first input vector.
+ *  @param srcv2  The second input vector.
+ */
+#define RC_VEC_SUB16(dstv, srcv1, srcv2)
+
+/**
+ *  Non-saturating addition, uint32_t elements.
+ *  Computes dstv = srcv1 + srcv2 for each 32-bit field in two's
+ *  complement truncating arithmetic (wrapping around at zero without
+ *  exceptions).
+ *
+ *  @param dstv   The output vector.
+ *  @param srcv1  The first input vector.
+ *  @param srcv2  The second input vector.
+ */
+#define RC_VEC_ADD32(dstv, srcv1, srcv2)
+
+/**
+ *  Non-saturating subtraction, uint32_t elements.
+ *  Computes dstv = srcv1 - srcv2 for each 32-bit field in two's
+ *  complement truncating arithmetic (wrapping around at zero without
+ *  exceptions).
+ *
+ *  @param dstv   The output vector.
+ *  @param srcv1  The first input vector.
+ *  @param srcv2  The second input vector.
+ */
+#define RC_VEC_SUB32(dstv, srcv1, srcv2)
+
+/**
  *  Average value, truncated.
  *  Computes dstv = (srcv1 + srcv2) >> 1 for each 8-bit field.
  *
@@ -751,6 +799,33 @@ typedef arch_vector_t rc_vec_t;
  */
 #define RC_VEC_GETMASKV(maskv, vec)
 
+/**
+ *  From a binary mask vector, unpack each of the left-most bits into
+ *  the corresponding 8-bit field as the value zero or 0xff.
+ *
+ *  @param  vec    The output vector.
+ *  @param  maskv  The input mask vector.
+ */
+#define RC_VEC_UNPACKB8(vec, maskv)
+
+/**
+ *  From a binary mask vector, unpack each of the left-most bits into
+ *  the corresponding 16-bit field as the value zero or 0xffff.
+ *
+ *  @param  vec    The output vector.
+ *  @param  maskv  The input mask vector.
+ */
+#define RC_VEC_UNPACKB16(vec, maskv)
+
+/**
+ *  From a binary mask vector, unpack each of the left-most bits into
+ *  the corresponding 32-bit field as the value zero or 0xffffffff.
+ *
+ *  @param  vec    The output vector.
+ *  @param  maskv  The input mask vector.
+ */
+#define RC_VEC_UNPACKB32(vec, maskv)
+
 /* @} */
 
 
brgds, H-P



reply via email to

[Prev in Thread] Current Thread [Next in Thread]