I'm experiencing an issue where passing AVX2 functions through function pointers or std::function works fine on Linux, but crashes on Windows. Direct AVX2 operations work fine on both platforms.
Specifically, I have these (dummy - my real fns are much more complicated) AVX2 functions, which work fine when called directly:
// Simple test function that just multiplies vector by 2
__m256d test_simple_AVX2(const __m256d x) {
const __m256d two = _mm256_set1_pd(2.0);
const __m256d res = _mm256_mul_pd(x, two);
return res;
}
// The scalar version for comparison
double test_simple_double(const double x) {
const double res = 2.0*x;
return res;
}
Now, if I call test_simple_AVX2 directly, it works fine - however, I need to be able to make the code more general and pass the function onto another function and then call test_simple_AVX2 from within that function.
Specifically, like this:
template <typename T>
inline void TEST_fn_AVX2_row_or_col_vector( Eigen::Ref<T> x_Ref,
FuncAVX fn_AVX,
FuncDouble fn_double) {
const int N = x_Ref.size();
const int vect_size = 4;
const double vect_siz_dbl = static_cast<double>(vect_size);
const double N_dbl = static_cast<double>(N);
const int N_divisible_by_vect_size = std::floor(N_dbl / vect_siz_dbl) * vect_size;
Eigen::Matrix<double, -1, 1> x_tail = Eigen::Matrix<double, -1, 1>::Zero(vect_size); // last vect_size elements
{
int counter = 0;
for (int i = N - vect_size; i < N; ++i) {
x_tail(counter) = x_Ref(i);
counter += 1;
}
}
if (N >= vect_size) {
alignas(32) double buffer[4]; // using an aligned buffer for AVX operations
for (int i = 0; i + vect_size <= N_divisible_by_vect_size; i += vect_size) {
// Copy data to aligned buffer
for(int j = 0; j < vect_size; j++) {
buffer[j] = x_Ref(i + j);
}
const __m256d AVX_array = _mm256_load_pd(buffer);
const __m256d AVX_array_out = fn_AVX(AVX_array); //// ERROR / ABORTED SESSION OCCURS HERE (so when calling "fn_AVX").
//// HOWEVER, if do manually (w/o calling seperate function), then it works! i.e.:
// const __m256d two = _mm256_set1_pd(2.0);
// const __m256d AVX_array_out = _mm256_mul_pd(AVX_array, two);
_mm256_store_pd(buffer, AVX_array_out);
// Copy back to Eigen
for(int j = 0; j < vect_size; j++) {
x_Ref(i + j) = buffer[j];
}
}
if (N_divisible_by_vect_size != N) { // Handle remainder
int counter = 0;
for (int i = N - vect_size; i < N; ++i) {
x_Ref(i) = fn_double(x_tail(counter));
counter += 1;
}
}
} else { // If N < vect_size, handle everything with scalar operations
for (int i = 0; i < N; ++i) {
x_Ref(i) = fn_double(x_Ref(i));
}
}
}
Now, I have tried to define "FuncAVX" using either function pointers, or using std::function - and neither of them work on Windows, however on Linux they work just fine!
Here's my attempt using function pointers:
typedef __m256d (*FuncAVX)(const __m256d); /// not working (on Windows - fine on Linux)!!
Using std::function also doesn't work:
typedef std::function<__m256d(const __m256d)> FuncAVX; /// not working (on Windows - fine on Linux)!!
So just to make it clear, when I do:
const __m256d AVX_array = _mm256_load_pd(buffer);
const __m256d two = _mm256_set1_pd(2.0);
const __m256d AVX_array_out = _mm256_mul_pd(AVX_array, two);
_mm256_store_pd(buffer, AVX_array_out);
It works fine (even on Windows), however, if I try to call the function ("fn_AVX") then it does not work on Windows - but does work on Linux. I.e. if I try this:
const __m256d AVX_array = _mm256_load_pd(buffer);
const __m256d AVX_array_out = fn_AVX(AVX_array); //// ERROR / ABORTED SESSION OCCURS HERE (so when calling "fn_AVX").
_mm256_store_pd(buffer, AVX_array_out);
It doesn't work on Windows.
Does anybody have any idea why this doesn't work on Windows?
Also, I have tried using UNALIGHED AVX intrinsics (i.e. using _mm256_loadu_pd and _mm256_storeu_pd instead of _mm256_load_pd and _mm256_store_pd) - and I still get the same issue!
More info: I'm using C++ via Rcpp. compiler: g++ compiler flags: -O3 -march=znver3 -mtune=znver3 -fPIC -D_REENTRANT -DSTAN_THREADS -pthread -fpermissive -mfma -mavx -mavx2 -flarge-source-files