google cloud platform - Issues with RedisCluster Connection Using ioredis - Stack Overflow

admin2025-04-17  2

I’m experiencing some connection issues with RedisCluster.

I’m using Redis version 7.0 and connecting to RedisCluster(memoryStore GCP using Iam and disabling TLS) with the ioredis package in a Node.js environment.

In my development environment, I notice frequent connection closures. I suspect this is due to inactivity, so I’ve set keepAlive to 600000.

In production, some pods occasionally encounter this error: "WRONGPASS invalid username-password pair or user is disabled."

Additionally, in Cloud Functions, some instances report this error: "[ioredis] Unhandled error event: ClusterAllFailedError: Failed to refresh slots cache."

Do you think the issue could be related to the Redis-Cluster configuration? Any suggestions on how to resolve this?

Thanks

code implementation-

 new Redis.Cluster(hosts, {
    scaleReads: 'all', // Send write queries to masters and read queries to masters or slaves randomly.
    redisOptions: {
        password: token,
        keepAlive: 600000, // 10 min in milliseconds
        reconnectOnError: (err) => {
            console.error('Reconnect on error:', err);
            return true;
        },
        maxRetriesPerRequest: null // Infinite retries for requests  let commands wait forever until the connection is alive again.
    },
    slotsRefreshTimeout: 5000,
    clusterRetryStrategy: (times) => this.exponentialBackoffWithJitter(times)
})

I’m experiencing some connection issues with RedisCluster.

I’m using Redis version 7.0 and connecting to RedisCluster(memoryStore GCP using Iam and disabling TLS) with the ioredis package in a Node.js environment.

In my development environment, I notice frequent connection closures. I suspect this is due to inactivity, so I’ve set keepAlive to 600000.

In production, some pods occasionally encounter this error: "WRONGPASS invalid username-password pair or user is disabled."

Additionally, in Cloud Functions, some instances report this error: "[ioredis] Unhandled error event: ClusterAllFailedError: Failed to refresh slots cache."

Do you think the issue could be related to the Redis-Cluster configuration? Any suggestions on how to resolve this?

Thanks

code implementation-

 new Redis.Cluster(hosts, {
    scaleReads: 'all', // Send write queries to masters and read queries to masters or slaves randomly.
    redisOptions: {
        password: token,
        keepAlive: 600000, // 10 min in milliseconds
        reconnectOnError: (err) => {
            console.error('Reconnect on error:', err);
            return true;
        },
        maxRetriesPerRequest: null // Infinite retries for requests  let commands wait forever until the connection is alive again.
    },
    slotsRefreshTimeout: 5000,
    clusterRetryStrategy: (times) => this.exponentialBackoffWithJitter(times)
})
Share Improve this question edited Feb 22 at 19:30 avifen 1,0527 silver badges19 bronze badges asked Feb 1 at 16:14 Shahar Ben EzraShahar Ben Ezra 711 silver badge4 bronze badges 6
  • Which type of auth you use? Do you use IAM? – avifen Commented Feb 1 at 19:46
  • What is the host you supply? – avifen Commented Feb 1 at 19:47
  • Please provide more information needed - does the keepalive helped? The pods getting this error on connection or after some time? Does all the pods have the same permissions and all in the same VPC as the cluster? Does the errors on cloud function are on connection or later on? Same as for the pods, all instances have the same permission and are in the same VPC? Did you try other things to debug it? For example, did you try to use larger slotsRefreshTimeout or else? – avifen Commented Feb 1 at 20:05
  • I am using Identity and Access Management (IAM), and the host it the Discovery Endpoint. – Shahar Ben Ezra Commented Feb 3 at 11:39
  • The keepalive didn't help because I was still getting uncaughtException: WRONGPASS invalid username-password pair or user is disabled. ReplyError: WRONGPASS invalid username-password pair or user is disabled. at parseError (/app/node_modules/redis-parser/lib/parser.js:179:12) at parseType (/app/node_modules/redis-parser/lib/parser.js:302:14). the pods getting the error after some time, all pods have the same permission and in the same vpc. about slotrefresh I increased the number and I don't get that error anymore – Shahar Ben Ezra Commented Feb 3 at 11:42
 |  Show 1 more comment

1 Answer 1

Reset to default 1

IAM auth token is a short live token, it is valid for one hour only (in GCP context).
Meaning, if one of your connections disconnected for any reason after an hour, you need to regenerate a new access token and use it as the password.
Even if you didn't have a disconnection, the authenticated connection is valid for 12 hours only, and should be re-authenticate.

In addition, GCP Redis cluster closes an idle connection after 10 minutes of inactivity.

So few things happen here:

  1. Each time you are idle for more than 10 min, your pods/local connection gets disconnected.
  2. If the disconnection happens after more than one hour, you get wrong pass error since the token is not valid anymore.
  3. If you don't have any disconnection, but you didn't refresh the connection using a new token, your connection will get kicked out after 12 hours, and you will get again pass err.
  4. Even though it is not a little, probably because of limited compute and networking in Cloud Functions, refresh slots take more time, and they are failing to do so on time.

Solutions:

For the idle issue, just send a ping once in a few minutes.

ioredis doesn't support OOB credentials providers, so you will need to set the new tokens to the client-connection object manually, the best solution for all the issues above is to manually schedule a replace each 50 min and re-auth each 10 hours (less than required for safety):

function renewToken() {  
    // Logic to generate or retrieve the new token  
    return 'yourNewToken'; // Replace with your actual token logic  
}  

async function updatePassword() {  
    const newToken = renewToken();  
    try {  
        // Update the Redis password  
        clusterClient.options.redisOptions.password = newToken;  
        console.log('Password updated successfully to:', newToken);  
    } catch (error) {  
        console.error('Error updating password:', error);  
    }  
}  

async function authenticate() {  
    const newToken = renewToken();  
    try {  
        // Authenticate with the new token  
        await clusterClient.auth(newToken);  
        console.log('Authenticated successfully with new token:', newToken);  
    } catch (error) {  
        console.error('Error during authentication:', error);  
    }  
}  

function schedulePasswordUpdates() {  
    // Initial password update immediately  
    updatePassword();  

    // Update password every 50 minutes  
    setInterval(() => {  
        updatePassword();  
    }, 3000000); // 50 minutes (3000000 milliseconds)  

    // Every 10 hours (600 minutes): update password and authenticate  
    setInterval(async () => {  
        await updatePassword();  
        await authenticate(); // Run authentication after updating the password  
    }, 60000000); // 10 hours (60000000 milliseconds)  
}  

// Start the periodic password update and authentication  
schedulePasswordUpdates();  

See more on automating renew token in GCP docs: https://cloud.google.com/memorystore/docs/cluster/manage-iam-auth#automate_access_token_retrieval.

Other options:

  1. You can choose to just replace the token each hour and let it deal with the disconnection each 12 hours by the client retrying automatically using the new token you updated.
  2. You can choose not to renew at all and have an error handling, in which each time you get disconnected, you just kill the client, get a new token and recreate the client.

What to consider when choosing from the above:

  1. If you get disconnected, reconnect is more costly than simply sending auth, so if it's happening during heavy traffic time, it is not the best idea.
  2. If you choose to kill the client and recreate, you lose all the inflight commands that were on their way.

So in general, I recommend the first option.

For the [ioredis] Unhandled error event: ClusterAllFailedError: Failed to refresh slots cache. error, just increase the slotsRefreshTimeout so Cloud Functions has enough time to complete.

Disclosure:
I'm from AWS Elasticache, and not from GCP, or using memory Store.
My knowledge about GCP memory Store and IAM comes from working together with GCP engineers on valkey-glide, and working currently on designing OOB IAM integration for the valkey-glide which will do all the above without the user need to set it all by itself both for GCP and AWS. And because of the similarities of Elasticache IAM usage and memory store IAM usage.
I might miss something unique to GCP, but I don't think so, my work currently in the design including integration with both, and a nice amount of research on GCP IAM auth.
See GCP pointing to glide as the future client of valkey/redis-oss.

转载请注明原文地址:http://anycun.com/QandA/1744823781a88117.html