support@lune.dev

Want to use over 200+ MCP servers inside your coding tools like Cursor?

Asked 5 months ago by QuasarScientist474

From Stack Overflow

How can I fix intermittent RedisCluster connection issues with ioredis on GCP?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I’m experiencing connection issues with RedisCluster on GCP.

I’m using Redis version 7.0 and connecting to a RedisCluster (memoryStore with IAM authentication and TLS disabled) using the ioredis package in a Node.js environment.

In my development environment, connections frequently close, which I suspect might be due to inactivity (I’ve set keepAlive to 600000).

In production, some pods occasionally report the error:
"WRONGPASS invalid username-password pair or user is disabled."

Additionally, in Cloud Functions, some instances log the error:
"[ioredis] Unhandled error event: ClusterAllFailedError: Failed to refresh slots cache."

Could these issues be related to the RedisCluster configuration? Any suggestions on how to resolve this problem?

Thanks

Below is my code implementation:

JAVASCRIPT
new Redis.Cluster(hosts, {
    scaleReads: 'all', // Send write queries to masters and read queries to masters or slaves randomly.
    redisOptions: {
        password: token,
        keepAlive: 600000, // 10 min in milliseconds
        reconnectOnError: (err) => {
            console.error('Reconnect on error:', err);
            return true;
        },
        maxRetriesPerRequest: null // Infinite retries for requests  let commands wait forever until the connection is alive again.
    },
    slotsRefreshTimeout: 5000,
    clusterRetryStrategy: (times) => this.exponentialBackoffWithJitter(times)
})

Comments 0

•

Answers 2

•

2 Answers

Answer by StackOverflow TychoAI Agent • 5 months ago

Below are some areas to investigate and suggestions based on your description:

1. Authentication/ACL Configuration

The error

"WRONGPASS invalid username-password pair or user is disabled"

indicates that at least some nodes are rejecting your authentication. Since Redis 6.0+ uses ACLs, make sure that:

The user you’re authenticating as is enabled on all nodes.
If you’re only specifying a password, verify what the default username is. For some Redis setups the default username may not work especially if disabled.
Consider explicitly adding a username in your connection options if needed, for example:

JAVASCRIPT
redisOptions: {
    username: 'default', // or the correct user if the default is not enabled
    password: token,
    keepAlive: 600000,
    // ...
},

Check your Redis instance configuration and IAM setup to see if additional credentials or parameters are required.

2. Consistency Across Cluster Nodes

Since you’re using Cluster mode, all nodes must have the same ACL configuration. Inconsistencies between cluster nodes in terms of user status or credentials may lead to intermittent authentication failures when ioredis refreshes the slots cache or reconnects. Validate that:

The IAM token used is valid and not expiring mid-connection.
All nodes are configured identically regarding user credentials.

3. Refreshing Slots Cache and Reconnection Behavior

The error in Cloud Functions:

"[ioredis] Unhandled error event: ClusterAllFailedError: Failed to refresh slots cache"

may be a downstream effect of authentication issues on one or more nodes. When ioredis attempts to refresh the slots cache (and possibly reconnect after an idle period), if the connection does not reauthenticate correctly, it fails. Some recommendations include:

Investigate whether the token is long-lived or needs periodic renewal.
Ensure that the reconnection logic (like your custom strategy and reconnectOnError) is properly re-sending your authentication credentials upon reconnect.

4. KeepAlive and Connection Stability

Setting a high keepAlive value (600000 ms) might help in keeping persistent connections but ensure that:

Your network settings and Redis instance configuration allow for such long idle periods without closing the connection.
This value is consistent with expectations on both the client (ioredis) and the server side.

In some cases, if the connection is closed for inactivity or reset by a proxy or firewall, a reconnection may not properly reauthenticate. Consider testing with a slightly lower keepAlive to see if connection drops reduce.

5. Logging and Monitoring

Finally, ramp up your logging to capture:

The exact moment and event that triggers a connection drop.
The full error objects during the reconnect attempts.
Whether the provided credentials are missing or mismatched during a reconnection (this might require enabling debug logs for ioredis).

This additional information can help pinpoint whether the root cause is truly in the cluster configuration or an issue with token management and reauthentication.

By checking the authentication setup in your Redis cluster (including user naming, token lifetimes, and consistency across nodes) as well as verifying your reconnection and keepAlive strategies, you should be able to narrow down and eventually resolve these intermittent connection issues.

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by AuroraSatellite802 • 5 months ago

IAM auth token is a short live token, it is valid for one hour only (in GCP context).

Meaning, if one of your connections disconnected for any reason after an hour, you need to regenerate a new access token and use it as the password.

Even if you didn't have a disconnection, the authenticated connection is valid for 12 hours only, and should be re-authenticate.

In addition, GCP Redis cluster closes an idle connection after 10 minutes of inactivity.

So few things happen here:

Each time you are idle for more than 10 min, your pods/local connection gets disconnected.
If the disconnection happens after more than one hour, you get wrong pass error since the token is not valid anymore.
If you don't have any disconnection, but you didn't refresh the connection using a new token, your connection will get kicked out after 12 hours, and you will get again pass err.
Even though it is not a little, probably because of limited compute and networking in Cloud Functions, refresh slots take more time, and they are failing to do so on time.

Solutions:

For the idle issue, just send a ping once in a few minutes.

ioredis doesn't support OOB credentials providers, so you will need to set the new tokens to the client-connection object manually, the best solution for all the issues above is to manually schedule a replace each 50 min and re-auth each 10 hours (less than required for safety):

JAVASCRIPT
function renewToken() {  
    // Logic to generate or retrieve the new token  
    return 'yourNewToken'; // Replace with your actual token logic  
}  

async function updatePassword() {  
    const newToken = renewToken();  
    try {  
        // Update the Redis password  
        clusterClient.options.redisOptions.password = newToken;  
        console.log('Password updated successfully to:', newToken);  
    } catch (error) {  
        console.error('Error updating password:', error);  
    }  
}  

async function authenticate() {  
    const newToken = renewToken();  
    try {  
        // Authenticate with the new token  
        await clusterClient.auth(newToken);  
        console.log('Authenticated successfully with new token:', newToken);  
    } catch (error) {  
        console.error('Error during authentication:', error);  
    }  
}  

function schedulePasswordUpdates() {  
    // Initial password update immediately  
    updatePassword();  

    // Update password every 50 minutes  
    setInterval(() => {  
        updatePassword();  
    }, 3000000); // 50 minutes (3000000 milliseconds)  

    // Every 10 hours (600 minutes): update password and authenticate  
    setInterval(async () => {  
        await updatePassword();  
        await authenticate(); // Run authentication after updating the password  
    }, 60000000); // 10 hours (60000000 milliseconds)  
}  

// Start the periodic password update and authentication  
schedulePasswordUpdates();

See more on automating renew token in GCP docs: https://cloud.google.com/memorystore/docs/cluster/manage-iam-auth#automate_access_token_retrieval.

Other options:

You can choose to just replace the token each hour and let it deal with the disconnection each 12 hours by the client retrying automatically using the new token you updated.
You can choose not to renew at all and have an error handling, in which each time you get disconnected, you just kill the client, get a new token and recreate the client.

What to consider when choosing from the above:

If you get disconnected, reconnect is more costly than simply sending auth, so if it's happening during heavy traffic time, it is not the best idea.
If you choose to kill the client and recreate, you lose all the inflight commands that were on their way.

So in general, I recommend the first option.

For the [ioredis] Unhandled error event: ClusterAllFailedError: Failed to refresh slots cache. error, just increase the slotsRefreshTimeout so Cloud Functions has enough time to complete.

Disclosure:

I'm from AWS Elasticache, and not from GCP, or using memory Store.

My knowledge about GCP memory Store and IAM comes from working together with GCP engineers on valkey-glide, and working currently on designing OOB IAM integration for the valkey-glide which will do all the above without the user need to set it all by itself both for GCP and AWS.
And because of the similarities of Elasticache IAM usage and memory store IAM usage.

I might miss something unique to GCP, but I don't think so, my work currently in the design including integration with both, and a nice amount of research on GCP IAM auth.

See GCP pointing to glide as the future client of valkey/redis-oss.

No comments yet.

Discussion

No comments yet.

How can I fix intermittent RedisCluster connection issues with ioredis on GCP?

2 Answers

1. Authentication/ACL Configuration

2. Consistency Across Cluster Nodes

3. Refreshing Slots Cache and Reconnection Behavior

4. KeepAlive and Connection Stability

5. Logging and Monitoring

Discussion

Similar Posts

Why can't I start or delete my GCP VM instance despite having Owner permissions?

GKE Workload Identity Setup: Resolving INVALID_ARGUMENT Error When Adding IAM Binding

How can I debug gRPC 'Call Cancelled' errors potentially caused by a GKE internal load balancer?