User dispersion
It is crucial to consider throttling not only on an individual level but also across an entire tenancy. For instance, if an organization has 3,000 users who load Cloud Drive Mapper (CDM) between 08:30 and 09:00, they all have a dynamic site configured, and each user is a member of around 200 sites. This will generate over 600,000 requests during this mapping process. Since our provider cache settings (CacheLifetimeInSeconds) are also applied universally across the tenancy, this process repeats every 3 hours by default, potentially causing major load spikes at predictable times throughout the day.
To avoid throttling across an entire tenancy, CDM must:
-
Reduce the volume of requests at startup: To reduce the load on providers, minimize the number of calls made during the initial startup and mapping process.
-
Eliminate fixed caching Time to Live (TTLs): Fixed cache expiration times result in synchronized calls from all users, leading to load spikes that could put strain on the remote provider.
There are two ways CDM handles these issues:
Cache refresh frequency
The Cache refresh frequency setting can be passed through to the provider for each drive. It is set to an average of 3 hours by default but can be configured to a value you like. However, it should be noted that the lower this value is, the higher the probability of being throttled.
The cache refresh frequency value is shared across all endpoints within a drive. If an endpoint is shared between multiple drives with differing cache refresh frequencies, the first drive to initialize the endpoint sets the cache duration.
CDM adds a randomized component to the cache expiry to introduce variability in cache refresh times and reduce synchronized requests. Specifically, when an endpoint initializes, it generates a random time between 0 and 1/6 of the Cache refresh frequency (by default, between 0 and 1,800 seconds). CDM then adjusts the cache refresh frequency as follows:
-
If the random number is even, CDM adds the random delay to the Cache refresh frequency.
-
If the random number is odd, CDM subtracts the random delay from the Cache refresh frequency.
Example
Assume the mapping process takes 120 seconds per user, which represents a very large environment and is usually the top end of the time this takes. For simplicity, if we look at the cache refresh frequency of the first 4 users logging in with an empty database and no startup cache persistence:
-
User 1: Logs in at 08:30 and completes mapping by 08:32, with a base cache refresh frequency of 10,800 seconds, their next refresh is scheduled for 11:32 (it finishes at 11:34) and then again at 14:34.
-
Users 2-4: Without a random delay, users 2-4 logging in at 08:30 would all have identical refresh schedules as User 1, synchronizing high request volumes across all users.
By applying a random delay to the default average of 3 hours:
-
User 2: receives a random delay of 2,100 seconds (35 minutes), setting their cache refresh at 12:07 (it finishes at 12:09), with the following refresh at 15:44.
-
User 3: receives a random delay of 1,400 seconds (23 minutes), setting their cache refresh at 11:55 (it finishes at 11:57), with the next at 15:20.
-
User 4: receives a random delay of 1,101 seconds (18 minutes), setting their cache refresh at 11:13 (it finishes at 11:15), with the following at 13:56.
This randomization generates a much more distributed pattern, allowing the provider to receive more evenly distributed requests throughout the day.
Startup and mapping processes
During startup, CDM does one of two things:
-
Condition 1: If the endpoint exists in the database or is a direct type, the user mounts immediately before the initial delta runs.
-
Condition 2: If the above condition is false, CDM must perform an initial delta on the root endpoint to confirm that items exist. This ensures that empty drives aren’t shown. In this scenario, the user must wait until the initial delta completes before they mount, potentially leading to noticeable delays, especially if throttling slows the process.
Throttling can extend this process significantly if CDM must attempt delta multiple times. While it is unavoidable when condition 2 is triggered for the user, CDM takes steps to minimize the chances of encountering this condition.
To improve user experience and avoid overloading the provider, CDM applies the following measures:
-
Persisted cache: For users in condition 2, CDM avoids calling the remote provider immediately after mapping the drive. Instead, CDM returns results from the previous session’s cached delta, regardless of its expiration status.
-
Randomized TTL: If the cache is expired, CDM applies a randomized TTL based on a single random number rather than the standard CacheLifetimeInSeconds. CDM disregards the even/odd condition and sets the TTL to this random number alone.
Example
Using the previous example of users 1-4, the dispersed TTLs would produce the following staggered refresh timings:
-
User 2: Logs in at 8:30 and completes immediately using cache. The next delta occurs at 9:05, completing by 9:07, and then again at 12:42.
-
User 3: Logs in at 8:30 and completes immediately using cache. The next delta occurs at 8:53, completing by 8:55, and then again at 12:18.
-
User 4: Logs in at 8:30 and completes immediately using cache. The next delta occurs at 8:48, completing by 8:50 and then again at 11:31.
In the example, each user’s cache timing is rounded to the nearest minute for simplicity, but timings are precise to the second. If the cache persists from a recent session (it has not expired), CDM respects the existing TTL, further aiding dispersion by distributing refresh times.
This approach distributes the number of calls to the provider during startup. As a result, anyone needing to perform delta (due to meeting condition 2 and having no prebuilt cache) has unobstructed access to the provider. This helps ensure that the overall number of requests to the provider remains evenly distributed.
Consider a school environment where hundreds of students log in simultaneously within a short time. Without dispersion, this would generate a massive number of simultaneous requests to the remote provider.
The dispersion value is applied per endpoint, meaning that even multiple endpoints accessed by the same user are distributed, further minimizing burst loads on the remote provider.