World Record: Finding the first repeating 24-digit substring of Pi
In this article, I describe how I identified the longest known sequence to date, which is repeated for the first time in the decimal representation of π. The determining factor is the position of the repetition. Its length is 24 and reads 307680366924568801265656.
If you look at the decimal representation of π, you will find sequences that occur several times - such as 1 and 26 here:
Special sequences of this kind are collected on the On-Line Encyclopedia of Integer Sequences (OEIS). This specific sequence can be found under the tag A197123, called a(k) in the following. I got the inspiration to search for this sequence, and in general, to solve problems from computational science at all, from Jeff Sponaugle, who found a(19) - a(23) and described his procedure in detail in his video. It was also a great opportunity to push different data structures to their limits and compare them.
How many digits do we need?
I wanted to find the next sequence of length 24. However, the question arises as to how many digits have to be searched through in order to find a repetition of length 24 with sufficient probability. The longer the sequence we are looking for, the more digits we need. We can estimate how many with the help of the birthday paradox.
In its basic form, it states that in a relatively small group (n = 23) there is a surprisingly high probability (over 50%) of finding a couple whose birthday is on the same day.
Applied to a(k), the number of sequences we need to search corresponds to the number of people in the group, while the number of possible birthdays (365 in a normal year) corresponds to the number of different combinations of the sequence. Since the sequence we are looking for should be \(k = 24\) characters long and the digits in π are randomly distributed, there are \(10^{24}\) possibilities.
Let us first calculate the opposite event, namely the probability \(P\) that there is no repetition of length \(k\) for \(n\) randomly selected sequences:
\[P = \prod_{i=0}^{n-1} \left( 1 - \frac{i}{10^{k}} \right)\]
With \(P\) we can estimate how many sequences \(n\) we have to examine in order to find a repetition with sufficient probability, e.g. 90%.
\[1 - P \geq 0.90\]
The Poisson distribution can be used to estimate this, as the number of sequences considered is relatively small in relation to the all possible sequences of length 24:
\[ P \approx \exp\left(-\frac{n^2}{2 \cdot 10^k}\right) \]
Rearranged:
\[ n \approx \sqrt{2 \cdot 10^k \cdot \ln(10)} \]
Now we can compare the approximated values with the positions where the already known repetitions were found. To illustrate this, I have calculated the real positions from a(1) to a(24):
This calculation shows that we would have to search through \(2.15 \times 10^{12}\) digits of π in order to find a repetition of length 24 with at least 90% certainty. We can see that the 90% would not have been sufficient for several sequence lengths (e.g. 10, 11, 19, 21). In the worst-case scenario, we would calculate for a long time only to end up with no matches because we had too few digits. In this case, we would have to start all over again. Nevertheless, I chose a confidence level of 90%, which required a Pi.txt with a size of 2.15 TB. I got this here (y-cruncher required to assemble parts):
Methods & Optimizations
The algorithm to find the first repetition in π is relatively simple. We need a data structure into which we insert all k-digit sequences one after the other. A sliding window, for example, is suitable for this. Before inserting, however, we check whether the current sequence already exists in the data structure. This would be the case if we have already seen it before. If so, we would have found a repetition.
Hash maps, BCD and jumping around
A hash map is ideally suited for this case. The complexity of Inserting a new sequence and Checking whether a sequence exists are both in \(O(1)\). This means that the required time for these operations are independent of the size of the data structure - in particular, the data structure does not have to be traversed.
For insertion, we first use a suitable hash function to calculate the position where we want to write the sequence, based on the sequence to be inserted. With a good hash function, all hash values should occur with approximately the same probability, which reduces the collision rate.
The following example illustrates how a hash map works. Instead of a 24-digit sequence, we insert the following numbers one after the other into a hash map of size 5: 1, 2, 3, 6. A simple hash function is this: \[h(x) = x \,mod\, 5\]
A problem occurs when the same hash value is returned for two different inputs, as here for 1 and 6. I have handled this case using a C# dictionary which is located in the bins of the hash map. Internally, a dictionary is also based on hash maps and can grow dynamically. Alternatively, a binary tree would also be an option.
However, it quickly became clear that saving the entire sequence in ASCII format would require too much memory. We have already calculated that we need to search \(2.15 \times 10^{12}\) digits in π, which is almost exactly the number of sequences that need to be stored. In ASCII format, each digit requires 1 byte, so 24 bytes are required per stored sequence.
\[ 2.15 \times 10^{12} \times 24 \, \text{byte}= 51.6 \, \text{TB} \]
My first idea was not to store the sequences in ASCII format, but as Binary Coded Decimal (BCD). As there are 10 possibilities per digits, only 4 bits are needed to represent a digit, as \(2^{4} = 16 > 10 \). Accordingly, only 12 bytes are required to store a 24-digit sequence, which is a reduction of 50%. However, this is still too much.
Another optimization could be to save only the position of the sequence in π in the hash map instead of the sequence itself. However, the hash function, whose value determines the location in the hash map, is still calculated on the sequence itself. Compared to saving the sequence directly in ASCII format, a 50% reduction in memory could be expected. Combined with the BCD optimization, 30% or 15.05 TB of memory would be required, assuming an average number of 12 decimal places per index. In the event of a collision, we would have to use file seek to jump to the corresponding position in the Pi file and check whether the sequence found matches the current one. This can take some time depending on the storage medium (HDD vs. NVME SSD), but should not happen too often with a sufficiently large hash map.
Prefix Filter: Reduce memory usage at the cost of time
Instead of searching all sequences in π at once, we can also divide them into sections. The most elegant way is to only look at sequences that start with a fixed prefix at any time. Each prefix can then be checked individually. Each prefix search updates a global variable if there is an earlier hit. As soon as a match is found in the current prefix, we can continue with the next one, as the current prefix can no longer find a better candidate. In this way, the memory requirement is reduced by a factor of 10, 100 or even 1000 required at any given time - depending on how long the fixed prefix is. The prefix approach also saves the handling of edge cases, such as a sequence lying on the boundary between two blocks, if the Pi file were naively subdivided into sections. The disadvantage, however, is that the runtime increases by almost the same factor as the amount of memory saved, as the decimal expanson of π has to be traversed correspondingly often. In practice, a prefix length of 1 should therefore be aimed for or at most 2 if several prefixes can be searched at the same time.
Disk based strategy still requires some fine tuning
Instead of using memory, I also considered a disk-based approach, as the speeds of modern SSDs are getting closer and closer to those of memory. The problem, however, is the file system, which rewrites the entire sector, which is at least 4 KB in size, for every change, even if only one bit is changed. Accordingly, a multiple of the amount actually required would be written, so that the TBW (maximum terabytes written, usually ~ 600 TB) specified by the manufacture would quickly be reached. A disk-based approach would certainly be possible if you manage to make several changes in RAM and regularly write a larger number of changes back to the disk instead of each one individually. Modern databases follow a similar approach.
The solution: Bloom Filter
Despite the optimizations mentioned, 1.5 TB of RAM would still be required with a fixed prefix length of 1. Especially with regard to a(25) and beyond, a better approach was needed. Therefore, I replaced the hash map with a bloom filter.
A BloomFilter is a probabilistic data structure that indicates whether an object may or certainly has not yet been inserted. Several hash values are calculated for the same element (different hash functions or slight modifications of the same hash function). A bit is set in the bloom filter at these hash locations. If we now want to know whether a certain element has already been inserted, we calculate the hash functions and check whether bits are set at all positions. If at least one is not set, then the element has certainly not been inserted into the bloom filter before, otherwise it probably has.
Here is a short example, how a bloom filter works:
Let's say we use the hash functions \[f(x) = x \,mod\, 3\] and \[h(x) = x\, mod \,7\] We want to insert 8, 13 and 20. In the beginning, the bloom filter is empty.
For 8, we get the hash values 2 and 1. Since both places are emtpy, we know that the 8 has not been inserted yet. We set the bits at the hash locations.
Now we want to insert 13. The hash values are 1 and 6. Since the bloom filter is empty at 6 (even though index 1 is set) we certainly know that 13 has not been inserted previously. We set the remaining bit at index 6 and start inserting number 20.
The hash values for 20 are 2 and 6. Since both index 2 and 6 are already set in our bloom filter, we have to (wrongly) assume that we probably inserted the number 20 before.
To find repetitions in π, we check all sequences in the decimal expansion of π one after the other to see if they have already been inserted into the bloom filter, otherwise we insert them. Assuming a sequence has already been inserted, we can save it and traverse through the entire Pi file once to verify all potential matches. Alternatively, we could also start the verification for every possible hit immediately. On the one hand, running through the entire Pi file is time-consuming and should be avoided. On the other hand, we could save ourselfes a lot of work if a repetition is found for an early prefix, as we don't have to search any deeper in π than the known hit.
The probability of false positives, i.e. that an object is recognized as inserted although it has not been, can be reduced by increasing the bloom filter size and selecting the correct number of hash functions.
For example, with just 135 GB of memory and 30 hash functions, a two-digit prefix can be checked from the \(2.15 \times 10^{12}\) sequences of π, with an expected number of false positives of less than 1. However, we would also get by with just 90 GB of memory, but would then have to expect approx. 2200 false positives per prefix, which can be verified within one traversal at the end. The low use of resources makes it possible to select smaller prefixes, for example those of length 2, and to search them simultaneously.
By increasing the memory, fewer hash calculations are required, which can increase the speed. However, my profiling has shown that the largest amount of time is needed for inserting into the memory anyway and not for calculating the hash function.
The relationship between the false positive rate, number of hash functions and size can be calculated using this calculator, for example:
The existing BloomFilter libraries in C# have limitations: They either lack performance or use signed integers for indexing, which limits the capacity to \(2^{32}\) elements. To overcome this, I implemented my own filter in C#. While still using integers for indexing, my approach divides the desired size of the bloom filter into multiple 2 GB bit arrays. This choice takes advantage of the CPU's ability to optimise integer calculations. In addition, I use "unsafe" code in C# to manipulate bits directly via pointers, ensuring the most efficient access possible.
In the end however, I replaced the C# bloom filter with my own bloom filter implementation in C++, which I integrated into my C# project via a wrapper. This enabled me to achieve a 2-fold increase in speed.
I use the 128 bit version of MurmurHash3 as the hash function.
Hardware used in my little lab
I tested the hash map and bloom filter based implementations on a shared TUM server that is available to all students. Since I wanted to use the resources responsibly, I only searched two two-digit prefixes at the same time, each with 90 GB of memory.
I actually wanted to have the entire Pi file searched on this, which would have taken several months. I wasn't in a hurry either. However, it became apparent that Jeff was also searching for a(24) and that he had (potentially) allocated considerably more resources to this. So the competition was on 😃.
I have therefore moved the processing of the remaining 90 prefixes to my small compute cluster, which consists of 3x AMD EPYC servers with 750 GB of memory each, 2x AMD EPYC servers with 512 GB RAM and 2x Intel Xeon servers with 256 GB RAM. With the exception of the Xeon servers, all have DDR5 RAM. In this way, On these machines, I was able to search 38 prefixes in parallel. On the EPYC servers, each run takes just under an hour per billion sequences added to the bloom filter. In total, the remaining 2-digit prefixes took about 3-4 days.
For testing, I also rented a server with dual Intel Xeon Platinum, which also has DDR5 memory and generally has similar specifications to the AMD server, but the Pi search was much slower there. This is probably due to the lower memory bandwidth, which is due to the lower number of memory channels. In addition, in the NUMA architecture, each physical processor has its own local part of the main memory, where it has fast access to. However, if a task running on CPU A tries to access to the memory area of CPU B, it is considerably slower, which could also play a role.
I ran the disk-based tests on a low-end Intel Xeon server with 128 GB of memory, as it has a data center NVME that can handle more write load.
I streamed the 2 TB Pi file from my storage server to the workers via a 10 Gbit/s connection. I also implemented a load balancing system that ensures that more than one instance never processes the same prefix.
307680366924568801265656
occurs at position 720,433,323,463 (~720 billion) and is repeated at 1,024,968,002,034 (~1 trillion). It is the first 24-digit sequence that repeats itself in the decimal expanson of π. The repeat position is key, as this marks the point where the pair is completed. For example, I also found the 24-digit sequence 236354815530106853645486, which occurs for the first time at approximately 398 billion, but its second repetition is found at around 1.5 trillion, so it is not the first 24-digit repetition.