6 min read

The prime number theorem: Explorations in R

Prime numbers

A prime number is defined as a number greater than 1 that is only divisible by 1 and itself. Thus, 2, 3, 5, and 7 are all primes, but the following are not:

4 = 2 x 2
6 = 3 x 2
8 = 2 x 2 x 2
9 = 3 x 3

In fact, no even numbers greater than 2 are primes, given that they are all divisible by 2. As a result, primes are actually the “building blocks” of all other natural numbers via multiplication, since every number has a unique prime number decomposition (up to the order of the factors), known as the fundamental theorem of arithmetic.

Despite (or perhaps because of!) their fundamental nature, prime numbers are unwieldy to work with and it is often difficult to prove mathematical facts about them. In fact two of the most famous mathematical conjectures (unsolved problems) are actually about primes:

  • The Goldbach conjecture: Every even number greater than 2 can be written as a sum of 2 primes (4 = 2+2, 6 = 3+3, 8 = 3+5 etc)
  • The twin prime conjecture: There are an infinite number of primes exactly 2 apart (such as 3 and 5, 5 and 7, 11 and 13 etc)

The prime number theorem

One of the most famous results about prime numbers that has been proven is the prime number theorem, which gives the asymptotic distribution of prime numbers. The mathematical statement is that:

\(\pi(x)\) ~ \(x/log(x)\), where \(\pi(x)\) is the number of primes less than a real number x and \(log(x)\) is the logarithm of x in base e.

The interpretation of the “~” notation is that as x goes towards infinity (gets larger), \(\pi(x)\) asymptotically approaches \(x/log(x)\), meaning that the limit of \(\pi(x)/[x/log(x)]\) is 1. Another way of stating this is that the probability of any integer less than or equal to N being a prime is close to \(1/log(N)\) for large values of \(N\). This is because the fraction of integers less than or equal to N that are primes is \(\pi(N)/N\), which is equal to the probability of picking out a prime when choosing a number between 1 and N.

Prime numbers in R

I was very happy to see that the R statistical programming language actually has a library that deals with primes. I took a look at it and it relies on underlying code written in C++ to do an exhaustive search of whether a number is a prime (when considering a natural number \(N\), you only need to search for primes up to \(\sqrt{N}\)), so it’s actually pretty fast!

We first load the prime library:

library(primes)

Generate primes

As an example, we generate all primes between 2 and 50:

generate_primes(2,50)
##  [1]  2  3  5  7 11 13 17 19 23 29 31 37 41 43 47

Plot prime numbers on an array

We can plot prime numbers on an array. Below, we show the first 2,500 numbers as a 50 x 50 array, with the non-primes in grey and the primes in black:

library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
##Make a 50 x 50 array
array_50_50 <- expand.grid(Row=1:50,Column=1:50)
##add the numbers corresponding to cells
array_50_50$Number <- 1:50^2
##add whether each number is a prime
array_50_50$Prime <- is_prime(array_50_50$Number)
head(array_50_50)
##   Row Column Number Prime
## 1   1      1      1 FALSE
## 2   2      1      2  TRUE
## 3   3      1      3  TRUE
## 4   4      1      4 FALSE
## 5   5      1      5  TRUE
## 6   6      1      6 FALSE
ggplot(array_50_50, aes(Column, Row, col=Prime)) +
  geom_point() +
  scale_color_manual(values=c("grey", "black"))

Explore the prime number theorem in R

The prime number theorem gives us an idea of the density of primes among all numbers, so the density of the black dots among all dots. The actual number of primes in the array above is 367, whereas the number predicted by the prime number theorem is 2500/log(2500) = 319.5277733. To understand the theorem a bit better, we can create function to obtain the number of primes less than or equal to a certain value:

pi_nr_primes <- function(x)
{
  sapply(x, function(x){length(generate_primes(1,x))})
}

We now plot \(\pi(x)\) for values between 2 and 2,500, with the orange line indicating the identity line:

x <- 2:2500

plot(pi_nr_primes(x) ~ x,
     ylab=expression(pi(x)))
abline(0,1, col="#E69F00", lwd=2)
text(350,200, "y=x", col="#E69F00")

We add \(x/log(x)\) to the plot in blue:

x_over_log_x <- x/log(x)

plot(pi_nr_primes(x) ~ x,
     ylab=expression(pi(x)))
abline(0,1, col="#E69F00", lwd=2)
lines(x_over_log_x ~ x, col="#56B4E9",lwd=2)
text(350,200, "y=x", col="#E69F00")
text(1000,100, "y=x/log(x)", col="#56B4E9")

It does not look like \(\pi(x)\) gets closer to \(x/log(x)\) as \(x\) increases, but note that the prime number theorem is actually about their ratio, not their difference. As a result, we plot the ratio instead, with the green line indicating 1:

y <- pi_nr_primes(x)/x_over_log_x
plot(y ~ x,
     ylab=expression(pi(x)/(x/log(x))))
abline(h=1, col="#009E73", lwd=2)

It still looks like \(\pi(x)/[x/log(x)]\) may flatten i.e. asymptote at a value higher than 1. In fact, pi(2500)/[2500/log(2500)] = 1.14857. If we go up to 10,000, however, we get 1.1319508. At 1,000,000, it becomes 1.0844899 and at a 100,000,000 (warning! this takes 40 seconds on my laptop) it becomes 1.0612992. I’m not sure we can really get to much higher numbers than that in R, but we know for a fact from graduate-level number theory that the limit is indeed 1 as x goes to infinity.

It would not make sense to make a plot for all points up to 100,000,000, because it would take an extraordinarily long time and also we would get a lot of overlapping points. Instead, we could randomly sample 2,500 points between 1 and 100,000,000 and make this plot for them. That was also causing my laptop to massively stall, so I had to be happy with looking at 1 to 1,000,000 (which also took over 1 minute to run):

x <- 1:(10^6)
set.seed(31048) ##set the seed so the same numbers are sampled each time
x_sample <- sample(x, 2500)
x_over_log_x_sample <- x_sample/log(x_sample)

y <- pi_nr_primes(x_sample)/x_over_log_x_sample

If we plot this to be on the same scale as before, it can be hard to see the decrease:

plot(y ~ x_sample,
     ylab=expression(pi(x)/(x/log(x))),
     ylim=c(0,max(pi_nr_primes(2:2500)/(2:2500/log(2:2500)))))
abline(h=1, col="#009E73", lwd=2)

However, if we plot just the values over 2,500, it is easier to see that there is an overall decrease, until it gets between 1.08 and 1.09:

plot(y[x_sample > 2500] ~ x_sample[x_sample > 2500],
     ylab=expression(pi(x)/(x/log(x))))
abline(h=1, col="#009E73", lwd=2)

Probability interpretation of the prime number theorem in R

Let’s also take a quick look at the probability interpretation of the prime number theorem. Thus, let’s sample 10,000 numbers with replacement from the first 2,500 numbers:

set.seed(4028502)
sample_from_1_to_2500 <- sample(1:2500, 10000, replace = TRUE)

The probability that each sampled number is a prime is exactly \(\pi(2500)/2500\) = 0.1468 but it can also be approximated as the number of “successes” in 10,000 trials, where a “success” is being a prime number. We can do this as follows:

sample_of_primes <- is_prime(sample_from_1_to_2500)
sum(sample_of_primes)/10000
## [1] 0.1448

This number is close to 0.1468. The approximation from the prime number is 1/log(2500) = 0.1278111.