How I solved Wordle with Python and a bit of statistics
The last article I wrote for this substack was over 2.5 years ago. I’m excited to restart this newsletter (special shoutout to my friend Ayesha for inspiring me) with a broader focus on my passions and interests and will try my best to write more consistently! Hopefully you find these posts interesting and enjoy the small periodic glimpses into my life, and thank you for reading and/or subscribing —Dhruv
When I discover a new game, I sometimes like to create a simulation of it to delve into its mechanics and explore different strategies. This sometimes involves adapting the game into a model for a different purpose, as I previously demonstrated with Monopoly. Alternatively, I may implement various strategies for playing the game to compare their effectiveness.
Although it’s been a few years since Wordle had a stranglehold on our group chats, I wanted to revisit a program I made at the time with this second purpose in mind. I cleaned up and posted the code on my github (link), and in this post, I’ll walk through the method. If you’re more of a visual learner, I would recommend watching Grant Sanderson’s video on the YouTube channel 3blue1brown. The video came out a few days after I wrote my engine—I thought it was cool that we landed on basically the same idea with a slight variation.
Wordle explained
The goal of Wordle is to correctly guess a 5 letter world in six or fewer tries. After each guess, the game will color code squares to give you hints about the letters in the solution. A “green” square means that you have the correct letter in the correct position. A “yellow” square means you have the chosen a letter but placed it in the wrong position. A “black/grey” square means that the letter is not in the final word. Take a look at this example:
The S, A and E are in the correct position, and the T is the right letter but in the wrong position (in this case, the answer was SKATE). As you might have noticed, this was a suboptimal guess. The first clue already showed that “R” is not in the final word, so a better guess would have been something like STAKE, which would have resulted in a yellow T and K, or something like STAVE/STADE which would have generated the same color sequence but eliminated another letter.
This logic leads us to the start of a very basic strategy. The pool of words you want to draw your next guess from should consist of:
Words that have letters in the same positions as the green letters in your previous guesses
Words that do not have letters in the same positions as yellow letters, but do have those letters in the word with at least the same frequency
Words that do not have any letters previously identified as black/grey
While a human’s “next guess” is determined by whatever words they can conjure up that best conform to the above rules, a computer has access to the entire dictionary of five letter words—how do we pick our guess from the set of all possible valid guesses?We could select from the pool randomly, but a better approach is to use probability.
The intuition behind the algorithm
At turn t, let W be the set of valid next guesses and w be one possible guess from this set. Now, suppose we pairwise compare w with each word u in W / {w} (meaning every word in W that isn’t w). In particular, we let H(w,u) be the string that defines the relationship between w and u, for example H(STAKE, SKATE) = GYGYG. We can imagine putting the word SKATE in a bucket labeled “GYGYG” and repeating this process for every u to sort the set W / {w} into buckets.
Now, when we get the actual results of the guess w, we throw away all the buckets except for the one labeled with the result we got. This now represents the next possible pool of guesses to chose from in turn t+1, and we repeat the process, selecting a new w from this remaining set until we reach GGGGG
Obviously, the fastest way to get to the right answer is to select a w at each step that minimizes the number of words in whatever bucket ultimately gets chosen. Intuitively, the best way to do this so you get the best performance on average is to choose a w that “spreads out” the words across as many buckets as possible. A natural way to capture this is to leverage the concept of expectation.
Calculating expected bucket size
Because any word is equally likely to be the solution, if there are n words in W and |k| is the number of words in bucket k, the probability of that bucket containing the answer is |k|/n. If we let B(w | W) be the set of buckets created by guess w given the word pool W, the expected size of the chosen bucket (words in a bucket times the probability of that bucket summed over all buckets) is
So, to find get our next guess w*, we compute
To summarize in one line: at every step, we guess the word that minimizes the expected number of remaining options we’ll have after we get the result back!1
It’s worth noting that while I look at minimizing the expected size of the chosen bucket, Grant leverages the concept of Information Theory, which results in a slightly different objective function.
The reason Grant does this is he assumes that the words themselves are not equally likely to show up as answers, and instead the probability of a word being selected by the NYT is proportional to its frequency in the English language. I disagree with this assumption, but maybe I’m wrong and that’s how the game actually works.
Wrapping up
The method works! The code is available here with instructions on how to run it.
I also included support for the game Absurdle, which is a variant of Wordle where the secret word changes to fall into the bucket with the MOST number of words in it at each step. In practice, this meant changing the guess selection mechanism to opt for the word with the smallest maximum bucket size.2
I had a lot of fun writing (and rewriting3) this engine and found it to be a very practical application of probability theory. Reach out if you have any comments or feedback on the method or suggestions on how to improve what I’ve written!
As a sidenote, there is a variant of Wordle called absurdle where the solution changes each turn to a word from the largest “bucket”. So, this solution should, I think, be optimal for that variant of the game as well!
Although this approach makes sense, as of writing, I wasn’t able to get my engine to the optimal “4 guesses” which is the theoretical best performance. This is probably worth spending more time debugging.
While I was writing this post, I realized there was a 10x simpler way to implement the algorithm and ended up rewriting the whole thing. It’s way more readable. I guess there’s something to be said about writing and coding going hand in hand (or maybe substack is my new rubber duck debugging)