Trivipedia 0 -> 0.6

Sep 06, 2021

Trivipedia is a trivia game based on the most popular non-pornographic pages on Wikipedia.

While I was at Google Research, I became super interested in data mining and unsupervised learning. Manually labelling data to train a machine learning model or to do some natural language processes was a lot of work. Even if there were companies such as Crowdflower to help with it, I’ve never liked the idea of someone sitting down with scores of mundane work in service of some hungry Artificial Intelligence algorithm.

Wikipedia is often used as a source of data mining and unsupervised learning for these reasons:

All of the data can be downloaded in bulk and there are several formats for different methods.
The view counts for every Wikipedia page, aggregated by day, can also be readily downloaded in bulk, meaning that it’s easy to focus on the top content and crowd-source other behavioral information.
Because pages reference each other and there are categories, Wikipedia contains a nice graph of information that is extremely useful.

So the idea behind Trivipedia was to make the world’s largest trivia game by scraping popular facts from Wikipedia and turning them into multiple-choice questions where players had to guess the right answer for points.

The algorithm to generate the questions works like this:

1. Fetch the most popular X pages on Wikipedia based on views.
2. For each page, get the 3 nearest neighboring pages based on the page graph.
3. Get sentences inside the page containing the title of the page.
4. Replace the page title with blanks
5. Create possible answers using the page title and the titles of the nearest neighbors.

This approach worked well! Some questions are extremely difficult (e.g. can you figure out Japan is Japan just from the amount of landmass?) but overall it was stellar. With some player data, the nonsense questions could be thrown out over time as more people played the game.

Surprisingly, the most viewed pages on Wikipedia are about porn stars. Even on Wikipedia! How many times does one read about the Early Life of Sasha Grey? Fortunately, all of the porn star pages belong to a porn star category so it was easy to remove them. Someone reading this will make a Pornpedia app and become a millionaire.

The Good

Trivipedia was incredibly fun and one of these days I need to bring it back from the dead. I loved the graph of people’s performance in your geographic location:

The Bad

With no error handling or reporting of any kind, it wasn’t clear how to grow the app’s popularity and it faded into disrepair.

The Ugly

Despite the server being down for years, I don’t know how to remove the app from the Amazon app store. People are still buying it and demanding a refund.

The Takeaway

I learned a ton about mobile development and about data mining. The data mining was wildly successful and it showed me how incredibly important semi-supervised methods are. I ended up adding semi-supervised support to Sibyl during my tenure at Google Research, which helped many teams leverage unsupervised models to improve performance of their supervised models. I also remember a bunch of random factoids such as the cup size of Japan and the number of square miles in Sasha Grey.

ThreadPool.cc

Discussion about this post