Thursday, May 23, 2013

Problems with using player data to assess level difficulty

Recently, I've been trying to build a system into my project, BOTS, to help assess the difficulty and "type" of levels submitted by users. The end goal is to be able to integrate high-quality user submissions into the level progression, so the game experience will grow organically and be sufficiently varied to keep users interested long into the game's lifespan. Whenever working with user-generated content there are going to be some problems  my most recent work has dealt with designing mechanisms into the game to discourage trolling or sandboxing behavior in level creators, reducing the number of messy or abusive submissions. While eliminating the worst content and removing it is is quite helpful, what we'd really like to do is be able to identify more user-centric information about content, and use it to essentially sort content.

The technique that I have been exploring in order to do this works in tandem with Knowledge Tracing to attempt to identify which knowledge components a level contains, as well as assess how "difficult" it is within each of those knowledge components. I chose to explore a hill-climbing method using observations from play; when we are presented with a new level, we simply assume it is as hard as the hardest level in the game, and contains each knowledge component. At first, only "expert" users, the ones who are most confident, competent, and challenge-seeking, will be presented with this material, since we have overestimated its difficulty. Based on those users' success or failure, we will then adjust our estimate of the content, making our model conform to the established knowledge of the students' concept states and proficiency.

There is a problem with this approach that became apparent as soon as I began working with it; The very significant trade-off between information gained and user experience.

Let's say that a very good user solved the problem. What have we actually learned, and how can we reasonably adjust the estimate? We could say that, at most, this problem is as difficult as the most difficult problem that player has already solved, but if we're using specifically the best users, this isn't very much information. If the very good user gets the problem wrong, again, we don't know a lot. The problem is, in that case, either spectacularly difficult or the user made a mistake, but neither really helps us estimate the content's values.

Now suppose we give this content to an average user, with knowledge of only some of the core concepts of the game. Based on whether they get the problem right or wrong, we will learn something about which concepts that problem contains; If a user does not understand functions, but solves the problem anyway, the problem probably doesn't contain functions.

However, in contrast to the very good user, there is significant risk to the experience of the average user. In the worst-case scenario for the expert user, they are bored (problem is very easy) but in the worst case for the average user, they problem contains a concept they don't know... this is especially bad because the worst case for the user is the best case for evaluating content because it isolates a specfic unknown concept and shows us that this problem cannot contain that concept!

I'm thinking more about this trade-off and will write more on this topic in the future.