Focus on the Assessment Construct

A structured assessment response to generative AI.

As Artificial Intelligence continues to reshape our educational landscape, we need to refine our assessment methodologies to ensure their value and purpose remain relevant. However, it is important to note that in most cases, AI should not drive the purpose of an assessment task, rather the assessment construct is what defines the form of an assessment.

An assessment construct refers to the specific trait, attribute, skill, knowledge, or ability that an assessment or measure is designed to evaluate.

Drawing an analogy from traffic light signals (Red, Yellow, Green), we can categorise and clarify the purpose behind each type of assessment in the age of generative AI.

Red

The “Red” category represents tasks typically designed to assess foundational knowledge and skill development. Despite the emerging opportunities that AI is bringing to society and specifically education, there is still an important place for learning and assessing foundational knowledge and skills. It becomes very problematic sending red tasks home or allow them to be completed in unsupervised conditions. Why? Generative AI could significantly compromise the construct validity of these specific assessments. If we are trying to assess an individual’s ability to think about what has been learnt, generative AI is now obviously going to be more problematic in ways Grammarly never was. This is not a new problem. Sending a Maths test home has always seemed counter productive. There are countless sources of undue assistance that would significantly reduce the construct validity of what Maths concepts are being evaluated. In an AI world, if we want any chance of upholding construct validity across many subjects, we need to include some red tasks. This is particularly important for subjects that have traditionally moved away from supervised assessment conditions. Red tasks do not need to be tests. The use of interviews, oral reports, a simple thesis defense, in-class group tasks are effective forms of assessment that will protect the construct validity from the undue assistance of generative AI.

I am not proposing all assessments should be red, however ultimately, their inclusion across a course will significantly reduce the workload of teachers.

The attempt to police ‘AI-enabled cheating’ is a fool’s errand. Since the release of ChatGPT late in 2022, thousands of hours have been wasted across schools and universities by well-meaning educators unsuccessfully trying to catch students excessively using these new AI tools. I’m not suggesting we do not take academic integrity seriously, however, a paradigm shift is required.

A necessary paradigm shift: With the introduction of generative AI, it becomes imperative to prioritise the construct validity across the entire course grade, rather than place too much emphasis on individual isolated tasks. This paradigm shift can be found in proposition 2 and 5 of a recent paper, ‘Assessment reform for the age of artificial intelligence‘ by the Tertiary Education Quality and Standards Agency (TEQSA). Including red tasks across a course will result in our ability to still trust the construct validity of the overall semester result. This is due to red tasks moderating the possible reduction of construct validity inherent in yellow tasks (see below).

After 12 months of implementing this assessment structure across a high school, from my experience, once a teacher makes this paradigm shift, their anxiety significantly reduces regarding the challenges that AI can bring to academic integrity. They can then spend more time on the things that actually matter to student learning. No one got into teaching because they love catching young people ‘cheating’.

Yellow

The purpose of a yellow task is to measure domain specific assessment constructs that are most effectively completed over an extended period of time, thus, a red task is not suitable for these situations. Using creative writing as an example, during a red task, students could be asked to write a narrative in a single, supervised lesson. However, for constructs such as this, creative output needs more time and space for the assessment to be meaningful.

Hence, for certain assessment constructs, we allow students to complete yellow tasks for homework and we assume they will have access to generative AI. We acknowledge that given the chance, some students will take the opportunity in unsupervised environments to excessively use AI tools for these tasks; to the point that it could be considered ‘cheating’. However, we can tolerate this risk if we have ensured to include some red tasks across our course (see above).

Furthermore, rather than just hoping students won’t use AI to cheat, another necessary paradigm shift should occur: we can teach students how AI technologies can appropriately assist in the demonstration of the specific assessment constructs being measured.

As a simple example, for domains where spelling and syntax are not part of the key assessment constructs being measured, AI tools could be used to assist a student who struggles with this aspect of their communication. Empowering students to use AI tools to augment their communication will remove unnecessary cognitive load so they can focus on what is being communicated. In this situation, the ‘what’ is the most important assessment construct.

To repeat myself, in unsupervised environments, it is certain that some students will find it too tempting to stop at just editing assistance. Some will use AI inappropriately and will also have tools like ChatGPT assist with their thinking too. Hence the importance of red tasks. Students who begin relying too much on AI technologies during yellow tasks, will still need to demonstrate their ability to think for themselves during red tasks.

The key to yellow tasks is that teachers ought to clearly state what AI assistance can look like within the specific assessment and students should then acknowledge this assistance. I recommend Leon Furze’s AI Assessment Scale as an effective structure to communicate the appropriate level of AI assistance across a yellow task.

The AI Assessment Scale – Leon Furze

Just like driving through a traffic intersection, the yellow light causes most ambiguity. Should I accelerate through? Or should I stop? Within reason, yellow tasks will give students the opportunity to practice self-regulation and consider their use of AI carefully – am I using this tech too much? Is there ways I can use it more efficiently? However, from my experience so far, many young people don’t yet have the meta-cognition skills to do this intuitively alone. Students need guidance and explicit teaching to apply this new skill.

Green

Lastly, the “Green” category signals that generative AI is not only allowed but expected. The purpose of the task is for students to use Artificial Intelligence. The assessment construct of these tasks focuses on the ability to use AI technologies to enhance creative productivity or to augment their learning. The assessment should still be grounded within a content related context, but this is when students can practice leveraging the potential of this emerging technology. However, there is an important caveat: our expectation for what high-level success looks like in green tasks need to significantly increase. If a student is equipped with the most powerful literary tool ever created, how can they be allowed to continue to produce mediocre work? By expecting our students to leverage this technology, we need to expect more from our AI augmented students.

There is an important difference between yellow tasks and green tasks. It comes back to the assessment construct; the purpose of the assessment itself. Yellow tasks are primarily interested in assessing domain-specific assessment constructs not inherently tied to generative AI, whereas the sole purpose of a green task is assessing an individual’s ability to utilise generative AI within a domain. At this point, yellow tasks will most likely look very similar to homework tasks we were giving students before advanced chatbots were introduced in 2022, fundamentally because the assessment constructs are still valued within our domains. In contrast, many green tasks have only become viable with the advent of generative AI; their assessment constructs are new.

So far, some effective green tasks that I have used or seen are:

AI-enhanced Socratic Dialogue
Evaluating the reliability of a historical AI simulation
AI-augmented art

Closing Remarks

It is important to note that there is no hierarchy across the three colours. Even as we enter a new age of AI, the green is not inherently better than the red; each represents a unique purpose and value in education. All three types of assessment constructs should coexist in modern education. Whilst green tasks provide a significant opportunity for modern education, the use of AI shouldn’t become the sole purpose of education. Thus, AI should not dictate the form and function of all our assessment practices either. We measure what we value.

This post has been edited after 12 months of reflection using this assessment structure in an Australian high school.

AI in Education, Assessment, ChatGPT

Focus on the Assessment Construct

June 5, 2023
Emergence

February 26, 2024
Interactive Fiction with ChatGPT

October 16, 2023
The Anomic State of Education

September 24, 2023

TeachingTIME – Getting Real with Robin – June 12, 2023

June 12, 2023

[…] of stoplight colors to allow students to understand if AI will be allowed on specific assignments: Focus on the Assessment Construct (Cotterell, 2023). I love the thought of using the stoplight colors red, yellow, and green to […]

LikeLike

Asking students to critique ChatGPT – Teacher Directed AI

July 19, 2023

[…] don’t discard the valuable assessment constructs the traditional task aims to measure. I have argued elsewhere that the pressing challenge for educators who are too slow to respond to AI will be a significant […]

LikeLike

Teacher Directed AI

Focus on the Assessment Construct