Student Engagement as a Preference Selection problem

Introduction

Before I forget introducing them, my name is Tony. Two, three years ago I was in the back listening to great talks from Data Philly as a student and studying data science and now I'm giving a talk here.

Talk topic and publication context

So today's topic for what I am presenting is my latest publication from CIIP 2025. It's on a student engagement and how to formulate sort of unknowns or qualitative abstract tasks as a preference optimization problem.

Motivation: Defining an Abstract, Qualitative Problem

So in essence I am actually like student engagement like from a computer science perspective and from an engineering perspective I have no idea what that is. So yeah, I have no idea what engagement with the classroom environment is.

It needs lots of domain expertise. It needs evaluation methodologies that are cross -discipline with like psychology, with education, with like plethora of different disciplines as well.

So from my perspective, I don't know what that is. But let's still train a foundation model anyway. So this is the topic.

How do I use computational methods to classify and deal with tasks related to qualitatively and abstract problems.

So here's me. Yeah, my name. This is me coding on a yacht in SF.

I just came back from just a little bit background. I graduated from Temple University a year ago, studying computational data science.

And last year, I was working in SF, SF mainly as in founding AI engineer. Long story short, we were nailing demo day for an A16Z backed startup in SF Tech Week,

and then the founders had some immigration issues and crashed the car in Columbia. Now, I'm back. So yeah,

my website and the publication itself. We also have a Philly startup expo in March 6th. I just want to self -promote anyway. But enough about me.

What “student engagement” means (and why it’s hard)

So what is student engagement? I have no idea what that is.

Why Foundation Models Struggle in Niche Domains

So a lot of people, when they come to qualitatively abstract tasks or, like, unknown things, people chat GPT a lot now. So let's chat GPT to this thing.

But chat GPT as a foundation model has a lot of niched areas that it's not trained for, or proprietary data that's not available for its foundation model training,

where a lot of its performance also needs a lot of human feedback and domain expertise that we don't have.

So what solution do we have for this from a model training perspective? I know nowadays we have agentic systems and AI and self -learning loops and long horizon optimization

Customization with AI agents with an API call, but like this is a lower level So like I think I published this paper like I was working on this two years ago so

Scaling laws, data scarcity, and compute constraints

With anything foundation model and training Performance is based off this paper called the scaling laws. I think it was published by professor in Cornell

it's linear it Basically, the scaling loss paper states that the performance of a LLM model is linear to the compute and the data provided.

So the two main problems with dealing and trying to fine -tune proprietary spaces or niched out -of -distribution problems is that there's no data, usually. And there's also no compute, because how can you convince someone to give you compute when you don't even know where your problem space?

So, that's really the problem that I'm trying to solve with my paper, with student engagement,

From RLHF to Direct Preference Optimization (DPO)

some avant -garde thing I don't know anything about, comes RLHF. There's something called RLHF where you can do post -training, and you can use already fine -tuned foundation models oh sorry our reinforcement learning with human in the loop thank you for no reinforcement learning with human in the loop feedback

yes yes thank you very much for helping me out there I just go on autopilot

RLHF basics and why it’s expensive

So, basically, you can train a model traditionally with data and an objective function and a evaluation metric that compares the output Y with the output Y hat that you have labeled.

What RLHF is, let's try to put it in more human terms, is you use specific objectives and optimization and policies to guide the training after it was trained to specific knowledge that it does not already have. that may have been contained within the original data set but may not be highlighted enough in the training of the model.

So RLHF is something what OpenAI, Anthropic and these large AI companies use to say guide already already trained model towards a more preferred output,

say, don't talk about, don't swear, don't be too candid, be a little bit more nuanced.

These types of preferences, we use RLHF to guide the model towards those preferences. We'd read it or from your framing data.

Well, I was going to say, so you could essentially have a human evaluator do an A -B test and say, I prefer response B to response A because of these five names, and that'd be an example of it. Right?

In this case, because you decide what type of response you're going to say. The human is deciding of the two outputs which they prefer and why, but you can use is a rubric -based model, which I think is, I think the article is.

Sort of, sort of. But extending on that specific question, to get those annotations, to get the human feedback,

traditional RLHF, say with proximal policy optimization of what OpenAI first used, you actually need to train a reward model with reinforcement learning.

You need to get those annotations, whether something is good or not, train those annotations into a reward model.

And this formulation here called PPO is really just the difference between the reward and the sort of distribution difference between the original fine -tuned model, foundation model,

and the model you're currently training. training.

So there's three parts that is all cost intensive. Like the LLM, you need multiple nodes of H100s. The annotation, you need an entire department of people and human annotators.

You then need to train another model, reward model.

Just some more context, what this paper is about in terms of optimization without going too much in the weeds and losing time

DPO approach: removing the reward model and self-alignment

is we're using something called direct profit preference optimization to cut out the reward model and also formulating this problem this qualitative engagement problem with a self aligning framework where we do have some sort of rule based evaluation methodology we run inference based off the labels we can

and know something that is good or bad, we get free data that's self -annotating, we cut the reward model with something called DPO, and then we cut the cost of fine -tuning models,

foundation models, and the data annotation to approach something like student engagement with a computational method. So that's some that's difference between RLHF and DPO.

The way we formulated this task for the publication, we were using prior data sets that has already labeled data. So then in that sense all we needed to know was something was

whether the output was correct or incorrect and then further sample more more data from the incorrect options to further post -train this foundation model with no like just A6000s,

but still a large GPU, but like not the clusters of stuff that OpenAI has for this niche domain, it's called for student engagement.

So here's summarization of what LHF and DPO, DPO really cuts from IHF so here is the main thing that it cuts and for us the publication it also cuts the human annotation the cost for it.

This is more For math, explaining the DPO objective function. Where is HA? Hallucination aware.

So I was using this paper from Shanghai AI labs that had added this auxiliary loss, cross entropy loss, adding it to the original loss of the DPO to stabilize the optimization training.

Yes, I should probably move on.

Problem Setup: Visual Question Answering for Engagement

So the task. Here's an image of a student, still image, in a math sort of quiz test online.

And we ask, based off a system prompt, an additional prompt, whether the student is is looking at the screen, looking away, or looking at a piece of paper.

So the context here is, I'm gonna further, so this is an example, but I'm gonna move on.

So is this visual question answering with an input image, with an input text, output text for question answering based off these modalities. more this is a high -level system architecture actually let's let's jump straight to the data

Datasets used and label definitions

set first before I explain more system architecture so sorry for that so we used three different data sets one called student engagement here depicted here another called dicey data set and then they're called EngageNet.

We found these open source data sets in the wild and it's the only data that's available regarding student engagement or classroom performance.

Student engagement is divided, the data set is divided into these three classes of screen, wander, and paper. The context here is that they're doing some math test problem in a confined space in front of a screen.

It's a little bit draconian however the professor that was leading this research was the for the past six years only ever doing online courses so he really needed this um so that was a little the dicey data

set is more video based um we use the dicey data set similar to something like this we took um individual seconds of each of those 10 second videos and re -annotated each still image according to this framework and they can use the Dicey dataset as still images as an out of distribution dataset.

So therein, when thinking about like student engagement as an out of distribution or a niche task, we need some sort of benchmark. Say we train on the student engagement data set but we use another data set that it's completely outside of the scope of student engagement to see the performance of how well our fine -tuned model with DPO goes. So that's a little bit high level.

The The task being visual question and answering, input image, input question, output description of, say, whether the student is looking at the screen, looking at the paper, or looking away.

Metrics and what they capture

We use some basic F1 precision recall and accuracy. Does anyone know what those metrics are, or need me to further explain a little bit? it. So F1 is the harmonization between precision and recall.

Macro is specifically focusing on the class level of each label. So you're not trying to take the overall precision recall and aggregate it into a score. you're taking the average of each class performance and there and get a better sense of the performance per class and there in sort of like you get a confusion matrix and see how many how to true true positives false false negatives and all that yes because that's what's available in data set and

And I'm not, so I think the professor leading the research for us, we of course borrowed the data set from University of Madison, I remember, I think it was. But for him, it was more so, he teaches a lot of online courses and a lot of times it's like on Zoom. During COVID, I think he refused to come back to the classroom now. but for him it's like being able to identify how engaged a student is directly correlates to his education and performance and delivery so that's for him it's something very close to his heart yes we are first focusing on still

images first continuous variables can in the sense like video and then further of further encoding it into a time series and translating that time series into a category is something that can be done but for our task already using so many resources it's too costly yes that is a very simple task indeed oh but but

But we also then further used other out of distribution, like the Dicey data set. Oh my god. More so because of the qualitative nature of the eye gazing and the tracking and the facial features. That's why we use this computational method to abstract away all of the necessity for the egocentric computer vision into just neural network work, and then just converging it to a few classes.

So for me, as a computer science student, I don't have to think about all the nuances in psychology and all that. So yeah, that's the problem task.

Training Workflow and System Architecture

So here's a system architecture of the framework that we employed for abstracting away curating human data data for our preferred and dispreferred post -training data.

Basically what we did is think of this part as DPO. You have a already fine -tuned model that outputs embeddings. I have a training model that outputs embeddings of preferred and dispreferred embeddings. It goes through the loss.

we have an evaluation methodology based off the labeled data and basically whenever a data point was wrongly classified we have a new data point and even more so we what we did was further takes a sample 30 % of the corrected labeled or outputted data take another option and paired it with a the wrong answer to create even more data.

1So then I have this infinite curating optimization loop for using this DPO, direct preference optimization, to keep iteratively boosting the performance of my model without the human annotation using, cutting the cost of foundation models with less compute but more data.

So yes, this is like a system diagram. I think, how much time do I have? Oh, that's great.

Model choice: Mini-GPT4 as the vision-language backbone

So just introducing a little bit about Mini -GP4. It's just a vision language model. I think what you do need to know about it is that it basically just has a linear layer that

adapts the LAMA -based Ficuna model with the vision tokens, and it just adapts the latent space. That's really all there is. There's some details about the training of what I used, but yeah.

So I explained a little bit about student engagement data sets.

The reason why we sort of use deep learning as opposed to, say, egocentric computer vision is more so we could just throw the compute with the labels and the data and not have have to care about the trajectories and the different geometries of the differences in gazing.

So that's why we approach it with a vision language model problem.

Prompting strategy and preference pair construction

So here's an example of our instruction prompts that we used paired with an image ID and a caption that we use for fine tuning.

Some of our prompts describe this image in detail or take a look at this image and describe what you notice.

One very important thing that during this model training for this work is actually like, if you have general prompts like this, it doesn't work well for niche tasks

like student engagement. But if I specifically prompt it a little bit more, I mean, I think everyone uses ChatGPT has more specific prompts now,

now, but it really boosts just by using fine tuning with more specific prompts. It boosts the performance by like 20 % accuracy, just like that.

Wherein, say, reaching a 97 % precision recall, wherein just using these prompts, it stabilizes at 80 % precision recall.

So the prompt structure and system prompting and fine tuning, there's lots of research about it as well. but it's also very important.

So here's an example of, say, the annotated data of chosen and rejected preferences that we further use in our iterated DPO P -O loop. It's really just what I explained before.

Let me know if I'm boring you guys because it gets really into the weeds.

What was the last bit there? There were these fascinating, quite English sentences up there. The person rejected, but those were chosen.

Those were chosen. It kind of made sense. It kind of went on the right track.

but I think it's more so the the input prompt that I was used and the the AI that generated these prompts and the embedding space distance was closer to the late output was closer to the labeled detection so that's why they They weren't chosen.

So question? Yeah.

Audience Q&A: confidence scoring and qualitative labels

In your training workflow, one of the things I noticed is that you kind of landed on the binary. Is this person paying attention? Is this person not paying attention? Is there a way that you could approach this with a confidence score as to whether or not the person is paying attention with the degrees

So that's, I think, exactly what algorithms like GRPO and ORPO did, right? With their cross -entropy based optimization. This work was two, three years ago and that wasn't out yet. So you could, you could, what GRPO does is literally take the average of outputs between between the fine -tuned model, and then depending on which one is closer to the output of the Y hat of the labels, you take that as the latent direction of where you optimize to.

The problem with that though, for something more categorical and qualitative in nature can you really give it a number though like a confidence score that's really the problem

label it as a difference so i think it's doing that because if you look at the top set it's person to individual which is a specific class of words and the bottom set is also like an abstracted version of that so we have person individual and that's pretty much all they say say at the top, but you have figure, trope, image, epitome, these are all like categorical qualifications around what a person or individual would be. So it's already trying to do

that abstraction and focus specifically on, I used to read sociolinguistic and cognitive semantics. Yes. So this is even though it does look a little bit weird, it is heading towards the right direction, step one set of a time.

Yeah, thanks for letting me off the hook for that one. I probably just took this picture when I was like rushing for a capstone project and forgot to really read it.

Experiments and Results

But yes, we have this student engagement data set that we fine -tuned on only about a balanced set of 600 to 800 sets for each class. We then sort of further divided the student engagement data set into hard samples, like we're covering their eyes and all that,

but the original data set is actually 18 ,000 721 in frames still images so the task here for us was using the small amount of images available see whether it would perform well after post -training with DPO on imbalance sets

on hard samples and further re annotated like dicey and engage net data sets as well as using their own original dicey and engagement labels for inference and evaluation so that's where our paper really tried to make the significance

and the novelty that when we approach like student engagement with that that it's hard to define what it is we can use this computational approach to gap to reach the sort of what was the word fill the problem gap of unknowns where a

foundational model traditionally takes like months and lots of computer train but we can lower the cost lower the data annotation cost but still work for tasks that are not contained within within this scope of the problem but yet get extended for even further tasks that are within this abstraction.

In-domain, imbalanced, hard-sample, and OOD performance

So, yeah, here are some performance metrics. For our balance set, of course, like MiniHBD4 with DPO, it sort of just works, tops the benchmarks from the original paper of student engagement.

They were just using mobile nets, that's exception nets and VGG. It might not be a fair comparison. Let's just want to highlight it because those are just traditional computer vision models. Where in here, it's like an entire large vision language model with transformers and fine tuning techniques and optimizations that it's like 10 years ahead of the time.

For the imbalance set it also works quite well for the student engagement data set and then the hard samples using our iterative DPO method it also works quite well with the hard samples as well as relabeled dicey samples sort of giving proving a point that with this method and with this curation of human annotations and lowering of compute costs we can still perform well with out of scope or even abstract concepts that

I myself don't know about and here are additional results on using our sort of of our fine -tuned model with Mini -GB4, inferencing on labels that it was completely not trained for.

Here's for the engagement labels. They had four labels of highly engaged, barely engaged, non -engaged, and I think I mentioned finally engaged.

And the Dicey dataset, they had eight different labels that we just threw out the window because I think the Dicey dataset that was poorly formulated as a paper.

But it's the methodologies of our iterated DPO framework was still topping the benchmarks compared to the available benchmarks given by the dataset.

Additional evaluation: LLM-as-a-judge consistency checks

So yes, here is some additional LLM as a judge framework that we use for evaluations, other than just the accuracy F1 and like sentence similarity scores that we did.

did, we used GPT 3 .5 to judge on the correctness of the outputs and the consistency. Say we run it n number of times, is it able to give the consistent output of, oh, the output label is wandering, whether the model keeps wandering.

And the performance of our DPO and Mini -GP4 model topping the benchmarks is consistent with our other metrics. So different ways to really try to frame our problem

with different approaches and problems and evaluate it rigorously, relatively, LLM as a judge. So yeah, that's really what I was trying to present. I'm not actually presenting student engagement at all.

Future Directions

it's purely like how can I use computation to chat GPT away things that I don't care about okay I mean the professor cares a lot about it but yeah for future work though there are sort of ways you

Constraining outputs with grammars and decoder-level structure

can use context -free grammars and say abstract syntax trees to constrain the LM output within within a rigorous discrete domain space.

And you could embed these activation functions or these combinatorics into the decoder layer of the LLM. And that was something I wanted to play with and maybe even get more distinct and discrete outputs

that we don't have to use LLM as a judge and do all that.

A sentence similarity comparing large embeddings with large embeddings, but like underneath the hood, what is the model actually thinking, we don't know.

And here's some additional evaluation frameworks that I utilized for my own capstone project. You could sort of formulate evaluation where you ask adversarial samples say if the student was wandering but then the you you ask it whether it's

looking at a piece of paper or looking away or looking at screen in these type of methods you can judge how much consistent the model and it's performing as an evaluation popular sampling random sampling different evaluation methodologies this is also I think they have a github for all these evaluation

Alternative preference-optimization methods (beyond DPO)

methodologies so as like the using complex free grammars to embed it into the activation functions of the decoding layers then of course in my in my own capstone I also use something called Kyneman -Dursey optimization more so the the optimization is a lot better

for capturing like human concepts. They called halos into the optimization and think what they did here was then further decrease the necessity for an entire,

comparing the policy and the reference model embeddings, like what your supervised fine tune model and your training model, all you need was whether something was thumbs up

or thumbs down into the loss function itself. So that cuts compute as well. Then this came out.

I mean, I think that all of them have Git Hubs, but like this one is just like, I think came out before GRPO, but this was basically reformulating optimization problems

as across entropy loss, such that I could just self -learn different preferences. And of course the mathematics is in the paper, I won't explain it.

And then GRPO, what you mentioned about confidence and comparing different embeddings. For this specific problem though,

for student engagement, because it's qualitative and categorical, for me, I would sort of refrain from using something from GRPO,

especially like more like domain spaces with like legal or propriety like healthcare where it's also very categorical requires human judgment you can't really give it a confidence in during the optimization otherwise it might just go off within the optimization space and just lose itself

Agentic self-improvement loops and the evaluation bottleneck

and there's no it's hard to guard rail that during the training whereas current current AI systems, you can use something like agentic systems with a self -optimization loop. There are papers from Stanford called agentic context engineering.

You have a generator, reflector and a curator loop. You have papers from Alibaba, Alpha Evolve, that also utilize visualize these types of more domain, easily understandable systems, embed it into the

agentic loop, and then self -improve itself. But even then, you need a very rigorous evaluation methodology. And in most business use cases, you're asking the business guy or the product guy, how do you tell whether an output is good or not?

Even they don't know. So what do you do then as an engineer? You AB test it in production and hope that the business doesn't fail. Yeah, so hopefully I didn't bore everyone.

Conclusion and Closing Notes

And do sign up for, I'm just advertising here, like we have a Philly Startup Expo in March 6th. I'm bringing some of my friends from A16Z and 161 Ventures from SF that I met. I'm truly trying to show the best of Philadelphia and

and get the VCs coming back as a series of events. And here's my publication. Even though I worked on this two, three years ago, and most of the work was done by then,

they finally published in 2025. It's weird.

Thank you. Thank you, Tony.