LLM Application in Data Cleaning

Introduction

I'm very happy to meet everybody today and the last time when I'm here I remember I'm one of you listening to Hugo's, the next presenter's presentation about a very interesting automation. I think you did 8 &8. Yes, you see, I remember. It was about a year ago.

And you see everybody. I think if you're curious enough, you can be the speaker for the next event.

About the Speaker and the Research Context

so just a short introduction about myself my name is Vicky and my company my company's name is Mindset has nothing to do with Mindstone they just sound similar and we are a marketing research agency specialized in travel retail like those duty -free shops in the airport and then we go

there not that we actually go there there will be other team there and do the field work and then they will give away the questionnaires so that we collect the data so talking about this may I ask how many of you work in the research area marketing

research oh really nice to meet you and how many of you worked with the questionnaire like a questionnaire you also okay so I think that would be perhaps would be your interest point today so oh so now I want you to say

Why Open-Ended Survey Responses Are Hard to Work With

that since you yeah you can take photo go ahead so has anybody know what is a open -ended question can you give me some examples of ended yeah exactly yeah exactly so there was no ABCD listed so what can be obstacles if now you receive

Excel sheet, and with each cell there are different types of open -ended questions. What can be the potential difficulties? Like hard to analyze, yes, and then what about other people's opinion?

Can you think of the other predictor? I will show you later, so don't worry. Yes it's unstructured, exactly, because it's normally qualitative, not quantitative.

Multilingual Responses, Typos, and Unstructured Text

And then also the special thing about our company since we are a multinational company So we do have the respondents from different nationalities and our questionnaire mainly delivered in English So that means the non -english native speakers they can have typos

Well, they're also multilingual and then grammar issues and sometimes receive languages like my mother tongue Chinese and then today I'm gonna show you also Arabic and then somehow it create the difficulties of analyzing and how could we get insights out of this information and just give

The Dataset and Real-World Examples

you a bit of background of today's application so the data set thanks to my company allow me to use actually the real data from our database so in this data set there are 7560 entries of this abstract data so here what you can see I'm not sure if you can.

Oh, there was no mouse. Okay. Okay, the mouse is not appearing.

So on the top there, it's Chinese. And then on the top right, you can see the English. I'll just read it, read out loud.

I don't like much about the packaging. It looks too busy. The color platy is not that appealing. Hence, you can see there are some grammar issues.

I think busy was like, I don't know, the packaging can't really be busy. And And then vety, good taste to have, is not vety. It's obviously a typo.

And then the third line here, I took the screenshot, is all in the shouting angry mode, like everything is uppercase. And then here we also have a bit of Arabic.

But in this specific data set I chose, I think there will be less than 5 % of Arabic. Arabic. So yeah, that was the real examples comes from our company.

Goal: Make Responses Readable Through Cleaning

So my target is to make it readable. So I'm just doing cleaning. I'm not even stepping into the point of encoding or whatever.

Proposed Workflow and Tooling

So this is my logic. And I know that before, as you already asked, there are some people work in AI and this is a logic behind it like I use the Python so

first load the data and then choose the columns so you have to specify which which part of because typically in a questionnaire you have the user ID so you're not gonna process the user ID and you're gonna tell the machine to read from which column to which column so specify the area and

Using a Local LLM Server (Ollama) for Privacy

then check the LLM server here what I use because I use the local hosted one one, it's just safer, no data leak.

And I use Ollama. 1And then Ollama basically is a language model manager. I think maybe everybody here, people already knows.

And then we will define the single task cleaning function. And then that's including the part prompts,

and then the dupe, and then cleaning the unix in parallel, which is to, there are some ways of setting up. I will also talk about it later in the part.

And then map back to the for data set and export.

Prompting and the First Implementation

So this was the first one I coded. I used the olama and the orange box part is the part that I highlighted with the three key parameters that I will also talk about a little bit later.

And the the pink part which is not very clear is where I put the first prompt to ask it to be I think the first one I coded is you are a precise editor etc and

specify the task that I wanted to do to rewrite the survey response into a clear natural English etc and this is the second one I did so here are the

What the Cleaning Looked Like in Practice

examples since I'm not able to run the full demo it will cost me with my company's computer laptop it takes me an hour and a half to run through and then process the whole data so here I would like to share with you directly is a result what I achieved the the first one is the one that I took the screenshot

where you can see that little bit of programmatic English and then after the iteration the first one I test with Lama it actually already start to become a are quite readable when it's too cluttered. Here it replaced busy with cluttered.

And then in fact, when I go for the very last one, after I tuned the LM model parameters, and then actually it starts to go back to too busy. And also I will explain it to why these changes later part.

And then in the next example, I choose here, there are some uppercase. and then after several round and this uppercase has finally been achieved okay no no mouse the the blue ones over there that you can see it's and where I also

wanted to highlight that there is CLR so people maybe they are working in certain industry they have their own way of shorthand at the beginning when I just see this I have no idea what a CLR means and it was just a shorthand for color and then after the machine read it and then it actually translated it into color. And then the Arabic language was also translated into English and I double -checked it was correct.

Common Failure Modes: Partial Translations and Truncation

Then after the first and the second iteration, first iteration, I still find it quite annoying because since Chinese in my mother tongue I also find the issue that the Chinese language they what how they translate is not exactly accurately plus there is obvious issue I think everybody can see label number four green in the middle there are still some Chinese there it's not completely translated and

now what's even funnier is that in the number five it just stops the it stops and then number five green number five dark green they all stops in the middle They didn't really translate the entire paragraph, but then after several rounds, I also managed it. Then the sixth example I gave, it was more or less similar, like the LLM didn't really translate the entire one.

Model Choice and Parameter Tuning

So here are the parameters that I tried to test it with.

So finally, since this is quite a Chinese heavy data, I chose the model model Q1 developed by Alibaba.

You can easily pull them through Ollama. And the parameter I changed.

And there are also people who work in IT. But, you know, sometimes we also have the habit to ask ChargeBT to maybe check our French grammar, since we We are not native French speakers.

Reducing Hallucinations With Temperature

But somehow, ChartGPT will start to create, to have some hallucinations. 1So one of the ways to reduce the hallucination is by reducing the number of the temperature.

It will reduce the creativity level. So it stops to give you something doesn't exist. But by just fetching, using the model straight away on our,

I think everybody knows the UI in ChartGPT, you just select the model. And now you can select, OK, deep thinking. but there was no way for you to change the parameter unless you call it out like a hotel locally.

The first one is the temperature. So that was, for me, that's why I'm not such a big fan of using the interface frontally.

Avoiding Cut-Off Outputs With Max Prediction Length

And then the second one that I changed is number of predict. The reason this number of predict is very important for this translation specific task is because I also hear this from my non -technical background colleague.

colleague, they were saying that if you're just putting the Excel sheet into the CharGBT, it doesn't read. So, and needless to say, you wanted us to do the actual task to translate a paragraph of sentence into English.

So here it defines the maximum output of the words. Here what, because I roughly estimated it's a paragraph, so maximum, it would be more than 250 English words. Here then I change it to the number of predict to define the output.

That is why in the previous example, what I showed you here, you see in the number five, the green one, and number five, the dark green one, the translation just stops. And this thing has been largely improved. As soon as I change the number of predict, I predict, I define the output length for the text.

Reducing Randomness With Top-P Sampling

And then the last one is the top -end. Top -end itself has a very interesting application, even in the biopharmaceutical industry, very interesting. I was reading this article the other day.

And here, it actually reduced the randomness because the principle, the way how LLM work is for each word, the model will assign it a probability. So by limiting the top M percent, sorry, not top M, top P percent, you can reduce the randomness.

Like, for example, here, I just take a screenshot from the Wikipedia. And as you can see there, it starts to give you the sentence, the a cat. But the a cat, it doesn't make sense. Or the a cat, even if you add dog and it's not a complete sentence. sentence.

Hence, if you restrict a bit for the top P level for these translation -specific tasks, normally the range would be 0 .9 to 1. And I tried the number 1. But yeah, you

can play around with it, like if you want to do some experiment or if you have some other specific tasks. But I think those three key parameters would need to change. It's based on the task, your objective that you want to achieve. So here I set it up because it's just a translation focused task.

Conclusion and Q&A

So yeah, that was a very tiny sharing and I think, well I didn't really write the conclusion but I think everybody can have your conclusions and if you have some questions you can ask me.

Thank you. It's very short. Thank you.