Formatting
Formatting of text is crucial so that the information processed by the AI model is uniform and discrepancies in data format do not induce hallucinations. Some of the activities that we perform to format the data in preparation for the AI are:
Format Standardization: It’s crucial to have your document in a consistent format that the retrieval AI can process. For example, converting PDFs or images of text or ensuring all documents are in a uniform file format like .txt or .docx.
Normalization: This is about converting text into a uniform style. It could include turning all characters to lowercase, standardizing date formats or converting measurements to a single system.
Tokenization: Breaking the text down into smaller pieces, like words, phrases, or sentences. It’s a critical step for text analysis, as it defines the basic units for the AI to process and analyze.
Data Structuring: Organizing the text in a structured format like JSON, XML, or even tables can be crucial, especially if the AI system is designed to understand structured data better.
Data Cleaning
Most documentation is messy, meant for human eyes and understanding. This can cause noise in the outputs and result in low quality responses from the model. Some activities that we complete in this stage are:
Removing Irrelevant Information: This involves eliminating parts of the document that don’t contribute to the AI’s understanding. For instance, in a research paper, the AI might not need to process the acknowledgments or references section for content retrieval.
Handling Missing Data: Identifying gaps in the information and deciding how to deal with them. For example, if certain key information like dates or names is missing in a historical document, decide whether to fill it in based on context, mark it as missing, or exclude the incomplete section altogether.
Correcting Typos and Standardizing Language: Ensuring the text is free from spelling errors and grammatical mistakes. Also, if the document uses different dialects or variations of a language (like American and British English), standardize them to one form to maintain consistency.
Normalization: This is about converting text into a uniform style. It could include turning all characters to lowercase (to treat words like “Apple” and “apple” the same), standardizing date formats (e.g., DD-MM-YYYY), and converting measurements to a single system (metric or imperial).
Removing Stop Words: Stop words are common words that usually don’t carry significant meaning and are often filtered out in data processing to reduce noise. Words like “and”, “the”, “is”, etc., are typical examples.
Stemming and Lemmatization: These are techniques to reduce words to their base or root form. Stemming might cut off prefixes or suffixes (turning “running” to “run”), while lemmatization involves using vocabulary and morphological analysis (turning “better” to “good”).
Entity Recognition: This is about identifying and categorizing key elements in the text like names, organizations, locations, dates, etc. It helps in understanding the context and key subjects in the document.
Handling Special Characters: Special characters (like @, #, &, etc.) or non-standard symbols might need to be removed or encoded, especially if they could interfere with the AI’s processing.
Testing and Optimization
There are many variables that we set when configuring a Knowledge based chatbot. These configurations need to be optimized based on the outputs received. Some of the activities we perform in this stage include
Studying the use case and determining accuracy metrics that best suits your context and need. Some common metrics we adopt are precision measurement, recall, F1 Score, or BLEU Score.
Building a standardized test set by collecting potential questions from key stakeholders and potential users. This gives us a set of test questions to use for measuring success.
Apart from accuracy metrics, qualitative analysis is also crucial. Sometimes, numerical metrics don’t tell the whole story. We read through the AI outputs and assess them for quality, relevance, and coherence. This step often involves human evaluators.
A/B testing and benchmarking is a crucial part of the evaluation. We compare metric performance with multiple models.