Skip to main content

How to use Generative AI on custom data sets across industry to build domain specific use cases

Generative AI is transforming the landscape of machine learning by enabling the creation of new content that can mimic human-like capabilities. One of the most powerful forms of generative AI is the Large Language Model (LLM), which can understand and generate human language with remarkable proficiency.
A key feature of LLMs is their ability to be re-trained or fine-tuned on custom data sets. This allows organizations to tailor the model's responses to specific domains or use cases. For instance, a legal firm could train an LLM on legal documents to assist in drafting contracts, while a medical research company could train it on scientific papers to generate new research hypotheses.
The process of re-training an LLM involves several steps. First, a suitable data set must be curated. This data set should be large enough to cover the desired scope and diverse enough to enable the model to learn various patterns and nuances. Next, the model must be fine-tuned, which involves adjusting the model's parameters so that it better aligns with the new data set. Finally, the re-trained model must be evaluated to ensure that it performs as expected on tasks relevant to the new domain. Sounds easy?
Re-training LLMs with custom data sets not only enhances their applicability but also improves their accuracy and efficiency in generating relevant content. As generative AI continues to evolve, the ability to customise and adapt these models will become increasingly important for businesses looking to leverage AI for competitive advantage.
Fine-tuning an LLM (Large Language Model) involves several critical steps to adapt the model to a specific domain. Here's a simplified overview of the process:
1) Data Preparation: Gather a dataset that is representative of the domain or task for which the model will be fine-tuned. This dataset should include examples of the type of content the model will generate or analyze.
2) Model Selection: Choose a pre-trained LLM that best fits the size and scope of your data. Larger models may require more data but can potentially yield better results.
3) Parameter Adjustment: Modify the model's hyperparameters, such as learning rate, batch size, and number of epochs, to optimize the training process.
4) Training: Use the custom dataset to train the model, allowing it to learn from the new examples and adjust its internal weights accordingly.
5) Evaluation: Test the fine-tuned model on a separate validation set to measure its performance and make any necessary adjustments.
6) Deployment: Once satisfied with the model's performance, deploy it to start generating or analyzing content as required.
Keep in mind, fine-tuning is an iterative process that may require multiple rounds of adjustment and evaluation to achieve the desired performance.

Custom data sets for re-training LLMs can vary widely depending on the specific needs and goals of the project. Here are some examples:
1) Customer Service Data: Transcripts of customer service calls or online chat logs can be used to train an LLM to understand and respond to customer inquiries.
2) Legal Documents: A collection of contracts, legal briefs, and case law can help an LLM learn the language of legal discourse for applications like automated contract analysis.
3) Medical Records: Anonymized patient records, treatment plans, and medical research papers can train an LLM to assist with medical diagnosis or literature review.
4) Technical Manuals: Manuals and documentation for specific products or technologies can help an LLM understand technical language and assist with customer support.
5) Literary Works: A corpus of literary texts can be used to train an LLM to generate creative writing or analyze literary styles.
6) Research Articles: A dataset of scientific articles from a particular field can train an LLM to summarize research findings or generate new research ideas.
We need to ensure that each of these data sets are carefully prepared so that they are representative, comprehensive, and appropriately formatted for the training process.

Let's consider a scenario where we have a dataset of customer service transcripts that we want to use to fine-tune an LLM. Below is a simplified example of how you might select, load, and train the model using this data:
This shows a basic framework for fine-tuning an LLM on customer service data. It includes steps for loading the data, preparing datasets, initialising the model and tokenizer, defining training arguments, and training and evaluating the model.


#python code

import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

# Load the dataset
data = pd.read_csv('customer_service_transcripts.csv')
texts = data['transcript'].tolist()
# Select a subset of the data for training
train_texts = texts[:int(len(texts) * 0.9)]
test_texts = texts[int(len(texts) * 0.9):]
# Save the training and test sets
with open('train_data.txt', 'w') as f:
f.write('\n'.join(train_texts))
with open('test_data.txt', 'w') as f:
f.write('\n'.join(test_texts))
# Load a pre-trained model and tokenizer
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Prepare the datasets
train_dataset = TextDataset(tokenizer=tokenizer,file_path='train_data.txt',block_size=128)
test_dataset = TextDataset(tokenizer=tokenizer,file_path='test_data.txt',block_size=128)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Define training arguments
training_args = TrainingArguments(
output_dir='./results',
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
eval_steps=500,
save_steps=1000,
warmup_steps=500,
prediction_loss_only=True)

# Initialize the Trainer
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=test_dataset)

# Train the model
trainer.train()

# Evaluate the model
trainer.evaluate()

Comments

Popular posts from this blog

Exporting/Saving Putty/SecureCRT/SecureFX sessions on Windows!!

This is my first stint with blogging and I thought why not just start with sharing some useful information/tricks  that can save you some time. To begin with , here are couple of tips that could save you some time. Actually, I had once this problem where I have to take backup and move my work data to a new laptop and as we normally do , I also put my data on some external drive and copied it to the new box. During all this exercise I realized that by this I can only back up the data, what about the work I am doing..my configurations and installations...? What the heck...I need to spend lot of time to redo all that...crap. I was using Putty, SecureCRT and SecureFX to log on to client side boxes and to work upon. And all three had at least 30 saved sessions having IP/user/pwd information, reconfiguring those sessions would be a daunting tasks. I did some research "well by that I meant googling also..:)" and here is how a little information can help you.. I...

Navigating the Data Lake: Deploying Hadoop/HDFS and Spark on Kubernetes

Introduction: In the modern era of data-driven decision-making, organizations are constantly seeking innovative solutions to harness the power of big data. One such solution is the deployment of Hadoop/HDFS and Spark on Kubernetes—a strategic endeavor that requires careful planning and execution. In this blog post, I will explore the process of deploying these technologies on Kubernetes, unlocking new possibilities for data management and analysis. This is the same solution that we used when we started to build our no code/low code data pipeline and visualization platform DataSetu. For Datasetu we needed to create a Data lake for processing huge dataset with high performance. This how we went about it, Chapter 1: Setting the Foundation with Kubernetes Before embarking on our journey to the data lake, it's essential to establish a solid foundation. Kubernetes serves as the cornerstone of our infrastructure, providing the orchestration and scalability needed to manage our distributed...

class.forName(String className) used in JDBC unfolded

During many interviews I have taken , I have observed even seasoned programmers find it difficult to explain the use of class.forname() method while making a JDBC connection object. In this post, I am just trying to explain this, Here is a simple JDBC code, import java.sql.*; public class JdbcConCode { public static void main(String args[]) { Connection con = null; String url = "jdbc:mysql://localhost:3306/nilesh"; String driver = "com.mysql.jdbc.Driver"; String user = "root"; String pass = "nilesh"; try { Class.forName(driver); con = DriverManager.getConnection(url, user, pass); System.out.println("Connection is created...."); //TODO //rest of the code goes here } catch (Exception e) { System.out.println(e); } } } Let's go through the code above, the url indicates the location of the database schema, here the database is nilesh . Then there is Driver which is the fully qualified JDBC driver class name. ...