Skip to main content

How to use Generative AI on custom data sets across industry to build domain specific use cases

Generative AI is transforming the landscape of machine learning by enabling the creation of new content that can mimic human-like capabilities. One of the most powerful forms of generative AI is the Large Language Model (LLM), which can understand and generate human language with remarkable proficiency.
A key feature of LLMs is their ability to be re-trained or fine-tuned on custom data sets. This allows organizations to tailor the model's responses to specific domains or use cases. For instance, a legal firm could train an LLM on legal documents to assist in drafting contracts, while a medical research company could train it on scientific papers to generate new research hypotheses.
The process of re-training an LLM involves several steps. First, a suitable data set must be curated. This data set should be large enough to cover the desired scope and diverse enough to enable the model to learn various patterns and nuances. Next, the model must be fine-tuned, which involves adjusting the model's parameters so that it better aligns with the new data set. Finally, the re-trained model must be evaluated to ensure that it performs as expected on tasks relevant to the new domain. Sounds easy?
Re-training LLMs with custom data sets not only enhances their applicability but also improves their accuracy and efficiency in generating relevant content. As generative AI continues to evolve, the ability to customise and adapt these models will become increasingly important for businesses looking to leverage AI for competitive advantage.
Fine-tuning an LLM (Large Language Model) involves several critical steps to adapt the model to a specific domain. Here's a simplified overview of the process:
1) Data Preparation: Gather a dataset that is representative of the domain or task for which the model will be fine-tuned. This dataset should include examples of the type of content the model will generate or analyze.
2) Model Selection: Choose a pre-trained LLM that best fits the size and scope of your data. Larger models may require more data but can potentially yield better results.
3) Parameter Adjustment: Modify the model's hyperparameters, such as learning rate, batch size, and number of epochs, to optimize the training process.
4) Training: Use the custom dataset to train the model, allowing it to learn from the new examples and adjust its internal weights accordingly.
5) Evaluation: Test the fine-tuned model on a separate validation set to measure its performance and make any necessary adjustments.
6) Deployment: Once satisfied with the model's performance, deploy it to start generating or analyzing content as required.
Keep in mind, fine-tuning is an iterative process that may require multiple rounds of adjustment and evaluation to achieve the desired performance.

Custom data sets for re-training LLMs can vary widely depending on the specific needs and goals of the project. Here are some examples:
1) Customer Service Data: Transcripts of customer service calls or online chat logs can be used to train an LLM to understand and respond to customer inquiries.
2) Legal Documents: A collection of contracts, legal briefs, and case law can help an LLM learn the language of legal discourse for applications like automated contract analysis.
3) Medical Records: Anonymized patient records, treatment plans, and medical research papers can train an LLM to assist with medical diagnosis or literature review.
4) Technical Manuals: Manuals and documentation for specific products or technologies can help an LLM understand technical language and assist with customer support.
5) Literary Works: A corpus of literary texts can be used to train an LLM to generate creative writing or analyze literary styles.
6) Research Articles: A dataset of scientific articles from a particular field can train an LLM to summarize research findings or generate new research ideas.
We need to ensure that each of these data sets are carefully prepared so that they are representative, comprehensive, and appropriately formatted for the training process.

Let's consider a scenario where we have a dataset of customer service transcripts that we want to use to fine-tune an LLM. Below is a simplified example of how you might select, load, and train the model using this data:
This shows a basic framework for fine-tuning an LLM on customer service data. It includes steps for loading the data, preparing datasets, initialising the model and tokenizer, defining training arguments, and training and evaluating the model.


#python code

import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

# Load the dataset
data = pd.read_csv('customer_service_transcripts.csv')
texts = data['transcript'].tolist()
# Select a subset of the data for training
train_texts = texts[:int(len(texts) * 0.9)]
test_texts = texts[int(len(texts) * 0.9):]
# Save the training and test sets
with open('train_data.txt', 'w') as f:
f.write('\n'.join(train_texts))
with open('test_data.txt', 'w') as f:
f.write('\n'.join(test_texts))
# Load a pre-trained model and tokenizer
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Prepare the datasets
train_dataset = TextDataset(tokenizer=tokenizer,file_path='train_data.txt',block_size=128)
test_dataset = TextDataset(tokenizer=tokenizer,file_path='test_data.txt',block_size=128)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Define training arguments
training_args = TrainingArguments(
output_dir='./results',
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
eval_steps=500,
save_steps=1000,
warmup_steps=500,
prediction_loss_only=True)

# Initialize the Trainer
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=test_dataset)

# Train the model
trainer.train()

# Evaluate the model
trainer.evaluate()

Comments

Popular posts from this blog

Exporting/Saving Putty/SecureCRT/SecureFX sessions on Windows!!

This is my first stint with blogging and I thought why not just start with sharing some useful information/tricks  that can save you some time. To begin with , here are couple of tips that could save you some time. Actually, I had once this problem where I have to take backup and move my work data to a new laptop and as we normally do , I also put my data on some external drive and copied it to the new box. During all this exercise I realized that by this I can only back up the data, what about the work I am doing..my configurations and installations...? What the heck...I need to spend lot of time to redo all that...crap. I was using Putty, SecureCRT and SecureFX to log on to client side boxes and to work upon. And all three had at least 30 saved sessions having IP/user/pwd information, reconfiguring those sessions would be a daunting tasks. I did some research "well by that I meant googling also..:)" and here is how a little information can help you.. If you are usin

What is big data ? and what role Hadoop has to play?

In recent times we have been hearing , reading (in many blogs) quite often about cloud computing ,big data and HPC ie. high performance computing. Undoubtedly, these are the buzz words for this decade. If you have been in the industry for quite some time then you can recall last decade was web 2.0 decade. So what is big data? Is it a new framework or new data modelling technique or part of NO SQL movement? I have seen many times people relate big data to one of these or they are confused about it. Same is with HADOOP, it is thought as a NO SQL database framework. Big data is nothing but just huge amount of data which requires mining. Per wiki , Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.So big data is a hovering p

class.forName(String className) used in JDBC unfolded

During many interviews I have taken , I have observed even seasoned programmers find it difficult to explain the use of class.forname() method while making a JDBC connection object. In this post, I am just trying to explain this, Here is a simple JDBC code, import java.sql.*; public class JdbcConCode { public static void main(String args[]) { Connection con = null; String url = "jdbc:mysql://localhost:3306/nilesh"; String driver = "com.mysql.jdbc.Driver"; String user = "root"; String pass = "nilesh"; try { Class.forName(driver); con = DriverManager.getConnection(url, user, pass); System.out.println("Connection is created...."); //TODO //rest of the code goes here } catch (Exception e) { System.out.println(e); } } } Let's go through the code above, the url indicates the location of the database schema, here the database is nilesh . Then there is Driver which is the fully qualified JDBC driver class name.