Creating a training and validation split for torchvision datasets
I’ve had an annoying issue with some of the torchvision datasets in that they don’t split the training and validation data. I was trying to decide on the best solution to this issue today and decided to ask ChatGPT (since this is something we can verify!).
The solution it proposed is below:
trainval_dataset = torchvision.datasets.OxfordIIITPet(
root='/tmp',
split='trainval',
download=True,
transform=transform
)
testing_dataset = torchvision.datasets.OxfordIIITPet(
root='/tmp',
split='test',
download=True,
transform=transform
)
val_split = 0.2
from sklearn.model_selection import train_test_split
train_indices, val_indices = train_test_split(range(len(trainval_dataset)), test_size=val_split, random_state=8208)
training_dataset = torch.utils.data.Subset(trainval_dataset, train_indices)
validation_dataset = torch.utils.data.Subset(trainval_dataset, val_indices)
train_loader = torch.utils.data.DataLoader(training_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = torch.utils.data.DataLoader(validation_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = torch.utils.data.DataLoader(testing_dataset, batch_size=BATCH_SIZE, shuffle=False)
Which seems to work quite nicely! I like the fact that you can shuffle the indices to create different training and validation sets each time (and can set it with a seed).