Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to design negative samples for Florence-2 model training? #52

Open
1 task done
David-19940718 opened this issue Sep 18, 2024 · 10 comments
Open
1 task done
Labels
question Further information is requested

Comments

@David-19940718
Copy link

Search before asking

  • I have searched the Multimodal Maestro issues and found no similar feature requests.

Question

Hi, @skylargivens,

We currently have a good understanding of how to create positive samples for the Florence-2 model, using a format like this:

{
  "image": "IMG_20220316_144445_jpg.rf.a79f523e54855af2323f0cfdb9a4dedc.jpg",
  "prefix": "<OD>",
  "suffix": "5 of hearts<loc_54><loc_213><loc_291><loc_598>6 of hearts<loc_205><loc_251><loc_471><loc_670>7 of hearts<loc_363><loc_309><loc_688><loc_797>8 of hearts<loc_598><loc_395><loc_973><loc_974>"
}

However, I'm unclear on how to properly design negative samples for training. Negative samples are crucial for improving the model's ability to discriminate and reduce false positives. Some questions I have:

  1. Should negative samples use the same image but with incorrect object descriptions?
  2. Do we need to use completely unrelated images and descriptions?
  3. How do we handle the location tags for negative samples?
  4. What's the recommended ratio of positive to negative samples in the training set?

Any guidance or best practices for creating effective negative samples would be greatly appreciated. This will help ensure we're training the Florence-2 model optimally for object detection tasks.

Additional

If there are any existing resources, documentation, or examples specifically for Florence-2 negative sample creation, please point me in that direction. Also, if there are any tools or scripts the team recommends for generating or augmenting negative samples, that information would be very helpful.

@David-19940718 David-19940718 added the question Further information is requested label Sep 18, 2024
@David-19940718
Copy link
Author

We're currently experiencing a situation where our model's mAP (mean Average Precision) metrics are degrading while the loss values suggest overfitting. Our current saving strategy is based solely on validation loss, as shown in the following code snippet:

    def save_best(self, processor: AutoProcessor, model: AutoModelForCausalLM, val_loss: float):
        """Saves the best model checkpoint if the validation loss improves.

        Args:
            processor (AutoProcessor): The processor to save.
            model (AutoModelForCausalLM): The model to save.
            val_loss (float): The current validation loss.
        """
        if val_loss < self.best_val_loss:
            self.best_val_loss = val_loss
            save_model(self.best_checkpoint_dir, processor, model)
            print(f"New best model saved with validation loss: {self.best_val_loss}")

I've been looking at our model saving strategy, and I'm curious about your thoughts on its effectiveness. While we're using validation loss as the primary metric for saving the best model, it seems that our mAP scores are not reflecting the improvements we see in the loss. Do you think relying solely on validation loss is the best approach for designing our model saving criteria?

Would it be more beneficial to consider a combination of metrics, such as both validation loss and mAP, to ensure we're not just minimizing loss but also improving the model's precision? Or are there other metrics or strategies you believe would be more suitable for our current situation?

Looking forward to your insights on this matter.

image

@SkalskiP
Copy link
Collaborator

Hi @David-19940718 👋🏻 First of all, I'm thrilled to have users like you who are eager to experiment early on and push the library forward.

Regarding negative samples, I don't think there are any established best practices at the moment, but I'll ask a few people involved in VLM training about it.

I thought a good idea, and potentially simple to implement, would be to use the COCO dataset as negative samples. For example, splitting the training into two parts. In the first part, you fine-tune only on your dataset, and in the second part, on a mix of your dataset and the COCO dataset. This way, in the first phase, the model quickly learns your classes, and in the second phase, it becomes resistant to overfitting.

As for your second question, the ability to define any metric as a condition for saving a checkpoint sounds very reasonable. I'll try to add a GH issue to add such support.

@David-19940718
Copy link
Author

Thank you for your detailed and encouraging response. 😄

@David-19940718
Copy link
Author

Hi @SkalskiP,

By introducing appropriate data augmentation strategies, I've observed a significant reduction in overfitting. Moreover, under the same experimental conditions, the mAP accuracy has improved by several percentage points.

In future version development plans, it might be worth considering the addition of this feature.

image

@SkalskiP
Copy link
Collaborator

Hi @David-19940718 👋🏻 That looks fantastic! Could you tell me exactly what strategies you employed?

@David-19940718
Copy link
Author

Sure! The main strategies I employed are:

  • Random horizontal flipping (50% chance)
  • Color jittering (adjusting brightness, contrast, saturation, and hue)
class DetectionDataset(Dataset):
    def __init__(self, jsonl_file_path: str, image_directory_path: str, split_name: str):
        self.dataset = JSONLDataset(jsonl_file_path, image_directory_path)
        self.mode = split_name
        if split_name == "train":
            self.transform = transforms.Compose([
                transforms.RandomHorizontalFlip(p=0.5),
                transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1)
            ])

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        image, data = self.dataset[idx]
        prefix = data["prefix"]
        suffix = data["suffix"]
        # Apply data augmentation
        if self.mode == "train":
            image = self.transform(image)
        
        return prefix, suffix, image

@SkalskiP
Copy link
Collaborator

Hi @David-19940718 👋🏻 Oh, so you ended up using fairly traditional data augmentation techniques?

From what I see, you applied flipping. I understand that you also had to augment the object detection suffix in the process.

@David-19940718
Copy link
Author

Yes, I just did a simple initial validation. I applied some basic data augmentation techniques to get started and test things out. 😄

@SkalskiP
Copy link
Collaborator

@David-19940718 would you perhaps have a moment to draft a PR introducing basic data augmentation?

@kengboonang
Copy link

Hello! Would be interested to know if there are any updates regarding this! Currently working on fine-tuning Florence2-base-ft for Object Detection tasks and have tried the following:

  • leaving out negative samples entirely
  • using the following annotations:
    • none<loc_000><loc_000><loc_000><loc_000> (only on negative samples)
    • background<loc_000><loc_1000><loc_000><loc_1000> (for all samples)

Leaving the negative samples out entirely still led to better results as compared to the two annotation methods I've tried where the model is unable to converge as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants