CS 1678/2078: Homework 4

Due: 4/10/2024, 11:59pm


This assignment is worth 50 points.

In this assignment, you will implement essential elements of a transformer decoder for image captioning. Additionally, you will be introduced the self-supervised learning through SimCLR, a framework that enhances visual understanding and representation learning using contrastive learning for image classification.

Starter code is provided here. You need to save a copy in your own Google Drive so that you can edit. There are two self-explanatory notebooks related to transformers and self-supervised learning. You are asked to edit python files (.py) inside the starter code to implement related components. The code sections where you should start and end your implementation are explicitly identified by comment lines. These python files are imported in the notebooks. Therefore, you are also asked to use the notebooks for background information, follow instructions, and test your implementations. A small portion of the COCO dataset is utilized for captioning while implementing transformers. Meanwhile, the CIFAR10 dataset is employed for image classification in self-supervised learning, with both datasets being automatically downloaded by the respective notebooks. While grading, the outputs from the notebooks will primarily be used, but submitting the modified python files is also required.


Part A: Implementing Transformers (25 Points)

Please follow the detailed instructions in the "Transformer_Captioning.ipynb" notebook. The implementations that you need to complete are listed for each bullet below. After completing the implementations in the python files, you should run the related cells in the notebook to test your implementations.

  1. [5 pts] Implement multi-headed scaled dot-product attention in the designated sections of MultiHeadAttention class in the file cs1678_2078/transformer_layers.py
  2. [5 pts] Implement designated sections of the PositionalEncoding class in cs1678_2078/transformer_layers.py
  3. [5 pts] Answer the inline question inside the notebook.
  4. [5 pts] Implement the forward function of the CaptioningTransformer class in the file cs1678_2078/classifiers/transformer.py
  5. [5 pts] After training transformer-based captioning model on small training data, report the final loss and sampling at test time. You do not need to change or implement any code. If your previous transformer implementations are correct, only running the cells at the end of the notebook will output the results.

Part B: Self-supervised Learning (25 Points)

Please follow the detailed instructions in the "Self_Supervised_Learning.ipynb" notebook. The implementations that you need to complete are listed for each bullet below. After completing the implementations in the python files, you should run the related cells in the notebook to test your implementations.

  1. [5 pts] Implement the compute_train_transform() and CIFAR10Pair.__getitem__() functions for data augmentation transform in cs1678_2078/simclr/data_utils.py
  2. [5 pts] Implement the functions sim and simclr_loss_naive in cs1678_2078/simclr/contrastive_loss.py
  3. [5 pts] Implement the functions sim_positive_pairs, compute_sim_matrix, simclr_loss_vectorized in cs1678_2078/simclr/contrastive_loss.py
  4. [5 pts] Implement the train function in cs1678_2078/simclr/utils.py
  5. [5 pts] Report the baseline and self-supervised accuracies, and the plot for comparison. You do not need to change or implement any code. If your previous implementations are correct, only running the cells at the end of the notebook will output the results.

Some tips:
Submission: You need to submit your CS1678_2078_HW4 folder after completing the implementations and running the related cells in the notebooks. Before compressing (zip) the CS1678_2078_HW4 folder from Colab, please exclude the cs1678_2078/datasets/coco_captioning and pretrained_model directories to reduce the size of the submission (otherwise, the file size would exceed the limit for submission on Canvas).


Acknowledgement: This assignment is adopted from the Stanford CS231n course.