# Assignment 4

#### Student ID: *Double click here to fill the Student ID*

#### Name: *Double click here to fill the name*

## Q1: Transfer Learning on the Dataset Used by the CNN Explainer

[CNN Explainer](https://poloclub.github.io/cnn-explainer/) is an interactive, open-source visualization tool designed to provide a comprehensive understanding of Convolutional Neural Networks (CNNs). In the last assignment, we attempted to replicate the original experiment using TinyVGG by training the model from scratch. However, the results were not satisfactory. In this exercise, we will continue our work and leverage transfer learning to improve performance on the test dataset. First, load the dataset using the following code (feel free to modify the code if you prefer to use PyTorch or other frameworks).

To ensure reproducibility, please set all the random seeds to 2024:

In [None]:
!gdown --fuzzy https://drive.google.com/file/d/1DvyriY4ehA56Bj3asAbCz_syYZvTkumW/view?usp=sharing
!unzip cnn_data.zip

In [None]:
seed = 2024

training_images = 'train_images/'
vali_images = 'val_images/'
test_images = 'test_images/'

train_dataset = image_dataset_from_directory(
    training_images,
    image_size=(64, 64),
    shuffle=True,
    batch_size=32,
    seed=seed)
validation_dataset = image_dataset_from_directory(
    vali_images,
    image_size=(64, 64),
    shuffle=False,
    batch_size=32)
test_dataset = image_dataset_from_directory(
    test_images,
    image_size=(64, 64),
    shuffle=False,
    batch_size=32)

#### (a) EfficientNet is a modern convolutional neural network obtained through [network architecture search](https://lilianweng.github.io/posts/2020-08-06-nas/). We will use it to perform transfer learning by following the procedure below: (10%)

1. Add a callback to monitor the validation loss and save the best model based on the validation loss.

2. Import the convolutional base of [`EfficientNetV2Backbone`](https://keras.io/api/keras_cv/models/backbones/efficientnetv2/) (`efficientnetv2_b0`) with pre-trained weights from ImageNet. Freeze all the weights in the convolutional base.

3. Add a dropout layer after the convolutional base (remember to flatten the output of the base before applying dropout) with a dropout rate of 0.5, followed by a dense layer with softmax activation to classify the 10 classes.

4. Train the model for 10 epochs using the [`Nadam`](https://keras.io/api/optimizers/Nadam/) optimizer with the default learning rate. Finally, report the accuracy on the test set. Remember to reload the best model before testing.

In [None]:
# coding your answer here.

#### (b) Contrastive Language-Image Pre-Training (CLIP) is a foundational model that includes both text and image encoders. Its multimodal nature enables zero-shot classification. In this problem, we will leverage [`CLIP`](https://huggingface.co/docs/transformers/model_doc/clip) to perform classification. (10%)

1. Load the model and processor using the checkpoint [`openai/clip-vit-base-patch32`](https://huggingface.co/openai/clip-vit-base-patch32) from Hugging Face.

2. Use the following list as the candidate text labels:
    ```
    ['boat',
     'bug',
     'bus',
     'car',
     'espresso',
     'koala',
     'orange',
     'panda',
     'pepper',
     'pizza']
    ```

3. Perform zero-shot 10-class classification using the processor and the `CLIP` model on the test dataset.

4. Calculate the accuracy on the test dataset.

**Hint:** Refer to our lab to learn how to perform zero-shot classification. Remember that you can extract the image data and labels from the test set into `X` and `y` `NumPy` arrays.

In [None]:
# coding your answer here.

#### (c) [Stable Diffusion](https://huggingface.co/blog/stable_diffusion) is a text-to-image model that can generate high-quality images using a diffusion model. Design a prompt (e.g., "a photo of ...") to generate three realistic photos that belong to one of the given 10 classes in the CNN Explainer (e.g., koala, bug) using `diffusers` or `KerasCV`. Finally, plot the three generated images. Additionally, resize the three images you generate into $64 \times 64$ pixels. (10%)

**Hint:** You can use `PIL`, `OpenCV`, `Skimage`, or even `NumPy` to resize the images.

In [None]:
# coding your answer here.

#### (d) Use the best model you identified in part (a) and the `CLIP` model from part (b) to perform inference on the three downsampled images you generated in (c). Are the results from both models correct for all three images? (5%)

In [None]:
# coding your answer here.

> Ans: *double click here to answer the question.*

### (e) Use [`KerasTuner`](https://keras.io/keras_tuner/) to Perform Network Architecture Search. The search space is described as follows (15%):

|        | Type                | Activation | Notice|
|--------|---------------------|---------|------------|
| Output | Fully connected    | Softmax    ||
| D1 | DropOut     |        ||
| F1     | Fully connected         | ReLu ||
| FL     | Flatten              |            ||
| ...     |  |                  |The convoltion blocks may repeat 1~3 times|
| P1     | Max pooling     |------------|\||
| R2     | ReLu         ||\||
| B2     | batch normalization ||\||
| C2     | Convolution     ||\|-------> These 7 layer forms 1 convolution blocks|
| R1     | ReLu         ||\||
| B1     | batch normalization ||\||
| C1     | Convolution      |------------|\||
| In     | Input         |           ||

1. **Number of Convolutional Blocks:** Search for the number of convolutional blocks (each block consists of seven layers: (Convolution, Batch Normalization, ReLU) Ã— 2 followed by a pooling layer) from 1 to 3. Fix the filter size to 3 for the convolution layer.
2. **Number of Filters:** Search for the number of filters used in the convolutional layers within the convolutional blocks from 16 to 96, with a step size of 16.
3. **Number of Neurons in Dense Layer:** Search for the number of neurons in the first dense layer from 20 to 50, with a step size of 10.
4. **Dropout Rate:** Search for the dropout rate from 0.3 to 0.8, with a step size of 0.1.
5. **Learning Rate:** Use the Adam optimizer and search for the learning rate from 0.0001 to 0.01, using a logarithmic sampling strategy with a step size of 10.

Use Bayesian optimization to conduct a maximum of 3 trials with two executions per trial. Evaluate the performance on the entire validation set for 10 epochs during the search.

Finally, use the architecture you identified and report the test accuracy achieved with this architecture.

In [None]:
# coding your answer here.

> Ans: *double click here to answer the question.*

## Q2: Analyze Sentiment Dataset Using Different Models


[Sentiment Analysis](https://huggingface.co/tasks/text-classification) is a subfield of Natural Language Processing (NLP) that involves determining the emotional tone behind words. Its purpose is to understand the attitudes, opinions, and emotions of a speaker or writer with respect to a specific topic or the overall contextual polarity of a document. In this problem, we will use datasets sourced from three different websites: `imdb.com`, `amazon.com`, and `yelp.com`. We will classify each sentence as positive (1) or negative (0).

Firstly, execute the following code cell to import the datasets and organize them into a unified dataset. Feel free to modify the code if you prefer to use `PyTorch` or other frameworks.

To ensure reproducibility, please set all the random seeds to 2024:

In [None]:
!gdown --fuzzy https://drive.google.com/file/d/1E_l4Mh3OU6tRJWXFVrtXaHV-FN6j0Urm/view?usp=sharing
!unzip -qq /content/nlp_data.zip

In [None]:
filepath_dict = {'yelp':  'nlp_data/yelp_labelled.txt',
          'amazon': 'nlp_data/amazon_cells_labelled.txt',
          'imdb':  'nlp_data/imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source
    df_list.append(df)

df = pd.concat(df_list)
sentences = df['sentence'].to_numpy()
y = df['label'].to_numpy()

sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.3, random_state=2024)

train_ds = tf.data.Dataset.from_tensor_slices((sentences_train, y_train))
val_ds = tf.data.Dataset.from_tensor_slices((sentences_test, y_test))
# For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required.
train_ds = train_ds.shuffle(3000).batch(32)
val_ds = val_ds.batch(32)
text_only_train_ds = train_ds.map(lambda x, y: x)

#### (a) The Bag-of-Words model is a popular approach in NLP for representing text data, and we will use this model first. (10%)

1. **Build Text Representation:** Create representations of the input text using bigrams with different encoding methods (multi-hot, count, and TF-IDF). Set the maximum number of tokens to 10,000 during vectorization to obtain a 5,000-dimensional vector for each sample.

2. **Construct Classifier:** Build a Random Forest classifier with `n_estimators` set to 10.

3. **Compare Encoding Methods:** Use the model and the accuracy metric to compare the multi-hot, count, and TF-IDF encoding methods. Determine which encoding method performs best based on validation accuracy.

**Hint:** Refer to our lab to learn how to use the `TextVectorization` layer in `Keras` to vectorize the training and validation sets. If you are using `PyTorch`, you may find `torchtext` and `TfidfVectorizer` from `sklearn` useful.

In [None]:
# coding your answer here.

> Ans: *double click here to answer the question.*

#### (b) Now we will try to improve the performance by tuning the hyperparameters using [`Optuna`](https://optuna.org/). (10%)

1. **Define the Parameter Grid:**  
   ```python
   param_grid = {
       'n_estimators': (10, 50),
       'criterion': ['gini', 'entropy'],
       'min_samples_split': (2, 4),
       'max_features': ['sqrt', 'log2']
   }
   ```

2. **Perform Hyperparameter Search:**  
   Use Bayesian optimization with [`GPSampler`](https://optuna.readthedocs.io/en/stable/reference/samplers/generated/optuna.samplers.GPSampler.html) to search for the best hyperparameters. Treat `n_estimators` and `min_samples_split` as integer parameters, and `criterion` and `max_features` as categorical parameters. Perform cross-validation with 5 folds during the search.

3. **Construct and Evaluate the Model:**  
   Build the Random Forest classifier using the best hyperparameters found in step 2. Report the accuracy on the validation set.

**Hint:** Refer to the [`Optuna` documentation](https://optuna.readthedocs.io/en/stable/index.html) or our lab for guidance on setting up the sampler and defining the search space. Ensure that cross-validation is properly integrated into the hyperparameter tuning process to obtain reliable performance estimates.

In [None]:
# coding your answer here.

> Ans: *double click here to answer the question.*

#### (c) We will now proceed with a sequence model. To determine the maximum length used in the vectorizer, draw a histogram showing the distribution of the number of words in each sentence in the dataset. Furthermore, calculate the average number of words and the 95% quantile of the word counts, and label them on the plot. (10%)

In [None]:
# coding your answer here.

#### (d) (10%)

1. **Set Maximum Length:** Set the maximum length to 10,000 when vectorizing the text and use the 95th percentile of the number of words in the sentences identified in part (c) as the maximum token length.

2. **Build an RNN Model:** Construct a Recurrent Neural Network (RNN) with the following architecture:

|        | Type                | Channels    | Activation | Notice|
|--------|---------------------|---------|------------|------------|
| Output | Fully connected     |       | Sigmoid    | Binary classification|
| D1     | Dropout |         |            |with dropout rate set to 0.8|
| R1     | Bidirectional RNN  with GRU cell        | 8      |        |Bidirectional RNN layer with 8 GRU cells|
| E1     | Embedding         |       |        | Output of embedding is set to 8 dimensions and remember to mask the padded zeros |
| In     | Input               |  |            |Input is truncated to x words with 10,000 dimensions |                                        |

3. **Compile and Train the Model:**  
    - Add a callback to monitor the validation loss and save the best model based on **validation accuracy**.
    - Compile the model using the `Adam` optimizer with the default learning rate.
    - Fit the model for 10 epochs.

4. **Evaluate the Model:**  
    - Report the accuracy on the validation set.
    - Compare this accuracy to the results obtained in part (b) and provide comments on the performance differences.

**Hint:** Refer to our lab to learn how to use the `TextVectorization` layer in `Keras` to vectorize the training and validation sets. If you are using `PyTorch`, you may find `torchtext` useful.

In [None]:
# coding your answer here.

> Ans: *double click here to answer the question.*

#### (e) Generative Pre-trained Transformer (GPT) is a powerful pretrained foundation model that can generate text based on the [Transformer](https://transformer.realcat.top/). (10%)

1. **Load the Model and Tokenizer:**  
   Load the model and tokenizer using the checkpoint [`gpt2-xl`](https://huggingface.co/gpt2-xl) from Hugging Face.

2. **Generate a Positive Review:**  
   Design a prompt and use the `generate()` method to produce 30 tokens that represent a positive review. For example, you can input the text "The movie is great since" and allow the model to generate 30 tokens.

3. **Classify the Generated Review:**  
   Feed the generated review into the best model identified in parts (a) and (c). Determine whether the model predicts the review as positive.

**Hint:** You can adjust parameters such as `do_sample`, `top_k`, `num_beams`, or `temperature` in the `generate()` function to make the generated text more convincing. Refer to our lab for a brief introduction. You may also find the [Transformer Explainer](https://transformer.realcat.top/) useful for understanding GPT.

In [None]:
# coding your answer here.

> Ans: *double click here to answer the question.*