# Assignment 4

#### Student ID: *Double click here to fill the Student ID*

#### Name: *Double click here to fill the name*

To ensure reproducibility, please set all the random seeds to 2023:

## Q1 Transfer learning on the dataset used by the CNN explainer

[CNN Explainer](https://poloclub.github.io/cnn-explainer/) is an interactive, open-source visualization tool designed to provide a comprehensive understanding of Convolutional Neural Networks (CNNs). In the last assignment, we tried to replicate the original experiment using TinyVGG by training the model from scratch. However, the results are not satisfactory. In this exercise, we will continue our journey and leverage transfer learning to improve the performance on the given test dataset. First, load the dataset using the following code (Feel free to change the code if you want to use `Pytorch` or other frameworks.):

In [None]:
!unzip -qq data_hw3.zip

In [None]:
training_images = 'class_10_train/'
vali_images = 'class_10_val/val_images/'
test_images = 'class_10_val/test_images/'

In [None]:
from tensorflow.keras.utils import image_dataset_from_directory

train_dataset = image_dataset_from_directory(
    training_images,
    image_size=(64, 64),
    batch_size=32)
validation_dataset = image_dataset_from_directory(
    vali_images,
    image_size=(64, 64),
    batch_size=32)
test_dataset = image_dataset_from_directory(
    test_images,
    image_size=(64, 64),
    batch_size=32)

Found 5000 files belonging to 10 classes.
Found 250 files belonging to 10 classes.
Found 250 files belonging to 10 classes.


#### (a) EfficientNet is a modern convnets obtained from [network architecture search](https://lilianweng.github.io/posts/2020-08-06-nas/). We will use it to perform transfer learning by using the following procedure: (10%)

1. First, add the callback to monitor the validation loss and save the best model base on the validation loss. 

2. Import the convolutional base of `EfficientNetV2S` (`efficientnet_v2_s`) and pre-trained weight from ImageNet. Try to freeze all the weights in the convolutional base. 

3. Add a dropout layer after the convolutional base (remember to flatten the output of the base before dropout) with a dropout rate set to 0.5, followed by a dense layer with softmax activation to classify the given 10 classes.

4. Train the model for 10 epochs using `Adam` optimizer with default learning rate. Finally, report the accuracy of the test set. Remember to reload the best model before the test.

Hint: Remember you may need different input preprocessing for each pre-train model. Checkout the documentation for that pretrain model first.

In [None]:
# coding your answer here.

#### (b) Looking at the training/validation loss, you can see that the model is overfitting. Try to add a data augmentation layer for the model in (a) as follows: (10%)

* Applies random horizontal flipping 
* Rotates the input images by a random value in the range `[â€“36 degrees, +36 degrees]`)
* Zooms in or out of the image by a random factor in the range `[-20%, +20%]`
* Randomly choose a location to crop images down to a target size `[56, 56]`
* Randomly adjust the contrast of images so that the resulting images are `[0.85, 1.15]` brighter or darker than the original one.

In addition, unfreeze the last three layers of the convolutional base (i.e., We will fine-tune the last three layers and the classification head). Fit your model using `Adam` optimizer for enough epochs (40, for instance). Finally, report the accuracy of the test set. Remember to reload the best model before the test.

Hint: You can find out how to apply the data augmentation in the previous lab. Notice that to make the shape compatible, you need to construct the convolutional base with a different size that meets the output of the data augmentation layer.

In [None]:
# coding your answer here.

#### (c) `CLIP` is a foundation model that contains both text and image encoder. The multimodal nature makes it possible to conduct zero-shot classification. In this problem, we will leverage it to perform the classification. (15%)

1. Firstly, try to load the model and processor with checkpoint [`openai/clip-vit-large-patch14`](https://huggingface.co/openai/clip-vit-large-patch14) from hugging face. You can use `TFAutoModelForZeroShotImageClassification`(or `AutoModelForZeroShotImageClassification`) and `AutoProcessor` to load the model.

2. Secondly, use the following mapping to generate the candidate text labels: 
```
{'n01882714': 'koala', 
 'n02165456': 'ladybug', 
 'n02509815': 'lesser panda', 
 'n03662601': 'lifeboat', 
 'n04146614': 'school bus', 
 'n04285008': 'sports car', 
 'n07720875': 'bell pepper', 
 'n07747607': 'orange', 
 'n07873807': 'pizza', 
 'n07920052': 'espresso'}
```

3. Thirdly, perform zero-shot 10 class classification using the processor and the `CLIP` model on the test dataset. 

4. Finally, calculate the accuracy on the test dataset.

Hint: If you are using `image_dataset_from_directory` function, it will assign labels to images based on their directory names. The labels are integer indices of the class names sorted alphabetically (i.e., 'n01882714'-> 0, 'n02165456'-> 1 ...). In addition, refer to our lab to see how to perform zero-shot classification. Finally, remember that you can extract the image data and labels from the test set into `X` and `y` `NumPy` arrays.

In [None]:
# coding your answer here.

#### (d) [Stable diffusion](https://huggingface.co/blog/stable_diffusion) is a text-to-image model that can generate high-quality images using a diffusion model. Try to design a prompt (e.g.: "a photo of ...") to generate three realistic photos that belong to one of the given 10 classes in the CNN explainer (i.e.:koala, ladybug....) using `diffuser` (`StableDiffusionPipeline`) or `KerasCV` (`keras_cv.models.StableDiffusion`). Finally, plot the above three generated images. (15%)

Hint: Refer to our lab to see how to use stable diffusion.

In [None]:
# coding your answer here.

#### (e) Firstly, resize the three images you generate in (e) into $64 \times 64$ pixels. Secondly, use the best model you found in (a)-(b) and the `CLIP` model in (c) to perform inference on these three downsample images. Are the results from the two models correct on all three images? (10%)

Hint: You can use PIL/OpenCV/Skimage or even NumPy to resize the images.

In [None]:
# coding your answer here.

> Ans: *double click here to answer the question.*

## Q2 Analyze sentiment dataset using different models

[Sentiment analysis](https://demo.allennlp.org/sentiment-analysis/glove-sentiment-analysis) is a subfield of natural language processing (NLP) that involves determining the emotional tone behind words. Its purpose is to understand the attitudes, opinions, and emotions of a speaker or writer with respect to some topic or the overall contextual polarity of a document. In this problem, we will use the dataset that comes from three different websites/fields:
`imdb.com`, `amazon.com` and `yelp.com` and we will classify each sentence as positive(1) or negative(0). Firstly, execute the following code cell to import the dataset and organize them into a dataset (Feel free to change the code if you would like to use `Pytorch` or other frameworks.):

In [None]:
!unzip -qq /content/sentiment-data.zip

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf

filepath_dict = {'yelp':   'sentiment labelled sentences/sentiment labelled sentences/yelp_labelled.txt',
                 'amazon': 'sentiment labelled sentences/sentiment labelled sentences/amazon_cells_labelled.txt',
                 'imdb':   'sentiment labelled sentences/sentiment labelled sentences/imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source
    df_list.append(df)

df = pd.concat(df_list)

sentences = df['sentence'].to_numpy()
y = df['label'].to_numpy()
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.25, random_state=2023)

train_ds = tf.data.Dataset.from_tensor_slices((sentences_train, y_train))
val_ds = tf.data.Dataset.from_tensor_slices((sentences_test, y_test))
train_ds = train_ds.shuffle(3000).batch(32)
val_ds = val_ds.batch(32)
text_only_train_ds = train_ds.map(lambda x, y: x)

#### (a) Bag-of-word model is a popular approach in NLP to represent text data and we will try this model first. (15%)

1. Firstly, try to build the representation of the input text using bigram and TF-IDF encoding. Set the maximum token to 10,000 when performing vectorization so that we get a vector of 10,000 dimensions for each sample. 

2. Secondly, build a simple MLP model as follows: 

|        | Type                | Maps    | Activation | Notice|
|--------|---------------------|---------|------------|------------|
| Output | Fully connected     |       | Sigmoid    | Binary classification|
| D2     | Dropout |         |            |with dropout rate set to 0.75|
| D1     | Fully connected          | 16      | ReLu       |16 neurons|
| In     | Input               |  |            |Input is 10,000 dimension|


3. Add the callback to monitor the validation loss and save the best model base on the **validation accuracy**. Fit the model with `Adam` optimizer for 10 epochs with the default learning rate. 

4. Finally, report the best validation accuracy.

Hint: Refer to our lab to see how to use `TextVectorization` layer in `Keras` to vectorize the training and validation set. If you are using `Pytorch` you may find `torchtext` and `TfidfVectorizer` from `sklearn` useful.

In [None]:
# coding your answer here.

#### (b) Now, we will go with the sequence model. (15%)

1. Firstly, set the maximum length and maximum token to 200 and 10,000 when vectorizing the text, respectively. 

2. Secondly, build an RNN as follows: 

|        | Type                | Maps    | Activation | Notice|
|--------|---------------------|---------|------------|------------|
| Output | Fully connected     |       | Sigmoid    | Binary classification|
| D1     | Dropout |         |            |with dropout rate set to 0.75|
| R1     | Bidirectional RNN  with GRU cell        | 16      |        |Bidirectional RNN layer with 16 GRU cells|
| E1     | Embedding         |       |        | Output of embedding is set to 64 dimensions and remember to mask the padded zeros |
| In     | Input               |  |            |Input is truncated to 200 words with 10,000 dimensions |

3. Add the callback to monitor the validation loss and save the best model base on the **validation accuracy**. Fit the model with `Adam` optimizer for 10 epochs with the default learning rate. 

4. Finally, report the best validation accuracy.

Hint: Refer to our lab to see how to use `TextVectorization` layer in `Keras` to vectorize the training and validation set. If you are using `Pytorch`, you may find `torchtext` useful.

In [None]:
# coding your answer here.

#### (c) There is a [rule of thumb](https://developers.google.com/machine-learning/guides/text-classification/step-2-5#algorithm_for_data_preparation_and_model_building) that you should pay close attention to the **ratio between the number of samples in your training data and the mean number of words per sample** when approaching a new text classification task. If that ratio is smaller or less than 1,500, the bag-of-bigrams model will perform better. If that ratio exceeds 1,500, you should go with a sequence model. In other words, sequence models work best when lots of training data are available and each sample is relatively short. (10%) 

Try to plot the Histogram of the number of words per sample for the training dataset and calculate the ratio described above. Finally, compare the accuracy you get using bag-of-bigrams in (a) and the results you get in (b). Make some comments on the rule of thumb.

In [None]:
# coding your answer here.

> Ans: *double click here to answer the question.*

#### (d) GPT is a powerful pretrained foundation model that can generate text. (10%)

1. Firstly, try to load the model and tokenizer with checkpoint [`gpt2-xl`](https://huggingface.co/gpt2-xl) from the hugging face. You can load the model using `TFAutoModelForCausalLM` (or `AutoModelForCausalLM`) and `AutoTokenizer`.

2. Secondly, design a prompt and use the `generate()` method to generate 200 tokens that represent a positive review (For instance, you can feed the text "The movie is great" to the model and let it generate 200 tokens). 

3. Finally, feed the review to the best model you found in (a) and (b). Does the model predict the review as positive? 

Hint: You can try to tune the `do_sample`, `top_k`, `num_beams` or  `temperature` in the `generate()` function so that the generated text is more convincing. Again, refer to our lab for a brief introduction.

In [None]:
# coding your answer here.

> Ans: *double click here to answer the question.*