{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "51NJxmTLJqC4"
      },
      "source": [
        "<table align=\"left\">\n",
        "  <td>\n",
        "    <a href=\"https://colab.research.google.com/github/phonchi/Fundamental-tools-for-data-science/blob/master/Colab_tutorial.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://kaggle.com/kernels/welcome?src=https://github.com/phonchi/Fundamental-tools-for-data-science/blob/main/Colab_tutorial.ipynb\"><img src=\"https://kaggle.com/static/images/open-in-kaggle.svg\" /></a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://github.com/phonchi/Fundamental-tools-for-data-science/blob/main/Colab_tutorial.ipynb\">\n",
        "      <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\"><br> View on GitHub\n",
        "    </a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QsA8y9eTJqC5"
      },
      "outputs": [],
      "source": [
        "import multiprocessing\n",
        "cores = multiprocessing.cpu_count() # Count the number of cores in a computer\n",
        "cores"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_PRvLbd57nPi"
      },
      "outputs": [],
      "source": [
        "from psutil import virtual_memory\n",
        "ram_gb = virtual_memory().total / 1e9\n",
        "print('Your runtime has {:.1f} gigabytes of available RAM\\n'.format(ram_gb))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7b5JDtlj7h55"
      },
      "outputs": [],
      "source": [
        "import sys\n",
        "# Is this notebook running on Colab or Kaggle?\n",
        "IS_COLAB = \"google.colab\" in sys.modules\n",
        "IS_KAGGLE = \"kaggle_secrets\" in sys.modules\n",
        "\n",
        "gpu_info = !nvidia-smi\n",
        "gpu_info = '\\n'.join(gpu_info)\n",
        "if gpu_info.find('failed') >= 0:\n",
        "  if IS_COLAB:\n",
        "    print(\"Go to Runtime > Change runtime and select a GPU hardware accelerator.\")\n",
        "  if IS_KAGGLE:\n",
        "    print(\"Go to Settings > Accelerator and select GPU.\")\n",
        "else:\n",
        "  from tensorflow.python.client import device_lib\n",
        "  print(device_lib.list_local_devices())"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "MUDfkpqbGLrf"
      },
      "outputs": [],
      "source": [
        "!nvidia-smi -L"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CkIBYM5VN1XI"
      },
      "source": [
        "A100 > V100 > L4 > P100 ~ T4 > K80 (but most of the time you get T4 using the free Colab)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ltYd8to5y_bW"
      },
      "source": [
        "# Working with Bash"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "dMRLJpZvzCVW"
      },
      "outputs": [],
      "source": [
        "!pip install colab-xterm -qq\n",
        "%load_ext colabxterm"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "tFFi3r6PzDwm"
      },
      "outputs": [],
      "source": [
        "%xterm height=500"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Wej_mEyXQSHc"
      },
      "source": [
        "## Bash commands\n",
        "\n",
        "Jupyter includes shortcuts for common operations, such as ls:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5OCYEvK5QSHf"
      },
      "outputs": [],
      "source": [
        "!ls /bin"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MtDXTg-LYIqv"
      },
      "source": [
        "`!` calls out to a shell (in a new process), while `%` affects the process associated with the notebook\n",
        "\n",
        "<!--- https://stackoverflow.com/questions/45784499/what-is-the-difference-between-and-in-jupyter-notebooks\n",
        "-->"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "D5Z5P-9U2dvN"
      },
      "outputs": [],
      "source": [
        "!cd sample_data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7YryRSj5Y4MW"
      },
      "outputs": [],
      "source": [
        "%pwd"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kUW2l8k3X4aG"
      },
      "outputs": [],
      "source": [
        "%cd sample_data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "y8Da6JWKQSHh"
      },
      "source": [
        "That `!ls` probably generated a large output. You can select the cell and clear the output by either:\n",
        "\n",
        "1. Clicking on the clear output button (x) in the toolbar above the cell; or\n",
        "2. Right clicking the left gutter of the output area and selecting \"Clear output\" from the context menu."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qM4myQGfQboQ"
      },
      "source": [
        "## Magics\n",
        "Colaboratory shares the notion of magics from Jupyter. There are shorthand annotations that change how a cell's text is executed. To learn more, see [Jupyter's magics page](https://ipython.readthedocs.io/en/stable/interactive/magics.html).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "P2kDUGLz55xz"
      },
      "outputs": [],
      "source": [
        "%load_ext autoreload\n",
        "%autoreload 2"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GOwlZRXEQSHZ"
      },
      "source": [
        "# Working with python\n",
        "Colaboratory is built on top of [Jupyter Notebook](https://jupyter.org/). Below are some examples of convenience functions provided."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "91dd8dd8"
      },
      "source": [
        "### First few things to know"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "88e50605"
      },
      "source": [
        "Python is a programming language, while Jupyter notebook or Google Colab are environment for you to develop and run your code.  In the **notebook environment** , there are two types of **cells** : **Markdown** (this one) and **code** (cell you can run your code).  To run a code cell, click on the cell, press `shift + enter` or `ctrl + enter`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "291093b8"
      },
      "outputs": [],
      "source": [
        "print(\"Press shift + enter to run the cell and move on to the next cell.\")\n",
        "print(\"Press ctrl + enter to run the cell and stay at the current cell.\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ae25bb37"
      },
      "outputs": [],
      "source": [
        "my_name_is_unusually_long = \"my-name\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fd02e89a"
      },
      "source": [
        "If you have a long object called `my_name_is_unusually_long` , it is hard to remember and type it accurately.  You may type `my_name` and then press `tab` to let the notebook environment **auto-complete** the name.  For Google Colab, instead of `tab` , you may wait or use `ctrl + space`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "a95a7e78"
      },
      "outputs": [],
      "source": [
        "# put your cursor at the end of my_name and wait\n",
        "my_name"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "65fc736c"
      },
      "source": [
        "When seeing a new function, you may use `?` or `help` to read its documentation."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "a7b563a8"
      },
      "outputs": [],
      "source": [
        "a = \"nsysu\"\n",
        "a.upper?"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "SSTcw_Hw055Y"
      },
      "outputs": [],
      "source": [
        "help(a.upper)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4e0af8de"
      },
      "source": [
        "Whenever you encouter any difficulty or error message, you are encourage to **ask** [ChatGPT](https://chatgpt.com/)/[Gemnin](https://gemini.google.com/)/[Copilot](https://copilot.microsoft.com/).  "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d1b6f138"
      },
      "source": [
        "### Assign and print"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aa84f745"
      },
      "source": [
        "Use `=` to assign the value on the right-hand side to the variable on the left-hand side.  \n",
        "Then use `print` to print the variable."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "dd98010e"
      },
      "outputs": [],
      "source": [
        "greeting = \"hello\"\n",
        "print(greeting)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "66e8dc19"
      },
      "outputs": [],
      "source": [
        "greeting = greeting + \", how are you?\" # \"+\"\" means Concatenate for string!\n",
        "print(greeting)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "796f942e"
      },
      "source": [
        "By default, the notebook environment print the representation of the variable in the last line."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "357286b9"
      },
      "outputs": [],
      "source": [
        "reply = \"fantastic, how about you?\"\n",
        "greeting\n",
        "reply"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "52cd714e"
      },
      "source": [
        "### Data type"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4fba8e28"
      },
      "source": [
        "Data type tells Python how to handle your object.  For example, you may ask Python to upper-case a string, but you cannot upper-case an integer."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5d730835"
      },
      "outputs": [],
      "source": [
        "a = 235 # integer\n",
        "b = \"two three five\" # string\n",
        "c = True # boolean\n",
        "d = (2, 3, 5) # tuple\n",
        "e = [2, 3, 5] # list\n",
        "f = {\"two\": 2, \"three\": 3, \"five\": 5} # dictionary\n",
        "type(f)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bb9cc5c4"
      },
      "outputs": [],
      "source": [
        "a.upper()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "85f29af8"
      },
      "outputs": [],
      "source": [
        "b.upper()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0e1b587a"
      },
      "source": [
        "### String"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "88933773"
      },
      "source": [
        "Python offers different ways to input a string."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "e9ded216"
      },
      "outputs": [],
      "source": [
        "a = 'The book \"1984\" is awesome.\\nI enjoyed reading it.'\n",
        "b = \"The book '1984' is awesome.\\nI enjoyed reading it.\"\n",
        "print(a)\n",
        "print(b)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9c2c9861"
      },
      "source": [
        "To avoid a long line of input, you may use paranetheses to make your code looks better."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6a1d04a8"
      },
      "outputs": [],
      "source": [
        "c = (\"The book '1984' is awesome.\\n\"\n",
        "     \"I enjoyed reading it.\")\n",
        "print(c)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3060c778"
      },
      "source": [
        "The symbol `\\n` is an **escape character** that stands for \"new line\".  You may use `r\"...\"` or `r'...'` to keep the original form instead of converting the escape characters.  This is particularly useful when your string is the code of some programming language.  "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9baae238"
      },
      "outputs": [],
      "source": [
        "d = r\"The book '1984' is awesome.\\nI enjoyed reading it.\"\n",
        "print(d)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "F2__OryZ-NGw"
      },
      "source": [
        "##### > Exercise: Please use AI to list some common escape charater in Python and give corresponding examples."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eb59076b"
      },
      "source": [
        "To record all the line breaks and tabs, use triple quotes."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "e606422e"
      },
      "outputs": [],
      "source": [
        "e = \"\"\"The book '1984' is   awesome.\n",
        "I enjoyed reading it.\"\"\"\n",
        "print(e)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "431c2e78"
      },
      "source": [
        "A few functions are powerful on strings.  \n",
        "\n",
        "- `split` : split the string into a list of words\n",
        "- `join` : merge several string together\n",
        "- `upper` , `lower` , `capitalize` : make letters upper-case, lower-case, or capitalize the sentence\n",
        "- `startswith` , `endswith` : check if the string starts or ends with some letters\n",
        "- `isascii` : check if the string is ASCII only"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1de9399e"
      },
      "outputs": [],
      "source": [
        "b = \"The book '1984' is awesome.\\nI enjoyed reading it.\"\n",
        "b.split()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0da52c61"
      },
      "outputs": [],
      "source": [
        "b = \"The book '1984' is awesome.\\nI enjoyed reading it.\"\n",
        "words = b.split()\n",
        "'_'.join(words)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "e664630a"
      },
      "outputs": [],
      "source": [
        "greeting = \"hello, How are you?\"\n",
        "print(greeting.upper())\n",
        "print(greeting.lower())\n",
        "print(greeting.capitalize())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ac3ffb4d"
      },
      "source": [
        "Use the `f-string` to embed some variables in your string."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "281820e2",
        "scrolled": true
      },
      "outputs": [],
      "source": [
        "book_name = \"1984\"\n",
        "description = \"awesome\"\n",
        "text = f\"The book '{book_name}' is {description}.\\n I enjoyed reading it.\"\n",
        "print(text)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "df81a24a"
      },
      "source": [
        "### `for` loop and `if` statment"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f21b68c3"
      },
      "source": [
        "A `for` loop allows you to do [routine works](https://pythontutor.com/render.html#code=for%20noun%20in%20%5B'apple',%20'balls',%20'cats',%20'dollar',%20'engine',%20'formulae'%5D%3A%0A%20%20%20%20if%20noun.endswith%28's'%29%3A%0A%20%20%20%20%20%20%20%20print%28f%22%7Bnoun%7D%20is%20plural.%22%29%0A%20%20%20%20else%3A%0A%20%20%20%20%20%20%20%20print%28f%22%7Bnoun%7D%20is%20singular.%22%29&cumulative=false&curInstr=0&heapPrimitives=true&mode=display&origin=opt-frontend.js&py=311&rawInputLstJSON=%5B%5D&textReferences=false).  An `if` statement gives you the flexibility to do different things in different conditions.   "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "68699737"
      },
      "outputs": [],
      "source": [
        "for noun in ['apple', 'balls', 'cats', 'dollar', 'engine', 'formulae']:\n",
        "    if noun.endswith('s'):\n",
        "        print(f\"{noun} is plural.\")\n",
        "    else:\n",
        "        print(f\"{noun} is singular.\")\n",
        "\n",
        "# Grammar is not a set of rules; it is something inherent in the language, and\n",
        "# language cannot exist without it. It can be discovered, but not invented.\n",
        "#                                                             --- Charlton Laird"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "839b69e6"
      },
      "source": [
        "The `if` statement also provide an easy way to set values based on conditions."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0ac6e6d5"
      },
      "outputs": [],
      "source": [
        "temperature = 20\n",
        "weather = \"hot\" if temperature > 26 else \"cool\"\n",
        "weather"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6b7b0fcd"
      },
      "source": [
        "### 📘 Example program: Discover word relations"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ab5a9eb2"
      },
      "source": [
        "[Pointwise Mutual Information (PMI)](https://en.wikipedia.org/wiki/Pointwise_mutual_information) is a statistical measure used to evaluate the association between two words or events, denoted as $x$ and $y$. The PMI formula is expressed as follows:\n",
        "\n",
        "$$\n",
        "\\operatorname{pmi}(x,y) = \\log_2\\frac{p(x,y)}{p(x)p(y)} = \\log_2\\frac{p(x\\mid y)}{p(x)} = \\log_2\\frac{p(y\\mid x)}{p(y)}\n",
        "$$\n",
        "\n",
        "**Intuition**: PMI calculates the ratio between the joint probability of $x$ and $y$ occurring together and the product of their individual probabilities. Essentially, it measures how much the actual co-occurrence of $x$ and $y$ deviates from what would be expected if they were independent. The logarithm base 2 transforms the ratio such that:\n",
        "- A ratio of 1 (indicating independence) is mapped to 0.\n",
        "- A ratio greater than 1 (indicating positive association) is mapped to a positive value.\n",
        "- A ratio less than 1 (indicating negative association) is mapped to a negative value.\n",
        "\n",
        "**Interpretation**:\n",
        "- **Zero PMI** indicates that the events $x$ and $y$ are **independent**. For example, the words \"city\" and \"full\" typically do not depend on each other.\n",
        "- **Positive PMI** suggests that the events are more likely to occur together than independently. For example, \"New\" and \"York\" often appear together and thus have a positive PMI.\n",
        "- **Negative PMI** implies that the occurrence of one event reduces the likelihood of the other occurring, suggesting a negative association. For example, \"quiet\" and \"bustling\" may have a negative PMI.\n",
        "\n",
        "Using PMI can be particularly useful in natural language processing and information retrieval to identify words that frequently co-occur in meaningful ways, thus revealing patterns in language use."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "c041f5e9"
      },
      "outputs": [],
      "source": [
        "# generated by ChatGPT\n",
        "new_york = (\"New York is a bustling city in the United States. \"\n",
        "        \"It is often called 'The Big Apple' and is full of excitement and opportunities. \"\n",
        "        \"With its towering skyscrapers and busy streets, New York is a hub of new ideas and innovation. \"\n",
        "        \"People from all over the world come to New York to chase their dreams and seek new adventures. \"\n",
        "        \"The city never sleeps, and there is always something happening, from concerts and Broadway shows to festivals and parades. \"\n",
        "        \"The streets are lined with shops, restaurants, and iconic landmarks like the Statue of Liberty and Times Square. \"\n",
        "        \"New York is a melting pot of cultures, where you can taste new foods, hear new languages, and experience a vibrant mix of traditions. \"\n",
        "        \"It's a city where old meets new, where history and modernity blend together, creating a unique and exciting atmosphere that captures the spirit of New York.\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1421ee75"
      },
      "outputs": [],
      "source": [
        "new_york = new_york.lower()\n",
        "\n",
        "punctuations = r\"\"\"!()-[]{};:'\"\\,<>./?@#$%^&*_~\"\"\"\n",
        "for p in punctuations:\n",
        "    new_york = new_york.replace(p, \"\")\n",
        "\n",
        "new_york"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "f824ff19",
        "scrolled": false
      },
      "outputs": [],
      "source": [
        "tokens = []\n",
        "for word in new_york.split():\n",
        "    if word not in ['a', 'is', 'with', 'and']: # add more\n",
        "        tokens.append(word)\n",
        "\n",
        "tokens"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "a6642101"
      },
      "outputs": [],
      "source": [
        "from math import log2\n",
        "\n",
        "count_new = 0\n",
        "count_york = 0\n",
        "count_new_york = 0\n",
        "\n",
        "for i,tok in enumerate(tokens):\n",
        "    if tok == \"new\":\n",
        "        count_new += 1\n",
        "        try:\n",
        "            next_tok = tokens[i+1]\n",
        "            if next_tok == \"york\":\n",
        "                count_new_york += 1\n",
        "        except IndexError:\n",
        "            pass\n",
        "    if tok == \"york\":\n",
        "        count_york += 1\n",
        "\n",
        "p_new = count_new / len(tokens)\n",
        "p_work = count_york / len(tokens)\n",
        "p_new_york = count_new_york / len(tokens)\n",
        "pmi = log2(p_new_york / (p_new * p_work))\n",
        "print(\"p_new =\", p_new)\n",
        "print(\"p_york =\", p_work)\n",
        "print(\"p_new_york =\", p_new_york)\n",
        "print(\"pmi =\", pmi)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xHeVczbbqF1l"
      },
      "outputs": [],
      "source": [
        "from math import log2\n",
        "from collections import Counter\n",
        "\n",
        "# 使用 Counter 來統計單詞出現次數，這樣代碼更簡潔\n",
        "#tokens = [\"new\", \"york\", \"new\", \"york\", \"city\"]  # 假設的 tokens 列表，實際使用中應替換為真正的 tokens\n",
        "counts = Counter(tokens)\n",
        "\n",
        "# 統計 \"new\" 和 \"york\" 的出現次數\n",
        "count_new = counts[\"new\"]\n",
        "count_york = counts[\"york\"]\n",
        "\n",
        "# 統計 \"new york\" 這個詞組的出現次數\n",
        "count_new_york = sum(1 for i in range(len(tokens) - 1) if tokens[i] == \"new\" and tokens[i + 1] == \"york\")\n",
        "\n",
        "# 計算 \"new\", \"york\" 和 \"new york\" 的概率\n",
        "total_tokens = len(tokens)\n",
        "p_new = count_new / total_tokens if total_tokens > 0 else 0\n",
        "p_york = count_york / total_tokens if total_tokens > 0 else 0\n",
        "p_new_york = count_new_york / total_tokens if total_tokens > 0 else 0\n",
        "\n",
        "# 確保分母不為零，防止數學錯誤\n",
        "pmi = log2(p_new_york / (p_new * p_york)) if p_new > 0 and p_york > 0 and p_new_york > 0 else float('-inf')\n",
        "\n",
        "# 輸出結果\n",
        "print(\"p_new =\", p_new)\n",
        "print(\"p_york =\", p_york)\n",
        "print(\"p_new_york =\", p_new_york)\n",
        "print(\"pmi =\", pmi)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3lCZ4oQxZ6q0"
      },
      "source": [
        "### Integration with bash command"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fUyg8ZTRZ1pc"
      },
      "source": [
        "Execute any other process using `!` with string interpolation from python variables, and note the result can be assigned to a variable:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ndmJtQ7m2tsL"
      },
      "outputs": [],
      "source": [
        "message = 'Colaboratory is great!'"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kXwii9694JHs"
      },
      "outputs": [],
      "source": [
        "!echo '{message}\\n'"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "both",
        "id": "zqGrv0blQSHj"
      },
      "outputs": [],
      "source": [
        "foo = !echo '{message}\\n'\n",
        "foo"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Cg2X9T3uMW1c"
      },
      "outputs": [],
      "source": [
        "!mkdir test"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Pg_Et6ygMVLq"
      },
      "outputs": [],
      "source": [
        "OUT_DIR = './test'\n",
        "!rm -rf {OUT_DIR}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aro-UJgUQSH1"
      },
      "source": [
        "# Integration with Drive\n",
        "\n",
        "Colaboratory is integrated with Google Drive. It allows you to share, comment, and collaborate on the same document with multiple people:\n",
        "\n",
        "* The **SHARE** button (top-right of the toolbar) allows you to share the notebook and control permissions set on it.\n",
        "\n",
        "* **File->Make a Copy** creates a copy of the notebook in Drive.\n",
        "\n",
        "* **File->Save** saves the File to Drive. **File->Save and checkpoint** pins the version so it doesn't get deleted from the revision history.\n",
        "\n",
        "* **File->Revision history** shows the notebook's revision history."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BaCkyg5CV5jF"
      },
      "source": [
        "## Uploading files from your local file system\n",
        "\n",
        "`files.upload` returns a dictionary of the files which were uploaded.\n",
        "The dictionary is keyed by the file name and values are the data which were uploaded."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vz-jH8T_Uk2c"
      },
      "outputs": [],
      "source": [
        "from google.colab import files\n",
        "\n",
        "uploaded = files.upload()\n",
        "\n",
        "for fn in uploaded.keys():\n",
        "  print('User uploaded file \"{name}\" with length {length} bytes'.format(\n",
        "      name=fn, length=len(uploaded[fn])))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AUEDkNHCYsXW"
      },
      "source": [
        "Files are temporarily stored, and will be removed once you end your session.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hauvGV4hV-Mh"
      },
      "source": [
        "## Downloading files to your local file system\n",
        "\n",
        "`files.download` will invoke a browser download of the file to your local computer.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "p2E4EKhCWEC5"
      },
      "outputs": [],
      "source": [
        "from google.colab import files\n",
        "\n",
        "with open('example.txt', 'w') as f:\n",
        "  f.write('some content')\n",
        "\n",
        "files.download('example.txt')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "u22w3BFiOveA"
      },
      "source": [
        "## Mounting Google Drive locally\n",
        "\n",
        "The example below shows how to mount your Google Drive on your runtime using an authorization code, and how to write and read files there. Once executed, you will be able to see the new file (`foo.txt`) at [https://drive.google.com/](https://drive.google.com/).\n",
        "\n",
        "This only supports reading, writing, and moving files; to programmatically modify sharing settings or other metadata, use one of the other options below.\n",
        "\n",
        "**Note:** When using the 'Mount Drive' button in the file browser, no authentication codes are necessary for notebooks that have only been edited by the current user."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-92G1g2zLU4C"
      },
      "outputs": [],
      "source": [
        "!ls /content/drive"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XDg9OBaYqRMd"
      },
      "outputs": [],
      "source": [
        "with open('/content/drive/My Drive/foo.txt', 'w') as f:\n",
        "  f.write('Hello Google Drive!')\n",
        "!cat /content/drive/My\\ Drive/foo.txt"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "D78AM1fFt2ty"
      },
      "outputs": [],
      "source": [
        "#drive.flush_and_unmount()\n",
        "#print('All changes made in this colab session should now be visible in Drive.')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "MkFWrbWcI49k"
      },
      "outputs": [],
      "source": [
        "#https://drive.google.com/file/d/1lmFNy9xgi1AqmH9b2A9O-79E0cCCaj5X/view?usp=sharing\n",
        "!gdown 1lmFNy9xgi1AqmH9b2A9O-79E0cCCaj5X"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5NMY54_CI9Xk"
      },
      "outputs": [],
      "source": [
        "from nsysu import hello"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wJ69ExcgKU0j"
      },
      "outputs": [],
      "source": [
        "hello()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Vb9u0d1n5To7"
      },
      "source": [
        "## Loading Public Notebooks Directly from GitHub\n",
        "\n",
        "Colab can load public github notebooks directly, with no required authorization step.\n",
        "\n",
        "For example, consider the notebook at this address: https://github.com/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb.\n",
        "\n",
        "The direct colab link to this notebook is: https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb.\n",
        "\n",
        "To generate such links in one click, you can use the [Open in Colab](https://chrome.google.com/webstore/detail/open-in-colab/iogfkhleblhcpcekbiedikdehleodpjo) Chrome extension."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "W-SHKQp9jjNN"
      },
      "source": [
        "# Run Flask or other web app"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "e2YJgdZ7jp-V"
      },
      "outputs": [],
      "source": [
        "!pip install flask -qq\n",
        "!pip install pyngrok -qq"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NZjucMdBkVVE"
      },
      "outputs": [],
      "source": [
        "from pyngrok import ngrok, conf\n",
        "import getpass"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UILplti7kWWk"
      },
      "outputs": [],
      "source": [
        "print(\"Enter your authtoken, which can be copied from https://dashboard.ngrok.com/\")\n",
        "conf.get_default().auth_token = getpass.getpass()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nbmySDrBkhyk"
      },
      "outputs": [],
      "source": [
        "# Setup a tunnel to the streamlit port 8050\n",
        "public_url = ngrok.connect(8050)\n",
        "public_url"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lx1bbIubjvvO"
      },
      "outputs": [],
      "source": [
        "from flask import Flask\n",
        "\n",
        "app = Flask(__name__)\n",
        "\n",
        "@app.route('/')\n",
        "def hello():\n",
        "    return 'Hello NSYSU!'\n",
        "\n",
        "if __name__ == '__main__':\n",
        "    app.run(port=8050)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dnFEHWflQNt-"
      },
      "source": [
        "# Downloading data from Kaggle\n",
        "\n",
        "The Dogs vs. Cats dataset that we will use isn’t packaged with Keras. It was made available by Kaggle as part of a computer vision competition in late 2013, back when convnets weren’t mainstream. You can download the original dataset from www.kaggle.com/c/dogs-vs-cats/data.\n",
        "\n",
        "But you can also use Kaggle API. First, you need to create a Kaggle API key and download it to your local machine. Just navigate to the Kaggle website in a web browser, log in, and go to the My Account page. In your account settings, you’ll find an API section. Clicking the Create New API Token button will generate a kaggle.json key file and will download it to your machine."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "HCmvieoXQMmt"
      },
      "outputs": [],
      "source": [
        "# Upload the API’s key JSON file to your Colab\n",
        "# session by running the following code in a notebook cell:\n",
        "from google.colab import files\n",
        "files.upload()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zArXzoEhRy2V"
      },
      "source": [
        "Finally, create a `~/.kaggle` folder, and copy the key file to it. As a security best practice, you should also make sure that the file is only readable by the current user, yourself:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3p_FvdDcSC5I"
      },
      "outputs": [],
      "source": [
        "!mkdir ~/.kaggle\n",
        "!cp kaggle.json ~/.kaggle/\n",
        "!chmod 600 ~/.kaggle/kaggle.json"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Ncja_yEuSTBd"
      },
      "outputs": [],
      "source": [
        "# You can now download the data we’re about to use:\n",
        "!kaggle competitions download -c dogs-vs-cats"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jdoal-4GSeN3"
      },
      "source": [
        "The first time you try to download the data, you may get a “403 Forbidden” error. That’s because you need to accept the terms associated with the dataset before you download it—you’ll have to go to www.kaggle.com/c/dogs-vs-cats/rules (while logged into your Kaggle account) and click the I Understand and Accept button. You only need to do this once.\n",
        "\n",
        "More information about [Kaggle API](https://github.com/Kaggle/kaggle-api)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "W-Iw6QDGy4qN"
      },
      "outputs": [],
      "source": [
        "!kaggle datasets download -d bricevergnou/spotify-recommendation"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rq1PiIP9x-rX"
      },
      "outputs": [],
      "source": [
        "import kagglehub\n",
        "\n",
        "# Download the latest version.\n",
        "kagglehub.dataset_download('bricevergnou/spotify-recommendation')"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "provenance": []
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}