Appendix C — Introduction to Kaggle

Author

phonchi

Published

February 17, 2023

C.1 Setup

You can lookup the resources first:

import multiprocessing
cores = multiprocessing.cpu_count() # Count the number of cores in a computer
cores

from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

import sys
# Is this notebook running on Colab or Kaggle?
IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules

gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  if IS_COLAB:
    print("Go to Runtime > Change runtime and select a GPU hardware accelerator.")
  if IS_KAGGLE:
    print("Go to Settings > Accelerator and select GPU.")
else:
  from tensorflow.python.client import device_lib 
  print(device_lib.list_local_devices())

!nvidia-smi -L

V100 > P100 > T4 > K80 (but most of the time you get K80 or T4 using the free Colab)

C.2 Working with Bash

Kaggle provides simple bash (console) below

C.3 Working with python

Kaggle kernel is built on top of Jupyter Notebook. Below are some examples of convenience functions provided.

C.3.1 System aliases

Jupyter includes shortcuts for common operations, such as ls:

%ls /kaggle/

! calls out to a shell (in a new process), while % affects the process associated with the notebook

Execute any other process using ! with string interpolation from python variables, and note the result can be assigned to a variable:

message = 'Kaggle is great!'
!echo {message}

Kaggle is great!

foo = !echo {message}
foo

['Kaggle is great!']

!mkdir test

OUT_DIR = './test'
!rm -rf {OUT_DIR}

!apt-get  install htop

C.3.2 Magics

Kaggle shares the notion of magics from Jupyter. There are shorthand annotations that change how a cell’s text is executed. To learn more, see Jupyter’s magics page.

%load_ext autoreload
%autoreload 2

C.3.3 Automatic completions and exploring code

Colab provides automatic completions to explore attributes of Python objects, as well as to quickly view documentation strings. As an example, first run the following cell to import the numpy module.

import numpy as np

from numpy import arccos

If you now insert your cursor after np and press Period(.), you will see the list of available completions within the np module.

np.

If you type an open parenthesis after any function or class in the module, you will see a pop-up of its documentation string:

??np.ndarray

help(np.ndarray)

C.4 Adding Data Sources

One of the advantages to using Notebooks as your data science workbench is that you can easily add data sources from thousands of publicly available Datasets or even upload your own. You can also use output files from another Notebook as a data source. You can add multiple data sources to your Notebook’s environment, allowing you to join together interesting datasets.

Navigate to the “Data” pane in a Notebook editor and click the “Add Data” button. This will open a modal that lets you select Datasets to add to your Notebook. The input data will be stored in the /kaggle/input/ directory. The output data will be stored in the /kaggle/working/ directory. You can also use the kaggle datasets download command to download a dataset to your Notebook’s environment. For more information, see the Kaggle Datasets documentation.

You will notice that there is a third option in the “Add Data” modal: Notebook Output Files. Up to 20 GBs of output from a Notebook may be saved to disk in /kaggle/working. This data is saved automatically and you can then reuse that data in any future Notebook: just navigate to the “Data” pane in a Notebook editor, click on “Add Data”, click on the “Notebook Output Files” tab, find a Notebook of interest, and then click to add it to your current Notebook. By chaining Notebooks as data sources in this way, it’s possible to build pipelines and generate more and better content than you could in a single notebook alone. If you need additional temporary scratch space, consider saving to /kaggle/tmp/ where the limits are much more generous. Note that the /kaggle/tmp/ directory is not guaranteed to be persisted between Notebook runs. For more information, see the Kaggle Notebooks documentation.

Finally, you can add your own dataset by uploading your files.

!pip install --upgrade --no-cache-dir gdown -qq

!gdown --fuzzy https://drive.google.com/file/d/1KE8dUFWUM389SdDhGj-UJgyGtXpnsqpl/view?usp=sharing

from nsysu import hello

hello()

!ls /kaggle/input/

C.4.1 Resoruces

The Resources of Kaggle GPU: - Kaggle GPU: 16G NVIDIA TESLA P100 - Limited to 30+ hrs/week depending on usage. - Limited to 12hrs/run

C.4.2 Save version

You can run the code in the background with Kaggle. Firstly, make sure your code is bug-free, as any error in any code block would result in early stopping. Click the “Save Version” button. The concept of “Versions” is a collection consisting of a Notebook version, the output it generates, and the associated metadata about the environment. Two options are available:

Quick Save: Skips the top-to-bottom notebook execution and just takes a snapshot of your notebook exactly as it’s displayed in the editor. This is a great option for taking a bunch of versions while you’re still actively experimenting. You can choose to reserve the output
Save & Run All: Creates a new session with a completely clean state and runs your notebook from top to bottom. In order to save successfully, the entire notebook must execute within 12 hours (9 hours for TPU notebooks). Save & Run All is identical to the “Commit” behavior.