Appendix C — Introduction to Kaggle
C.1 Setup
You can lookup the resources first:
import sys
# Is this notebook running on Colab or Kaggle?
IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
if IS_COLAB:
print("Go to Runtime > Change runtime and select a GPU hardware accelerator.")
if IS_KAGGLE:
print("Go to Settings > Accelerator and select GPU.")
else:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
V100 > P100 > T4 > K80 (but most of the time you get K80 or T4 using the free Colab)
C.2 Working with Bash
Kaggle provides simple bash (console) below
C.3 Working with python
Kaggle kernel is built on top of Jupyter Notebook. Below are some examples of convenience functions provided.
C.3.1 System aliases
Jupyter includes shortcuts for common operations, such as ls
:
!
calls out to a shell (in a new process), while %
affects the process associated with the notebook
Execute any other process using !
with string interpolation from python variables, and note the result can be assigned to a variable:
C.3.2 Magics
Kaggle shares the notion of magics from Jupyter. There are shorthand annotations that change how a cell’s text is executed. To learn more, see Jupyter’s magics page.
C.3.3 Automatic completions and exploring code
Colab provides automatic completions to explore attributes of Python objects, as well as to quickly view documentation strings. As an example, first run the following cell to import the numpy
module.
If you now insert your cursor after np
and press Period(.
), you will see the list of available completions within the np
module.
If you type an open parenthesis after any function or class in the module, you will see a pop-up of its documentation string:
C.4 Adding Data Sources
One of the advantages to using Notebooks as your data science workbench is that you can easily add data sources from thousands of publicly available Datasets or even upload your own. You can also use output files from another Notebook as a data source. You can add multiple data sources to your Notebook’s environment, allowing you to join together interesting datasets.
Navigate to the “Data” pane in a Notebook editor and click the “Add Data” button. This will open a modal that lets you select Datasets to add to your Notebook. The input data will be stored in the /kaggle/input/
directory. The output data will be stored in the /kaggle/working/
directory. You can also use the kaggle datasets download
command to download a dataset to your Notebook’s environment. For more information, see the Kaggle Datasets documentation.
You will notice that there is a third option in the “Add Data” modal: Notebook Output Files. Up to 20 GBs of output from a Notebook may be saved to disk in /kaggle/working
. This data is saved automatically and you can then reuse that data in any future Notebook: just navigate to the “Data” pane in a Notebook editor, click on “Add Data”, click on the “Notebook Output Files” tab, find a Notebook of interest, and then click to add it to your current Notebook. By chaining Notebooks as data sources in this way, it’s possible to build pipelines and generate more and better content than you could in a single notebook alone. If you need additional temporary scratch space, consider saving to /kaggle/tmp/
where the limits are much more generous. Note that the /kaggle/tmp/
directory is not guaranteed to be persisted between Notebook runs. For more information, see the Kaggle Notebooks documentation.
Finally, you can add your own dataset by uploading your files.
C.4.1 Resoruces
The Resources of Kaggle GPU: - Kaggle GPU: 16G NVIDIA TESLA P100 - Limited to 30+ hrs/week depending on usage. - Limited to 12hrs/run
C.4.2 Save version
You can run the code in the background with Kaggle. Firstly, make sure your code is bug-free, as any error in any code block would result in early stopping. Click the “Save Version” button. The concept of “Versions” is a collection consisting of a Notebook version, the output it generates, and the associated metadata about the environment. Two options are available:
- Quick Save: Skips the top-to-bottom notebook execution and just takes a snapshot of your notebook exactly as it’s displayed in the editor. This is a great option for taking a bunch of versions while you’re still actively experimenting. You can choose to reserve the output
- Save & Run All: Creates a new session with a completely clean state and runs your notebook from top to bottom. In order to save successfully, the entire notebook must execute within 12 hours (9 hours for TPU notebooks). Save & Run All is identical to the “Commit” behavior.