3  Relational Database and data wrangling

Author

phonchi

Published

March 6, 2023

Open In Colab


3.1 Queringing data with BigQuery

# For SQL
from google.cloud import bigquery

import numpy as np
import pandas as pd

import matplotlib as mpl
from matplotlib import pyplot as plt
%matplotlib inline

Structured Query Language, or SQL, is the programming language used with databases, and it is an important skill for any data scientist. In this example, you’ll build your SQL skills using BigQuery, a web service work as database management system that lets you apply SQL to huge datasets.

3.1.1 Preliminaries for google colab (optional)

We want to start exploring the Google BiqQuery public datasets. Let’s start by walking through the required setup steps, and then we can load and explore some data.

If you are using colab. Follow this quickstart guide, which will explain how to: 1. Create a Cloud Platform project if you don’t have one already. 2. Enable billing for the project (If you apply the free trial, you already satisfy this condition.) 3. Enable the BigQuery API 4. Enabling the Service account

Now we need to authenticate to gain access to the BigQuery API. We will create a client, specifying the service account key file (replace ‘lunar-pact-378812-7a28b789bde2.json’ with your key file).

# @title Setup
from google.oauth2 import service_account

# TODO(developer): Set key_path to the path to the service account key file.

key_path = "lunar-pact-378812-7a28b789bde2.json"

credentials = service_account.Credentials.from_service_account_file(
    key_path
)
# @title Alternative Setup
#from google.colab import auth
#from google.cloud import bigquery
#from google.colab import data_table

#project = 'lunar-pact-378812' # Project ID inserted based on the query results selected to explore
#location = 'US' # Location inserted based on the query results selected to explore
#client = bigquery.Client(project=project, location=location)
#data_table.enable_dataframe_formatter()
#auth.authenticate_user()

Now that we’re authenticated, we need to load the BigQuery package, and the google.colab.data_table package that can be used to display large pandas dataframes as an interactive data. Loading data_table is optional, but it will be useful for working with data in pandas.

#%load_ext google.cloud.bigquery
#%load_ext google.colab.data_table
client = bigquery.Client(credentials=credentials, project=credentials.project_id,)

3.1.2 Create the reference

You can also work with Kaggle, which provide bigquery integration that you do not need to setup a google account. Each Kaggle user can scan 5TB every 30 days for free. Once you hit that limit, you’ll have to wait for it to reset. See https://www.kaggle.com/product-feedback/48573 for more details.

The first step in the workflow is to create a Client object. As you’ll soon see, this Client object will play a central role in retrieving information from BigQuery datasets.

# Uncomment below and create a "Client" object if you are using Kaggle
# client = bigquery.Client()

We’ll work with a dataset of posts on Hacker News, a website focusing on computer science and cybersecurity news. In BigQuery, each dataset is contained in a corresponding project. In this case, our hacker_news dataset is contained in the bigquery-public-data project.

To access the dataset, We begin by constructing a reference to the dataset with the dataset() method. Next, we use the get_dataset() method, along with the reference we just constructed, to fetch the dataset.

See the full list of public datasets or the kaggle bigquery dataset if you want to explore others.

# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

Every dataset is just a collection of tables. You can think of a dataset as a spreadsheet file containing multiple tables, all composed of rows and columns.We use the list_tables() method to list the tables in the dataset.

# List all the tables in the "hacker_news" dataset
tables = list(client.list_tables(dataset))

# Print names of all tables in the dataset (there are four!)
for table in tables:  
    print(table.table_id)
comments
full
full_201510
stories

Similar to how we fetched a dataset, we can fetch a table. In the code cell below, we fetch the full table in the hacker_news dataset

# Construct a reference to the "full" table
table_ref = dataset_ref.table("full")

# API request - fetch the table
table = client.get_table(table_ref)

In the next section, you’ll explore the contents of this table in more detail. For now, take the time to use the image below to consolidate what you’ve learned so far.

first_commands

3.1.3 Table schema

The structure of a table is called its schema. We need to understand a table’s schema to effectively pull out the data we want.

In this example, we’ll investigate the full table that we fetched above.

# Print information on all the columns in the "full" table in the "hacker_news" dataset
table.schema
[SchemaField('title', 'STRING', 'NULLABLE', None, 'Story title', (), None),
 SchemaField('url', 'STRING', 'NULLABLE', None, 'Story url', (), None),
 SchemaField('text', 'STRING', 'NULLABLE', None, 'Story or comment text', (), None),
 SchemaField('dead', 'BOOLEAN', 'NULLABLE', None, 'Is dead?', (), None),
 SchemaField('by', 'STRING', 'NULLABLE', None, "The username of the item's author.", (), None),
 SchemaField('score', 'INTEGER', 'NULLABLE', None, 'Story score', (), None),
 SchemaField('time', 'INTEGER', 'NULLABLE', None, 'Unix time', (), None),
 SchemaField('timestamp', 'TIMESTAMP', 'NULLABLE', None, 'Timestamp for the unix time', (), None),
 SchemaField('type', 'STRING', 'NULLABLE', None, 'Type of details (comment, comment_ranking, poll, story, job, pollopt)', (), None),
 SchemaField('id', 'INTEGER', 'NULLABLE', None, "The item's unique id.", (), None),
 SchemaField('parent', 'INTEGER', 'NULLABLE', None, 'Parent comment ID', (), None),
 SchemaField('descendants', 'INTEGER', 'NULLABLE', None, 'Number of story or poll descendants', (), None),
 SchemaField('ranking', 'INTEGER', 'NULLABLE', None, 'Comment ranking', (), None),
 SchemaField('deleted', 'BOOLEAN', 'NULLABLE', None, 'Is deleted?', (), None)]

Each SchemaField tells us about a specific column (which we also refer to as a field). In order, the information is:

  • The name of the column
  • The field type (or datatype) in the column
  • The mode of the column ('NULLABLE' means that a column allows NULL values, and is the default)
  • A description of the data in that column

For instance, the field has the SchemaField:

SchemaField('by', 'string', 'NULLABLE', "The username of the item's author.",())

This tells us: - the field (or column) is called by, - the data in this field is strings, - NULL values are allowed, and - it contains the usernames corresponding to each item’s author.

We can use the list_rows() method to check just the first five lines of of the full table to make sure this is right. This returns a BigQuery RowIterator object that can quickly be converted to a pandas DataFrame with the to_dataframe() method.

# Preview the first five lines of the "full" table
client.list_rows(table, max_results=5).to_dataframe()
title url text dead by score time timestamp type id parent descendants ranking deleted
0 None None I would rather just have wired earbuds, period... <NA> zeveb <NA> 1591717736 2020-06-09 15:48:56+00:00 comment 23467666 23456782 <NA> <NA> <NA>
1 None None DNS? <NA> nly <NA> 1572810465 2019-11-03 19:47:45+00:00 comment 21436112 21435130 <NA> <NA> <NA>
2 None None These benchmarks seem pretty good. Filterable... <NA> mrkeen <NA> 1591717727 2020-06-09 15:48:47+00:00 comment 23467665 23467426 <NA> <NA> <NA>
3 None None Oh really?<p>* Excel alone uses 86.1MB of priv... <NA> oceanswave <NA> 1462987532 2016-05-11 17:25:32+00:00 comment 11677248 11676886 <NA> <NA> <NA>
4 None None These systems are useless. Of the many flaws:... <NA> nyxxie <NA> 1572810473 2019-11-03 19:47:53+00:00 comment 21436113 21435025 <NA> <NA> <NA>

The list_rows() method will also let us look at just the information in a specific column. If we want to see the first five entries in the by column, for example, we can do that!

# Preview the first five entries in the "by" column of the "full" table
client.list_rows(table, selected_fields=table.schema[4:5], max_results=5).to_dataframe()
by
0 zeveb
1 nly
2 mrkeen
3 oceanswave
4 nyxxie

3.1.4 Select, From & Where

Now that you know how to access and examine a dataset, you’re ready to write your first SQL query! As you’ll soon see, SQL queries will help you sort through a massive dataset, to retrieve only the information that you need. We’ll begin by using the keywords SELECT, FROM, and WHERE to get data from specific columns based on conditions you specify.

We’ll use an OpenAQ dataset about air quality. First, we’ll set up everything we need to run queries and take a quick peek at what tables are in our database.

# Construct a reference to the "openaq" dataset
dataset_ref = client.dataset("openaq", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# List all the tables in the "openaq" dataset
tables = list(client.list_tables(dataset))

# Print names of all tables in the dataset (there's only one!)
for table in tables:  
    print(table.table_id)
global_air_quality

The dataset contains only one table, called global_air_quality. We’ll fetch the table and take a peek at the first few rows to see what sort of data it contains.

# Construct a reference to the "global_air_quality" table
table_ref = dataset_ref.table("global_air_quality")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the "global_air_quality" table
client.list_rows(table, max_results=5).to_dataframe()
location city country pollutant value timestamp unit source_name latitude longitude averaged_over_in_hours location_geom
0 Borówiec, ul. Drapałka Borówiec PL bc 0.85217 2022-04-28 07:00:00+00:00 µg/m³ GIOS 1.0 52.276794 17.074114 POINT(52.276794 1)
1 Kraków, ul. Bulwarowa Kraków PL bc 0.91284 2022-04-27 23:00:00+00:00 µg/m³ GIOS 1.0 50.069308 20.053492 POINT(50.069308 1)
2 Płock, ul. Reja Płock PL bc 1.41000 2022-03-30 04:00:00+00:00 µg/m³ GIOS 1.0 52.550938 19.709791 POINT(52.550938 1)
3 Elbląg, ul. Bażyńskiego Elbląg PL bc 0.33607 2022-05-03 13:00:00+00:00 µg/m³ GIOS 1.0 54.167847 19.410942 POINT(54.167847 1)
4 Piastów, ul. Pułaskiego Piastów PL bc 0.51000 2022-05-11 05:00:00+00:00 µg/m³ GIOS 1.0 52.191728 20.837489 POINT(52.191728 1)

let’s put together a query. Say we want to select all the values from the city column that are in rows where the country column is 'US' (for “United States”).

# Query to select all the items from the "city" column where the "country" column is 'US'
# SQL is almost completely case and indentation insensitive. The capitalization and
# indentation style here is preferred style.
query = """
        SELECT city
        FROM `bigquery-public-data.openaq.global_air_quality`
        WHERE country = 'US'
        """

Notice also that SQL statements requires single quotes for its strings inside python string (we use triple quotation mark here). We begin by setting up the query with the query() method.

# Set up the query
query_job = client.query(query)

# API request - run the query, and return a pandas DataFrame
us_cities = query_job.to_dataframe()

Now we’ve got a pandas DataFrame called us_cities, which we can use like any other DataFrame.

# What five cities have the most measurements?
us_cities.city.value_counts().head()
Phoenix-Mesa-Scottsdale                     39414
Los Angeles-Long Beach-Santa Ana            27479
Riverside-San Bernardino-Ontario            26887
New York-Northern New Jersey-Long Island    25417
San Francisco-Oakland-Fremont               22710
Name: city, dtype: int64

If you want multiple columns, you can select them with a comma between the names:

query = """
        SELECT city, country
        FROM `bigquery-public-data.openaq.global_air_quality`
        WHERE country = 'US'
        """

You can select all columns with a * like this:

query = """
        SELECT *
        FROM `bigquery-public-data.openaq.global_air_quality`
        WHERE country = 'US'
        """

3.1.5 Querying big dataset

You can estimate the size of any query before running it. Here is an example using the Hacker News dataset. To see how much data a query will scan, we create a QueryJobConfig object and set the dry_run parameter to True.

# Query to get the score column from every row where the type column has value "job"
query = """
        SELECT score, title
        FROM `bigquery-public-data.hacker_news.full`
        WHERE type = "job" 
        """

# Create a QueryJobConfig object to estimate size of query without running it
dry_run_config = bigquery.QueryJobConfig(dry_run=True)

# API request - dry run query to estimate costs
dry_run_query_job = client.query(query, job_config=dry_run_config)

print("This query will process {} bytes.".format(dry_run_query_job.total_bytes_processed))
This query will process 553320240 bytes.

You can also specify a parameter when running the query to limit how much data you are willing to scan. Here’s an example with a low limit.

# Only run the query if it's less than 1 MB
ONE_MB = 1000*1000
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=ONE_MB)

# Set up the query (will only run if it's less than 1 MB)
safe_query_job = client.query(query, job_config=safe_config)

# API request - try to run the query, and return a pandas DataFrame
safe_query_job.to_dataframe()
InternalServerError: ignored

In this case, the query was cancelled, because the limit of 1 MB was exceeded. However, we can also increase the limit to run the query successfully!

3.1.6 Group By, Having & Count

Now that you can select raw data, you’re ready to learn how to group your data and count things within those groups.

The Hacker News dataset contains information on stories and comments from the Hacker News social networking site. We’ll work with the comments table and begin by printing the first few rows

# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "comments" table
table_ref = dataset_ref.table("comments")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the "comments" table
client.list_rows(table, max_results=5).to_dataframe()
id by author time time_ts text parent deleted dead ranking
0 9734136 None None 1434565400 2015-06-17 18:23:20+00:00 None 9733698 True <NA> 0
1 4921158 None None 1355496966 2012-12-14 14:56:06+00:00 None 4921100 True <NA> 0
2 7500568 None None 1396261158 2014-03-31 10:19:18+00:00 None 7499385 True <NA> 0
3 8909635 None None 1421627275 2015-01-19 00:27:55+00:00 None 8901135 True <NA> 0
4 9256463 None None 1427204705 2015-03-24 13:45:05+00:00 None 9256346 True <NA> 0

Let’s use the table to see which comments generated the most replies. Since: - the parent column indicates the comment that was replied to, and - the id column has the unique ID used to identify each comment,

we can GROUP BY the parent column and COUNT() the id column in order to figure out the number of comments that were made as responses to a specific comment.

Furthermore, since we’re only interested in popular comments, we’ll look at comments with more than ten replies. So, we’ll only return groups HAVING more than ten ID’s.

# Query to select comments that received more than 10 replies
query_popular = """
                SELECT parent, COUNT(id)
                FROM `bigquery-public-data.hacker_news.comments`
                GROUP BY parent
                HAVING COUNT(id) > 10
                """
# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query_popular, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
popular_comments = query_job.to_dataframe()

# Print the first five rows of the DataFrame
popular_comments.head()
parent f0_
0 2385424 55
1 8441979 57
2 7634152 46
3 9062758 49
4 5694173 61
popular_comments
parent f0_
0 2385424 55
1 8441979 57
2 7634152 46
3 9062758 49
4 5694173 61
... ... ...
77363 1748827 37
77364 3657756 37
77365 2873865 37
77366 9395540 37
77367 1772903 37

77368 rows × 2 columns

Each row in the popular_comments DataFrame corresponds to a comment that received more than ten replies.

A couple hints to make your queries even better: - The column resulting from COUNT(id) was called f0__. That’s not a very descriptive name. You can change the name by adding AS NumPosts after you specify the aggregation. This is called aliasing. - If you are ever unsure what to put inside the COUNT() function, you can do COUNT(1) to count the rows in each group. Most people find it especially readable, because we know it’s not focusing on other columns. It also scans less data than if supplied column names (making it faster and using less of your data access quota).

Using these tricks, we can rewrite our query:

# Improved version of earlier query, now with aliasing & improved readability
query_improved = """
                 SELECT parent, COUNT(1) AS NumPosts
                 FROM `bigquery-public-data.hacker_news.comments`
                 GROUP BY parent
                 HAVING COUNT(1) > 10
                 """

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query_improved, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
improved_df = query_job.to_dataframe()

# Print the first five rows of the DataFrame
improved_df.head()
parent NumPosts
0 2970550 63
1 8254532 40
2 7687784 44
3 4101992 53
4 1632878 39
improved_df
parent NumPosts
0 2970550 63
1 8254532 40
2 7687784 44
3 4101992 53
4 1632878 39
... ... ...
77363 7844298 37
77364 7864644 37
77365 2866332 37
77366 1885594 37
77367 2175321 37

77368 rows × 2 columns

Now you have the data you want, and it has descriptive names.

3.1.6.1 Note on using GROUP BY

Note that because it tells SQL how to apply aggregate functions (like COUNT()), it doesn’t make sense to use GROUP BY without an aggregate function. Similarly, if you have any GROUP BY clause, then all variables must be passed to either a 1. GROUP BY command, or 2. an aggregation function.

Consider the query below:

query_good = """
             SELECT parent, COUNT(id)
             FROM `bigquery-public-data.hacker_news.comments`
             GROUP BY parent
             """

Note that there are two variables: parent and id. - parent was passed to a GROUP BY command (in GROUP BY parent), and - id was passed to an aggregate function (in COUNT(id)).

And the query below won’t work, because the author column isn’t passed to an aggregate function or a GROUP BY clause:

query_bad = """
            SELECT author, parent, COUNT(id)
            FROM `bigquery-public-data.hacker_news.comments`
            GROUP BY parent
            """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query_bad, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
improved_df = query_job.to_dataframe()

# Print the first five rows of the DataFrame
improved_df.head()
BadRequest: ignored

3.1.7 Order By

Frequently, you’ll want to sort your results. Let’s use the US Traffic Fatality Records database, which contains information on traffic accidents in the US where at least one person died.

We’ll investigate the accident_2015 table. Here is a view of the first few rows.

# Construct a reference to the "nhtsa_traffic_fatalities" dataset
dataset_ref = client.dataset("nhtsa_traffic_fatalities", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "accident_2015" table
table_ref = dataset_ref.table("accident_2015")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the "accident_2015" table
client.list_rows(table, max_results=5).to_dataframe()

Let’s use the table to determine how the number of accidents varies with the day of the week. Since: - the consecutive_number column contains a unique ID for each accident, and - the timestamp_of_crash column contains the date of the accident in DATETIME format,

we can: - EXTRACT the day of the week (as day_of_week in the query below) from the timestamp_of_crash column, and - GROUP BY the day of the week, before we COUNT the consecutive_number column to determine the number of accidents for each day of the week.

Then we sort the table with an ORDER BY clause, so the days with the most accidents are returned first.

# Query to find out the number of accidents for each day of the week
query = """
        SELECT COUNT(consecutive_number) AS num_accidents, 
               EXTRACT(DAYOFWEEK FROM timestamp_of_crash) AS day_of_week
        FROM `bigquery-public-data.nhtsa_traffic_fatalities.accident_2015`
        GROUP BY day_of_week
        ORDER BY num_accidents DESC
        """
# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**9)
query_job = client.query(query, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
accidents_by_day = query_job.to_dataframe()

# Print the DataFrame
accidents_by_day
num_accidents day_of_week
0 5659 7
1 5298 1
2 4916 6
3 4460 5
4 4182 4
5 4038 2
6 3985 3

Notice that the data is sorted by the num_accidents column, where the days with more traffic accidents appear first.

To map the numbers returned for the day_of_week column to the actual day, you might consult the BigQuery documentation on the DAYOFWEEK function. It says that it returns “an integer between 1 (Sunday) and 7 (Saturday), inclusively”. So, in 2015, most fatal motor accidents in the US occured on Sunday and Saturday, while the fewest happened on Tuesday.

3.1.8 As and With

On its own, AS is a convenient way to clean up the data returned by your query. We’re going to use a common table expression (CTE) to find out how many Bitcoin transactions were made each day for the entire timespan of a bitcoin transaction dataset.

We’ll investigate the transactions table. Here is a view of the first few rows.

# Construct a reference to the "crypto_bitcoin" dataset
dataset_ref = client.dataset("crypto_bitcoin", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "transactions" table
table_ref = dataset_ref.table("transactions")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the "transactions" table
client.list_rows(table, max_results=5).to_dataframe()
hash size virtual_size version lock_time block_hash block_number block_timestamp block_timestamp_month input_count output_count input_value output_value is_coinbase fee inputs outputs
0 a16f3ce4dd5deb92d98ef5cf8afeaf0775ebca408f708b... 275 275 1 0 00000000dc55860c8a29c58d45209318fa9e9dc2c1833a... 181 2009-01-12 06:02:13+00:00 2009-01-01 1 2 4000000000.000000000 4000000000.000000000 False 0E-9 [{'index': 0, 'spent_transaction_hash': 'f4184... [{'index': 0, 'script_asm': '04b5abd412d4341b4...
1 591e91f809d716912ca1d4a9295e70c3e78bab077683f7... 275 275 1 0 0000000054487811fc4ff7a95be738aa5ad9320c394c48... 182 2009-01-12 06:12:16+00:00 2009-01-01 1 2 3000000000.000000000 3000000000.000000000 False 0E-9 [{'index': 0, 'spent_transaction_hash': 'a16f3... [{'index': 0, 'script_asm': '0401518fa1d1e1e3e...
2 12b5633bad1f9c167d523ad1aa1947b2732a865bf5414e... 276 276 1 0 00000000f46e513f038baf6f2d9a95b2a28d8a6c985bcf... 183 2009-01-12 06:34:22+00:00 2009-01-01 1 2 2900000000.000000000 2900000000.000000000 False 0E-9 [{'index': 0, 'spent_transaction_hash': '591e9... [{'index': 0, 'script_asm': '04baa9d3665315562...
3 828ef3b079f9c23829c56fe86e85b4a69d9e06e5b54ea5... 276 276 1 0 00000000fb5b44edc7a1aa105075564a179d65506e2bd2... 248 2009-01-12 20:04:20+00:00 2009-01-01 1 2 2800000000.000000000 2800000000.000000000 False 0E-9 [{'index': 0, 'spent_transaction_hash': '12b56... [{'index': 0, 'script_asm': '04bed827d37474bef...
4 35288d269cee1941eaebb2ea85e32b42cdb2b04284a56d... 277 277 1 0 00000000689051c09ff2cd091cc4c22c10b965eb8db3ad... 545 2009-01-15 05:48:32+00:00 2009-01-01 1 2 2500000000.000000000 2500000000.000000000 False 0E-9 [{'index': 0, 'spent_transaction_hash': 'd71fd... [{'index': 0, 'script_asm': '044a656f065871a35...

Since the block_timestamp column contains the date of each transaction in DATETIME format, we’ll convert these into DATE format using the DATE() command.

We do that using a CTE, and then the next part of the query counts the number of transactions for each date and sorts the table so that earlier dates appear first.

# Query to select the number of transactions per date, sorted by date
query_with_CTE = """ 
                 WITH time AS 
                 (
                     SELECT DATE(block_timestamp) AS trans_date
                     FROM `bigquery-public-data.crypto_bitcoin.transactions`
                 )
                 SELECT COUNT(1) AS transactions,
                        trans_date
                 FROM time
                 GROUP BY trans_date
                 ORDER BY trans_date
                 """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query_with_CTE, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
transactions_by_date = query_job.to_dataframe()

# Print the first five rows
transactions_by_date.head()
transactions trans_date
0 1 2009-01-03
1 14 2009-01-09
2 61 2009-01-10
3 93 2009-01-11
4 101 2009-01-12

Since they’re returned sorted, we can easily plot the raw results to show us the number of Bitcoin transactions per day over the whole timespan of this dataset.

transactions_by_date.set_index('trans_date').plot()
<AxesSubplot:xlabel='trans_date'>

As you can see, common table expressions (CTEs) let you shift a lot of your data cleaning into SQL. That’s an especially good thing in the case of BigQuery, because it is vastly faster than doing the work in Pandas.

3.1.9 Joining data

When our data lives across different tables, how do we analyze it? By JOINing the tables together. A JOIN combines rows in the left table with corresponding rows in the right table, where the meaning of “corresponding” is based on how we specify the join.

GitHub is the most popular place to collaborate on software projects. A GitHub repository (or repo) is a collection of files associated with a specific project. Most repos on GitHub are shared under a specific legal license, which determines the legal restrictions on how they are used. For our example, we’re going to look at how many different files have been released under each license.

We’ll work with two tables in the database. The first table is the licenses table, which provides the name of each GitHub repo (in the repo_name column) and its corresponding license. Here’s a view of the first five rows.

# Construct a reference to the "github_repos" dataset
dataset_ref = client.dataset("github_repos", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "licenses" table
licenses_ref = dataset_ref.table("licenses")

# API request - fetch the table
licenses_table = client.get_table(licenses_ref)

# Preview the first five lines of the "licenses" table
client.list_rows(licenses_table, max_results=5).to_dataframe()
repo_name license
0 autarch/Dist-Zilla-Plugin-Test-TidyAll artistic-2.0
1 thundergnat/Prime-Factor artistic-2.0
2 kusha-b-k/Turabian_Engin_Fan artistic-2.0
3 onlinepremiumoutlet/onlinepremiumoutlet.github.io artistic-2.0
4 huangyuanlove/LiaoBa_Service artistic-2.0

The second table is the sample_files table, which provides, among other information, the GitHub repo that each file belongs to (in the repo_name column). The first several rows of this table are printed below.

# Construct a reference to the "sample_files" table
files_ref = dataset_ref.table("sample_files")

# API request - fetch the table
files_table = client.get_table(files_ref)

# Preview the first five lines of the "sample_files" table
client.list_rows(files_table, max_results=5).to_dataframe()
repo_name ref path mode id symlink_target
0 EOL/eol refs/heads/master generate/vendor/railties 40960 0338c33fb3fda57db9e812ac7de969317cad4959 /usr/share/rails-ruby1.8/railties
1 np/ling refs/heads/master tests/success/merger_seq_inferred.t/merger_seq... 40960 dd4bb3d5ecabe5044d3fa5a36e0a9bf7ca878209 ../../../fixtures/all/merger_seq_inferred.ll
2 np/ling refs/heads/master fixtures/sequence/lettype.ll 40960 8fdf536def2633116d65b92b3b9257bcf06e3e45 ../all/lettype.ll
3 np/ling refs/heads/master fixtures/failure/wrong_order_seq3.ll 40960 c2509ae1196c4bb79d7e60a3d679488ca4a753e9 ../all/wrong_order_seq3.ll
4 np/ling refs/heads/master issues/sequence/keep.t 40960 5721de3488fb32745dfc11ec482e5dd0331fecaf ../keep.t

Next, we write a query that uses information in both tables to determine how many files are released in each license.

# Query to determine the number of files per license, sorted by number of files
query = """
        SELECT L.license, COUNT(1) AS number_of_files
        FROM `bigquery-public-data.github_repos.sample_files` AS sf
        INNER JOIN `bigquery-public-data.github_repos.licenses` AS L 
            ON sf.repo_name = L.repo_name
        GROUP BY L.license
        ORDER BY number_of_files DESC
        """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
file_count_by_license = query_job.to_dataframe()

It’s a big query, and so we’ll investigate each piece separately.

We’ll begin with the JOIN (highlighted in blue above). This specifies the sources of data and how to join them. We use ON to specify that we combine the tables by matching the values in the repo_name columns in the tables.

Next, we’ll talk about SELECT and GROUP BY (highlighted in yellow). The GROUP BY breaks the data into a different group for each license, before we COUNT the number of rows in the sample_files table that corresponds to each license. (Remember that you can count the number of rows with COUNT(1).)

Finally, the ORDER BY (highlighted in purple) sorts the results so that licenses with more files appear first.

It was a big query, but it gave us a nice table summarizing how many files have been committed under each license:

# Print the DataFrame
file_count_by_license
license number_of_files
0 mit 20560894
1 gpl-2.0 16608922
2 apache-2.0 7201141
3 gpl-3.0 5107676
4 bsd-3-clause 3465437
5 agpl-3.0 1372100
6 lgpl-2.1 799664
7 bsd-2-clause 692357
8 lgpl-3.0 582277
9 mpl-2.0 457000
10 cc0-1.0 449149
11 epl-1.0 322255
12 unlicense 208602
13 artistic-2.0 147391
14 isc 118332

There are a few more types of JOIN, along with how to use UNIONs to pull information from multiple tables. We’ll work with the Hacker News dataset. We begin by reviewing the first several rows of the comments table.

# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "comments" table
table_ref = dataset_ref.table("comments")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()
id by author time time_ts text parent deleted dead ranking
0 9734136 None None 1434565400 2015-06-17 18:23:20+00:00 None 9733698 True <NA> 0
1 4921158 None None 1355496966 2012-12-14 14:56:06+00:00 None 4921100 True <NA> 0
2 7500568 None None 1396261158 2014-03-31 10:19:18+00:00 None 7499385 True <NA> 0
3 8909635 None None 1421627275 2015-01-19 00:27:55+00:00 None 8901135 True <NA> 0
4 9256463 None None 1427204705 2015-03-24 13:45:05+00:00 None 9256346 True <NA> 0
# Construct a reference to the "stories" table
table_ref = dataset_ref.table("stories")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()
id by score time time_ts title url text deleted dead descendants author
0 6988445 cflick 0 1388454902 2013-12-31 01:55:02+00:00 Appshare http://chadflick.ws/appshare.html Did facebook or angrybirds pay you? We will! <NA> True <NA> cflick
1 7047571 Rd2 1 1389562985 2014-01-12 21:43:05+00:00 Java in startups Hello, hacker news!<p>Have any of you used jav... <NA> True <NA> Rd2
2 9157712 mo0 1 1425657937 2015-03-06 16:05:37+00:00 Show HN: Discover what songs were used in YouT... http://www.mooma.sh/ The user can paste a media url(currently only ... <NA> True <NA> mo0
3 8127403 ad11 1 1407052667 2014-08-03 07:57:47+00:00 My poker project, what do you think? Hi guys, what do you think about my poker proj... <NA> True <NA> ad11
4 6933158 emyy 1 1387432701 2013-12-19 05:58:21+00:00 Christmas Crafts Ideas - Easy and Simple Famil... http://www.winxdvd.com/resource/christmas-craf... There are some free Christmas craft ideas to m... <NA> True <NA> emyy

The query below pulls information from the stories and comments tables to create a table showing all stories posted on January 1, 2012, along with the corresponding number of comments. We use a LEFT JOIN so that the results include stories that didn’t receive any comments.

# Query to select all stories posted on January 1, 2012, with number of comments
join_query = """
             WITH c AS
             (
             SELECT parent, COUNT(*) as num_comments
             FROM `bigquery-public-data.hacker_news.comments` 
             GROUP BY parent
             )
             SELECT s.id as story_id, s.by, s.title, c.num_comments
             FROM `bigquery-public-data.hacker_news.stories` AS s
             LEFT JOIN c
             ON s.id = c.parent
             WHERE EXTRACT(DATE FROM s.time_ts) = '2012-01-01'
             ORDER BY c.num_comments DESC
             """

# Run the query, and return a pandas DataFrame
join_result = client.query(join_query).result().to_dataframe()
join_result.head()
story_id by title num_comments
0 3412900 whoishiring Ask HN: Who is Hiring? (January 2012) 154
1 3412901 whoishiring Ask HN: Freelancer? Seeking freelancer? (Janua... 97
2 3412643 jemeshsu Avoid Apress 30
3 3412891 Brajeshwar There's no shame in code that is simply "good ... 27
4 3414012 ramanujam Impress.js - a Prezi like implementation using... 27

Since the results are ordered by the num_comments column, stories without comments appear at the end of the DataFrame. (Remember that NaN stands for “not a number”.)

# None of these stories received any comments
join_result.tail()
story_id by title num_comments
439 3413342 jonsteiman The List You Can't Miss: Top 5 Blogs of 2011 <NA>
440 3412327 arroyo Can your business card be followed? <NA>
441 3413203 norris_tony44 10 Popular iPhone Games you Must Play <NA>
442 3412940 julelara Washington Redskins vs Philadelphia Eagles liv... <NA>
443 3412632 UsedCarFleetCom Used Car fleet <NA>

As you’ve seen, JOINs horizontally combine results from different tables. If you instead would like to vertically concatenate columns, you can do so with a UNION.

Next, we write a query to select all usernames corresponding to users who wrote stories or comments on January 1, 2014. We use UNION DISTINCT (instead of UNION ALL) to ensure that each user appears in the table at most once.

# Query to select all users who posted stories or comments on January 1, 2014
union_query = """
              SELECT c.by
              FROM `bigquery-public-data.hacker_news.comments` AS c
              WHERE EXTRACT(DATE FROM c.time_ts) = '2014-01-01'
              UNION DISTINCT
              SELECT s.by
              FROM `bigquery-public-data.hacker_news.stories` AS s
              WHERE EXTRACT(DATE FROM s.time_ts) = '2014-01-01'
              """

# Run the query, and return a pandas DataFrame
union_result = client.query(union_query).result().to_dataframe()
union_result.head()
by
0 vidarh
1 tlarkworthy
2 jbl
3 dmgrow
4 maurorm

To get the number of users who posted on January 1, 2014, we need only take the length of the DataFrame.

# Number of users who posted stories or comments on January 1, 2014
len(union_result)
2282

3.1.10 Nested and Repeated data

So far, you’ve worked with many types of data, including numeric types (integers, floating point values), strings, and the DATETIME type. In this tutorial, you’ll learn how to query nested and repeated data. These are the most complex data types that you can find in BigQuery datasets!

We’ll work with the Google Analytics Sample dataset. It contains information tracking the behavior of visitors to the Google Merchandise store, an e-commerce website that sells Google branded items.

We begin by printing the first few rows of the ga_sessions_20170801 table. This table tracks visits to the website on August 1, 2017. The table has many nested fields from table preview:

# Construct a reference to the "google_analytics_sample" dataset
dataset_ref = client.dataset("google_analytics_sample", project="bigquery-public-data")

# Construct a reference to the "ga_sessions_20170801" table
table_ref = dataset_ref.table("ga_sessions_20170801")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()
visitorId visitNumber visitId visitStartTime date totals trafficSource device geoNetwork customDimensions hits fullVisitorId userId clientId channelGrouping socialEngagementType
0 <NA> 1 1501591568 1501591568 20170801 {'visits': 1, 'hits': 1, 'pageviews': 1, 'time... {'referralPath': None, 'campaign': '(not set)'... {'browser': 'Chrome', 'browserVersion': 'not a... {'continent': 'Europe', 'subContinent': 'South... [] [{'hitNumber': 1, 'time': 0, 'hour': 5, 'minut... 3418334011779872055 None None Organic Search Not Socially Engaged
1 <NA> 2 1501589647 1501589647 20170801 {'visits': 1, 'hits': 1, 'pageviews': 1, 'time... {'referralPath': '/analytics/web/', 'campaign'... {'browser': 'Chrome', 'browserVersion': 'not a... {'continent': 'Asia', 'subContinent': 'Souther... [{'index': 4, 'value': 'APAC'}] [{'hitNumber': 1, 'time': 0, 'hour': 5, 'minut... 2474397855041322408 None None Referral Not Socially Engaged
2 <NA> 1 1501616621 1501616621 20170801 {'visits': 1, 'hits': 1, 'pageviews': 1, 'time... {'referralPath': '/analytics/web/', 'campaign'... {'browser': 'Chrome', 'browserVersion': 'not a... {'continent': 'Europe', 'subContinent': 'North... [{'index': 4, 'value': 'EMEA'}] [{'hitNumber': 1, 'time': 0, 'hour': 12, 'minu... 5870462820713110108 None None Referral Not Socially Engaged
3 <NA> 1 1501601200 1501601200 20170801 {'visits': 1, 'hits': 1, 'pageviews': 1, 'time... {'referralPath': '/analytics/web/', 'campaign'... {'browser': 'Firefox', 'browserVersion': 'not ... {'continent': 'Americas', 'subContinent': 'Nor... [{'index': 4, 'value': 'North America'}] [{'hitNumber': 1, 'time': 0, 'hour': 8, 'minut... 9397809171349480379 None None Referral Not Socially Engaged
4 <NA> 1 1501615525 1501615525 20170801 {'visits': 1, 'hits': 1, 'pageviews': 1, 'time... {'referralPath': '/analytics/web/', 'campaign'... {'browser': 'Chrome', 'browserVersion': 'not a... {'continent': 'Americas', 'subContinent': 'Nor... [{'index': 4, 'value': 'North America'}] [{'hitNumber': 1, 'time': 0, 'hour': 12, 'minu... 6089902943184578335 None None Referral Not Socially Engaged

Now we’ll work with the hits column as an example of data that is both nested and repeated. Since:

  • hits is a STRUCT (contains nested data) and is repeated,
  • hitNumber, page, and type are all nested inside the hits column, and
  • pagePath is nested inside the page field,

we can query these fields with the following syntax:

# Query to determine most popular landing point on the website
query = """
        SELECT hits.page.pagePath as path,
            COUNT(hits.page.pagePath) as counts
        FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`, 
            UNNEST(hits) as hits
        WHERE hits.type="PAGE" and hits.hitNumber=1
        GROUP BY path
        ORDER BY counts DESC
        """

# Run the query, and return a pandas DataFrame
result = client.query(query).result().to_dataframe()
result.head()
path counts
0 /home 1257
1 /google+redesign/shop+by+brand/youtube 587
2 /google+redesign/apparel/mens/mens+t+shirts 117
3 /signin.html 78
4 /basket.html 35

3.1.11 Analytic Function (Optional)

You can also define analytic functions, which also operate on a set of rows like aggregation function. However, unlike aggregate functions, analytic functions return a (potentially different) value for each row in the original table. Analytic functions allow us to perform complex calculations with relatively straightforward syntax. For instance, we can quickly calculate moving averages and running totals, among other quantities.

We’ll work with the San Francisco Open Data dataset.

# Construct a reference to the "san_francisco" dataset
dataset_ref = client.dataset("san_francisco", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "bikeshare_trips" table
table_ref = dataset_ref.table("bikeshare_trips")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()
trip_id duration_sec start_date start_station_name start_station_id end_date end_station_name end_station_id bike_number zip_code subscriber_type
0 1235850 1540 2016-06-11 08:19:00+00:00 San Jose Diridon Caltrain Station 2 2016-06-11 08:45:00+00:00 San Jose Diridon Caltrain Station 2 124 15206 Customer
1 1219337 6324 2016-05-29 12:49:00+00:00 San Jose Diridon Caltrain Station 2 2016-05-29 14:34:00+00:00 San Jose Diridon Caltrain Station 2 174 55416 Customer
2 793762 115572 2015-06-04 09:22:00+00:00 San Jose Diridon Caltrain Station 2 2015-06-05 17:28:00+00:00 San Jose Diridon Caltrain Station 2 190 95391 Customer
3 453845 54120 2014-09-15 16:53:00+00:00 San Jose Diridon Caltrain Station 2 2014-09-16 07:55:00+00:00 San Jose Diridon Caltrain Station 2 127 81 Customer
4 1245113 5018 2016-06-17 20:08:00+00:00 San Jose Diridon Caltrain Station 2 2016-06-17 21:32:00+00:00 San Jose Diridon Caltrain Station 2 153 95070 Customer

Each row of the table corresponds to a different bike trip, and we can use an analytic function to calculate the cumulative number of trips for each date in 2015.

# Query to count the (cumulative) number of trips per day
num_trips_query = """
                  WITH trips_by_day AS
                  (
                  SELECT DATE(start_date) AS trip_date,
                      COUNT(*) as num_trips
                  FROM `bigquery-public-data.san_francisco.bikeshare_trips`
                  WHERE EXTRACT(YEAR FROM start_date) = 2015
                  GROUP BY trip_date
                  )
                  SELECT *,
                      SUM(num_trips) 
                          OVER (
                               ORDER BY trip_date
                               ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
                               ) AS cumulative_trips
                  FROM trips_by_day
                  """

# Run the query, and return a pandas DataFrame
num_trips_result = client.query(num_trips_query).result().to_dataframe()
num_trips_result.head()
trip_date num_trips cumulative_trips
0 2015-01-13 1368 10709
1 2015-04-06 1281 91635
2 2015-04-25 405 111722
3 2015-06-18 1352 166602
4 2015-07-14 1358 192161

The query uses a common table expression (CTE) to first calculate the daily number of trips. Then, we use SUM() as an aggregate function. - Since there is no PARTITION BY clause, the entire table is treated as a single partition. - The ORDER BY clause orders the rows by date, where earlier dates appear first. - By setting the window frame clause to ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, we ensure that all rows up to and including the current date are used to calculate the (cumulative) sum. See https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts#def_window_frame for more details.

The next query tracks the stations where each bike began (in start_station_id) and ended (in end_station_id) the day on October 25, 2015.

# Query to track beginning and ending stations on October 25, 2015, for each bike
start_end_query = """
                  SELECT bike_number,
                      TIME(start_date) AS trip_time,
                      FIRST_VALUE(start_station_id)
                          OVER (
                               PARTITION BY bike_number
                               ORDER BY start_date
                               ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                               ) AS first_station_id,
                      LAST_VALUE(end_station_id)
                          OVER (
                               PARTITION BY bike_number
                               ORDER BY start_date
                               ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                               ) AS last_station_id,
                      start_station_id,
                      end_station_id
                  FROM `bigquery-public-data.san_francisco.bikeshare_trips`
                  WHERE DATE(start_date) = '2015-10-25' 
                  """

# Run the query, and return a pandas DataFrame
start_end_result = client.query(start_end_query).result().to_dataframe()
start_end_result.head()
bike_number trip_time first_station_id last_station_id start_station_id end_station_id
0 230 22:41:00 22 22 22 22
1 601 15:50:00 68 67 68 50
2 601 23:27:00 68 67 50 67
3 604 08:56:00 70 66 70 39
4 604 12:34:00 70 66 39 67

The query uses both FIRST_VALUE() and LAST_VALUE() as analytic functions. - The PARTITION BY clause breaks the data into partitions based on the bike_number column. Since this column holds unique identifiers for the bikes, this ensures the calculations are performed separately for each bike. - The ORDER BY clause puts the rows within each partition in chronological order. - Since the window frame clause is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, for each row, its entire partition is used to perform the calculation. (This ensures the calculated values for rows in the same partition are identical.)

You can check https://cloud.google.com/bigquery/docs/reference/standard-sql/introduction and https://googleapis.dev/python/bigquery/latest/index.html for more details.

3.2 Data Wrangling with Pandas

3.2.1 Series objects

The Pandas library contains these useful data structures: * Series objects, that we will discuss now. A Series object is 1D array, similar to a column in a spreadsheet (with a column name and row labels). * DataFrame objects. This is a 2D table, similar to a spreadsheet (with column names and row labels).

3.2.1.1 Creating a Series

Let’s start by creating our first Series object!

s = pd.Series([2,-1,3,5])
s
0    2
1   -1
2    3
3    5
dtype: int64

Arithmetic operations on Series are also possible, and they apply elementwise, just like for ndarrays in NumPy:

s + [1000,2000,3000,4000]
0    1002
1    1999
2    3003
3    4005
dtype: int64
s + 1000 # Broadcasting
0    1002
1     999
2    1003
3    1005
dtype: int64

3.2.1.2 Index labels

Each item in a Series object has a unique identifier called the index label. By default, it is simply the rank of the item in the Series (starting at 0) but you can also set the index labels manually:

s2 = pd.Series([68, 83, 112, 68], index=["alice", "bob", "charles", "darwin"])
s2
alice       68
bob         83
charles    112
darwin      68
dtype: int64

You can then use the Series just like a dict:

s2["bob"]
83

You can still access the items by integer location, like in a regular array:

s2[1]
83

To make it clear when you are accessing, it is recommended to always use the loc attribute when accessing by label, and the iloc attribute when accessing by integer location:

s2.loc["bob"]
83
s2.iloc[1]
83

Slicing a Series also slices the index labels:

s2.iloc[1:3]
bob         83
charles    112
dtype: int64

3.2.1.3 Initialize from dict

You can create a Series object from a dict. The keys will be used as index labels:

weights = {"alice": 68, "bob": 83, "colin": 86, "darwin": 68}
s3 = pd.Series(weights)
s3
alice     68
bob       83
colin     86
darwin    68
dtype: int64

When an operation involves multiple Series objects, pandas automatically aligns items by matching index labels.

print(s2.keys())
print(s3.keys())

s2 + s3
Index(['alice', 'bob', 'charles', 'darwin'], dtype='object')
Index(['alice', 'bob', 'colin', 'darwin'], dtype='object')
alice      136.0
bob        166.0
charles      NaN
colin        NaN
darwin     136.0
dtype: float64

The resulting Series contains the union of index labels from s2 and s3. Since "colin" is missing from s2 and "charles" is missing from s3, these items have a NaN result value. (ie. Not-a-Number means missing).

Automatic alignment is very handy when working with data that may come from various sources with varying structure and missing items

3.2.1.4 Initialize with a scalar

You can also initialize a Series object using a scalar and a list of index labels: all items will be set to the scalar.

meaning = pd.Series(42, ["life", "universe", "everything"])
meaning
life          42
universe      42
everything    42
dtype: int64

Pandas makes it easy to plot Series data using matplotlib:

temperatures = [4.4,5.1,6.1,6.2,6.1,6.1,5.7,5.2,4.7,4.1,3.9,3.5]
s4 = pd.Series(temperatures, name="Temperature")
s4.plot()
plt.show()

You can easily convert it to NumPy array by dicarding the index.

s4.to_numpy()
array([4.4, 5.1, 6.1, 6.2, 6.1, 6.1, 5.7, 5.2, 4.7, 4.1, 3.9, 3.5])

There are many options for plotting your data. It is not necessary to list them all here: if you need a particular type of plot (histograms, pie charts, etc.), just look for it in the excellent Visualization section of pandas’ documentation, and look at the example code.

3.2.2 Handling time

Many datasets have timestamps, and pandas is awesome at manipulating such data: * it can represent periods (such as 2016Q3) and frequencies (such as “monthly”) * it can convert periods to actual timestamps, and vice versa * it can resample data and aggregate values any way you like * it can handle timezones.

3.2.2.1 Time range

Let’s start by creating a time series using pd.date_range(). This returns a DatetimeIndex containing one datetime per hour for 12 hours starting on March 6th 2023 at 5:30pm.

dates = pd.date_range('2023/03/06 5:30pm', periods=12, freq='H')
dates
DatetimeIndex(['2023-03-06 17:30:00', '2023-03-06 18:30:00',
               '2023-03-06 19:30:00', '2023-03-06 20:30:00',
               '2023-03-06 21:30:00', '2023-03-06 22:30:00',
               '2023-03-06 23:30:00', '2023-03-07 00:30:00',
               '2023-03-07 01:30:00', '2023-03-07 02:30:00',
               '2023-03-07 03:30:00', '2023-03-07 04:30:00'],
              dtype='datetime64[ns]', freq='H')

This DatetimeIndex may be used as an index in a Series:

temp_series = pd.Series(temperatures, dates)
temp_series
2023-03-06 17:30:00    4.4
2023-03-06 18:30:00    5.1
2023-03-06 19:30:00    6.1
2023-03-06 20:30:00    6.2
2023-03-06 21:30:00    6.1
2023-03-06 22:30:00    6.1
2023-03-06 23:30:00    5.7
2023-03-07 00:30:00    5.2
2023-03-07 01:30:00    4.7
2023-03-07 02:30:00    4.1
2023-03-07 03:30:00    3.9
2023-03-07 04:30:00    3.5
Freq: H, dtype: float64

Let’s plot this series:

temp_series.plot(kind="bar")

plt.grid(True)
plt.show()

3.2.3 Periods

The pd.period_range() function returns a PeriodIndex instead of a DatetimeIndex. For example, let’s get all quarters in 2022 and 2023:

quarters = pd.period_range('2022Q1', periods=8, freq='Q')
quarters
PeriodIndex(['2022Q1', '2022Q2', '2022Q3', '2022Q4', '2023Q1', '2023Q2',
             '2023Q3', '2023Q4'],
            dtype='period[Q-DEC]')

Adding a number N to a PeriodIndex shifts the periods by N times the PeriodIndex’s frequency:

quarters + 3
PeriodIndex(['2022Q4', '2023Q1', '2023Q2', '2023Q3', '2023Q4', '2024Q1',
             '2024Q2', '2024Q3'],
            dtype='period[Q-DEC]')

Pandas also provides many other time-related functions that we recommend you check out in the documentation

3.2.4 DataFrame objects

A DataFrame object represents a spreadsheet, with cell values, column names and row index labels. You can define expressions to compute columns based on other columns, create pivot-tables, group rows, draw graphs, etc. You can see DataFrame as dictionaries of Series.

3.2.4.1 Creating a DataFrame

#%unload_ext google.colab.data_table #cloab only
#from google.colab import data_table
#data_table.disable_dataframe_formatter()

You can create a DataFrame by passing a dictionary of Series objects:

people_dict = {
    "weight": pd.Series([68, 83, 112], index=["alice", "bob", "charles"]),
    "birthyear": pd.Series([1984, 1985, 1992], index=["bob", "alice", "charles"], name="year"),
    "children": pd.Series([0, 3], index=["charles", "bob"]),
    "hobby": pd.Series(["Biking", "Dancing"], index=["alice", "bob"]),
}
people = pd.DataFrame(people_dict)
people
weight birthyear children hobby
alice 68 1985 NaN Biking
bob 83 1984 3.0 Dancing
charles 112 1992 0.0 NaN

A few things to note: * the Series were automatically aligned based on their index, * missing values are represented as NaN, * Series names are ignored (the name "year" was dropped), * DataFrames are displayed nicely in Jupyter notebooks!

3.2.4.2 Subsets - Accessing columns

You can access columns by using the column name or fancy indexing. They are returned as Series objects:

people["birthyear"]
alice      1985
bob        1984
charles    1992
Name: birthyear, dtype: int64

You can also get multiple columns at once:

people[["birthyear", "hobby"]]
birthyear hobby
alice 1985 Biking
bob 1984 Dancing
charles 1992 NaN

Another convenient way to create a DataFrame is to pass all the values to the constructor as an ndarray, or a list of lists, and specify the column names and row index labels separately:

values = [
            [1985, np.nan, "Biking",   68],
            [1984, 3,      "Dancing",  83],
            [1992, 0,      np.nan,    112]
         ]
d3 = pd.DataFrame(
        values,
        columns=["birthyear", "children", "hobby", "weight"],
        index=["alice", "bob", "charles"]
     )
d3
birthyear children hobby weight
alice 1985 NaN Biking 68
bob 1984 3.0 Dancing 83
charles 1992 0.0 NaN 112

3.2.4.3 Multi-index (optional)

You can also create multi-index datafram as follows:

df = pd.DataFrame(
    {
        "a" : [4 ,5, 6],
        "b" : [7, 8, 9],
        "c" : [10, 11, 12]
     },
    index = pd.MultiIndex.from_tuples(
        [('d',1),('d',2),('e',2)], names=['n','v']
    )
)
df
a b c
n v
d 1 4 7 10
2 5 8 11
e 2 6 9 12

If all columns are tuples of the same size, then they are understood as a multi-index. The same goes for row index labels. For example:

d5 = pd.DataFrame(
  {
    ("public", "birthyear"):
        {("Paris","alice"):1985, ("Paris","bob"): 1984, ("London","charles"): 1992},
    ("public", "hobby"):
        {("Paris","alice"):"Biking", ("Paris","bob"): "Dancing"},
    ("private", "weight"):
        {("Paris","alice"):68, ("Paris","bob"): 83, ("London","charles"): 112},
    ("private", "children"):
        {("Paris", "alice"):np.nan, ("Paris","bob"): 3, ("London","charles"): 0}
  }
)
d5
public private
birthyear hobby weight children
Paris alice 1985 Biking 68 NaN
bob 1984 Dancing 83 3.0
London charles 1992 NaN 112 0.0

You can now get a DataFrame containing all the “public” columns very simply:

d5["public"]
birthyear hobby
Paris alice 1985 Biking
bob 1984 Dancing
London charles 1992 NaN

It is noted that most methods return modified copies in pandas.

3.2.4.4 Subsets - Accessing rows

Let’s go back to the people DataFrame:

people
weight birthyear children hobby
alice 68 1985 NaN Biking
bob 83 1984 3.0 Dancing
charles 112 1992 0.0 NaN

The loc attribute lets you access rows instead of columns. The result is a Series object in which the DataFrame’s column names are mapped to row index labels:

people.loc["charles"]
weight        112
birthyear    1992
children      0.0
hobby         NaN
Name: charles, dtype: object

You can also access rows by integer location using the iloc attribute:

people.iloc[2]
weight        112
birthyear    1992
children      0.0
hobby         NaN
Name: charles, dtype: object

You can also get a slice of rows, and this returns a DataFrame object:

people.iloc[1:3]
weight birthyear children hobby
bob 83 1984 3.0 Dancing
charles 112 1992 0.0 NaN

Finally, you can pass a boolean array to get the matching rows. This is most useful when combined with boolean expressions:

people[people["birthyear"] < 1990]
weight birthyear children hobby
alice 68 1985 NaN Biking
bob 83 1984 3.0 Dancing

You can also accessing columns by specifiying the second axis:

people.iloc[:,2]
alice      NaN
bob        3.0
charles    0.0
Name: children, dtype: float64

3.2.4.5 Adding and removing columns

You can generally treat DataFrame objects like dictionaries of Series, so the following work fine:

people
weight birthyear children hobby
alice 68 1985 NaN Biking
bob 83 1984 3.0 Dancing
charles 112 1992 0.0 NaN
people["age"] = 2023 - people["birthyear"]  # adds a new column "age"
people["over 30"] = people["age"] > 30      # adds another column "over 30"
birthyears = people.pop("birthyear")
people.drop(columns=['children'], inplace=True) # drop a column inplace
people
weight hobby age over 30
alice 68 Biking 38 True
bob 83 Dancing 39 True
charles 112 NaN 31 True
birthyears
alice      1985
bob        1984
charles    1992
Name: birthyear, dtype: int64

When you add a new column, it must have the same number of rows. Missing rows are filled with NaN, and extra rows are ignored:

people["pets"] = pd.Series({"bob": 0, "charles": 5, "eugene":1})  # alice is missing, eugene is ignored
people
weight hobby age over 30 pets
alice 68 Biking 38 True NaN
bob 83 Dancing 39 True 0.0
charles 112 NaN 31 True 5.0

When adding a new column, it is added at the end (on the right) by default. You can also insert a column anywhere else using the insert() method:

people.insert(1, "height", [172, 181, 185])
people
weight height hobby age over 30 pets
alice 68 172 Biking 38 True NaN
bob 83 181 Dancing 39 True 0.0
charles 112 185 NaN 31 True 5.0

You can also create new columns by calling the assign() method. Note that this returns a new DataFrame object, the original is not modified

p2 = people.assign(
    bmi = people["weight"] / (people["height"] / 100) ** 2,
    has_pets = people["pets"] > 0
)
p2
weight height hobby age over 30 pets bmi has_pets
alice 68 172 Biking 38 True NaN 22.985398 False
bob 83 181 Dancing 39 True 0.0 25.335002 False
charles 112 185 NaN 31 True 5.0 32.724617 True

You can also rename the column name:

p2.rename(columns={'bmi':'body_mass_index'})
weight height hobby age over 30 pets body_mass_index has_pets
alice 68 172 Biking 38 True NaN 22.985398 False
bob 83 181 Dancing 39 True 0.0 25.335002 False
charles 112 185 NaN 31 True 5.0 32.724617 True

3.2.4.6 Querying a DataFrame

The query() method lets you filter a DataFrame based on a query expression:

people.query("age > 30 and pets == 0")
weight height hobby age over 30 pets
bob 83 181 Dancing 39 True 0.0

3.2.4.7 Sorting a DataFrame

You can sort a DataFrame by calling its sort_index method. By default it sorts the rows by their index label, in ascending order, but let’s reverse the order:

people.sort_index(ascending=False)
weight height hobby age over 30 pets
charles 112 185 NaN 31 True 5.0
bob 83 181 Dancing 39 True 0.0
alice 68 172 Biking 38 True NaN

Note that sort_index returned a sorted copy of the DataFrame. To modify people directly, we can set the inplace argument to True. Also, we can sort the columns instead of the rows by setting axis=1:

people.sort_index(axis=1, inplace=True)
people
age height hobby over 30 pets weight
alice 38 172 Biking True NaN 68
bob 39 181 Dancing True 0.0 83
charles 31 185 NaN True 5.0 112

To sort the DataFrame by the values instead of the labels, we can use sort_values and specify the column to sort by:

people.sort_values(by="age", inplace=True)
people
age height hobby over 30 pets weight
charles 31 185 NaN True 5.0 112
alice 38 172 Biking True NaN 68
bob 39 181 Dancing True 0.0 83

3.2.4.8 Plotting a DataFrame

Just like for Series, pandas makes it easy to draw nice graphs based on a DataFrame.

For example, it is trivial to create a line plot from a DataFrame’s data by calling its plot method:

people = people.assign(
    body_mass_index = people["weight"] / (people["height"] / 100) ** 2
)
people.plot(kind = "line", x = "body_mass_index", y = ["height", "weight"])
plt.show()

Again, there are way too many options to list here: the best option is to scroll through the Visualization page in pandas’ documentation, find the plot you are interested in and look at the example code.

3.2.4.9 Operations on DataFrames

Although DataFrames do not try to mimick NumPy arrays, there are a few similarities. Let’s create a DataFrame to demonstrate this:

grades_array = np.array([[8,8,9],[10,9,9],[4, 8, 2], [9, 10, 10]])
grades = pd.DataFrame(grades_array, columns=["sep", "oct", "nov"], index=["alice","bob","charles","darwin"])
grades
sep oct nov
alice 8 8 9
bob 10 9 9
charles 4 8 2
darwin 9 10 10

You can apply NumPy mathematical functions on a DataFrame: the function is applied to all values:

np.sqrt(grades)
sep oct nov
alice 2.828427 2.828427 3.000000
bob 3.162278 3.000000 3.000000
charles 2.000000 2.828427 1.414214
darwin 3.000000 3.162278 3.162278
grades + 1
sep oct nov
alice 9 9 10
bob 11 10 10
charles 5 9 3
darwin 10 11 11

Aggregation operations, such as computing the max(), the sum() or the mean() of a DataFrame, apply to each column, and you get back a Series object:

grades.mean()
sep    7.75
oct    8.75
nov    7.50
dtype: float64

Most of these functions take an optional axis parameter which lets you specify along which axis of the DataFrame you want the operation executed. The default is axis=0, meaning that the operation is executed vertically (on each column). You can set axis=1 to execute the operation horizontally (on each row). For example, let’s find out which students had all grades greater than 5:

(grades > 5).all(axis = 1)
alice       True
bob         True
charles    False
darwin      True
dtype: bool

If you add a Series object to a DataFrame (or execute any other binary operation), Pandas attempts to broadcast the operation to all rows in the DataFrame. This only works if the Series has the same size as the DataFrames rows. For example, let’s subtract the mean of the DataFrame (a Series object) from the DataFrame:

grades - grades.mean()  # equivalent to: grades - [7.75, 8.75, 7.50]
sep oct nov
alice 0.25 -0.75 1.5
bob 2.25 0.25 1.5
charles -3.75 -0.75 -5.5
darwin 1.25 1.25 2.5

If you want to subtract the global mean from every grade, here is one way to do it:

grades - grades.values.mean() # subtracts the global mean (8.00) from all grades
sep oct nov
alice 0.0 0.0 1.0
bob 2.0 1.0 1.0
charles -4.0 0.0 -6.0
darwin 1.0 2.0 2.0

The following shows the behavior of nan

bonus_array = np.array([[0,np.nan,2],[np.nan,1,0],[0, 1, 0], [3, 3, 0]])
bonus_points = pd.DataFrame(bonus_array, columns=["oct", "nov", "dec"], index=["bob","colin", "darwin", "charles"])
bonus_points
oct nov dec
bob 0.0 NaN 2.0
colin NaN 1.0 0.0
darwin 0.0 1.0 0.0
charles 3.0 3.0 0.0
grades + bonus_points
dec nov oct sep
alice NaN NaN NaN NaN
bob NaN NaN 9.0 NaN
charles NaN 5.0 11.0 NaN
colin NaN NaN NaN NaN
darwin NaN 11.0 10.0 NaN

3.2.4.10 Handling missing data

Dealing with missing data is a frequent task when working with real life data. Pandas offers a few tools to handle missing data.

Let’s try to fix the problem above. For example, we can decide that missing data should result in a zero, instead of NaN. We can replace all NaN values by a any value using the fillna() method:

(grades + bonus_points).fillna(0)
dec nov oct sep
alice 0.0 0.0 0.0 0.0
bob 0.0 0.0 9.0 0.0
charles 0.0 5.0 11.0 0.0
colin 0.0 0.0 0.0 0.0
darwin 0.0 11.0 10.0 0.0
final_grades = grades + bonus_points
final_grades
dec nov oct sep
alice NaN NaN NaN NaN
bob NaN NaN 9.0 NaN
charles NaN 5.0 11.0 NaN
colin NaN NaN NaN NaN
darwin NaN 11.0 10.0 NaN

We can call the dropna() method to get rid of rows that are full of NaNs:

final_grades_clean = final_grades.dropna(how="all")
final_grades_clean
dec nov oct sep
bob NaN NaN 9.0 NaN
charles NaN 5.0 11.0 NaN
darwin NaN 11.0 10.0 NaN

Now let’s remove columns that are full of NaNs by setting the axis argument to 1:

final_grades_clean = final_grades_clean.dropna(axis=1, how="all")
final_grades_clean
nov oct
bob NaN 9.0
charles 5.0 11.0
darwin 11.0 10.0

3.2.4.11 Aggregating with groupby

Similar to the SQL language, pandas allows grouping your data into groups to run calculations over each group.

First, let’s add some extra data about each person so we can group them, and let’s go back to the final_grades DataFrame so we can see how NaN values are handled:

final_grades["hobby"] = ["Biking", "Dancing", np.nan, "Dancing", "Biking"]
final_grades
dec nov oct sep hobby
alice NaN NaN NaN NaN Biking
bob NaN NaN 9.0 NaN Dancing
charles NaN 5.0 11.0 NaN NaN
colin NaN NaN NaN NaN Dancing
darwin NaN 11.0 10.0 NaN Biking

Now let’s group data in this DataFrame by hobby:

grouped_grades = final_grades.groupby("hobby")

We are ready to compute the average grade per hobby:

grouped_grades.mean()
dec nov oct sep
hobby
Biking NaN 11.0 10.0 NaN
Dancing NaN NaN 9.0 NaN

That was easy! Note that the NaN values have simply been skipped when computing the means.

3.2.4.12 Pivot tables (Optional)

Pandas supports spreadsheet-like pivot tables that allow quick data summarization.

3.2.4.13 Overview functions

When dealing with large DataFrames, it is useful to get a quick overview of its content. Pandas offers a few functions for this. First, let’s create a large DataFrame with a mix of numeric values, missing values and text values. Notice how Jupyter displays only the corners of the DataFrame:

much_data = np.fromfunction(lambda x,y: (x+y*y)%17*11, (10000, 26))
large_df = pd.DataFrame(much_data, columns=list("ABCDEFGHIJKLMNOPQRSTUVWXYZ"))
large_df[large_df % 16 == 0] = np.nan
large_df.insert(3,"some_text", "Blabla")
large_df
A B C some_text D E F G H I ... Q R S T U V W X Y Z
0 NaN 11.0 44.0 Blabla 99.0 NaN 88.0 22.0 165.0 143.0 ... 11.0 NaN 11.0 44.0 99.0 NaN 88.0 22.0 165.0 143.0
1 11.0 22.0 55.0 Blabla 110.0 NaN 99.0 33.0 NaN 154.0 ... 22.0 11.0 22.0 55.0 110.0 NaN 99.0 33.0 NaN 154.0
2 22.0 33.0 66.0 Blabla 121.0 11.0 110.0 44.0 NaN 165.0 ... 33.0 22.0 33.0 66.0 121.0 11.0 110.0 44.0 NaN 165.0
3 33.0 44.0 77.0 Blabla 132.0 22.0 121.0 55.0 11.0 NaN ... 44.0 33.0 44.0 77.0 132.0 22.0 121.0 55.0 11.0 NaN
4 44.0 55.0 88.0 Blabla 143.0 33.0 132.0 66.0 22.0 NaN ... 55.0 44.0 55.0 88.0 143.0 33.0 132.0 66.0 22.0 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 NaN NaN 33.0 Blabla 88.0 165.0 77.0 11.0 154.0 132.0 ... NaN NaN NaN 33.0 88.0 165.0 77.0 11.0 154.0 132.0
9996 NaN 11.0 44.0 Blabla 99.0 NaN 88.0 22.0 165.0 143.0 ... 11.0 NaN 11.0 44.0 99.0 NaN 88.0 22.0 165.0 143.0
9997 11.0 22.0 55.0 Blabla 110.0 NaN 99.0 33.0 NaN 154.0 ... 22.0 11.0 22.0 55.0 110.0 NaN 99.0 33.0 NaN 154.0
9998 22.0 33.0 66.0 Blabla 121.0 11.0 110.0 44.0 NaN 165.0 ... 33.0 22.0 33.0 66.0 121.0 11.0 110.0 44.0 NaN 165.0
9999 33.0 44.0 77.0 Blabla 132.0 22.0 121.0 55.0 11.0 NaN ... 44.0 33.0 44.0 77.0 132.0 22.0 121.0 55.0 11.0 NaN

10000 rows × 27 columns

The head() method returns the top 5 rows:

large_df.head()
A B C some_text D E F G H I ... Q R S T U V W X Y Z
0 NaN 11.0 44.0 Blabla 99.0 NaN 88.0 22.0 165.0 143.0 ... 11.0 NaN 11.0 44.0 99.0 NaN 88.0 22.0 165.0 143.0
1 11.0 22.0 55.0 Blabla 110.0 NaN 99.0 33.0 NaN 154.0 ... 22.0 11.0 22.0 55.0 110.0 NaN 99.0 33.0 NaN 154.0
2 22.0 33.0 66.0 Blabla 121.0 11.0 110.0 44.0 NaN 165.0 ... 33.0 22.0 33.0 66.0 121.0 11.0 110.0 44.0 NaN 165.0
3 33.0 44.0 77.0 Blabla 132.0 22.0 121.0 55.0 11.0 NaN ... 44.0 33.0 44.0 77.0 132.0 22.0 121.0 55.0 11.0 NaN
4 44.0 55.0 88.0 Blabla 143.0 33.0 132.0 66.0 22.0 NaN ... 55.0 44.0 55.0 88.0 143.0 33.0 132.0 66.0 22.0 NaN

5 rows × 27 columns

Of course there’s also a tail() function to view the bottom 5 rows. You can pass the number of rows you want:

large_df.tail(n=2)
A B C some_text D E F G H I ... Q R S T U V W X Y Z
9998 22.0 33.0 66.0 Blabla 121.0 11.0 110.0 44.0 NaN 165.0 ... 33.0 22.0 33.0 66.0 121.0 11.0 110.0 44.0 NaN 165.0
9999 33.0 44.0 77.0 Blabla 132.0 22.0 121.0 55.0 11.0 NaN ... 44.0 33.0 44.0 77.0 132.0 22.0 121.0 55.0 11.0 NaN

2 rows × 27 columns

The info() method prints out a summary of each columns contents:

large_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 27 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   A          8823 non-null   float64
 1   B          8824 non-null   float64
 2   C          8824 non-null   float64
 3   some_text  10000 non-null  object 
 4   D          8824 non-null   float64
 5   E          8822 non-null   float64
 6   F          8824 non-null   float64
 7   G          8824 non-null   float64
 8   H          8822 non-null   float64
 9   I          8823 non-null   float64
 10  J          8823 non-null   float64
 11  K          8822 non-null   float64
 12  L          8824 non-null   float64
 13  M          8824 non-null   float64
 14  N          8822 non-null   float64
 15  O          8824 non-null   float64
 16  P          8824 non-null   float64
 17  Q          8824 non-null   float64
 18  R          8823 non-null   float64
 19  S          8824 non-null   float64
 20  T          8824 non-null   float64
 21  U          8824 non-null   float64
 22  V          8822 non-null   float64
 23  W          8824 non-null   float64
 24  X          8824 non-null   float64
 25  Y          8822 non-null   float64
 26  Z          8823 non-null   float64
dtypes: float64(26), object(1)
memory usage: 2.1+ MB

Finally, the describe() method gives a nice overview of the main aggregated values over each column: * count: number of non-null (not NaN) values * mean: mean of non-null values * std: standard deviation of non-null values * min: minimum of non-null values * 25%, 50%, 75%: 25th, 50th and 75th percentile of non-null values * max: maximum of non-null values

large_df.describe()
A B C D E F G H I J ... Q R S T U V W X Y Z
count 8823.000000 8824.000000 8824.000000 8824.000000 8822.000000 8824.000000 8824.000000 8822.000000 8823.000000 8823.000000 ... 8824.000000 8823.000000 8824.000000 8824.000000 8824.000000 8822.000000 8824.000000 8824.000000 8822.000000 8823.000000
mean 87.977559 87.972575 87.987534 88.012466 87.983791 88.007480 87.977561 88.000000 88.022441 88.022441 ... 87.972575 87.977559 87.972575 87.987534 88.012466 87.983791 88.007480 87.977561 88.000000 88.022441
std 47.535911 47.535523 47.521679 47.521679 47.535001 47.519371 47.529755 47.536879 47.535911 47.535911 ... 47.535523 47.535911 47.535523 47.521679 47.521679 47.535001 47.519371 47.529755 47.536879 47.535911
min 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 ... 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000
25% 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 ... 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000 44.000000
50% 88.000000 88.000000 88.000000 88.000000 88.000000 88.000000 88.000000 88.000000 88.000000 88.000000 ... 88.000000 88.000000 88.000000 88.000000 88.000000 88.000000 88.000000 88.000000 88.000000 88.000000
75% 132.000000 132.000000 132.000000 132.000000 132.000000 132.000000 132.000000 132.000000 132.000000 132.000000 ... 132.000000 132.000000 132.000000 132.000000 132.000000 132.000000 132.000000 132.000000 132.000000 132.000000
max 165.000000 165.000000 165.000000 165.000000 165.000000 165.000000 165.000000 165.000000 165.000000 165.000000 ... 165.000000 165.000000 165.000000 165.000000 165.000000 165.000000 165.000000 165.000000 165.000000 165.000000

8 rows × 26 columns

3.2.4.14 Saving & loading

Pandas can save DataFrames to various backends, including file formats such as CSV, Excel, JSON, HTML and HDF5, or to a SQL database. Let’s create a DataFrame to demonstrate this:

my_df = pd.DataFrame(
    [["Biking", 68.5, 1985, np.nan], ["Dancing", 83.1, 1984, 3]], 
    columns=["hobby","weight","birthyear","children"],
    index=["alice", "bob"]
)
my_df
hobby weight birthyear children
alice Biking 68.5 1985 NaN
bob Dancing 83.1 1984 3.0

Let’s save it to CSV, HTML and JSON:

my_df.to_csv("my_df.csv")
my_df.to_html("my_df.html")
my_df.to_json("my_df.json")

Done! Let’s take a peek at what was saved:

for filename in ("my_df.csv", "my_df.html", "my_df.json"):
    print("#", filename)
    with open(filename, "rt") as f:
        print(f.read())
        print()
# my_df.csv
,hobby,weight,birthyear,children
alice,Biking,68.5,1985,
bob,Dancing,83.1,1984,3.0


# my_df.html
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>hobby</th>
      <th>weight</th>
      <th>birthyear</th>
      <th>children</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>alice</th>
      <td>Biking</td>
      <td>68.5</td>
      <td>1985</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>bob</th>
      <td>Dancing</td>
      <td>83.1</td>
      <td>1984</td>
      <td>3.0</td>
    </tr>
  </tbody>
</table>

# my_df.json
{"hobby":{"alice":"Biking","bob":"Dancing"},"weight":{"alice":68.5,"bob":83.1},"birthyear":{"alice":1985,"bob":1984},"children":{"alice":null,"bob":3.0}}

Note that the index is saved as the first column (with no name) in a CSV file, as <th> tags in HTML and as keys in JSON.

Saving to other formats works very similarly, but some formats require extra libraries to be installed. For example, saving to Excel requires the openpyxl library:

try:
    my_df.to_excel("my_df.xlsx", sheet_name='People')
except ImportError as e:
    print(e)

Now let’s load our CSV file back into a DataFrame:

my_df_loaded = pd.read_csv("my_df.csv", index_col=0)
my_df_loaded
hobby weight birthyear children
alice Biking 68.5 1985 NaN
bob Dancing 83.1 1984 3.0

As you might guess, there are similar read_json, read_html, read_excel functions as well. We can also read data straight from the Internet. For example, let’s load the top 1,000 U.S. cities from github:

us_cities = None
try:
    csv_url = "https://raw.githubusercontent.com/plotly/datasets/master/us-cities-top-1k.csv"
    us_cities = pd.read_csv(csv_url, index_col=0)
    us_cities = us_cities.head()
except IOError as e:
    print(e)
us_cities
State Population lat lon
City
Marysville Washington 63269 48.051764 -122.177082
Perris California 72326 33.782519 -117.228648
Cleveland Ohio 390113 41.499320 -81.694361
Worcester Massachusetts 182544 42.262593 -71.802293
Columbia South Carolina 133358 34.000710 -81.034814

There are more options available, in particular regarding datetime format. Check out the documentation for more details.

3.2.4.15 Combining DataFrames

One powerful feature of Pandas is it’s ability to perform SQL-like joins on DataFrames. Various types of joins are supported: inner joins, left/right outer joins and full joins. To illustrate this, let’s start by creating a couple simple DataFrames:

city_loc = pd.DataFrame(
    [
        ["CA", "San Francisco", 37.781334, -122.416728],
        ["NY", "New York", 40.705649, -74.008344],
        ["FL", "Miami", 25.791100, -80.320733],
        ["OH", "Cleveland", 41.473508, -81.739791],
        ["UT", "Salt Lake City", 40.755851, -111.896657]
    ], columns=["state", "city", "lat", "lng"])
city_loc
state city lat lng
0 CA San Francisco 37.781334 -122.416728
1 NY New York 40.705649 -74.008344
2 FL Miami 25.791100 -80.320733
3 OH Cleveland 41.473508 -81.739791
4 UT Salt Lake City 40.755851 -111.896657
city_pop = pd.DataFrame(
    [
        [808976, "San Francisco", "California"],
        [8363710, "New York", "New-York"],
        [413201, "Miami", "Florida"],
        [2242193, "Houston", "Texas"]
    ], index=[3,4,5,6], columns=["population", "city", "state"])
city_pop
population city state
3 808976 San Francisco California
4 8363710 New York New-York
5 413201 Miami Florida
6 2242193 Houston Texas

Now let’s join these DataFrames using the merge() function:

pd.merge(left=city_loc, right=city_pop, on="city")
state_x city lat lng population state_y
0 CA San Francisco 37.781334 -122.416728 808976 California
1 NY New York 40.705649 -74.008344 8363710 New-York
2 FL Miami 25.791100 -80.320733 413201 Florida

Note that both DataFrames have a column named state, so in the result they got renamed to state_x and state_y.

Also, note that Cleveland, Salt Lake City and Houston were dropped because they don’t exist in both DataFrames. This is the equivalent of a SQL INNER JOIN. If you want a FULL OUTER JOIN, where no city gets dropped and NaN values are added, you must specify how="outer":

all_cities = pd.merge(left=city_loc, right=city_pop, on="city", how="outer")
all_cities
state_x city lat lng population state_y
0 CA San Francisco 37.781334 -122.416728 808976.0 California
1 NY New York 40.705649 -74.008344 8363710.0 New-York
2 FL Miami 25.791100 -80.320733 413201.0 Florida
3 OH Cleveland 41.473508 -81.739791 NaN NaN
4 UT Salt Lake City 40.755851 -111.896657 NaN NaN
5 NaN Houston NaN NaN 2242193.0 Texas

Of course LEFT OUTER JOIN is also available by setting how="left": only the cities present in the left DataFrame end up in the result. Similarly, with how="right" only cities in the right DataFrame appear in the result. For example:

pd.merge(left=city_loc, right=city_pop, on="city", how="right")
state_x city lat lng population state_y
0 CA San Francisco 37.781334 -122.416728 808976 California
1 NY New York 40.705649 -74.008344 8363710 New-York
2 FL Miami 25.791100 -80.320733 413201 Florida
3 NaN Houston NaN NaN 2242193 Texas

3.2.4.16 Concatenation

Rather than joining DataFrames, we may just want to concatenate them. That’s what concat() is for:

result_concat = pd.concat([city_loc, city_pop])
result_concat
state city lat lng population
0 CA San Francisco 37.781334 -122.416728 NaN
1 NY New York 40.705649 -74.008344 NaN
2 FL Miami 25.791100 -80.320733 NaN
3 OH Cleveland 41.473508 -81.739791 NaN
4 UT Salt Lake City 40.755851 -111.896657 NaN
3 California San Francisco NaN NaN 808976.0
4 New-York New York NaN NaN 8363710.0
5 Florida Miami NaN NaN 413201.0
6 Texas Houston NaN NaN 2242193.0

Note that this operation aligned the data horizontally (by columns) but not vertically (by rows). In this example, we end up with multiple rows having the same index (eg. 3). Pandas handles this rather gracefully:

result_concat.loc[3]
state city lat lng population
3 OH Cleveland 41.473508 -81.739791 NaN
3 California San Francisco NaN NaN 808976.0

Or you can tell Pandas to just ignore the index:

pd.concat([city_loc, city_pop], ignore_index=True)
state city lat lng population
0 CA San Francisco 37.781334 -122.416728 NaN
1 NY New York 40.705649 -74.008344 NaN
2 FL Miami 25.791100 -80.320733 NaN
3 OH Cleveland 41.473508 -81.739791 NaN
4 UT Salt Lake City 40.755851 -111.896657 NaN
5 California San Francisco NaN NaN 808976.0
6 New-York New York NaN NaN 8363710.0
7 Florida Miami NaN NaN 413201.0
8 Texas Houston NaN NaN 2242193.0

Notice that when a column does not exist in a DataFrame, it acts as if it was filled with NaN values. If we set join="inner", then only columns that exist in both DataFrames are returned:

pd.concat([city_loc, city_pop], join="inner")
state city
0 CA San Francisco
1 NY New York
2 FL Miami
3 OH Cleveland
4 UT Salt Lake City
3 California San Francisco
4 New-York New York
5 Florida Miami
6 Texas Houston

3.2.4.17 Categories

It is quite frequent to have values that represent categories, for example 1 for female and 2 for male, or "A" for Good, "B" for Average, "C" for Bad. These categorical values can be hard to read and cumbersome to handle, but fortunately pandas makes it easy. To illustrate this, let’s take the city_pop DataFrame we created earlier, and add a column that represents a category:

city_eco = city_pop.copy()
city_eco["eco_code"] = [17, 17, 34, 20]
city_eco
population city state eco_code
3 808976 San Francisco California 17
4 8363710 New York New-York 17
5 413201 Miami Florida 34
6 2242193 Houston Texas 20

Right now the eco_code column is full of apparently meaningless codes. Let’s fix that. First, we will create a new categorical column based on the eco_codes:

city_eco["economy"] = city_eco["eco_code"].astype('category')
city_eco["economy"].cat.categories
Int64Index([17, 20, 34], dtype='int64')

Now we can give each category a meaningful name:

city_eco["economy"].cat.categories = ["Finance", "Energy", "Tourism"]
city_eco
population city state eco_code economy
3 808976 San Francisco California 17 Finance
4 8363710 New York New-York 17 Finance
5 413201 Miami Florida 34 Tourism
6 2242193 Houston Texas 20 Energy

Note that categorical values are sorted according to their categorical order, not their alphabetical order:

city_eco.sort_values(by="economy", ascending=False)
population city state eco_code economy
5 413201 Miami Florida 34 Tourism
6 2242193 Houston Texas 20 Energy
3 808976 San Francisco California 17 Finance
4 8363710 New York New-York 17 Finance

3.3 What next?

As you probably noticed by now, pandas is quite a large library with many features. Although we went through the most important features, there is still a lot to discover. Probably the best way to learn more is to get your hands dirty with some real-life data. It is also a good idea to go through pandas’ excellent documentation, in particular the Cookbook.

You can also work with BigQuery in Pandas. Check out https://pandas.pydata.org/docs/reference/api/pandas.read_gbq.html and https://pandas-gbq.readthedocs.io/en/latest/ for more details.

3.4 References

  1. https://www.kaggle.com/learn/
  2. https://github.com/ageron/handson-ml3
  3. https://github.com/ageron/handson-ml3