10 Working with Large Datasets

HoloViews supports even high-dimensional datasets easily, and the standard mechanisms discussed already work well as long as you select a small enough subset of the data to display at any one time. However, some datasets are just inherently large, even for a single frame of data, and cannot safely be transferred for display in any standard web browser. Luckily, HoloViews makes it simple for you to use the separate Datashader library together with any of the plotting extension libraries, including Bokeh and Matplotlib. Datashader is designed to complement standard plotting libraries by providing faithful visualizations for very large datasets, focusing on revealing the overall distribution, not just individual data points.

Datashader uses computations accelerated using Numba , making it fast to work with datasets of millions or billions of datapoints stored in Dask dataframes. Dask dataframes provide an API that is functionally equivalent to Pandas, but allows working with data out of core and scaling out to many processors across compute clusters. Here we will use Dask to load a large Parquet-format file of taxi coordinates.

How does datashader work?

  • Tools like Bokeh map Data (left) directly into an HTML/JavaScript Plot (right)
  • datashader instead renders Data into a plot-sized Aggregate array, from which an Image can be constructed then embedded into a Bokeh Plot
  • Only the fixed-sized Image needs to be sent to the browser, allowing millions or billions of datapoints to be used
  • Every step automatically adjusts to the data, but can be customized

When not to use datashader

  • Plotting less than 1e5 or 1e6 data points
  • When every datapoint must be resolveable individually; standard Bokeh will render all of them
  • For full interactivity (hover tools) with every datapoint

When to use datashader

  • Actual big data; when Bokeh/Matplotlib have trouble
  • When the distribution matters more than individual points
  • When you find yourself sampling, decimating, or binning to better understand the distribution
In [1]:
import pandas as pd
import holoviews as hv
import dask.dataframe as dd
import datashader as ds
import geoviews as gv

from holoviews.operation.datashader import datashade, aggregate
hv.extension('bokeh')