Add data

Data is the heart of any dashboard, so you need to pay maximum attention to getting a table. Optimized Data Frame will speed up calculations and reduce memory consumption.

To optimize performance, you can use your own Pandas methods. Below are a few basic practices that will allow you to efficiently store and process large data sets.

Tip 1: Don't upload unnecessary data

Load only those columns that will be used for building visualizations and/or filtering

pd.read_csv('data.csv', usecols=['only', 'used', 'columns'])

Tip 2: Use suitable data types

We can optimize the data types to reduce memory usage. By using the memory_usage() function, we can find the memory used by the data objects. It returns a series with an index of the original column names and values representing the amount of memory used by each column in bytes.

The syntax of memory_usgae() as follows:

DataFrame.memory_usage(index=True, deep=False)

For numeric data, use the smallest possible data types
In this code, columnsTMax of the int64 datatype is converted into the int32 datatype using the .astype() method. We can see the difference between the memory used by the TMax column. There is a decrease in memory usage.

data = pd.read_csv('https://raw.githubusercontent.com/toddwschneider/nyc-taxi-data/master/data/central_park_weather.csv')
print("Initially Memory usage:")
print(data[['TMAX']].memory_usage(index=True, deep=False))
print()
data[['TMAX']]=data[['TMAX']].astype('int32')
print("Memory used after optimization:")
print(data[['TMAX']].memory_usage(index=True, deep=False))

Initially Memory usage:
Index      128
TMAX     39432
TMIN     39432
dtype: int64

Memory used after optimization:
Index     128
TMAX     4929
TMIN     4929
dtype: int64

For non-numeric columns of Data Frame are assigned as object data types which can be changed to category data types. Usually, the non-numerical feature column has categorical variables which are mostly repeating.

import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/toddwschneider/nyc-taxi-data/master/data/central_park_weather.csv')

print("Initially Memory usage:")
print(data['NAME'].dtypes)
print(data['NAME'].memory_usage())
print()
data['NAME']=data['NAME'].astype('category')
print("Memory used after optimization:")
print(data['NAME'].dtypes)
print(data['NAME'].memory_usage())

Initially Memory usage:
object
39560

Memory used after optimization:
category
5173

The Page object must contain a data acquisition function, below is an example of obtaining data using optimization:

def get_df():
    # We load only used columns
    df =  pd.read_csv('titanic.csv', usecols=['survived', 'age', 'class', 'who', 'alone'])

    # Convert to the optimal data format
    df.age = df.age.fillna(0)
    df = df.astype(
        {
            'survived': 'int8',
            'age': 'int8',
            'class': 'category',
            'who': 'category',
            'alone': 'bool'
        }
    )  
    return df

The data collection function must be passed during initialization of the Page object

page = Page(
    ...
    getdf=get_df,               # Функция получения pd.DataFrame
    ...
    )

Note

Note that Dash Express caches the Data Frame and does not request data for every filtering request.