Working with the dataset

Parquet format

The format of each file in the datasets is Apache Parquet, an efficient data storage format that can be opened by most of the existing data processing tools. The format is supported by most of the main data analysis tools. Some suggested tools for processing or viewing the data are reported below.

Pandas

pandas is a Python library and supports parquet format. Parquet file can be read with the following code

>>> import pandas as pd
>>> pd.read_parquet("Sensors/Device-usage/batterycharge.parquet")

       experimentid  userid               timestamp            source  status
0        wenetItaly      55 2020-10-17 16:32:26.075  charging_unknown    True
1        wenetItaly       5 2020-10-17 20:14:00.549              <NA>   False
2        wenetItaly       3 2020-10-17 20:14:00.550              <NA>   False
3        wenetItaly      34 2020-10-17 21:12:19.403       charging_ac    True
4        wenetItaly      90 2020-10-17 21:13:19.405       charging_ac    True

CLI

DuckDBis a in-process database solution (see how to read parquet file on DuckDB documentation);

Run this command in the terminal

duckdb

Then, load the dataset

select * from "Sensors/Device-usage/batterycharge.parquet/*.parquet';

And the result looks like

┌──────────────┬────────┬─────────────────────────┬──────────────────┬─────────┐
 experimentid │ userid │        timestamp        │      source      │ status  │
   varchar    │ int64  │      timestamp_ns       │     varchar      │ boolean │
├──────────────┼────────┼─────────────────────────┼──────────────────┼─────────┤
 wenetItaly   │     55 │ 2020-10-15 16:32:26.075 │ charging_unknown │ true    │
 wenetItaly   │      5 │ 2020-10-16 20:14:00.549 │ NULL             │ false   │
 wenetItaly   │      3 │ 2020-10-17 20:14:00.55  │ NULL             │ false   │
 wenetItaly   │     34 │ 2020-10-18 21:12:19.403 │ charging_ac      │ true    │
 wenetItaly   │     90 │ 2020-10-19 21:13:19.405 │ charging_ac      │ true    │

To get description of the columns

describe 'batterycharge.parquet/*.parquet';
┌──────────────┬──────────────┬─────────┬─────────┬─────────┬─────────┐
 column_name  │ column_type  │  null   │   key   │ default │  extra  │
   varchar    │   varchar    │ varchar │ varchar │ varchar │ varchar │
├──────────────┼──────────────┼─────────┼─────────┼─────────┼─────────┤
 experimentid │ VARCHAR      │ YES     │ NULL    │ NULL    │ NULL    │
 userid       │ BIGINT       │ YES     │ NULL    │ NULL    │ NULL    │
 timestamp    │ TIMESTAMP_NS │ YES     │ NULL    │ NULL    │ NULL    │
 source       │ VARCHAR      │ YES     │ NULL    │ NULL    │ NULL    │
 status       │ BOOLEAN      │ YES     │ NULL    │ NULL    │ NULL    │
└──────────────┴──────────────┴─────────┴─────────┴─────────┴─────────┘

Desktop applications

  • Tad, a desktop application to visualize parquet files.

Code to get started

On this page, we collect valuable resources for processing the datasets of the LivePeople catalog. The page will be updated over time.

  • Feature engineering code is available on Github. This code generates a set of features that can be used to train machine learning models. Python.
Back to top