Working with the dataset
Parquet format
The format of each file in the datasets is Apache Parquet, an efficient data storage format that can be opened by most of the existing data processing tools. The format is supported by most of the main data analysis tools. Some suggested tools for processing or viewing the data are reported below.
Pandas
pandas is a Python library and supports parquet format. Parquet file can be read with the following code
>>> import pandas as pd
>>> pd.read_parquet("Sensors/Device-usage/batterycharge.parquet")
experimentid userid timestamp source status
0 wenetItaly 55 2020-10-17 16:32:26.075 charging_unknown True
1 wenetItaly 5 2020-10-17 20:14:00.549 <NA> False
2 wenetItaly 3 2020-10-17 20:14:00.550 <NA> False
3 wenetItaly 34 2020-10-17 21:12:19.403 charging_ac True
4 wenetItaly 90 2020-10-17 21:13:19.405 charging_ac TrueCLI
DuckDBis a in-process database solution (see how to read parquet file on DuckDB documentation);
Run this command in the terminal
duckdbThen, load the dataset
select * from "Sensors/Device-usage/batterycharge.parquet/*.parquet';And the result looks like
┌──────────────┬────────┬─────────────────────────┬──────────────────┬─────────┐
│ experimentid │ userid │ timestamp │ source │ status │
│ varchar │ int64 │ timestamp_ns │ varchar │ boolean │
├──────────────┼────────┼─────────────────────────┼──────────────────┼─────────┤
│ wenetItaly │ 55 │ 2020-10-15 16:32:26.075 │ charging_unknown │ true │
│ wenetItaly │ 5 │ 2020-10-16 20:14:00.549 │ NULL │ false │
│ wenetItaly │ 3 │ 2020-10-17 20:14:00.55 │ NULL │ false │
│ wenetItaly │ 34 │ 2020-10-18 21:12:19.403 │ charging_ac │ true │
│ wenetItaly │ 90 │ 2020-10-19 21:13:19.405 │ charging_ac │ true │To get description of the columns
describe 'batterycharge.parquet/*.parquet';┌──────────────┬──────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├──────────────┼──────────────┼─────────┼─────────┼─────────┼─────────┤
│ experimentid │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ userid │ BIGINT │ YES │ NULL │ NULL │ NULL │
│ timestamp │ TIMESTAMP_NS │ YES │ NULL │ NULL │ NULL │
│ source │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ status │ BOOLEAN │ YES │ NULL │ NULL │ NULL │
└──────────────┴──────────────┴─────────┴─────────┴─────────┴─────────┘Desktop applications
- Tad, a desktop application to visualize parquet files.
Code to get started
On this page, we collect valuable resources for processing the datasets of the LivePeople catalog. The page will be updated over time.
- Feature engineering code is available on Github. This code generates a set of features that can be used to train machine learning models. Python.