{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example usage\n", "\n", "Here we will demonstrate how to use the `stepsel` package for stepwise selection of variables in a regression model and how to prepare categorical data for such a model." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Importing the packages\n", "import stepsel\n", "import numpy as np\n", "import statsmodels.api as sm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the data\n", "First we load the data and take a look at the first few rows. It is soccer data from Czech First League." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(518, 24) Index(['match_id', 'match_datetime', 'team', 'hga', 'team_opp', 'goals',\n", " 'goals_opp', 'yellow', 'yellow_opp', 'red', 'red_opp', 'penalty',\n", " 'penalty_opp', 'fouls', 'fouls_opp', 'attacks', 'attacks_opp',\n", " 'dangerous_attacks', 'dangerous_attacks_opp', 'ref_interv',\n", " 'ref_interv_opp', 'ref_interv_per_attack', 'ref_interv_per_attack_opp',\n", " 'ref_interv_per_attack_diff'],\n", " dtype='object')\n" ] } ], "source": [ "dt = stepsel.datasets.load_soccer_data()\n", "print(dt.shape, dt.columns)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | match_id | \n", "match_datetime | \n", "team | \n", "hga | \n", "team_opp | \n", "goals | \n", "goals_opp | \n", "yellow | \n", "yellow_opp | \n", "red | \n", "... | \n", "fouls_opp | \n", "attacks | \n", "attacks_opp | \n", "dangerous_attacks | \n", "dangerous_attacks_opp | \n", "ref_interv | \n", "ref_interv_opp | \n", "ref_interv_per_attack | \n", "ref_interv_per_attack_opp | \n", "ref_interv_per_attack_diff | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "1 | \n", "2023-05-14 18:00:00 | \n", "Bohemians | \n", "H | \n", "Slovácko | \n", "0.0 | \n", "0.0 | \n", "2 | \n", "2 | \n", "0 | \n", "... | \n", "12.0 | \n", "117.0 | \n", "142.0 | \n", "71.0 | \n", "88.0 | \n", "20.0 | \n", "18.0 | \n", "0.140845 | \n", "0.153846 | \n", "-0.013001 | \n", "
| 1 | \n", "2 | \n", "2023-05-14 15:00:00 | \n", "Jablonec | \n", "H | \n", "Baník Ostrava | \n", "1.0 | \n", "1.0 | \n", "1 | \n", "1 | \n", "0 | \n", "... | \n", "11.0 | \n", "90.0 | \n", "116.0 | \n", "74.0 | \n", "58.0 | \n", "14.0 | \n", "22.0 | \n", "0.120690 | \n", "0.244444 | \n", "-0.123755 | \n", "
| 2 | \n", "3 | \n", "2023-05-14 15:00:00 | \n", "Teplice | \n", "H | \n", "Zlín | \n", "2.0 | \n", "1.0 | \n", "4 | \n", "4 | \n", "0 | \n", "... | \n", "14.0 | \n", "118.0 | \n", "134.0 | \n", "54.0 | \n", "62.0 | \n", "30.0 | \n", "26.0 | \n", "0.223881 | \n", "0.220339 | \n", "0.003542 | \n", "
| 3 | \n", "4 | \n", "2023-05-14 15:00:00 | \n", "Zbrojovka Brno | \n", "H | \n", "Pardubice | \n", "0.0 | \n", "2.0 | \n", "1 | \n", "2 | \n", "0 | \n", "... | \n", "14.0 | \n", "99.0 | \n", "126.0 | \n", "57.0 | \n", "76.0 | \n", "17.0 | \n", "36.0 | \n", "0.134921 | \n", "0.363636 | \n", "-0.228716 | \n", "
| 4 | \n", "5 | \n", "2023-05-13 18:00:00 | \n", "Sparta Praha | \n", "H | \n", "Slavia Praha | \n", "3.0 | \n", "2.0 | \n", "1 | \n", "2 | \n", "0 | \n", "... | \n", "15.0 | \n", "87.0 | \n", "145.0 | \n", "47.0 | \n", "94.0 | \n", "15.0 | \n", "29.0 | \n", "0.103448 | \n", "0.333333 | \n", "-0.229885 | \n", "
5 rows × 24 columns
\n", "