share/hotel_questions.ipynb

1 line
35 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"hotel_questions.ipynb","provenance":[],"collapsed_sections":[],"authorship_tag":"ABX9TyN2DvMFlTW8C58CuKlgqViY"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"code","execution_count":2,"metadata":{"id":"qZElOwJ35A8u","executionInfo":{"status":"ok","timestamp":1642092549357,"user_tz":-60,"elapsed":618,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"outputs":[],"source":["import pandas as pd\n","import numpy as np\n","import matplotlib.pyplot as plt\n","import seaborn as sns\n","\n","sns.set_style(\"whitegrid\")"]},{"cell_type":"markdown","source":["# Contexte\n","\n","Nous considérons un jeu de données pour expérimenter sur la prédiction des annulations de réservations de chambres d'hôtel. Nous souhaitons mesurer la qualité atteignable pour les prédictions d'annulations. Nous souhaitons également découvrir les facteurs les plus discriminants qui auront permis d'automatiser cette prédiction afin de guider la mise en œuvre de contre-mesures qui permettraient de réduire les pertes de profits liées aux annulations.\n","\n","# Réflexions préliminaires\n","\n","*Quelles sont les grandes catégories de problèmes qui peuvent être résolus par des algorithmes de machine learning ? À quelle(s) catégorie(s) pourrait appartenir le problème ci-dessus ?*\n","\n","*Comment approcher ce problème ? Quelles peuvent être les premières étapes nécessaire pour débuter le travail de modélisation afin de bien formuler le problème ? Quelles erreurs éviter ?*\n"],"metadata":{"id":"8YLVbt3H1vqS"}},{"cell_type":"code","source":["df = pd.read_csv('https://git.sdf.org/p6e7p7/share/raw/branch/master/hotel_booking.csv')"],"metadata":{"id":"7WHT0w-F5NwL","executionInfo":{"status":"ok","timestamp":1642092561747,"user_tz":-60,"elapsed":10141,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":3,"outputs":[]},{"cell_type":"markdown","source":["# Feature engineering"],"metadata":{"id":"a7TgOmBuFwAK"}},{"cell_type":"code","source":["df.head()"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":353},"id":"-iBiZZKkETMR","executionInfo":{"status":"ok","timestamp":1642092573757,"user_tz":-60,"elapsed":236,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}},"outputId":"20d3e627-1e36-4a45-9633-0d3b67803359"},"execution_count":4,"outputs":[{"output_type":"execute_result","data":{"text/html":["\n"," <div id=\"df-4cb66c70-03c2-4cbe-99fc-ae24d3041504\">\n"," <div class=\"colab-df-container\">\n"," <div>\n","<style scoped>\n"," .dataframe tbody tr th:only-of-type {\n"," vertical-align: middle;\n"," }\n","\n"," .dataframe tbody tr th {\n"," vertical-align: top;\n"," }\n","\n"," .dataframe thead th {\n"," text-align: right;\n"," }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n"," <thead>\n"," <tr style=\"text-align: right;\">\n"," <th></th>\n"," <th>hotel</th>\n"," <th>is_canceled</th>\n"," <th>lead_time</th>\n"," <th>arrival_date_year</th>\n"," <th>arrival_date_month</th>\n"," <th>arrival_date_week_number</th>\n"," <th>arrival_date_day_of_month</th>\n"," <th>stays_in_weekend_nights</th>\n"," <th>stays_in_week_nights</th>\n"," <th>adults</th>\n"," <th>children</th>\n"," <th>babies</th>\n"," <th>meal</th>\n"," <th>country</th>\n"," <th>market_segment</th>\n"," <th>distribution_channel</th>\n"," <th>is_repeated_guest</th>\n"," <th>previous_cancellations</th>\n"," <th>previous_bookings_not_canceled</th>\n"," <th>reserved_room_type</th>\n"," <th>assigned_room_type</th>\n"," <th>booking_changes</th>\n"," <th>deposit_type</th>\n"," <th>agent</th>\n"," <th>company</th>\n"," <th>days_in_waiting_list</th>\n"," <th>customer_type</th>\n"," <th>adr</th>\n"," <th>required_car_parking_spaces</th>\n"," <th>total_of_special_requests</th>\n"," <th>reservation_status</th>\n"," <th>reservation_status_date</th>\n"," </tr>\n"," </thead>\n"," <tbody>\n"," <tr>\n"," <th>0</th>\n"," <td>Resort Hotel</td>\n"," <td>0</td>\n"," <td>342</td>\n"," <td>2015</td>\n"," <td>July</td>\n"," <td>27</td>\n"," <td>1</td>\n"," <td>0</td>\n"," <td>0</td>\n"," <td>2</td>\n"," <td>0.0</td>\n"," <td>0</td>\n"," <td>BB</td>\n"," <td>PRT</td>\n"," <td>Direct</td>\n"," <td>Direct</td>\n"," <td>0</td>\n"," <td>0</td>\n"," <td>0</td>\n"," <td>C</td>\n"," <td>C</td>\n"," <td>3</td>\n"," <td>No Deposit</td>\n"," <td>NaN</td>\n"," <td>NaN</td>\n"," <td>0</td>\n"," <td>Transient</td>\n"," <td>0.0</td>\n"," <td>0</td>\n"," <td>0</td>\n"," <td>Check-Out</td>\n"," <td>2015-07-01</td>\n"," </tr>\n"," <tr>\n"," <th>1</th>\n"," <td>Resort Hotel</td>\n"," <td>0</td>\n"," <td>737</td>\n"," <td>2015</td>\n"," <td>July</td>\n"," <td>27</td>\n"," <td>1</td>\n"," <td>0</td>\n"," <td>0</td>\n"," <td>2</td>\n"," <td>0.0</td>\n"," <td>0</td>\n"," <td>BB</td>\n"," <td>PRT</td>\n"," <td>Direct</td>\n"," <td>Direct</td>\n"," <td>0</td>\n"," <td>0</td>\n"," <td>0</td>\n"," <td>C</td>\n"," <td>C</td>\n"," <td>4</td>\n"," <td>No Deposit</td>\n"," <td>NaN</td>\n"," <td>NaN</td>\n"," <td>0</td>\n"," <td>Transient</td>\n"," <td>0.0</td>\n"," <td>0</td>\n"," <td>0</td>\n"," <td>Check-Out</td>\n"," <td>2015-07-01</td>\n"," </tr>\n"," <tr>\n"," <th>2</th>\n"," <td>Resort Hotel</td>\n"," <td>0</td>\n"," <td>7</td>\n"," <td>2015</td>\n"," <td>July</td>\n"," <td>27</td>\n"," <td>1</td>\n"," <td>0</td>\n"," <td>1</td>\n"," <td>1</td>\n"," <td>0.0</td>\n"," <td>0</td>\n"," <td>BB</td>\n"," <td>GBR</td>\n"," <td>Direct</td>\n"," <td>Direct</td>\n"," <td>0</td>\n"," <td>0</td>\n"," <td>0</td>\n"," <td>A</td>\n"," <td>C</td>\n"," <td>0</td>\n"," <td>No Deposit</td>\n"," <td>NaN</td>\n"," <td>NaN</td>\n"," <td>0</td>\n"," <td>Transient</td>\n"," <td>75.0</td>\n"," <td>0</td>\n"," <td>0</td>\n"," <td>Check-Out</td>\n"," <td>2015-07-02</td>\n"," </tr>\n"," <tr>\n"," <th>3</th>\n"," <td>Resort Hotel</td>\n"," <td>0</td>\n"," <td>13</td>\n"," <td>2015</td>\n"," <td>July</td>\n"," <td>27</td>\n"," <td>1</td>\n"," <td>0</td>\n"," <td>1</td>\n"," <td>1</td>\n"," <td>0.0</td>\n"," <td>0</td>\n"," <td>BB</td>\n"," <td>GBR</td>\n"," <td>Corporate</td>\n"," <td>Corporate</td>\n"," <td>0</td>\n"," <td>0</td>\n"," <td>0</td>\n"," <td>A</td>\n"," <td>A</td>\n"," <td>0</td>\n"," <td>No Deposit</td>\n"," <td>304.0</td>\n"," <td>NaN</td>\n"," <td>0</td>\n"," <td>Transient</td>\n"," <td>75.0</td>\n"," <td>0</td>\n"," <td>0</td>\n"," <td>Check-Out</td>\n"," <td>2015-07-02</td>\n"," </tr>\n"," <tr>\n"," <th>4</th>\n"," <td>Resort Hotel</td>\n"," <td>0</td>\n"," <td>14</td>\n"," <td>2015</td>\n"," <td>July</td>\n"," <td>27</td>\n"," <td>1</td>\n"," <td>0</td>\n"," <td>2</td>\n"," <td>2</td>\n"," <td>0.0</td>\n"," <td>0</td>\n"," <td>BB</td>\n"," <td>GBR</td>\n"," <td>Online TA</td>\n"," <td>TA/TO</td>\n"," <td>0</td>\n"," <td>0</td>\n"," <td>0</td>\n"," <td>A</td>\n"," <td>A</td>\n"," <td>0</td>\n"," <td>No Deposit</td>\n"," <td>240.0</td>\n"," <td>NaN</td>\n"," <td>0</td>\n"," <td>Transient</td>\n"," <td>98.0</td>\n"," <td>0</td>\n"," <td>1</td>\n"," <td>Check-Out</td>\n"," <td>2015-07-03</td>\n"," </tr>\n"," </tbody>\n","</table>\n","</div>\n"," <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-4cb66c70-03c2-4cbe-99fc-ae24d3041504')\"\n"," title=\"Convert this dataframe to an interactive table.\"\n"," style=\"display:none;\">\n"," \n"," <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n"," width=\"24px\">\n"," <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n"," <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n"," </svg>\n"," </button>\n"," \n"," <style>\n"," .colab-df-container {\n"," display:flex;\n"," flex-wrap:wrap;\n"," gap: 12px;\n"," }\n","\n"," .colab-df-convert {\n"," background-color: #E8F0FE;\n"," border: none;\n"," border-radius: 50%;\n"," cursor: pointer;\n"," display: none;\n"," fill: #1967D2;\n"," height: 32px;\n"," padding: 0 0 0 0;\n"," width: 32px;\n"," }\n","\n"," .colab-df-convert:hover {\n"," background-color: #E2EBFA;\n"," box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n"," fill: #174EA6;\n"," }\n","\n"," [theme=dark] .colab-df-convert {\n"," background-color: #3B4455;\n"," fill: #D2E3FC;\n"," }\n","\n"," [theme=dark] .colab-df-convert:hover {\n"," background-color: #434B5C;\n"," box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n"," filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n"," fill: #FFFFFF;\n"," }\n"," </style>\n","\n"," <script>\n"," const buttonEl =\n"," document.querySelector('#df-4cb66c70-03c2-4cbe-99fc-ae24d3041504 button.colab-df-convert');\n"," buttonEl.style.display =\n"," google.colab.kernel.accessAllowed ? 'block' : 'none';\n","\n"," async function convertToInteractive(key) {\n"," const element = document.querySelector('#df-4cb66c70-03c2-4cbe-99fc-ae24d3041504');\n"," const dataTable =\n"," await google.colab.kernel.invokeFunction('convertToInteractive',\n"," [key], {});\n"," if (!dataTable) return;\n","\n"," const docLinkHtml = 'Like what you see? Visit the ' +\n"," '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n"," + ' to learn more about interactive tables.';\n"," element.innerHTML = '';\n"," dataTable['output_type'] = 'display_data';\n"," await google.colab.output.renderOutput(dataTable, element);\n"," const docLink = document.createElement('div');\n"," docLink.innerHTML = docLinkHtml;\n"," element.appendChild(docLink);\n"," }\n"," </script>\n"," </div>\n"," </div>\n"," "],"text/plain":[" hotel is_canceled ... reservation_status reservation_status_date\n","0 Resort Hotel 0 ... Check-Out 2015-07-01\n","1 Resort Hotel 0 ... Check-Out 2015-07-01\n","2 Resort Hotel 0 ... Check-Out 2015-07-02\n","3 Resort Hotel 0 ... Check-Out 2015-07-02\n","4 Resort Hotel 0 ... Check-Out 2015-07-03\n","\n","[5 rows x 32 columns]"]},"metadata":{},"execution_count":4}]},{"cell_type":"code","source":["df.columns"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"9hBfXq8x5lhw","executionInfo":{"status":"ok","timestamp":1642092576789,"user_tz":-60,"elapsed":251,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}},"outputId":"17aee87c-3cf2-4be0-efa5-92049eec923e"},"execution_count":5,"outputs":[{"output_type":"execute_result","data":{"text/plain":["Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',\n"," 'arrival_date_month', 'arrival_date_week_number',\n"," 'arrival_date_day_of_month', 'stays_in_weekend_nights',\n"," 'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',\n"," 'country', 'market_segment', 'distribution_channel',\n"," 'is_repeated_guest', 'previous_cancellations',\n"," 'previous_bookings_not_canceled', 'reserved_room_type',\n"," 'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',\n"," 'company', 'days_in_waiting_list', 'customer_type', 'adr',\n"," 'required_car_parking_spaces', 'total_of_special_requests',\n"," 'reservation_status', 'reservation_status_date'],\n"," dtype='object')"]},"metadata":{},"execution_count":5}]},{"cell_type":"markdown","source":["|variable |class |description |\n","|:------------------------------|:---------|:-----------|\n","|hotel |character | Hotel (H1 = Resort Hotel or H2 = City Hotel) |\n","|is_canceled |double | Value indicating if the booking was canceled (1) or not (0) |\n","|lead_time |double | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date |\n","|arrival_date_year |double | Year of arrival date|\n","|arrival_date_month |character | Month of arrival date|\n","|arrival_date_week_number |double | Week number of year for arrival date|\n","|arrival_date_day_of_month |double | Day of arrival date|\n","|stays_in_weekend_nights |double | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |\n","|stays_in_week_nights |double | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel|\n","|adults |double | Number of adults|\n","|children |double | Number of children|\n","|babies |double |Number of babies |\n","|meal |character | Type of meal booked. Categories are presented in standard hospitality meal packages: <br> Undefined/SC no meal package;<br>BB Bed & Breakfast; <br> HB Half board (breakfast and one other meal usually dinner); <br> FB Full board (breakfast, lunch and dinner) |\n","|country |character | Country of origin. Categories are represented in the ISO 31553:2013 format |\n","|market_segment |character | Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators” |\n","|distribution_channel |character | Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators” |\n","|is_repeated_guest |double | Value indicating if the booking name was from a repeated guest (1) or not (0) |\n","|previous_cancellations |double | Number of previous bookings that were cancelled by the customer prior to the current booking |\n","|previous_bookings_not_canceled |double | Number of previous bookings not cancelled by the customer prior to the current booking |\n","|reserved_room_type |character | Code of room type reserved. Code is presented instead of designation for anonymity reasons |\n","|assigned_room_type |character | Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons |\n","|booking_changes |double | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation|\n","|deposit_type |character | Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:<br>No Deposit no deposit was made;<br>Non Refund a deposit was made in the value of the total stay cost;<br>Refundable a deposit was made with a value under the total cost of stay. |\n","|agent |character | ID of the travel agency that made the booking |\n","|company |character | ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons |\n","|days_in_waiting_list |double | Number of days the booking was in the waiting list before it was confirmed to the customer |\n","|customer_type |character | Type of booking, assuming one of four categories:<br>Contract - when the booking has an allotment or other type of contract associated to it;<br>Group when the booking is associated to a group;<br>Transient when the booking is not part of a group or contract, and is not associated to other transient booking;<br>Transient-party when the booking is transient, but is associated to at least other transient booking|\n","|adr |double | Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights |\n","|required_car_parking_spaces |double | Number of car parking spaces required by the customer |\n","|total_of_special_requests |double | Number of special requests made by the customer (e.g. twin bed or high floor)|\n","|reservation_status |character | Reservation last status, assuming one of three categories:<br>Canceled booking was canceled by the customer;<br>Check-Out customer has checked in but already departed;<br>No-Show customer did not check-in and did inform the hotel of the reason why |\n","|reservation_status_date |double | Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel|"],"metadata":{"id":"WZzNMxZ1GSHO"}},{"cell_type":"code","source":["df.dtypes"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"kf3nWixkTJrj","executionInfo":{"status":"ok","timestamp":1642092581709,"user_tz":-60,"elapsed":240,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}},"outputId":"18fec81b-2210-4440-bb72-4b7b488343b9"},"execution_count":6,"outputs":[{"output_type":"execute_result","data":{"text/plain":["hotel object\n","is_canceled int64\n","lead_time int64\n","arrival_date_year int64\n","arrival_date_month object\n","arrival_date_week_number int64\n","arrival_date_day_of_month int64\n","stays_in_weekend_nights int64\n","stays_in_week_nights int64\n","adults int64\n","children float64\n","babies int64\n","meal object\n","country object\n","market_segment object\n","distribution_channel object\n","is_repeated_guest int64\n","previous_cancellations int64\n","previous_bookings_not_canceled int64\n","reserved_room_type object\n","assigned_room_type object\n","booking_changes int64\n","deposit_type object\n","agent float64\n","company float64\n","days_in_waiting_list int64\n","customer_type object\n","adr float64\n","required_car_parking_spaces int64\n","total_of_special_requests int64\n","reservation_status object\n","reservation_status_date object\n","dtype: object"]},"metadata":{},"execution_count":6}]},{"cell_type":"code","source":["df.shape"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"bIeaffTpTNP-","executionInfo":{"status":"ok","timestamp":1642092584305,"user_tz":-60,"elapsed":210,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}},"outputId":"9e78dcf4-04fc-439e-d77f-7efa2d649d3a"},"execution_count":7,"outputs":[{"output_type":"execute_result","data":{"text/plain":["(119390, 32)"]},"metadata":{},"execution_count":7}]},{"cell_type":"markdown","source":["## Introduction de variables calculées\n","\n","Introduire de nouvelles variables qui sont des transformations de variables existantes.\n","\n","* `guests` est la somme de `adults`, `children` et `babies`\n","* `different_room_assigned` doit indiquer si `reserved_room_type` et `assigned_room_type` diffèrent.\n","\n","Supprimer les variables `adults`, `children`, `babies`, `reserved_room_type` et `assigned_room_type`.\n"],"metadata":{"id":"XTG8RrXYGpSc"}},{"cell_type":"code","source":[""],"metadata":{"id":"W-oSLrHOITK9","executionInfo":{"status":"ok","timestamp":1642092602988,"user_tz":-60,"elapsed":300,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["## Gestion des valeurs manquantes et sélection de variables\n","\n","Compter les valeurs manquantes pour chaque variable."],"metadata":{"id":"2y5fmidJRO0N"}},{"cell_type":"code","source":[""],"metadata":{"id":"6dVGewivRbQV","executionInfo":{"status":"ok","timestamp":1642092612586,"user_tz":-60,"elapsed":204,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["Combien de valeurs distinctes possèdent les variables ``agent`` et ``company`` ? Comment se répartissent ces valeurs en fréquence ?"],"metadata":{"id":"J8VWyB3KR-fP"}},{"cell_type":"code","source":[""],"metadata":{"id":"tjpR7SdbMJiG","executionInfo":{"status":"ok","timestamp":1642092621700,"user_tz":-60,"elapsed":228,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["Comment se répartissent les valeurs distinctes de la variable ``country``?"],"metadata":{"id":"DNJ3ohgETZFh"}},{"cell_type":"code","source":[""],"metadata":{"id":"Wq-ep1H0TgMv","executionInfo":{"status":"ok","timestamp":1642092644441,"user_tz":-60,"elapsed":210,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["En analysant le processus métier, nous apprenons que le pays d'origine des clients est par défaut le Portugal. Il est éventuellement mis à jour au moment du check-in.\n","\n","Étant donnée cette information, que pouvons-nous faire avec la variable ``country``?"],"metadata":{"id":"Zfb85sz_Tq4y"}},{"cell_type":"code","source":[""],"metadata":{"id":"TyP_cZcOUomS","executionInfo":{"status":"ok","timestamp":1642092661043,"user_tz":-60,"elapsed":199,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["Remplacer les valeurs manquantes de ``guests`` par la valeur la plus fréquente que prend cette variable."],"metadata":{"id":"vCBuWrK7VOSj"}},{"cell_type":"code","source":[""],"metadata":{"id":"55zIyqBqVDUB"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Observer la répartition des catégories de la variable ``distribution_channel`` en fonction des catégories de la variable ``market_segment``.\n","\n","Par exemple, nous souhaitons découvrir que la catégorie ``Corporate`` de la variable ``distribution_channel`` se distribue à 72% dans la catégorie ``Corporate``, 18% dans la catégorie ``Groups``... de la variable ``market_segment``."],"metadata":{"id":"E2GBCIBCWDxI"}},{"cell_type":"code","source":[""],"metadata":{"id":"AVpFmsUgPYnm","executionInfo":{"status":"ok","timestamp":1642092684141,"user_tz":-60,"elapsed":388,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["Étant donné ce fort recouvrement, nous choisissons de retirer la variable ``distribution_channel``."],"metadata":{"id":"GgLeWl-AXxHZ"}},{"cell_type":"code","source":[""],"metadata":{"id":"Sh5PfRvEX576","executionInfo":{"status":"ok","timestamp":1642092704285,"user_tz":-60,"elapsed":204,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["Dans le cadre de cet exemple, nous proposons de supprimer également les variables ``reservation_status``, ``reservation_status_date``.\n","\n"],"metadata":{"id":"gk6z-TfTYK-9"}},{"cell_type":"code","source":[""],"metadata":{"id":"7_jics2lvW9V","executionInfo":{"status":"ok","timestamp":1642092714774,"user_tz":-60,"elapsed":223,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["Par ailleurs, au moment de la collecte du jeu de données, les réservations dont la date de check-in est dans le futur ont un statut qui est soit inconnu, soit annulé. Les réservations dont le statut est pour l'instant inconnu pourraient être finalement annulées. Ainsi, pour ne pas introduire un biais artificiel lors de la construction d'un algorithme prédictif, pour les réservations dont la date d'enregistrement est future, il a été décidé de ne conserver que celles dont le statut est déjà \"annulé\". Quel impact cette décision peut-elle avoir sur les variables temporelles : ``arrival_date_year``, ``arrival_date_month``, ``arrival_date_week_number`` et ``arrival_date_day_of_month`` ? "],"metadata":{"id":"4LAR1LJFvRUE"}},{"cell_type":"code","source":[""],"metadata":{"id":"r6OghP1JYyqj","executionInfo":{"status":"ok","timestamp":1642092728895,"user_tz":-60,"elapsed":271,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["# Visualisations pour explorer le jeu de données\n","\n","Tracer un histogramme de la variable ``lead_time``."],"metadata":{"id":"5YpA8n3YDsD8"}},{"cell_type":"code","source":[""],"metadata":{"id":"g2kPTwDC6hfu","executionInfo":{"status":"ok","timestamp":1642092756382,"user_tz":-60,"elapsed":211,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["Discrétiser la variable ``lead_time`` en intervalles : (0,100], (100,200], (200,300], (300,400], (400,500] et (500, 800] pour ensuite tracer un diagramme en bâtons avec sur l'axe des abscisses ces nouvelles catégories et sur l'axe des y les proportions d'annulations pour chacun des deux hôtels."],"metadata":{"id":"TP15CzwkEN0g"}},{"cell_type":"code","source":[""],"metadata":{"id":"GRZJ-FYY8y9b","executionInfo":{"status":"ok","timestamp":1642092770556,"user_tz":-60,"elapsed":212,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"code","source":["df = df.drop(['cat_lead_time'], axis=1)"],"metadata":{"id":"CqBWVtq6wEk2"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["# Construction d'un modèle prédictif\n","\n","Construire un modèle prédictif basé sur l'algorithme de la forêt aléatoire (Random Forest).\n","\n","## Encodage des variables catégorielles\n","\n","Transformer les colonnes qui correspondent à des catégories représentées par des chaînes de caractères en variables one-hot encodées."],"metadata":{"id":"rnwGd9RT0jUz"}},{"cell_type":"code","source":[""],"metadata":{"id":"mEDJN79oAPC0","executionInfo":{"status":"ok","timestamp":1642092789760,"user_tz":-60,"elapsed":209,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["## Échantillonage du jeu de données\n","\n","Découper le jeu de données en une dataframe X qui ne contient pas la variable à prédire (``is_canceled``) et une série y constituée de la variable à prédire."],"metadata":{"id":"rV0ZSl-lAPm-"}},{"cell_type":"code","source":[""],"metadata":{"id":"68zEJB-41Mbh","executionInfo":{"status":"ok","timestamp":1642092801371,"user_tz":-60,"elapsed":201,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["Séparer ensuite le jeu de données en une partie pour l'apprentissage (70%) et une partie pour le test (30%). Pourquoi est-ce nécessaire ?"],"metadata":{"id":"OXhCmEjl1mW9"}},{"cell_type":"code","source":[""],"metadata":{"id":"PEm27mLJ1xir","executionInfo":{"status":"ok","timestamp":1642092808423,"user_tz":-60,"elapsed":225,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["## Apprentissage du modèle\n","\n","Construire un modèle de forêt aléatoire en conservant les valeurs par défaut de tous les hyperparamètres. Qu'est-ce qu'un hyperparamètre ?"],"metadata":{"id":"_XYeFJgX3vXm"}},{"cell_type":"code","source":[""],"metadata":{"id":"EC6jrDXN4KZs","executionInfo":{"status":"ok","timestamp":1642092828986,"user_tz":-60,"elapsed":210,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["## Évaluation du modèle\n","\n","Quelle est la matrice de confusion de ce modèle sur le jeu de test ?"],"metadata":{"id":"_mfKJRIkCcgr"}},{"cell_type":"code","source":[""],"metadata":{"id":"2bjn2auCB4vC","executionInfo":{"status":"ok","timestamp":1642092837530,"user_tz":-60,"elapsed":209,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["Quels sont la précision et le rappel de ce modèle sur le jeu de test ?"],"metadata":{"id":"GCRi-7xGDGOz"}},{"cell_type":"code","source":[""],"metadata":{"id":"Ej8LIktcFJxM","executionInfo":{"status":"ok","timestamp":1642092848312,"user_tz":-60,"elapsed":231,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["Afficher la courbe précision/rappel associée à ce modèle prédictif."],"metadata":{"id":"KjdP_WB8H080"}},{"cell_type":"code","source":[""],"metadata":{"id":"tEls3t4EG9_R","executionInfo":{"status":"ok","timestamp":1642092862560,"user_tz":-60,"elapsed":201,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["## Optimisation des hyperparamètres\n","\n","Tester par grid search avec validation croisée multi-plis quelques possibilités de valeurs d'hyperparamètres du modèle de forêt aléatoire. Conserver les meilleures valeurs et ré-entrainer le modèle sur tout le jeu d'entraînement. Vérifier sa performance sur le jeu de test."],"metadata":{"id":"mGelfj9nIMCB"}},{"cell_type":"code","source":[""],"metadata":{"id":"BfOD-WPHJqni","executionInfo":{"status":"ok","timestamp":1642092888862,"user_tz":-60,"elapsed":268,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]},{"cell_type":"markdown","source":["## Importance des variables prédictives\n","\n","Analyser l'importance relative des différentes variables prédictives selon le modèle de forêt aléatoire."],"metadata":{"id":"Gr9gNxBPN1Ue"}},{"cell_type":"code","source":[""],"metadata":{"id":"J2wcqUYMOIkW","executionInfo":{"status":"ok","timestamp":1642092911783,"user_tz":-60,"elapsed":240,"user":{"displayName":"Pierre-Edouard PORTIER","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjrWdngYDIDAFn2GRDYdLSUlCwKObK25BfBfTXMLw=s64","userId":"05025412540823229047"}}},"execution_count":7,"outputs":[]}]}