A delayed flight, which causes a last-minute shift in schedule and unneeded time in an airport, is every flier’s worst nightmare. To prevent fliers from having to deal with this inconvenience, our team developed a model which can show users if there will be a delay. Both airport companies and fliers will be able to use this model to adjust their schedules accordingly.
During our first week of development, our team learned more about the problem we decided to face through exploratory data analysis (EDA).
During our second week of development, we tackled the major problem at hand; predicting if there would actually be a delay.
During our last week of development, our team created the very website you're browsing through!
Before predicting any delay behaviors, we needed to establish an understanding of the data we were working with; what correlations could we find between the airline being traveled, or the length of the flight, and the flight being delayed? What do these correlations (or lack thereof) mean?
We established these connections through EDA, or Exploratory Data Analysis. As the name suggests, we explored our data through plots and visualizations in order to gain a deeper understanding of the meaning behind the numbers on a spreadsheet. We used the Airline Delay Prediction Dataset on Kaggle.
The plots we created compared our independent variables (Flight length, departure time, airline, etc) to our target, in order for us to find the most relevant trends. The specifics for each plot are listed below, but the general trends are as follows:
Most features have little to no correlation to the occurrence of a delay
The arrival time for on time flights is 100 minutes behind the arrival time for delayed flights.
Wednesday is the most busy day of the week, followed closely by Tuesday and Thursday
Majority of airports (both the beginning and end locations) have very little documentation, meaning we would benefit from grouping those airports into an “Other” category
After finding trends in our data through EDA, we knew how we needed to clean our data. A few examples of how we transformed our data included:
Dropping the rows where the flight duration was zero (was not recorded)
Dropping unnecessary columns, in this case the id column
Encoding categorical variables (such as the different airport names) as numbers
Cutting down our large sample size of ~500,000 to a random sample of 5,000 for our data visualization, and 20,000 for training and testing the model.
Cutting down the number of categories in AirportTo and AirportFrom by grouping any airport with three or less samples into an “Other” category
These changes were made for the benefit of the model, as computers can’t handle categories or such large datasets, but can work with numbers just fine. Similarly, our computers weren’t able to handle such a large amount of data, so cutting down the sample size immensely helped our processing speed without sacrificing too much information.
Lorem ipsum dolor sit amet, consectetur adipisicing elit. Minima maxime quam architecto quo inventore harum ex magni, dicta impedit.
Lorem ipsum dolor sit amet, consectetur adipisicing elit. Minima maxime quam architecto quo inventore harum ex magni, dicta impedit.
Lorem ipsum dolor sit amet, consectetur adipisicing elit. Minima maxime quam architecto quo inventore harum ex magni, dicta impedit.
Lorem ipsum dolor sit amet, consectetur adipisicing elit. Aut eaque, laboriosam veritatis, quos non quis ad perspiciatis, totam corporis ea, alias ut unde.