D208 - Predictive Modeling
Predictive Modeling builds on initial data preparation, cleaning, and analysis, enabling students to make assertions vital to organizational needs. In this course, students conduct logistic regression and multiple regression to model the phenomena revealed by data. The course covers normality, homoscedasticity, and significance, preparing students to communicate findings and the limitations of those findings accurately to organizational leaders.
Course Analysis
Compared to previous courses, this course was a huge step up in difficulty and possible took me the longest to pass. The most valuable resources for me were Dr. Middleton’s two webinars. Unfortunately, I didn’t find Dr. Sewell’s lectures to be particularly useful, except for the slide on calculating the Variance Inflation Factor to check for multicollinearity. Additionally, Mark Keith’s succinct videos were instrumental in demonstrating the code for multiple linear regression and standardization. For Task 2, I revisited the webinars and also benefited from Susan Li’s thorough linear regression tutorial and Proteus’s guidance on calculating odds ratios.
Since both tasks required the use of datasets familiar from previous classes, I was able to repurpose my earlier data cleaning and exploratory analysis code. I limited myself to about 8 explanatory variables for my models. Creating bivariate visualizations was somewhat tedious, but my notes from a Data Visualization class at Udacity proved to be a lifesaver. For Task 2’s categorical y variable, I opted for mosaic plots to visualize categorical data relationships.
Task 1, involving multiple linear regression, went smoothly thanks to Mark Keith’s video. I refined my model by eliminating variables based on their Variance Inflation Factor and p-values. Analyzing the final model was straightforward, although I determined that despite its statistical significance, it lacked practical significance. Crafting residual plots was the only minor challenge, as they didn’t offer much insight.
Task 2 proved more challenging. While Susan Li’s tutorial was informative, it was more detailed than necessary for the project, which initially led me astray. A DataCamp unit from a subsequent class, D209, would have been more appropriate. After constructing my initial model with around 12 variables, I pared it down by assessing VIF and p-values. However, when it came to interpreting the confusion matrix, I hit a snag.
Final Thoughts
It seems this is the most extensive and challenging task of the program, encompassing two parts with an enormous scope. By ‘enormous,’ I’m referring to the level of tedium involved. A significant portion of this work involved creating histograms and bivariate graphs, similar to the previous class, and providing statistical measures such as medians, means, and modes where relevant. If there was one word to describe this course it would be Tedious. Seriously Tedious. However, I appreciated the difficuly of the course itself and would hope the continuing courses are just the same.