D206 - Data Cleaning
Data Cleaning continues building proficiency in the data analytics life cycle with data preparation skills. This course addresses exploring, transforming, and imputing data as well as handling outliers. Learners write code to manipulate, structure, and clean data as well as to reduce features in data sets. The following courses are prerequisites: The Data Analytics Journey, and Data Acquisition.
Course Analysis
Starting this course I neglected the course materials for most of the class, assuming it would focus solely on data cleaning. However, I later realized this approach was flawed. The project also covered principal component analysis (PCA), a topic I was less acquainted with. I discovered that the class project guide, which I call the “secret” rubric, was much more instructive than the official one, which was too vague and sometimes misleading.
In contrast to previous courses where specific data issues were outlined for correction, this class left the identification of such problems to the students. In the churn dataset I worked with, I didn’t find any outliers that needed removal. This should be confirmed and documented the approach to handling missing values was left to my discretion, provided I could justify my methods. My strategy might have been unconventional, but it was effective enough to pass the performance assessment. The primary task was to correct columns with incorrect data types to ensure data consistency.
The data cleaning process was straightforward; however, the performance assessment demanded a meticulous breakdown of each step. For instance, it wasn’t enough to simply correct integer-formatted zip codes; I had to write code to pinpoint the problem, describe the solution, implement the fix, and then confirm the correction. This detailed segmentation of the workflow, which typically would be more iterative, felt unnatural and time-consuming, as it required extensive documentation for each fix rather than focusing on the practical aspect of resolving the issues.
The most demanding part of this performance assessment was undoubtedly the Principal Component Analysis (PCA) section at the end. PCA isn’t directly related to Data Cleaning, but it seems the course designers included it to broaden our analytical skills. I was unfamiliar with PCA, so this was a fresh learning curve for me. The course does touch upon PCA in Lesson 7, which could be beneficial to review. Dr. Middleton’s last lecture was particularly enlightening, offering a clear explanation of PCA and its coding within an hour—I skipped the Q&A session at the end. Additionally, Matt Brems’ article on Towards Data Science provided a thorough breakdown of PCA. In hindsight, the process was quite straightforward, especially with the code from Dr. Middleton’s lecture, but it was this new concept that initially slowed my progress.
Final Thoughts
The course was fairly decent overall. Dr. Middleton’s videos stood out as a highlight; they were excellent. However, the course falls short when it comes to the datasets provided. Having only engaged with the datasets twice, I already feel that not having a variety of datasets for different courses is a missed opportunity for students. It’s my concern that by the end of the degree, instead of showcasing a diverse portfolio of projects, we might end up with what feels like one extensive project that, while covering the full analytics lifecycle, could become repetitive and stale.