Over the past two years, artificial intelligence has leaped from the confines of research and development labs — where data science experts crafted powerful yet often unheralded solutions — to the forefront of every product conversation. To truly excel in building intelligent products, we must critically evaluate our methodologies to get there.
One of the traditional methodologies for data analytics is CRISP-DM, which has been widely used for almost 30 years. Combine this with Scrum, a popular agile framework, and you have a blend that can handle the iterative, exploratory nature of data science.
But how do these methodologies blend effectively? What are the steps involved in such a hybrid approach? This article delves into understanding these aspects, guided by a step-by-step process based on the CRISP-DM methodology.
1. Business Comprehension
The initial phase of blending CRISP-DM and Scrum starts with business comprehension, emphasizing a clear grasp of the project’s goals and requirements from a business standpoint. It’s crucial to align the efforts of data scientists with organizational objectives to lay a solid groundwork for the project. Understanding the business context ensures that the project stays relevant and adds value to the organization.
In traditional product development, the business goals are often clearly defined from the get-go. However, when incorporating data science, there’s an added layer of complexity. Data scientists need to understand not just what the business wants to achieve but also the potential limitations and capabilities of the data at hand. This deep comprehension forms the foundation upon which the entire project is built.
Every data science project must start with a series of conversations with stakeholders to clearly define objectives, expected outcomes, and key performance indicators (KPIs). The role of Scrum here is to facilitate these discussions through sprint planning meetings, where roles are defined, and initial tasks are allocated. This collaborative effort ensures that the entire team is on the same page before delving into the complexities of data analysis.
2. Data Familiarization
The focus then shifts to gathering data and acquainting the team with its intricacies. This phase involves exploratory data analysis (EDA) to uncover initial insights, assess data quality, and identify underlying patterns or anomalies. Data familiarization is crucial for understanding what information is available and how it can be utilized to meet business objectives.
The exploratory nature of data science can sometimes clash with the structured approach of Scrum. However, in this phase, both can work harmoniously. Leveraging Scrum’s iterative cycles, data scientists can conduct EDA in short sprints, presenting their findings at the end of each sprint cycle. This iterative approach ensures continuous feedback and allows for pivoting based on initial insights, which can significantly improve the direction of subsequent data exploration.
During this phase, it’s common to encounter a range of data quality issues like missing values, outliers, and inconsistencies. Addressing these issues early on is crucial, as they can greatly impact the accuracy and reliability of the models developed later. Scrum’s daily stand-ups and sprint reviews can be instrumental in keeping the team aligned and focused on resolving these data issues promptly and efficiently.
3. Data Preparation
Data preparation is likely the most time-consuming step, involving the cleaning and transforming of raw data into a suitable format for modeling. Here, we address issues like missing values, outliers, and data normalization, which are critical for the success of subsequent modeling efforts. The goal is to ensure that the data is ready for the complex algorithms and models that will be developed in later stages.
This phase can be particularly labor-intensive, consuming up to 80% of the total project time. The data preparation phase includes steps such as data cleaning, transformation, integration, and reduction. Each of these steps is crucial for improving the quality of the data, making it suitable for analysis, and ensuring reliable results.
When integrated with Scrum, this phase can be broken down into manageable chunks, with each sprint focusing on specific tasks within data preparation. Sprint retrospectives are essential in this phase, as they provide an opportunity for the team to reflect on what went well and what needs improvement. This continuous loop of feedback and adjustment ensures that the data preparation phase is as efficient and effective as possible.
4. Modeling
In the modeling phase, the team selects and applies various modeling techniques using the prepared data. This experimental phase may involve trying multiple algorithms, tuning parameters, and iteratively refining models to improve performance. This is where the core of data science lies, as it involves creating the models that will ultimately drive business value.
Modeling is inherently iterative and experimental. Data scientists need the freedom to explore different approaches, test various hypotheses, and refine their models continuously. This can sometimes be at odds with Scrum’s structured approach, but it can be harmonized by framing each experiment or model iteration as a separate sprint.
During each sprint, the team can focus on a specific aspect of the modeling process, such as selecting the right algorithm, tuning hyperparameters, or testing model performance. This structured approach allows for continuous feedback and improvement, ensuring that the final model is robust and reliable. Sprint reviews and retrospectives are particularly valuable here, providing an opportunity for the team to share insights, learn from mistakes, and refine their approach.
5. Assessment
Before deployment, the models must be rigorously evaluated to ensure they meet the business objectives established in the first phase. In this phase, model performance is validated, assessing whether all critical business issues have been sufficiently addressed and determining the next steps. This thorough assessment is crucial for ensuring that the model delivers real, actionable insights.
Model assessment involves several evaluation metrics and techniques to ensure that the model performs well on unseen data. This phase also includes stress-testing the model to understand its limitations and ensure it can handle edge cases effectively.
Scrum’s iterative approach is particularly beneficial in this phase. Through short, focused sprints, the team can conduct various tests and evaluations, ensuring that the model is robust and reliable. Each sprint can focus on a specific evaluation metric or test, providing a structured approach to model assessment. Continuous feedback from stakeholders during sprint reviews ensures that the model aligns with business objectives and delivers the desired outcomes.
6. Implementation
Finally, the model is deployed into a real-world environment. The deployment could mean integrating it into a software application, using it to inform decision-making, or presenting findings to stakeholders. Implementation is not the end but part of a continuous cycle that may loop back to earlier phases as new data or objectives emerge.
Deployment involves several steps, including integrating the model into existing systems, monitoring its performance in the real world, and providing continuous updates and improvements. This phase also involves training end-users and stakeholders to understand and utilize the model effectively.
Scrum’s iterative approach ensures that deployment is smooth and efficient. Each sprint can focus on specific deployment tasks, such as integration, testing, and training. Continuous feedback during sprint reviews ensures that the deployment process is aligned with business objectives and that any issues are promptly addressed.
For organizations accustomed to working with iterative frameworks like Scrum, CRISP-DM provides structured approaches that complement those frameworks. They guide teams through discovery, data preparation, and model development when you need clarity and rigor the most: at the onset of the process.
As products become more reliant on AI, machine learning, and data science, the challenges of handling the uncertain outcomes of the exploratory work that comes with it intensify. A hybrid approach that allows teams to adapt while adhering to a structured, thorough process for data handling and model validation is critical.
Blending Scrum and CRISP-DM: A Use Case
The next step involves gathering data and familiarizing the team with its complexities. This stage includes exploratory data analysis (EDA) to derive initial insights, evaluate data quality, and detect patterns or anomalies. Understanding the data is vital for knowing what information is available and how it can be used to achieve business goals.
While data science’s exploratory nature might seem to clash with Scrum’s structured approach, both can coexist effectively at this stage. By using Scrum’s iterative cycles, data scientists can perform EDA in short sprints and present their findings at the end of each cycle. This iterative method ensures ongoing feedback and allows for adjustments based on preliminary insights, enhancing the course of future data exploration.
Common issues during this phase include missing values, outliers, and data inconsistencies. Tackling these problems early is essential for the models’ future accuracy and reliability. Scrum’s daily stand-ups and sprint reviews can help keep the team aligned and focused on addressing these problems quickly and efficiently.