Multi-channel Multi-granularity Forecasting on High Velocity Data

Abstract: 

Introduction:
In the advent of real time delivery many customer facing industries have a enormous data flowing into the system every minute. Businesses are forced to use these data to efficiently manage daily operations, to forecast future demands and to make strategic and tactical decisions to maintain competitive advantage within the industry. Forecasting is one such use case that has continued to have its place in monitoring, managing and planning in every industry. In large scale organisations each stakeholder or business vertical have their own requirement in terms of aggregation level as well as time-grandlarity of each forecast.
Consider an organisation selling its products or services to customer the organisation would necessity have multiple touch points or channels through which it can interact with customer for completing the transactions. Business would be interested to understand the forecast of the demand at overall level to understand the top line, but-reality to manage or run or plan the actual sales it would be necessary to have forecast at a granular level and for a different time horizon. For channel expansion businesses would need a forecast for a longer horizon (example: 12 months) and high granularity like monthly or quarterly, where as per monitoring the health of each channel for uninterrupted services, it would be better to have forecasted each channel level at lower granularity like minute level of both value and volume of transactions. It can be observed that at the raw level the data used for each of the forecast might be same, but the different forecast, different time granularity, different target variable design can account for building and maintaining hundreds of forecasting models. The size of data along with 100s of model building and monitoring and maintaining model health would require team of data scientist and streamlined framework to be able to deliver these for cast in minimal time frame.
This situation is often met with one other problem - not a single model family fits all time series from traditional models like ARIMA, VAR or smoothing family of models to ensemble-based based approaches including GBM, XGBOOST, LGBM to neural networks including RNNs and LSTMs. All of this implies there is often a significant degree of experimentation and scale to achieving the desired results.

Problem statement:
The main goal is to build, test and deploy all hundreds of models that can provide micro-level forecast for transactional and operation data to the business at various aggregation levels and across various time the granularities to facilitate both tactical and strategic decision making.
The solution is expected to use the scalable data pipeline, undergo robust model development including high degree of experimentation and evaluation and should be delivered within the constraints of the delivery time frame.

Challenges:
1. Each of the time series is likely to have its own patterns including sparcity, varied time zone, varied operational settings and varied frequencies.
2. Traditional time-series models might not be able to capture the volatility and sparcity in the most granular time-series for creating minute-level forecasts.
3. By using 10s data scientists to develop these hundreds of models there is a possibility of significant variance creeping in modelling decisions including varied but non exhaustive feature considerations multiple evaluation schemes, and only a personalized subset of modelling algorithms.
4. The time taken and quality to develop 100S of models with limited resources might be unpredictable.
5. Different pipelines build by multiple data scientist, for each of these models might be difficult to eventually integrated maintain, scale and monitor.

Solution design:
The proposed solution and methodology, tries to address this problem statement in consideration of the above mentioned challenges. The solution involves development of an object-oriented framework to facilitate the end-to-end model development life cycle with 5 key principles:
1. Centralised orchestration : Built a low or no code framework that can leverage a central config file to control the end to end model-development experimentation with no manual intervention.
2. Standardised decisioning: have the entire pipeline including data precising, feature engineering, model development, model evaluation scoring and post scoring common to all the models and streamlined.
3. Flexible solutioning : The pipeline although standardized for all the processes shall allow flexibility for team members to bring in custom changes without breaking the overall pipeline flow.
4. Scalability in integration and inferencing: the pipeline is designed and developed with high efficiency to be able to scale for any additional channels or to handle higher data volume.
5. Ease of development: the pipeline structure in a way to accommodate the standard deployment framework followed within the organisation. The way of having a common framework for development and deployment reduces the post code conversion time.
6. Allow for near real time consumption.

To achieve the above principles, we built the solution using a stack of python-pyspark-h2o-tablue-kafka and by developing the following key components:
1. Common utilities: We built separate common utilities for each of the independent processes to make the execution standardised and error free. These utilities also contribute to reduction in time due to reduced duplication of work. In case of feature engineering the utilities are built to cater to all the use cases as an on demand exhaustive feature store.
2. Process design: the process of execution of the experimentation from feature choices to model exploration is designed to seamlessly capture all the details of the experimentation and have relevant version control. The entire process is built in-adherence to the organisation's development framework.
3. Flow design : end to end pipeline can be executed without manual intervention once the desired config changes are made. The between process dependencies are predefined hence allows parallel computing of independent processes.
4. Tools and technology's integration : for seamless modelling in a distributed computing environment tools like sparking water and driveless AI is integrated into the main orchestrated pipeline and can be controlled through the same config file.
5. Continuous feedback: developed and in-house dashboard to continuously display the performance of the latest model built at each segment level. This in-house dashboard: shared with business, allowed us to get continuous feedback on the points of improvement and other requirements if any.

Bio: 

Naveen Rathani is a Quantitative Analytics Manager and heads the India delivery on AI/ML for COO and HR within enterprise analytics and data science (EADS). Naveen is an experienced predictive modelling expert and certified ML/AI professional with a passion to solve real world business challenges using data and analytics. He is proficient in deploying complex machine learning and statistical modelling algorithms. Previously, he worked in retail, telecom and insurance for the European and North American markets. Naveen has a MS degree in mathematics and is currently pursuing a management degree from IIM Bangalore.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from - Youtube
Vimeo
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google