Abstract: Our task as data scientists at Transfix is to predict the cost to move a truckload of freight from any two points in the continental US at anytime in the next 18 months. Cost prediction is not a unique business problem to solve. What makes it unique for us at Transfix is the data (or lack thereof) that we have to use to support the large number of predictions we have to make across geographies, seasons, business cycles, and equipment type. There are over a million routes (or lanes, as they’re called in the industry) between any two 3-digit zip code prefix that require a prediction from us. However, unlike the stock market, there is not one market price that defines the transaction of moving goods between two points. The unit economics, route preferences, and business goals of different sized trucking companies and owner-operators results in a wide range of viable rates that one will be able to book a truck from among the hundreds of thousands of carriers in this highly fragmented market.
Our internal data represents such a small percentage of the market that we have to supplement it with data from 3rd party aggregators. These providers collect and aggregate rates up to the metropolitan area, combining reported rates from fleets both large and small into a single time series of average monthly truck cost going back many years. For example, this leaves us with 60 observations per lane if we have 5 years worth of historical data. Unfortunately, receiving aggregates rather than raw data limits our ability to avoid any biases within the data resulting from sample size, composition of the companies reporting said rates, and the geographic uniformity of rates reported for each lane. With minimal industry access to more granular time series data at such a wide scale, how can we deploy models that generalizes well and are accurate enough to impact our business?
The spirit of this talk is less about predicting the cost of a truck than it is about overcoming data sparsity and an erratic data generating process with creative approaches to time series modeling. Come learn about how we use Facebook Prophet algorithm as a way to address the noise in our data, seasonal ARIMA models for forecasting our time series, and AWS Fargate to scale up and deliver models at scale.
Bio: Filip Piasevoli is a Senior Data Scientist at Transfix, the leading freight marketplace connecting shippers to a national network of reliable carriers. The tools of his trade are Python, whiteboard sessions with his team, a dusty copy of The Elements of Statistical Learning, a large carafe of pour-over coffee, and StackOverflow. Fortune 500 companies such as Anheuser-Busch, Unilever, and Target rely on Transfix to handle their most important FTL freight needs. With instant pricing tools, guaranteed capacity, data-driven insights, and reliable service, Transfix is changing the world of transportation one load at a time. Transfix was named one of Forbes' ""Next Billion-Dollar Companies"" in 2018.
Prior to joining Transfix, Piasevoli worked as a Data Scientist at 1010data, where he designed the analysis and user interface of a market intelligence tool reporting on the shopping behavior of 5 million U.S. consumers in more than 170 metro areas. Piasevoli holds a Master’s Degree in Computational Science and Engineering from Harvard University and a Bachelor’s Degree in Mathematics from Boston College.
Senior Data Scientist | Transfix