Designing data for development

sampling

recommendation

Author

Brandon Rundquist

Training a model requires data. And in the era of big data, deep learning, and large compute the data sets have become unwieldy and cost prohibitive.

The first step in creating a model is to create a dataset that is workable. A workable dataset has the same shape and type as the original dataset but is a fraction of the size. Having a workable dataset is important because it allows for quicker development and testing.

But that is only half the problem when it comes to modeling. The other half is having a representative dataset. A representative dataset is a dataset that has the same distribution as the original dataset. This is important because it allows for the model to be trained on a dataset that is similar to the original dataset. The model will be more accurate and reliable if it is trained on a representative dataset and more indicative of the results we would get from using the entire dataset.

In this post we will go over what it means to be a workable dataset and a representative dataset to optimize the development process for creating a model.

Getting started

To get started we will be using the Ratebeer dataset. The Ratebeer dataset is a dataset of beer reviews that contains information about the beer, the brewery, and the user who reviewed the beer. The dataset is large and contains over 1 million reviews. The Ratebeer dataset is for evaluating recommendation models. The dataset is a good candidate for seeing the importance of creating a workable and representative dataset since it is too large for iterative work and has associated papers that can help guide us in creating a representative dataset.

The model we will use for fitting a recommendation model is an SVD model. The SVD model is a matrix factorization model that is used for collaborative filtering. The SVD model is a good candidate for this dataset because it is a simple model that can be trained on a large dataset and is easy to interpret.

The problem is that it grows in time complexity as the dataset grows. A way to measure time complexity is a term called Big O notation. This is a way to measure how the time it takes to run an algorithm grows as the size of the input grows. The SVD model has a time complexity of \(O(mn min(m,n))\) where m is the number of users and n is the number of items. But m*n would be for a fully dense matrix and in the case of the Ratebeer dataset it is sparse. This means we can approximate the time complexity to be \(O(r min(m,n))\) where r is the number of ratings which will be less than m*n. Therefore we need to focus on lowering the number of reviews we have in the dataset to reduce our time complexity.

Fitting the model will be done with the Python library Surprise. Surprise is a Python scikit for building and analyzing recommender systems that deal with explicit rating data. The library is built on top of SciPy and NumPy and is easy to use and understand.